Academia.eduAcademia.edu

An Analysis of Residuals in Multiple Regressions

2015

This paper concentrates on residuals analysis to check the assumptions for a multiple linear regression model by using graphical method. Specifically, we plot the residuals and standardized residuals given by model against predicted values of the dependent variables, normal probability plot, histogram of residuals and Quantile plot of residuals. Finally, we explained the concept of heteroscedasticity which we used to check the assumption that the residuals in regression model the same variance. As an example, a formal method to detect the presence of heteroscedasticity by Breusch Pagan method using eview was presented.

International Journal of Advanced Technology in Engineering and Science www.ijates.com ISSN (online): 2348 – 7550 Volume No 03, Special Issue No. 01, April 2015 AN ANALYSIS OF RESIDUALS IN MULTIPLE REGRESSIONS Ahmad A. Suleiman1, Usman A. Abdullahi2, Umar A. Ahmad3 1, 2, 3 Postgraduate Student, Department of Mathematic / Statistics, Sharda University, Greater Noida, (India) ABSTRACT This paper concentrates on residuals analysis to check the assumptions for a multiple linear regression model by using graphical method. Specifically, we plot the residuals and standardized residuals given by model against predicted values of the dependent variables, normal probability plot, histogram of residuals and Quantile plot of residuals. Finally, we explained the concept of heteroscedasticity which we used to check the assumption that the residuals in regression model the same variance. As an example, a formal method to detect the presence of heteroscedasticity by Breusch Pagan method using eview was presented. I. INTRODUCTION The main aim of regression modelling and analysis is to develop a good predictive relationship between the dependent (response) and independent (predictor) variables. Regression diagnostics plays a vital role in finding and validating such a relationship. In this study, we discuss issues that arise in the development of a multiple linear regression model. Consider the following standard multiple linear regression model: Y  0  1 X1  2 X 2  ...   p X p   ' where Y is a response variable and X s are predictor variables, estimated from data, and  's are the (regression) parameters to be  is the error or residual. The validity of the inference methods depends on the error term  , satisfying these assumptions;  Independence: Observations (and hence residuals) are statistically independently distributed.  Normality: The residuals are normally distributed with zero mean.  Homoscedastiticity: All the observations (and hence residuals) have the same variance.  Multicollinearity: No linear correlation between independent variables II. METHOD AND ANALYSIS Here is a hypothetical data on Consumption, Export and GDP CONSUMPTION EXPORT GDP 50.35718 1.436314 35.06 50.44603 1.414639 35.66 57.87973 1.529996 37.83 72.30876 1.746588 41.4 563 | P a g e International Journal of Advanced Technology in Engineering and Science Volume No 03, Special Issue No. 01, April 2015 www.ijates.com ISSN (online): 2348 – 7550 77.65894 1.801414 43.11 80.01789 1.808723 44.24 103.396 2.092189 49.42 111.4546 2.158299 51.64 126.315 2.292468 55.1 138.3544 2.384188 58.03 149.9102 2.462792 60.87 164.6521 2.56228 64.26 187.6525 2.71842 69.03 195.7883 2.746364 71.29 214.2388 2.846271 75.27 241.5957 2.994864 80.67 288.8777 3.24181 89.11 301.7072 3.274088 92.15 303.5576 3.245564 93.53 292.6464 3.148417 92.95 271.2281 2.991046 90.68 291.739 3.068353 95.08 305.7957 3.105471 98.47 329.3367 3.186307 103.36 366.2961 3.320908 110.3 388.546 3.379249 114.98 433.4515 3.52285 123.04 503.9317 3.740863 134.71 513.4575 3.732336 137.57 505.2289 3.663468 137.91 583.8524 3.874784 150.68 582.0886 3.827768 152.07 635.0129 3.940019 161.17 716.5901 4.115023 174.14 634.7767 3.855544 164.64 706.7427 4.004662 176.48 2.1 Regression Diagnostics Saving Residuals for Diagnosis: There are many diagnostics we can perform on the residuals. Here are the most important ones:  Normal Probability Plot To diagnose if the errors are normally distributed, we draw a normal probability plot of the residuals. The residuals should fall approximately on a diagonal straight line in this plot, 564 | P a g e International Journal of Advanced Technology in Engineering and Science Volume No 03, Special Issue No. 01, April 2015 www.ijates.com ISSN (online): 2348 – 7550  histogram of the Residuals We can also plot a histogram to see if they are lumpy in the middle symmetric tails. From our histogram, the residuals appear to be normally distributed.  Quantile Plot of Residuals We can make a Quantile Plot where the residuals are plotted against their percentage (empirical cumulative probability) point (0 to 1): if normality holds shis should have an S-shape. 565 | P a g e International Journal of Advanced Technology in Engineering and Science www.ijates.com ISSN (online): 2348 – 7550 Volume No 03, Special Issue No. 01, April 2015 The S-shape of the curve seems to suggest that normality assumption is satisfied. To examine if the errors are independent. We look at the plot of residuals against estimated values to ensure that the residuals are randombly scattered above and below the 0 horizontal. 25 20 15 Residual 10 5 0 -5 -10 -15 -20 0 80 160 240 320 400 480 560 640 720 Predicted value Although the mean of the residual may be accepted to be zero at each x-value, the variance seems to increase with x-value suggesting a possible violation of thehomoscedasticity assumption.  Standardized Residuals The Standardized residuals is more useful since they are more easy to interpret. We plot Standardized residuals against estimated values to identified outliers in the dependent variable space. Large values (greater than 2 or 3 in absolute magnitude) indicate posssible problems. 566 | P a g e International Journal of Advanced Technology in Engineering and Science www.ijates.com ISSN (online): 2348 – 7550 Volume No 03, Special Issue No. 01, April 2015 2.5 2.0 Std._residual 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 0 80 160 240 320 400 480 560 640 720 Predicted Case no. 34 with a Standardized value of 2.246 can be considered an outlier. If we delete this case from the data set and recompute the regression, the fit become better. 2.2 Removal of Heteroscedasticity from Regression Mode We want remove heteroscedasticity from the regression model because heteroscedasticity is not desirable that is the residuals should be homoscedastic. There are many ways to remove heteroscedasticity from the model. One of the most popular ways is to convert all the variables into log, which is known as log transformation. Let us consider another multiple regression example with quite a few predictor variables. In our model, we have three variables such as consumption, export and growth domestic product (gdp). Here consumption is the dependent variable while the rest two are independent variables. If we see heteroscedasticity in the model after estimation, we need to convert all the three variables into log that is: Consumption >log(consumption), export > log(export), gdp> log(gdp). Once we run the model with log variables, heteroscedasticity will be removed and homoscedasticity will be appeared. As we know, homoscedasticity is desirable. Our hypothesis: Null hypothesis: Homoscedasticity Alternative hypothesis: Heteroscedasticity Basic Software output yield: 567 | P a g e International Journal of Advanced Technology in Engineering and Science Volume No 03, Special Issue No. 01, April 2015 www.ijates.com ISSN (online): 2348 – 7550 Now, we are to check whether this model has heteroscedasticity or not. Using Breusch- Pagan- Godfrey test we have p-value corresponding to R 2 is 0.0394 which is less than 5% , hence we reject the null hypothesis and accept the alternative hypothesis. Meaning that this model got heteroscedasticity. In other words, the residuals are heteroscedasticity so that the model is not desirable. After transforming the variables Breusch-Pagan-Godfrey shows that the p -value corresponding to the observed R 2 is 0.7207, which is more than 5% . Meaning that, we cannot reject the null hypothesis. Hence the model is homoscedasticity which is desirable. 2.3 Multicollinearity Problem and Regression Model How multicollinearity affects any estimated regression model? Here is our model: Y  C  X1  X 2  X 3  X 4  X 5  X 6 .................................................................. (1.1) Here Y is the dependent variable and the rest are independent variables. After estimating Model that only Model (1.1) , we saw X 5 is significant while others are not. We suspect that, there is a problem of multicollinearioty in (1.1) that is why most of the variables have become insignificant. 568 | P a g e International Journal of Advanced Technology in Engineering and Science Volume No 03, Special Issue No. 01, April 2015 www.ijates.com ISSN (online): 2348 – 7550 III. CONCLUSION 1. We estimate the value of the regression residuals for each value of y : ˆ  y  yˆ Whichis the 2. observed value – the predicted  or expected  value . We made sure the removal of multicollinearity by dropping the appropriate highly correlated independent variables before studying the residuals. 3. We have dealt with an example dealing with GDP in chapter four where we detected and removed heteroscedasticity by the methods suggested by Bruesch-Pagan. Another problem that can affect adversely our study of multiple regression is heteroscedasticity. 4. In conclusion we have dealt with  estimates of regression coefficients;  their standard errors, confidence intervals and tests of their significance;  analysis of variance for regression which produces an overall test of significance and an estimate of the error variance as the residual (error) mean-square;  multiple correlation coefficient, its square, and its adjusted value, which give a measure of how much of the variation has been captured by the predictor variables and hence how useful the regression is;  Using the saved residuals we can  make suitable plots to examine the assumption of normality;  carry out a formal test of significance for normality;  make suitable plots to examine the assumption of homoscedasticity;  make suitable plots to examine the assumption of independence. Thus some basic diagnosis can be performed of the validity of assumptions under which a standard regression analysis is carried out using this regression output. IV. ACKNOWLEDGEMENTS Our infinite gratitude goes first and foremost to Almighty Allah for sparing my life till this moment. Many special thanks and appreciation goes to our honorable supervisor and at the same time our esteemed lecturer Prof. U.V. Balakrishnan who his suggestions, corrections and encouragements made this work reality. We are also grateful to the entire lecturers in the Department of Statistics in person of Dr. N.M. Chahda, Dr. SwetaSrivastav, Dr. Krushidalamand others whose wealth of experience and knowledge I have benefited. REFERENCES [1]. AshishSen and Muni Srivastava, Regression Analysis: Theory, Methods, and Applications,SpringerVerlag, New York, 1990, p. 92. Notation changed. [2]. Belsley, David A.; Kuh, Edwin; Welsch, Roy E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. ISBN 0-471-05856-4. [3]. C. R. Rao, Linear Statistical Inference and Its Applications, John Wiley & Sons, New York, 1965, p. 258. 569 | P a g e International Journal of Advanced Technology in Engineering and Science Volume No 03, Special Issue No. 01, April 2015 www.ijates.com ISSN (online): 2348 – 7550 [4]. D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. [5]. D. E. Farrar and R. R. Glauber, “Multicollinearity in Regression Analysis: The Problem [6]. Revisited,” Review of Economics and Statistics, vol. 49, 1967, pp. 92–107. [7]. Douglas Montgomery and Elizabeth Peck, Introduction to Linear Regression Analysis [8]. H. Glejser, “A New Test for Heteroscedasticity,’’ Journal of the American Statistical Association, vol. 64, 1969, pp. 316–323. [9]. Hill, R. Carter; Adkins, Lee C. (2001). "Collinearity". In Baltagi, Badi H. A Companion to Theoretical Econometrics. Blackwell. pp. 256–278. doi:10.1002/9780470996249.ch13. ISBN 0-631-21254-X. [10]. J. T. Webster,“Regression Analysis and Problems of Multicollinearity,” Communications in Statistics A, vol. 4, no. 3, 1975, pp. 277–292; R. F. Gunst. [11]. Johnston, John (1972). Econometric Methods (Second ed.). New York: McGraw-Hill. pp. 159–168. [12]. John Wiley & Sons, New York, 1982, pp. 289–290. See also R. L. Mason, R. F. Gunst. [13]. Kmenta, Jan (1986). Elements of Econometrics (Second ed.). New York: Macmillan. pp. 430–442. ISBN 0-02-365070-2. [14]. Maddala, G. S.; Lahiri, Kajal (2009). Introduction to Econometrics (Fourth ed.). Chichester: Wiley. pp. 279–312. ISBN 978-0-470-01512-4. [15]. R. Koenker, “A Note on Studentizing a Test for Heteroscedasticity,” Journal of Econometrics, vol. 17, 1981, pp. 1180–1200. [16]. R. L. Mason, “Advantages of Examining Multicollinearities in Regression Analysis,” Biometrics, vol. 33, 1977, pp. 249–260. [17]. T. Breusch and A. Pagan, “A Simple Test for Heteroscedasticity and Random Coefficient [18]. Variation,’’ Econometrica, vol. 47, 1979, pp. 1287–1294. 570 | P a g e