The document discusses simple linear regression analysis of sales data from 200 markets. Advertising budgets for TV, radio, and newspaper are used as predictors of sales. Separate regression models were created for each predictor. TV had the strongest relationship with sales (R2=0.62) followed by radio (R2=0.33) and newspaper (R2=0.05).
The document discusses simple linear regression analysis of sales data from 200 markets. Advertising budgets for TV, radio, and newspaper are used as predictors of sales. Separate regression models were created for each predictor. TV had the strongest relationship with sales (R2=0.62) followed by radio (R2=0.33) and newspaper (R2=0.05).
The document discusses simple linear regression analysis of sales data from 200 markets. Advertising budgets for TV, radio, and newspaper are used as predictors of sales. Separate regression models were created for each predictor. TV had the strongest relationship with sales (R2=0.62) followed by radio (R2=0.33) and newspaper (R2=0.05).
The document discusses simple linear regression analysis of sales data from 200 markets. Advertising budgets for TV, radio, and newspaper are used as predictors of sales. Separate regression models were created for each predictor. TV had the strongest relationship with sales (R2=0.62) followed by radio (R2=0.33) and newspaper (R2=0.05).
• The Advertising data set consists of the sales (in thousands of units) of a particular product in 200 different markets. • It also contains the advertising budgets (in thousands of dollars) for the product in each of the markets for three different media: TV, radio, and newspaper. • Our objective is to check the association between advertising budgets and sales. Plot of Advertising Data Set Notation • Output Variable/ Response Variable/ Dependent Variable: Sales (𝑌). • Input Variables/ Predictors/ Independent Variables/ Features/ Variables: TV budget 𝑋1 Radio budget 𝑋2 Newspaper budget 𝑋3 Important Questions for an Effective Market Plan 1. Is there any relationship between advertising budget and sales? 2. How strong is the relationship between advertising budget and sales? 3. Which media contribute to sales? 4. How accurately can we estimate the effect of each medium on sales? 5. How accurately can we predict the future sales? Simple Linear Regression (SLR) • 𝑌: Response • 𝑋: Predictor • In SLR, we assume that there is approximately a linear relationship between 𝑋 and 𝑌. • Thus the SLR model is given by 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖, … … … 1 where 𝜖 is mean-zero random error term. • For example, let 𝑋 represent TV advertising budget and 𝑌 represent sales, then the SLR model is given by 𝑠𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 × 𝑇𝑉 + 𝜖. Prediction • In Equation (1), 𝛽0 and 𝛽1 are two unknown constants that represent the intercept and slope terms in the linear model. • Together they are called the model coefficients or parameters. • Let the estimates of 𝛽0 and 𝛽1 based on the training data be 𝛽0 and 𝛽1 . • Now, given the budget of TV advertising, we can predict future sales by computing 𝑠𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 × 𝑇𝑉, where 𝑠𝑎𝑙𝑒𝑠: predicted sales. How to Get the Estimates of the Coefficients? • In the advertising data set, we have data on the TV advertising budget and product sales in 𝑛 = 200 different markets. • Our job is to find the coefficients in such a way that the resulting line is as close as possible to the 𝑛 = 200 data points. • How should we measure the closeness? • The most common approach involves minimizing the least squares criterion. How to Get the Estimates of the Coefficients? How to Get the Estimates of the Coefficients? • Data: 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 . • Let 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 be the prediction for 𝑌 based on the 𝑖-th value of 𝑋. • Residual: 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 . • We define the residual sum of squares (𝑅𝑆𝑆) as
𝑅𝑆𝑆 = 𝑒12 + ⋯ + 𝑒𝑛2 .
• We minimize 𝑅𝑆𝑆 to get the estimates of the coefficients.
Summary of Regression Analysis of Sales on TV
Coefficient Std. error t-statistic p-value
Intercept 7.0325 0.4578 15.36 <0.0001
TV 0.0475 0.0027 17.67 <0.0001
Interpretations • An increase of $1000 in the TV advertising budget is associated with an increase in average sales by around 48 units. • Accuracy of the estimated coefficients can be measured from their respective standard errors. • Standard errors can be used to construct the confidence intervals. • A 95% confidence interval is defined as the range of values such that with 95% probability, the range will contain the true unknown value of the parameter. Interpretations • For linear regression, the 95% confidence interval for 𝛽1 approximately takes the form 𝛽1 ± 2. 𝑆𝐸 𝛽1 . • That is, there is approximately a 95% chance that the interval 𝛽1 − 2. 𝑆𝐸 𝛽1 , 𝛽1 + 2. 𝑆𝐸 𝛽1 will contain the true value of 𝛽1 . • Similarly, confidence interval for 𝛽0 approximately takes the form 𝛽0 ± 2. 𝑆𝐸 𝛽0 . Interpretations • In the case of the advertising data, the 95% confidence interval for 𝛽0 is [6.130,7.935] and the 95% confidence interval for 𝛽1 is [0.042,0.053]. • Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,940 units. • Furthermore, for each $1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units. Interpretations • Standard Errors can also be used to perform hypothesis tests on the coefficients. • The most common hypothesis test involves testing the null hypothesis of 𝐻0 ∶ 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌 versus the alternative 𝐻1 ∶ 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑠𝑜𝑚𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌. • Mathematically, this corresponds to testing 𝐻0 ∶ 𝛽1 = 0 𝑣𝑠 𝐻1 ∶ 𝛽1 ≠ 0. Interpretations • The test statistic for testing the hypothesis is given by 𝛽1 − 0 𝑡= . 𝑆𝐸(𝛽1 ) • Under 𝐻0 , the above test statistic follows a 𝑡-distribution with 𝑛 − 2 degrees of freedom. • We reject the null hypothesis, i.e., we declare that no relationship exists between 𝑋 and 𝑌, if the 𝑝-value is small enough. Interpretations • Notice that the coefficients are very large relative to their standard errors, so the 𝑡-statistics are also large. • The probabilities of seeing such values if 𝐻0 is true are virtually zero. Hence we can conclude that 𝛽0 ≠ 0 and 𝛽1 ≠ 0. • This clearly suggests that there is a significant relationship between TV and sales. Assessing the Accuracy of the Model • Once the null hypothesis is rejected in favour of the alternative hypothesis, it is natural to quantify the extent to which the model fits the data. • The quality of a linear regression fit is assessed using two related quantities: Residual Standard Error 𝑅2 statistic Residual Standard Error (𝑅𝑆𝐸) • The 𝑅𝑆𝐸 is an estimate of the standard deviation of 𝜖. • Roughly, it is the average amount by which the response will deviate from the true regression line. • It is computed using the formula 𝑛 1 1 2. 𝑅𝑆𝐸 = 𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦𝑖 𝑛−2 𝑛−2 𝑖=1
• The 𝑅𝑆𝐸 is considered a measure of the lack of fit.
Residual Standard Error (𝑅𝑆𝐸) • In case of the advertising data, the RSE is 3.26. • This means that the actual sales in each market deviate from the true regression line by approximately 3,260 units, on average. • Even if the model were correct and the true values of the unknown coefficients 𝛽0 and 𝛽1 were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3,260 units on average. • The next question is whether or not 3,260 units is an acceptable prediction error. Residual Standard Error (𝑅𝑆𝐸) • In the advertising data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is 3,260/14,000 = 23%. 2 𝑅 Statistic • The RSE provides an absolute measure of lack of fit of the model to the data. • Since it is measured in the units of 𝑌, it is not always clear what constitutes a good RSE. • The 𝑅2 statistic provides an alternative measure of fit. • It takes the form of a proportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale of Y . 2 𝑅 Statistic • To calculate 𝑅2 , we use the formula 2 𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆 𝑅 = =1− , 𝑇𝑆𝑆 𝑇𝑆𝑆 where 𝑇𝑆𝑆 = 𝑛𝑖=1 𝑦𝑖 − 𝑦𝑖 2 is the total sum of squares. • 𝑇𝑆𝑆 measures the amount of variability inherent in the response before the regression is performed. • In contrast, 𝑅𝑆𝑆 measures the amount of variability that is left unexplained after performing the regression. • Hence, 𝑇𝑆𝑆 − 𝑅𝑆𝑆 measures the amount of variability in the response that is explained (or removed) by performing the regression, and 𝑅2 measures the proportion of variability in 𝑌 that can be explained using 𝑋. 2 𝑅 Statistic • An 𝑅2 statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. • A number near 0 indicates that the regression did not explain much of the variability in the response. • This might occur because the linear model is wrong, or the inherent error 𝜎 2 is high, or both. • In the advertising data set, the 𝑅2 was 0.61, and so just under two- thirds of the variability in sales is explained by a linear regression on TV alone. 2 𝑅 Statistic • The 𝑅2 statistic has an interpretational advantage over the RSE, since unlike the RSE, it always lies between 0 and 1. • However, it can still be challenging to determine what is a good 𝑅2 value, and in general, this will depend on the application. • In the simple linear regression setting, 𝑅2 =𝑟 2 . • Thus 𝑅2 can work as a measure of linear relationship between 𝑋 and 𝑌. Summary of Regression Analysis of Sales on Radio Coefficient Std. error t-statistic p-value
Intercept 9.312 0.563 16.54 <0.0001
Radio 0.203 0.020 9.92 <0.0001
Summary of Regression Analysis of Sales on Newspaper Coefficient Std. error t-statistic p-value
Intercept 12.351 0.621 19.88 <0.0001
Newspaper 0.055 0.017 3.30 0.0012
Summary of Three Simple Linear Regression Models Model Predictors 𝑹𝟐 𝑹𝑺𝑬 1 TV 0.62 3.26 2 Radio 0.33 4.28 3 Newspaper 0.05 5.09