Lecture 6 - Multiple Regression Analysis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

DATA ANALYSIS, TOOLS AND

APPLICATION (QHO430)

Lecture 6 – Multiple Regression Analysis


Multiple Regression Analysis
• So far we have considered Simple Linear Regression which defines the relationship
between the dependent (y) & independent variables (x) by means of a linear
equation.
Simple linear regression implies there is only one independent variable

• We will now consider multiple regression.


Multiple regression implies there is more than one independent variable.
Multiple Regression Analysis
 So far, we have considered questions such as: Is there a relationship between the
amount spent on advertising and the sales achieved at a particular retail outlet?
 Here have used only one variable (amount spent on advertising) to try to predict the
number of sales.

In reality, it is possible that there are number of contributing factors, for example:
 does the size of the store have an influence?
 is the area of the region served significant?
 are the sales related to the local population size?
 does the number of years of experience of the sales staff
have an effect?
 does the gender of the sales rep make a difference?
etc . . .
Multiple Regression Analysis
 We could look at each of these relationships separately to see if they are significant.

 But . . .
Say we found a strong relationship between sales and store
size and also between sales and local population.

We know that store size and local population are likely to be


associated, so is the strong relationship with sales just due to
store size, or just due to population, or really due to a
combination of both?

 We need to extend our model. . .


Extending the Regression Equation

Previously, we related the dependent variable (eg sales, “y”) to one independent variable (eg
advertising spend, “x”) :

y = a + bx
Eg Sales = a + b * (advertising spend)

 We still have one intercept (a)


 And one individual slope or gradient (b)
Extending the Regression Equation
Now we extend this to include more than one independent variable:

y = a + b1x1 + b2x2 + . . . .
Eg Sales = a + b1 * (advertising spend) + b2 * (Population)+ . . . .

Now we have two (or more) independent x variables:


x1 (advertising spend) and
x2 (population)

 Each has its own individual slope (b1 and b2 . . .)


 We still have one intercept (a)
Multiple Regression Analysis
Excel’s Data Analysis add-in will allow us to fit this type of model.
 Select ‘Regression’
 Click in ‘y input’ box → select y values (including titles)
 Click in ‘x input’ box → select ALL x columns (including titles)
 Tick ‘Labels’ check box
 Tick the ‘output range check box’ → click in the box next to it
 → select a cell in an ‘empty region’ of your spread sheet as a
destination for the answers!
 Click OK
Multiple Regression Analysis
 We then first need to consider whether all of the independent
variables under consideration have a real impact.
– We can look at the “P-value” column of the Data Analysis add-in output
– If the p-value is less than 0.05 (5%) then we say the variable does have an impact (the
chances of being wrong in saying this are less than 5% which is a small enough risk)

 Once we know which of the variables have a significant impact, we can construct a
regression equation.
What is the P-value?
 A p-value is a measure of the probability that an observed difference could
have occurred just by random chance. The lower the p-value, the greater the
statistical significance of the observed difference.

 When you perform a statistical test a p-value helps you determine the significance
of your results in relation to the null hypothesis.

 The null hypothesis states that there is no relationship between the two variables
being studied (one variable does not affect the other). It states the results are due
to chance and are not significant in terms of supporting the idea being
investigated. Thus, the null hypothesis assumes that whatever you are trying to
prove did not happen.
What is the P-value?

 The smaller the p-value, the stronger the evidence that you should reject the null
hypothesis. A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It
indicates strong evidence against the null hypothesis, as there is less than a
5% probability the null is correct (and the results are random).
Multiple Regression Analysis
Example:
A toy manufacturer currently sells toys to retail outlets in six different areas. They are looking
to expand and their business and sales departments want to predict the probable number of
sales in a new area. They consider a variety of possible contributing factors.
Current Advertising Population M/F Sales
Areas Spend (£000) (Thousands) Sales Rep (£000)
A 1 220 M 110
B 5 690 F 295
C 8 810 M 405
D 6 430 M 220
E 3 105 F 105
F 10 595 F 390
Multiple Regression Analysis
 Before we can use Excel’s Data Analysis add-in, we need all
data to be quantitative (numerical).
Obviously, we also need quantitative data for our regression
equation.

 How do we cope with gender (Male/Female)?


If we call Male “0” and Female “1”, we can use these zeros
and ones in a regression equation as before.

 We can now process our data table and interpret the results.
Multiple Regression Analysis
Re-coding:
How do we cope with gender (Male/Female)?
If we call Male “0” and Female “1”, we can use these zeros
and ones in a regression equation as before.

Current Advertising Population M/F Sales


Areas Spend (£000) (Thousands) Sales Rep (£000)
A 1 220 0 110
B 5 690 1 295
C 8 810 0 405
D 6 430 0 220
E 3 105 1 105
F 10 595 1 390
Multiple Regression Analysis
We enter our
data into Excel’s
Data Analysis
add-in as shown
below
Multiple Regression Analysis
SUMMARY OUTPUT
This will give the
Regression Statistics following summary
Multiple R 0.991114
table. We look at the
R Square 0.982306
Adjusted R Square 0.955766 p-value column to
Standard Error 27.75423 see which of the
Observations 6
variables have a
ANOVA significant impact.
  df SS MS F Significance F
Regression 3 85530.24 28510.08 37.011796 0.026423
Residual 2 1540.594 770.2971
Total 5 87070.83      Which variable has the
Standard highest p value? Is it
  Coefficients Error t Stat P-value Lower 95% Upper 95%
greater than 0.05?
Intercept 12.46704 27.51523 0.453096 0.6948897 -105.921 130.8555
Advertising 19.50599 5.908842 3.301152 0.0807979 -5.91771 44.92969
Population 0.277403 0.06949 3.992014 0.0574007 -0.02159 0.576393
M/F 5.300092 23.78264 0.222855 0.8443382 -97.0283 107.6285
Multiple Regression Analysis
This will give the following
summary table. We look at
the p-value column to see
which of the variables have
a significant impact.

Which variable has the


highest p value? Is it
greater than 0.05?
We notice that the gender of
the sales rep has the highest p-
value (0.844) which is much
greater than 5% (0.05) indicating
that it has no effect on sales
figures.
Multiple Regression Analysis
SUMMARY OUTPUT We remove the gender variable (by
Regression Statistics amending the Input X Range) and re-
Multiple R 0.990892
R Square 0.981867
run Excel’s Data Analysis add-in. We
Adjusted R Square 0.969778 now see slightly different results
Standard Error 22.94087
Observations 6
ANOVA
Significance
  df SS MS F F
Regression 2 85491.98 42745.99 81.22236 0.002442
Residual 3 1578.851 526.2836 The p-values for both ‘Advertising’
Total 5 87070.83     
and ‘Population’ are now well within
  Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
5% and so have an impact on sales.
Intercept 14.83123 20.98493 0.706757 0.530667 -51.9522 81.61463 We now use the coefficient values to
Advertising 19.90127 4.65884 4.271723 0.02355 5.074766 34.72778
Population 0.273428 0.055514 4.92543 0.016039 0.096759 0.450097 produce the regression equation
Multiple Regression Analysis
Recall that the regression equation has the form:
y = a + b1x1 + b2x2 + . . . .
Sales = a + b1 * (advertising spend) + b2 * (Population)

We use the coefficient values from Excel’s Summary Output

• The intercept (a) is 14.83


• The ‘slope’ value ( b1) for Advertising is 19.9
• The ‘slope’ value ( b2 ) for Population is 0.27

So the regression equation is:


y = 14.83 + 19.9x1 + 0.27x2 + . . . .
Prediction Using Multiple Regression Analysis
Example:
Predict the sales return for an area with a population of 730 000 given an advertising budget
of £7000.

Answer
Sales (£000) = 14.83 + 19.9 × advertising (£000) + 0.27 × Population(000)

Sales = 14.83 + (19.9 × 7) + (0.27 × 730) = £351.23 thousand

= £351 230
Prediction Using Multiple Regression Analysis
Exercise:
Predict the sales return for an area with a population of 480 000 given an advertising budget
of £4000.

Answer
Sales (£000) = 14.83 + 19.9 × advertising (£000) + 0.27 × Population(000)

Sales =
Prediction Using Multiple Regression Analysis
Exercise:
Predict the sales return for an area with a population of 480 000 given an advertising budget
of £4000.

Answer
Sales (£000) = 14.83 + 19.9 × advertising (£000) + 0.27 × Population(000)

Sales = 14.83 + (19.9 × 4) + (0.27 × 480) = £224.03 thousand

£224,030
=
Summary
– Excel’s Data Analysis add-in tool performs regression
– Quality of fit of line to the data points can be judged by correlation (r) and coefficient of
determination (R2)
– Excel has CORREL and RSQ functions
– Accuracy of prediction is reflected by standard error (s.e.) in output
– Approximate 95% confidence intervals for predictions can be constructed if assumptions
can be made
– We can extend the regression model to include the effect of more than one significant
independent variables.
Relationship between Qualifications and Income

[taken from 
https://www.qs.com/what-effect-does-education-level-have-on-wealth/#:~:text=There's%20a%20clear
%20correlation%20between,the%20greater%20your%20salary%20becomes
.]
Response and Explanatory Variables
Response variable (Dependent variable)
• The outcome variable on which comparisons are made.

Explanatory variable (Independent variable)


• When the explanatory variable is categorical, it defines the groups to be
compared with respect to the response variable.
• When the explanatory variable is quantitative, it defines the change in different
numerical values to be compared with respect to the values for the response
variable.

Example: Response/Explanatory
• Level of Carbon dioxide and Amount of gasoline use for cars

• College G P A/Number of hours a week spent studying


For your assignment, you need to select your
own dataset and apply what you have learned

About correlation and regression


Where do we get Social and Economic Data from?
Login in to the Central Data Service

“The UK Data Service is funded by the Economic and Social Research Council (ESRC) to
meet the data needs of researchers, students and teachers from all sectors, including
academia, central and local government, charities and foundations, independent research
centres, think tanks, and business consultants and the commercial sector.”
•https://census.ukdataservice.ac.uk/
This website hosts a lot of the data for England on UK data

Register and Create an account using your Solent University credentials


•https://www.ukdataservice.ac.uk/get-data/how-to-access/registration
Boundary Data
https://census.ukdataservice.ac.uk/get-data/boundary-data.aspx
Boundary data – we will discuss what Boundary data is and the importance of it in data
analysis of the UK
How to … https://www.youtube.com/watch?v=AY9hcpMzaQ4

UK Boundary Data

What is a Boundary?
Recommended Reading
You have a rich list of resources to explore:
• https://learn.solent.ac.uk/course/view.php?id=42897&section=2

Relationship between Qualifications and Income


• https://www.qs.com/what-effect-does-education-level-have-on-wealth/#:~:text=There's%20a%20clear%20correlation
%20between,the%20greater%20your%20salary%20becomes
•Correlation Definitions, Examples & Interpretation
Census Dataset
• https://census.ukdataservice.ac.uk/
• https://census.ukdataservice.ac.uk/get-data/boundary-data.aspx
• https://www.youtube.com/watch?v=AY9hcpMzaQ4
Pearson Product-Moment Correlation:
• Statistics article
• Linkedin Statistics Regression
• Excel HELP in Pearson Correlation
Recommended Reading (Cont.)
Take time to go through the short videos on this site.

UK Data Services – Videos https://www.ukdataservice.ac.uk/use-data/tutorials.aspx

Look at the Key data area, take time to understand the different types of data available.

Do you understand what Longitudinal data is?


What is administrative data?
Suggested other data providers
Questions?

You might also like