Regression Analysis in Excel

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 20

Regression analysis in Excel -

the basics
In statistical modeling, regression analysis is used to estimate
the relationships between two or more variables:

Dependent variable (aka criterion variable) is the main factor


you are trying to understand and predict.

Independent variables (aka explanatory variables,
or predictors) are the factors that might influence the dependent
variable.

Regression analysis helps you understand how the dependent


variable changes when one of the independent variables varies
and allows to mathematically determine which of those variables
really has an impact.

Technically, a regression analysis model is based on the sum of


squares, which is a mathematical way to find the dispersion of
data points.  The goal of a model is to get the smallest possible
sum of squares and draw a line that comes closest to the data.

In statistics, they differentiate between a simple and multiple


linear regression. Simple linear regression models the
relationship between a dependent variable and one independent
variables using a linear function. If you use two or more
explanatory variables to predict the independent variable, you
deal with multiple linear regression. If the dependent variables
are modeled as a non-linear function because the data
relationships do not follow a straight line, use nonlinear
regression instead. The focus of this tutorial will be on a simple
linear regression.

As an example, let's take sales numbers for umbrellas for the


last 24 months and find out the average monthly rainfall for the
same period. Plot this information on a chart, and the regression
line will demonstrate the relationship between the independent
variable (rainfall) and dependent variable (umbrella sales):

Linear regression equation


Mathematically, a linear regression is defined by this equation:

y = bx + a + ε

Where:

 x is an independent variable.


 y is a dependent variable.
 a is the Y-intercept, which is the expected mean value of y when
all x variables are equal to 0. On a regression graph, it's the
point where the line crosses the Y axis.
 b is the slope of a regression line, which is the rate of change
for y as x changes.
 ε is the random error term, which is the difference between the
actual value of a dependent variable and its predicted value.

The linear regression equation always has an error term


because, in real life, predictors are never perfectly precise.
However, some programs, including Excel, do the error term
calculation behind the scenes. So, in Excel, you do linear
regression using the least squares method and seek
coefficients a and b such that:
y = bx + a

For our example, the linear regression equation takes the


following shape:

Umbrellas sold = b* rainfall + a

There exist a handful of different ways to find a and b. The three


main methods to perform linear regression analysis in Excel are:

 Regression tool included with Analysis ToolPak


 Scatter chart with a trendline
 Linear regression formula

Below you will find the detailed instructions on using each


method.

How to do linear regression in


Excel with Analysis ToolPak
This example shows how to run regression in Excel by using a
special tool included with the Analysis ToolPak add-in.

Enable the Analysis ToolPak add-in


Analysis ToolPak is available in all versions of Excel 2019 to
2003 but is not enabled by default. So, you need to turn it on
manually. Here's how:

. In your Excel, click File > Options.


. In the Excel Options dialog box, select Add-ins on the left
sidebar, make sure Excel Add-ins is selected in
the Manage box, and click Go.

. In the Add-ins dialog box, tick off Analysis Toolpak, and


click OK:
This will add the Data Analysis tools to the Data tab of your
Excel ribbon.

Run regression analysis


In this example, we are going to do a simple linear regression in
Excel. What we have is a list of average monthly rainfall for the
last 24 months in column B, which is our independent variable
(predictor), and the number of umbrellas sold in column C, which
is the dependent variable. Of course, there are many other
factors that can affect sales, but for now we focus only on these
two variables:

With Analysis Toolpak added enabled, carry out these steps to


perform regression analysis in Excel:

. On the Data tab, in the Analysis group, click the Data


Analysis button.
. Select Regression and click OK.

. In the Regression dialog box, configure the following settings:


 Select the Input Y Range, which is your dependent variable. In
our case, it's umbrella sales (C1:C25).
 Select the Input X Range, i.e. your independent variable. In
this example, it's the average monthly rainfall (B1:B25).

If you are building a multiple regression model, select two or


more adjacent columns with different independent variables.

 Check the Labels box if there are headers at the top of your X


and Y ranges.
 Choose your preferred Output option, a new worksheet in our
case.
 Optionally, select the Residuals checkbox to get the difference
between the predicted and actual values.

. Click OK and observe the regression analysis output created


by Excel.

Interpret regression analysis output


As you have just seen, running regression in Excel is easy
because all calculations are preformed automatically. The
interpretation of the results is a bit trickier because you need to
know what is behind each number. Below you will find a
breakdown of 4 major parts of the regression analysis output.

Regression analysis output: Summary Output


This part tells you how well the calculated linear regression
equation fits your source data.

Here's what each piece of information means:

Multiple R. It is the Correlation Coefficient that measures the


strength of a linear relationship between two variables. The
correlation coefficient can be any value between -1 and 1, and
its absolute value indicates the relationship strength. The larger
the absolute value, the stronger the relationship:

 1 means a strong positive relationship


 -1 means a strong negative relationship
 0 means no relationship at all

R Square. It is the Coefficient of Determination, which is used


as an indicator of the goodness of fit. It shows how many points
fall on the regression line. The R2 value is calculated from the
total sum of squares, more precisely, it is the sum of the squared
deviations of the original data from the mean.

In our example, R2 is 0.91 (rounded to 2 digits), which is fairy


good. It means that 91% of our values fit the regression analysis
model. In other words, 91% of the dependent variables (y-
values) are explained by the independent variables (x-values).
Generally, R Squared of 95% or more is considered a good fit.

Adjusted R Square. It is the R square adjusted for the number


of independent variable in the model. You will want to use this
value instead of R square for multiple regression analysis.

Standard Error. It is another goodness-of-fit measure that


shows the precision of your regression analysis - the smaller the
number, the more certain you can be about your regression
equation. While R2 represents the percentage of the dependent
variables variance that is explained by the model, Standard
Error is an absolute measure that shows the average distance
that the data points fall from the regression line.

Observations. It is simply the number of observations in


your model.

Regression analysis output: ANOVA


The second part of the output is Analysis of Variance (ANOVA):

Basically, it splits the sum of squares into individual components


that give information about the levels of variability within your
regression model:

 df is the number of the degrees of freedom associated with the


sources of variance.
 SS is the sum of squares. The smaller the Residual SS
compared with the Total SS, the better your model fits the data.
 MS is the mean square.
 F is the F statistic, or F-test for the null hypothesis. It is used to
test the overall significance of the model.
 Significance F is the P-value of F.

The ANOVA part is rarely used for a simple linear regression


analysis in Excel, but you should definitely have a close look at
the last component. The Significance F value gives an idea of
how reliable (statistically significant) your results are. If
Significance F is less than 0.05 (5%), your model is OK. If it is
greater than 0.05, you'd probably better choose another
independent variable.

Regression analysis output: coefficients


This section provides specific information about the components
of your analysis:

The most useful component in this section is Coefficients. It


enables you to build a linear regression equation in Excel:

y = bx + a

For our data set, where y is the number of umbrellas sold and x
is an average monthly rainfall, our linear regression formula
goes as follows:

Y = Rainfall Coefficient * x + Intercept

Equipped with a and b values rounded to three decimal places, it


turns into:

Y=0.45*x-19.074

For example, with the average monthly rainfall equal to 82 mm,


the umbrella sales would be approximately 17.8:

0.45*82-19.074=17.8

In a similar manner, you can find out how many umbrellas are
going to be sold with any other monthly rainfall (x variable) you
specify.

Regression analysis output: residuals


If you compare the estimated and actual number of sold
umbrellas corresponding to the monthly rainfall of 82 mm, you
will see that these numbers are slightly different:

 Estimated: 17.8 (calculated above)


 Actual: 15 (row 2 of the source data)

Why's the difference? Because independent variables are never


perfect predictors of the dependent variables. And the residuals
can help you understand how far away the actual values are
from the predicted values:

For the first data point (rainfall of 82 mm), the residual is


approximately -2.8.  So, we add this number to the predicted
value, and get the actual value: 17.8 - 2.8 = 15.

How to make a linear


regression graph in Excel
If you need to quickly visualize the relationship between the two
variables, draw a linear regression chart. That's very easy!
Here's how:

. Select the two columns with your data, including headers.


. On the Inset tab, in the Chats group, click the Scatter
chart icon, and select the Scatter thumbnail (the first one):

This will insert a scatter plot in your worksheet, which will


resemble this one:

.
. Now, we need to draw the least squares regression line. To
have it done, right click on any point and choose Add
Trendline… from the context menu.

. On the right pane, select the Linear trendline shape and,


optionally, check Display Equation on Chart to get your
regression formula:

As you may notice, the regression equation Excel has created


for us is the same as the linear regression formula we built
based on the Coefficients output.
.
. Switch to the Fill & Line tab and customize the line to your
liking. For example, you can choose a different line color and
use a solid line instead of a dashed line (select Solid line in
the Dash type box):
At this point, your chart already looks like a decent regression
graph:

Still, you may want to make a few more improvements:

 Drag the equation wherever you see fit.


 Add axes titles (Chart Elements button > Axis Titles).
 If your data points start in the middle of the horizontal and/or
vertical axis like in this example, you may want to get rid of the
excessive white space. The following tip explains how to do
this: Scale the chart axes to reduce white space.

And this is how our improved regression graph looks like:


Important note! In the regression graph, the independent
variable should always be on the X axis and the dependent
variable on the Y axis. If your graph is plotted in the reverse
order, swap the columns in your worksheet, and then draw the
chart anew. If you are not allowed to rearrange the source data,
then you can switch the X and Y axes directly in a chart.

How to do regression in Excel


using formulas

Microsoft Excel has a few statistical functions that can help you
to do linear regression analysis such as LINEST, SLOPE,
INTERCPET, and CORREL.

The LINEST function uses the least squares regression method


to calculate a straight line that best explains the relationship
between your variables and returns an array describing that line.
You can find the detailed explanation of the function's syntax
in this tutorial. For now, let's just make a formula for our sample
dataset:

=LINEST(C2:C25, B2:B25)

Because the LINEST function returns an array of values, you


must enter it as an array formula. Select two adjacent cells in the
same row, E2:F2 in our case, type the formula, and press Ctrl +
Shift + Enter to complete it.

The formula returns the b coefficient (E1) and the a constant


(F1) for the already familiar linear regression equation:

y = bx + a

If you avoid using array formulas in your worksheets, you can


calculate a and b individually with regular formulas:

Get the Y-intercept (a):


=INTERCEPT(C2:C25, B2:B25)

Get the slope (b):


=SLOPE(C2:C25, B2:B25)

Additionally, you can find the correlation coefficient (Multiple


R in the regression analysis summary output) that indicates how
strongly the two variables are related to each other:

=CORREL(B2:B25,C2:C25)

The following screenshot shows all these Excel regression


formulas in action:

Tip. If you'd like to get additional statistics for your regression


analysis, use the LINEST function with the stats parameter set
to TRUE as shown in this example.

To have a closer look at our linear regression formulas and other


techniques discussed in this tutorial, you are welcome to
download our sample Regression Analysis in Excel workbook.

That's how you do linear regression in Excel. That said, please


keep in mind that Microsoft Excel is not a statistical program. If
you need to perform regression analysis at the professional
level, you may want to use targeted software such
as XLSTAT, RegressIt, etc.

You may also be interested in:

 How to use Solver in Excel with examples

You might also like