Unit 4 Tutorials Correlation and Causation in Context

Unit 4 Tutorials: Correlation and
Causation in Context
INSIDE UNIT 4
Correlation
Scatterplot
Describing Scatterplots
Explanatory and Response Variables
Correlation
Interpreting Correlation and Causation
Positive and Negative Correlations

Coefficient of Determination/r^2
Outliers and Influential Points
Cautions about Correlation
Correlation and Causation
Establishing Causality
Line of Best Fit
Best-Fit Line and Regression Line

Linear Equation Algebra Review
Residuals
Least-Squares Line
Finding the Least-Squares Line
Interpreting Intercept and Slope
Multiple Regression
Predictions from Best-Fit Lines
Scatterplot
by Sophia
 WHAT'S COVERED
© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 1
This tutorial will discuss the topic of scatterplots. Our discussion breaks down as follows:
1. Scatterplots
2. Multiple Data Sets
1. Scatterplots
Scatterplots are ways that you can show more than one quantitative attribute at a time for a particular data
set. In the past, you’ve been using something like dot plots, where you have a particular quantitative attribute
about a data set. In addition, you've been making dot plots where you stack up dots at a particular value, and
you look at it that way.
However, scatterplots allow you not only to see how those values compare along with one attribute but also
along with a different attribute.
 EXAMPLE You might put the two variables cigarette consumption and cancer death in a
scatterplot. Perhaps certain states or countries have low cigarette consumption and maybe,
correspondingly, low cancer deaths. Each dot would correspond to one single state or one single
country.
 EXAMPLE If you were going with a sports team, maybe you'd want to know if spending a lot of
money on your team payroll causes them to win more. Each dot, in that case, would correspond to a
single team.
IN CONTEXT
This was the 1992 payrolls for the National Football League for their quarterback, who's usually their
most expensive player, and for the entire team. The values are in thousands of dollars.
Team QB Salary Total Payroll Team QB Salary Total Payroll
49ers 900 17,256 Falcons 2,250 25,642
Bears 3,000 23,074 Giants 1,600 23,258
Bengals 1,050 20,666 Jets 800 19,063
Bills 650 24,249 Lions 1,525 24,644
Broncos 500 21,992 Oilers 1,700 21,399
Browns 967 19,413 Packers 1,500 23,245
Buccaneers 675 19,545 Patriots 2,250 23,294
Cardinals 1,450 20,397 Raiders 1,300 20,390
Chargers 1,200 18,698 Rams 1,500 24,378
Chiefs 1,100 25,859 Redskins 1,450 20,780
Colts 2,000 22,022 Saints 1,200 23,695
Cowboys 1,750 28,349 Seahawks 1,250 25,348
Dolphins 1,400 23,728 Steelers 3,500 30,131
Eagles 425 19,325 Vikings 1,250 23,246
Next, let's put this on a scatterplot. The value that should go on the x-axis, or the horizontal axis,
should be the one that you think helps to explain the other variable. It is most likely the quarterback
salary that helps to contribute to a high or low team salary.
Start with the first team, the 49ers. Find that $900,000 for the quarterback and $17.2 million for the
team payroll and put a dot there. That's one of the many dots that we're going to end up with.
The next team, the Bears, had a quarterback salary of $3 million and a total payroll of about $23
million. As you continue with the rest of the teams, you’re going to end up with one dot for each
team. The final version looks like this:
It seems that as the quarterback salary increases, as it moves to the right, the total payroll tends to
increase as well.
 HINT
You can also see this using technology. If you want to use Excel, all you have to do is enter the data,
select the area that you want, and pick the correct graph of scatterplot.
You may need to add labels to the axes and sometimes there's a bit of extraneous stuff that you can get
rid of. Overall, though, you can see that same set of data.
 TERM TO KNOW
Scatterplot
A graphical display that allows us to see the relationship between two quantitative variables.
2. Multiple Data Sets

A great thing about scatterplots is that you can easily showmultiple data sets onto one plot. The way this is
done is by using different symbols to represent the different data sets.
 EXAMPLE Recall the in-context scenario from the previous section that compared quarterback
salary to total team payroll. Suppose that you wanted to add an additional categorical variable. You
want to know if the payrolls are different depending on conferences. There are two conferences in the
National Football League, the NFC and the AFC.
What you can do is use the same data, split the data between the two conferences, and use different
symbols for AFC (a gray circle) or NFC (a blue square).
You'll notice that it is the same scatterplot as before, however, the data points are separated by the two
conferences and this is visible with the two different symbols.
 TERM TO KNOW
Multiple Data Sets

Plotting more than one data set on a scatterplot requires that we use different colors or symbols for
the different data sets so we can see the relationships separately.
 SUMMARY
Scatterplots are ways that you can show more than one quantitative attribute at a time for a particular
data set. It is a way to show the relationship between two quantitative variables, which are paired data
sets. These are two attributes for the same individuals in the data set. One variable, typically the one
that we think might cause the other to happen, is assigned to the x-axis. The other is assigned to the
y-axis. It's also possible to put in multiple data sets, just using different symbols or different colors, to
denote the different sets.
Good luck!
Source: Adapted from Sophia tutorial by Jonathan Osters.
 TERMS TO KNOW
Multiple Data Sets

Plotting more than one data set on a scatterplot requires that we use different colors or symbols for the
different data sets so we can see the relationships separately.
Scatterplot
A graphical display that allows us to see the relationship between two quantitative variables.
Describing Scatterplots
by Sophia
 WHAT'S COVERED
In this tutorial, you're going to learn about describing scatterplots. Our discussion breaks down as
follows:
1. Describing Scatterplots
a. Form
b. Direction
c. Strength
1. Describing Scatterplots
When talking about univariate data, or one-variable data, you would discuss the shape, center, and spread of
a distribution when making histograms and dot plots,
On a scatterplot, it's a bit difficult to talk about the shape. Regarding center and spread, it's all very confusing;
perhaps the QB salary is very spread out, and the total salary is maybe not so spread out. This would make it
hard to talk about the spread.
Instead, you're actually going to describe:
Form
Direction
Strength
2. Form
In the form, we look for a pattern. Is the pattern linear, or do the data show a curve? Do they start low, then
peak, then end low? Or do they start low and end high? How do they curve or do they rise quickly and then
tail off? There's a lot to look at.
When discussing form, you will most likely describe a scatterplot as linear or non-linear.
Forms of a Scatterplot
Linear: The scatterplot is approximating a line.
Non-Linear: The data points follow a curve.
No Association: Data points resemble a cloud and there is

not clear pattern.
In addition, it is important to consider outliers or clusters when looking at the form. It the scatterplot was
essentially linear but had one outlier, we would want to note that. Also, if the data created clusters throughout
the scatterplot, this will be important to keep track of.
 TERM TO KNOW
Form
The overall shape of the data points. The form may be linear or nonlinear, or there may not be any
form at all to the points if they form a "cloud."
3. Direction
The direction refers to how the y-axis variable responds as you move to the right on the x-axis variable. There
are two main directions that a scatterplot can have are positive and negative.
Direction of a Scatterplot
Positive: The variables both increase or both decrease
Negative: The variables go in opposite directions (one

variable increases while the other variable decreases)
 TERM TO KNOW
Direction
The way one variable responds to an increase in the other. With a negative association, an increase
in one variable is associated with a decrease in the other, whereas with a positive association, an
increase in one variable is associated with an increase in the other.
3. Strength
The strength is how closely the two variables are associated with some line or curve. How well do the points
follow that indicated form? How well do these points stack up on a line? The strength can be described as
strong, moderate, or weak.
Strengths of a Scatterplot
Strong: The scatterplot would most resemble the form. The

data points are clustered around either a line or a curve.
Moderate: The data points are less clustered in a line or

curve, however, the direction is still clear.
Weak: The data points are much more spread out and the
direction may be less clear.
⚙ THINK ABOUT IT
Imagine the oval that you could put over the scatterplots. A strong association would have a very long,
thin oval over it. The moderate association would have kind of a wider oval over it, but it would still be
longer than it is wide. Over the weak association, the oval is almost more like a circle.
 HINT
The idea is, if you can encase the points in an oval, the stronger associations will have a longer, thinner
oval.
 TERM TO KNOW
Strength
The closeness of the points to the indicated form. Points that are strongly linear will all fall on or near a
straight line.
 TRY IT
This is the 1970 and 1980 price of different seafood in cents per pound. How would you describe the form,
direction, and strength of this scatterplot?
Form, Direction, and Strength of Scatterplot
(1970 vs 1980 Seafood Prices)
Form: Linear
The form is fairly linear. One point is a little bit low for the line that we would look at for the rest of
the data points. It also appears to have an outlier on the high side.
Direction: Positive
The direction is positive, which means that as the 1970 price increases, so does the 1980 price.
That's not surprising, because you would expect that the ones that are less expensive in 1970
would also be less expensive in 1980, and the ones that were more expensive in '70 would be
more expensive in '80.
Strength: Strong
The strength is very strong because it's fairly predictable what is going to happen with these
prices, based on the fact that they're very close to a line.
 SUMMARY
To describe scatterplots that look at one variable data, we look at shape, center, and spread. When
we look at two variable data scatterplots, we analyze form, direction, and strength. Regarding form,
are they linear or nonlinear? Are there unusual features, gaps, clusters, or outliers? Regarding
strength, how well do they follow that form? And lastly, we analyze the direction of the association:
what happens as the x-axis variable increases? Does the y-axis variable go up, down, or does it stay
the same? Or, is there really no association at all?
Good luck!
 TERMS TO KNOW
Direction
The way one variable responds to an increase in the other. With a negative association, an increase in
one variable is associated with a decrease in the other, whereas with a positive association, an increase
in one variable is associated with an increase in the other.
Form
The overall shape of the data points. The form may be linear or nonlinear, or there may not be any form
at all to the points, if they form a "cloud."
Strength
The closeness of the points to the indicated form. Points that are strongly linear will all fall on or near a
straight line.
Explanatory and Response Variables
by Sophia
 WHAT'S COVERED
This tutorial will explain explanatory and response variables. Our discussion breaks down as follows:
1. Explanatory Variables and Response Variables
1. Explanatory Variables and Response Variables

When examining the relationship between two variables, you often want to see if there's an effect that one
has on the other. Does one variable being high or low help to explain why another variable would be high or
low? Why would something being high or low cause one to increase or decrease? It doesn't necessarily have
to cause the increase or decrease; it just has to be associated with an increase or decrease in the other.
An explanatory variable is a variable that might cause an effect; it is the thing that we are looking to cause
something to happen. The response variable is the variable that will reflect that effect.
On a graph, the explanatory variable will go on the horizontal x-axis, and the response variable will go on the
vertical y-axis.
 HINT
Here is a mnemonic device to help you remember which variable goes on which axis: "explanatory" has
an "x" in it, so it's the x-axis, the horizontal axis.
IN CONTEXT
A fire breaks out, and you want to determine the relationship between the number of firefighters at
the fire and the financial damage caused by the fire. There's a positive association between these
two because as one goes up, the other goes up. Which one helps to explain the other?
The financial damage caused by the fire will help explain the number of firefighters at the fire. It's
important to know that it doesn't work the other way--meaning, if there are more firefighters, there
will be more damage.
They are associated, though, with each other. Because the severity of the fire is going to cause
more damage, it's also going to cause more firefighters to arrive on-scene.
When you put it on the graph, the explanatory variable, financial damage, goes on the x-axis. The
response variable, number of firefighters, goes on the y-axis.
⚙ THINK ABOUT IT
Consider the following examples and identify the explanatory and response variable:
Example Explanation
Maximum Explanatory: Maximum Daily Temperature

Daily Response: Cooling Costs
Temperature
and Cooling The maximum daily temperature is going to cause a change in cooling costs. The higher
Costs the temperature, the more it will cost to cool your house.
Rent and Explanatory: Square Footage of an Apartment

Square Response: Rent
Footage of
an The square footage is going to cause a change in the cost of the rent. As the square
Apartment footage of an apartment increases, the rent of the apartment will also go up.
SAT Verbal These two variables may be associated, meaning someone who does well on the verbal
Score and portion of the test may also do well in the math portion. Here, however, one variable does
SAT Math not necessarily cause a change in the other, so we would not assign either as the
Score explanatory variable, and you can choose any axis for the variables.
Occasionally, there is not a clear explanatory variable. What happens then?
IN CONTEXT
Cancer rates for kidney and lung cancer is known for the 50 states in the U.S. You don't think that
one type of cancer causes the other type. You don't really even think that an increase in one
corresponds to an increase or decrease in the other. The types of cancer don't seem to be related.
So in this case, when you graph them, it really doesn't matter which one is talking about being the
explanatory or response variable. They can be graphed either way.
OR
⭐ BIG IDEA
It only matters which variable goes on the x-axis if there's some obvious choice for an explanatory or
response variable. In situations where there is no clear explanatory variable, more investigation would be
required
 TERMS TO KNOW
Explanatory Variable
The variable whose increase or decrease we believe helps explain a tendency to increase or
decrease in some other variable.
Response Variable
The variable that tends to increase or decrease due to an increase or decrease in the explanatory
variable.
 SUMMARY
In a scatter plot, an explanatory variable is one variable that helps to explain an increase or decrease
in another; it is on the x-axis. The variable that appears to increase or decrease due to the increase or
decrease in the explanatory variable is called the response variable and is placed on the y-axis. If it’s
not clear whether one variable is associated with an increase or decrease in the other at all, or we
believe that there's no real association between the two, then it doesn't really matter which one we
call the explanatory or response.
Good luck!
 TERMS TO KNOW
The variable whose increase or decrease we believe helps explain a tendency to increase or decrease
in some other variable.
Response Variable
The variable that tends to increase or decrease due to an increase or decrease in the explanatory
variable.
Correlation
by Sophia
 WHAT'S COVERED
This tutorial will introduce correlation and correlation coefficient. Our discussion breaks down as
follows:
1. Correlation
a. Calculating Using Formula
b. Calculating Using Excel Function
2. Scatter Plots with Different Correlation Coefficients
1. Correlation
When first describing scatter plots, you learned about their form, direction, and strength.
Form assessed the linearity

Direction says whether the data points tend to move in a positive or negative direction
Strength shows how well they follow that form.
When the form is linear, we can use a number calledcorrelation. It measures the strength and direction of a
linear relationship. The direction will be easy to spot. It will be a positive number if there's a positive
association, and a negative number if there's a negative association. The numerical quantity will measure
strength.
⭐ BIG IDEA
In terms of direction, what correlation would do is this: if the explanatory and response variables rise
together, that's going to be called the positive direction. If one falls as the other increases, that's going to
be called a negative correlation.
The correlation is measured using a numerical value known as the correlation coefficient. The correlation
coefficient is a variable called "r" and is unit-less. It is expressed as a number between negative 1 and positive
1 and indicates the strength of the linear association.
Numbers that are close to negative 1 or positive 1 are associated with a strong association between the two
variables--a 1 indicating a strong positive association, and a negative 1 indicating a strong negative
association. Numbers near zero represent almost no linear relationship.
 HINT
You can use the chart above to help you to understand the value of a correlation. Numbers between 0.8
and 1 are considered to have a strong correlation; between 0.5 and 0.8 a moderate correlation; and
between 0 and 0.5 a very weak correlation. The same exists between negative 1 and 0.
 TERMS TO KNOW
Correlation
The strength and direction of a linear association between two quantitative variables.
Correlation Coefficient
The numerical value between -1 and +1 that measures the correlation between two quantitative
variables.
1a. Calculating Using Formula
Correlation is essentially the average of the products of the z-scores for the x's and the y's. The z- scores are
the values of x minus the means of x divided by the standard deviation of x. It's the same thing for y.
 FORMULA
Correlation
 EXAMPLE These are destinations that you could go to from the city of Minneapolis-Saint Paul, with
the distances away from Minneapolis and the airfare to fly to any of these places.
Destination Miles Airfare
Kansas City 460 379
Los Angelas 1,870 377
Milwaukee 338 158
New York City 1,167 283
Philadelphia 1,141 323
 STEP BY STEP
Step 1: Calculate z-scores of the x variable. In this situation, miles is x, or the explanatory variable, as
miles are believed to cause airfare to rise. This makes airfare the response variable, y. Take the given
miles and airfare and convert both of them into z-scores.
To do this, you need the mean and the standard deviation. Recall from Unit 3 that you can use Microsoft
Excel to easily find these values. For the mean, use the function "=AVERAGE", and for the standard
deviation, use the function "=STDEV.S". When using these functions in Excel, you just need to highlight
each column that you are finding the mean and standard deviation for. We will do this for both Miles and
Airfare.
Next, to calculate the z-score, subtract the mean from each value and divide by the standard deviation.
For example, using the first value in Miles, take 460 minus the mean, 995.2, and divide by the standard
deviation, 619.35. This gives us a -0.864.
Do the same thing for the 1870 miles to Los Angeles, and all of the other cities.
Step 2: Repeat this process and calculate the z-scores for the y values. In this scenario, the response
variables are the airfare values. Starting with Kansas City, 379 minus 304 divided by 90.93 gives us
0.825.
Do the same thing with all the rest of the airfare.
Step 3: Multiply the corresponding z-scores and add. Starting with -0.864 and 0.825, multiply the
corresponding z-scores for the x and y variables, all down the rows, then add them up.
The sum here ends up being positive 2.11. We can substitute this value into the correlation formula.
Step 4: Finally, divide by the number of observations minus 1. There are five observations, so the
denominator will be 5-1.
Dividing by four yields a correlation of 0.527. This value tells us that the correlation between airfare and
miles is a positive relationship but fairly weak association. We can also see this from the scatter plot:
1b. Calculating Using Excel Function
This is a very cumbersome process to go through, and the correlation coefficient is almost always found using
technology. In Excel, once we have the basic information for miles and airfare listed, all you have to do is type
in the command "=CORREL", which is short for correlation. Select all the things believed to be the x's, and all
of the things we believe to be the y's. Close the parentheses and hit "Enter."
Sure enough, it gives you the 0.527 that you got before.
2. Scatter Plots with Different Correlation

Coefficients
Let's explore some scatter plots with different correlation coefficients.
Graph Correlation Explanation
The data points are in a negative direction, so the

correlation is a negative number. It is also nearly
r = -0.99 linear, so its correlation is negative 0.99, which is
very close to negative 1. This graph shows a very
strong, negative association.
The data points are in a negative direction, so the

correlation is a negative number. However, the data
r = -0.5 is fairly spread out, so the strength is not terribly
strong. This graph shows a weak to moderate,
negative association.
The data points have a cloudy association, so it is
r=0 neither positive nor negative. There is also no linear
association between the two variables, so the
correlation is zero.
The data points show an upward association, so the

correlation is a positive number. Although it is linear,
r = 0.7 the data points are not very clustered, but still close
enough to show a fairly moderate to strong
association.
The data points are in a positive direction, so the

correlation is a positive number. It is also very linear,
r = 0.9
so its correlation will be closer to 1. This graph shows
a strong, positive association.
Even though the points are spread out, a positive

association is visible. However, since they are not
r = 0.3
closely clustered, the data points will show a weaker
strength.
The correlation only relates the linear relationship between two quantitative variables. As a caution, you're
going to hear the word correlation thrown around a lot in everyday speech; however, there are often very
common errors made when comparing two different variables. It is always important to make sure the two
variables being measured are quantitative.
⚙ THINK ABOUT IT
Can you spot the errors in the following statements?
"There is a strong correlation between Type 2 diabetes, physical inactivity, and obesity."
Although it's possible that they are related, you can't use the word correlation. The first error here is that
three variables are being compared, and correlation only compares two variables. Also, Type 2 diabetes
is categorical--either you have it, or you don't. Physical inactivity could be quantitative, but it's not
obviously quantitative. Obesity is certainly categorical.
"There is a strong correlation between IQ and religious affiliation."
Although IQ is quantitative, religious affiliation is categorical. You can't calculate the correlation between
the two.
 SUMMARY
Correlation measures the strength and direction of a linear relationship between two variables on a
scatter plot. Strong associations have correlation coefficients near positive 1 or negative 1.
Scatterplots with weak correlation coefficients are values near zero. Almost always, you can find it
using technology, such as a calculator, an Internet Applet, or a spreadsheet. Because of the way that r
is calculated, where you're multiplying z-scores, it doesn't matter which variable is called the
explanatory and which is the response.
Good luck!
 TERMS TO KNOW
Correlation
Correlation coefficient (r)

The numerical value between -1 and +1 that measures the correlation between two quantitative
variables.
 FORMULAS TO KNOW
Correlation
Positive and Negative Correlations
by Sophia
 WHAT'S COVERED
This tutorial will explore positive correlation and negative correlation. Our discussion breaks down as
follows:
1. Correlation
a. Positive and Negative Correlation
b. Relative Zero Correlation
c. Non-Linear Relationships
1. Correlations
Correlation is going to allow you to observe the strength and direction of a linear association between two
quantitative variables. Recall that it is a number between negative 1 and positive 1.
Any correlation coefficient between negative 0.5 and positive 0.5 is considered a weak association
between the two quantitative variables.
Any correlation coefficient between positive 0.5 and positive 0.8, or negative 0.5 to negative 0.8, is
considered a moderately strong correlation.
Any correlation coefficient between positive 0.8 to positive 1, or negative 0.8 to negative 1 is considered a
very strong correlation.
1a. Positive and Negative Correlation

A positive correlation is going to be a tendency of the response variable to increase in response to an
increase in the explanatory variable.
 EXAMPLE Below is a visual representation with a correlation coefficient, r, of positive 0.7. Even
though the direction is positive, the association is not terribly strong.
r = 0.7
A negative correlation is going to be the tendency of the response variable to decrease in response to an
increase in the explanatory variable.
 EXAMPLE Below is a visual representation with a correlation coefficient, r, of negative 0.99. This
means it's almost a perfectly straight linear relationship. It is a negative correlation because as the
explanatory variable on the x-axis increases, the response variable on the y-axis has a tendency to
decrease.
r = -0.99
 TERMS TO KNOW
Positive Correlation
The type of correlation present when two variables have a correlation coefficient generally greater
than or equal to 0.5.
Negative Correlation
The type of correlation present when two variables have a correlation coefficient generally less than
or equal to -0.5.
1b. Relative Zero Correlation

Some graphs will appear to be a cloud. In this case, the relationship will have a relative zero correlation.
There's no discernible association between the explanatory and the response.
 EXAMPLE Below is a visual representation with a correlation coefficient, r, of zero.
r=0
 HINT
If all the points lined up in a straight horizontal line, that would also give you a correlation coefficient of
zero.
 TERM TO KNOW
Relative Zero Correlation

The type of correlation present when two variables have a correlation coefficient generally between -
0.5 and 0.5.
1c. Non-Linear Relationship

One thing that's worth noting is that the numbers, like correlation, very rarely tell the entire story.
 EXAMPLE Consider the two tables below.

Table 1 Table 2
x y x y
10 804 10 914
8 695 8 814
13 758 13 874
9 881 9 877
11 833 11 926
14 996 14 810
6 724 6 613
4 426 4 310
12 1,084 12 913
7 482 7 726
5 568 5 474
r = 0.82 r = 0.82
If you take a look at these two tables, the correlation coefficient for each of them is 0.82 in both cases.
Based on that, you might think that they look similar when they are graphed. However, this is not the
case.
With the first graph, you can see it's a fairly strong positive association, just as you would expect.
With the second graph, it's a strong association, but it's not linear. This follows the form for anon-linear
relationship. If x and y have a nonlinear relationship, a line isn't going to model this accurately at all. Even
though they have the same correlation coefficient, one has a line being a correct model for the data set,
and the other does not.
If you see that the correlation is a number that is very, very low--near zero--you might assume there's no
relationship between x and y. However, you could be wrong.
 EXAMPLE Consider this data set.

x y
1.2 23.3
2.5 21.5
6.5 12.2
13.1 3.9
24.2 4
34.1 18
20.8 1.7
37.5 26.1
r = 0.00099
The correlation coefficient for this data is very low. You may assume that there is no relationship. Let's see
what the graph of this data looks like.
You can see there's a clear trend in the data set; however, it is non-linear.
⭐ BIG IDEA
It is important to know that the correlation coefficient, r, only measures the strength of a linear relationship
between x and y. To really understand a relationship between two variables, it is crucial to always graph
your data.
 TERM TO KNOW
Non-Linear Relationships
Associations between two variables that can be modeled better with a curve than a line.
 SUMMARY
Correlation is a way to quantify the strength and the direction of a linear association, or a linear
relationship between two quantitative variables that lie on a scatter plot. A strong linear association
will be a number near positive 1 or negative 1. There are also moderate correlation coefficients and
weak correlation coefficients. Weak linear associations will have a correlation coefficient near zero. A
set of data might have low correlation, but a strong non-linear association. Always plot your data, and
you'll see the association first hand.
Good luck!
 TERMS TO KNOW
The type of correlation present when two variables have a correlation coefficient generally less than or
equal to -0.5.
Non-linear Relationships
The type of correlation present when two variables have a correlation coefficient generally greater than
or equal to 0.5.

The type of correlation present when two variables have a correlation coefficient generally between -0.5
and 0.5.
Coefficient of Determination/r^2
by Sophia
 WHAT'S COVERED
This tutorial will explain the coefficient of determination. This is also called r squared, the square of
the correlation coefficient. Our discussion breaks down as follows:
1. The Coefficient of Determination

2. Finding r From r-squared
1. The Coefficient of Determination

The correlation coefficient, r, gives a general measure of strength and direction of a linear relationship. There
is also the coefficient of determination, or r squared, which provides a very specific measurement. It provides
the percent of the variation in the y-direction that can be explained by the linear relationship with the x
variable. This can be a little confusing to understand.
 EXAMPLE Even though it is on the y-axis, the graph here is a dot plot of the seafood prices in
1980. This is going to be your y variable, but it's not very well contextualized. You would still wonder
why the point all the way up at 400, which represents sea scallops, are so expensive. What would
cause that to be so high, while other prices are so low?
What you can do is add a variable to understand why the 1980 price of sea scallops was so high, while
some of the other prices were so low. Look at it with the new variable of 1970 prices to explain why some
of these are high or low or in the middle.
The low prices were low in 1970, and the high prices were high in 1970. Looking at this separated of its
previous context doesn't really help to explain why certain prices are high or low. Looking at it with the full
context of previous knowledge and its associations helps to explain why that specific point is high up, and
why some of the other points are low. It's high up because it's strongly linearly associated.
The value of the coefficient of determination, or r2, in this particular example is 0.935. This means that
93.5% of the variation in 1980 prices can be explained by a linear association with 1970 prices.
⚙ THINK ABOUT IT
You might be wondering what happened to the other 6.5% of the variation. How is that explained? The
reason that it's not 100% of the variation is that these points don't all lie perfectly on a line. If they did, all
the reasoning behind the 1980 price would be explained by the 1970 price. But, they don't lie exactly on a
line.
Some points fall conspicuously a little bit below what you would imagine the line to look like. The
remaining 6.5% of variation has to be explained by something else. Perhaps some species of fish were
over-fished, and that raised prices. Or perhaps people's tastes changed, and the demand for a particular
fish fell, and that lowered the price.
 TERM TO KNOW
Coefficient of Determination (r^2)

A value that explains the percent of variation in the response variable that can be explained by a
linear association with the explanatory variable. It is the square of the correlation coefficient.
2. Finding r From r-squared

Ultimately, r squared is always a positive number, and it does help to measure the strength of the linear
association. It measures something very, very specific. It doesn't indicate the direction; it only can indicate the
strength.
We can also use the coefficient of determination, r2 to find the correlation coefficient, r.
 STEP BY STEP
Step 1: Take the square root of r2. If only r-squared is given, what you have to do is take the square root
to obtain the correlation coefficient, r.
Step 2: Look at the graph to determine sign. You also have to look at the graph to find the association--
either positive or negative--to determine the sign of the correlation coefficient.
 EXAMPLE Look at each of the following examples and find the correlation coefficient, r, from r-
squared.
Coefficient of
Graph Correlation Coefficient (r)
Determination (r2)
Graph shows a positive association, so the correlation

coefficient stays .
Graph shows a negative association, so the correlation

coefficient is actually .
 SUMMARY
The coefficient of determination allows us to understand the percent of variation in the vertical
direction that can be explained by the linear association that the two variables have. If you solve for r,
from r squared, you need to not only take the square root but also look at the scatterplot to determine
the sign, because r squared can't be negative but r can.
Good luck!
 TERMS TO KNOW
Coefficient of Determination (r2)

A value that explains the percent of variation in the response variable that can be explained by a linear
association with the explanatory variable. It is the square of the correlation coefficient.
Outliers and Influential Points
by Sophia
 WHAT'S COVERED
This tutorial is going to teach you about outliers and influential points. Our discussion breaks down as
follows:
1. Outliers
2. Influential Points
1. Outliers
You may recall the term "outliers" when talking about univariate data. However, in bivariate data,outliers are a
little bit different.
An outlier is any point that deviates substantially from the overall form of the remainder of the data points.
 EXAMPLE Let's take a look at these two data sets. One thing that you might realize is that the
ones on the left seem quite random, whereas in the ones on the right, all the x's except one are 8,
which might be a clue to something.
Table 1 Table 2
x y x y
10 746 8 658
8 677 8 576
13 1274 8 771
9 711 8 884
11 781 8 847
14 884 8 704
6 608 8 525
4 539 19 1,250
12 815 8 556
7 642 8 791
5 573 8 689
However, if you calculate the mean and standard deviation, you will find that they have the same mean
for the x's, the same mean for the y's, the same standard deviation for the x's and the same standard
deviation for the y's. Also, their correlations are the same at 0.816 in a positive direction.
Based on that information, one would think that these two graphs will look fairly similar. Let's take a look:
Graph 1 Graph 2
Both graphs have an outlier that does not follow the overall trend of the graph. Depending on the pattern, the
outlier could be an extreme x-value, an extreme y-value, extreme for both the x- and y-values, or neither.
Types of
Example
Outliers
Extreme
x-values
This is an outlier in the x-direction because it's so much further to the right of the other pack of
points but not in the y-direction. If you look horizontally, it's sort of in the middle lower part of the
y-direction. It's an outlier in the x-direction but not the y-direction.
Extreme
y-values
This is an outlier in the y-direction because it's so much higher than the other y-direction, but not
the x-direction.
Extreme
x- and y-
values
This is an outlier in both the x- and y- direction because it's so much further to the right and also
higher than the rest of the points.
Neither
extreme x-
or y-
values
Even though it is not extreme in either the x- or y- direction, it doesn't fit the overall trend
established by the rest of the data.
 TERM TO KNOW
Outlier
Points that deviate substantially from the overall form of the remainder of the data points.
2. Influential Points
Influential points are points that, if removed, significantly changes a statistical measure. Usually, the measure
that we're talking about changing is correlation, but it could also affect other measurements such as the mean
of x or y and the standard deviation of x or y.
Some outliers are influential, and some are not.
 EXAMPLE When the scatterplot on the left includes the outlier, the correlation coefficient is 0.816.
However, when we remove the outlier, the correlation coefficient changes to 1. Since this dramatically
changes the correlation, this outlier would be considered an influential point.
With outlier: Without outlier:

r = 0.816 r=1
 EXAMPLE When the scatterplot below includes the outlier, the mean of x is 9, the standard
deviation of x is 3.3, and the correlation is 0.816. However, when we remove the outlier, the mean
becomes 8 because now all the x-values are 8, the standard deviation is 0 because they never
deviate from 8, and the correlation is 0. Therefore, it changes all of these measures very substantially
by being there. That outlier is certainly influential.
With outlier: Without outlier:
mean = 9 mean = 8
standard deviation = 3.3 standard deviation = 0
r = 0.816 r=0
 EXAMPLE The outlier in the scatterplot below is not going to have a great effect on the correlation
or the least squares regression line that these data sets create. In this case, a line is an inappropriate
model, but if you did make a line, having this point versus removing this point wouldn't affect that line
or the correlation very much.
 TERM TO KNOW
Influential Points
An observation that, if removed, significantly changes a statistical measure
 SUMMARY
Important points on a scatterplot are influential points and outliers. Influential points substantially
change at least one statistical measure. Outliers simply are points that deviate from the overall form of
the rest of the points. They may be outliers in the x- or y-direction, but don't have to be, according to
this definition. Be aware that different people use different definitions of outliers for scatterplots, so
there's not one hard-and-fast definition.
Good luck!
 TERMS TO KNOW
Influential Point
An observation that, if removed, significantly changes a statistical measure.
Outlier
In a scatter plot, an outlier is an observation that has an extreme x value, an extreme y value, both an
extreme x and y, or is well away from the main trend of points.
Cautions about Correlation
by Sophia
 WHAT'S COVERED
This tutorial will explain certain cautions about using correlation. Our discussion breaks down as
follows:
1. Cautions about Correlation

a. Influential Point and Non-Linearity
b. Inappropriate Grouping
1. Cautions about Correlation

Correlation is a statistical measure like mean or standard deviation. However, it doesn't tell the entire story.
You have to actually graph the data in order to really fully understand the relationship.
Sometimes the correlation coefficient is influenced by another factor, such as:
Influential Points
Non-Linearity
Inappropriate Grouping
1a. Influential Points and Non-Linearity

Recall that an influential point is an observation that, if removed, significantly changes a statistical measure.
They are usually easy to spot on a scatter plot because it is an outlier.
Also, remember that correlation measures the direction and strength of a linear relationship. If a graph is
curved, then it can be measured by a correlation coefficient.
 EXAMPLE Here are three data sets:

Table 1 Table 2 Table 3
x y x y x y
10 804 10 914 10 746
8 695 8 814 8 677
13 758 13 874 13 1,274
9 881 9 877 9 711
11 833 11 926 11 781
14 996 14 810 14 884
6 724 6 613 6 608
4 426 4 310 4 539
12 1,084 12 913 12 815
7 482 7 726 7 642
5 568 5 474 5 573
All of these three data sets have an x mean of 9, a y mean of 750, a standard deviation in the x of 3.32,
and a standard deviation in the y of 203. Their correlations are also 0.816, meaning they're all linear with
moderate strength.
However, if we look at the three graphs, only Graph 1 is linear in the way that the data suggests that it is.
One of the big ideas about correlation is that it can be affected strongly by non-linearity or influential
points.
Graph 1
Graph 2 Graph 3
Affected by Non-Linearity Affected by Influential Points
⭐ BIG IDEA
You need to not simply trust that the correlation gives you a strong number and believe then the x and
the y are strongly linearly related. You have to actually look at the data points on the scatterplot to see if
they are forming a line like that first one was, or forming a curve, or if they have influential points.
1b. Inappropriate Grouping
Another thing about correlation that can be misleading is it can also be affected by what we callinappropriate
grouping. This is when the subgroups are combined together when they should not be combined. This results
in a weakened, or even reversed, association.
 EXAMPLE Consider the scatter plot showing the age and salary of workers at a particular factory.
You would assume that the younger folks would make less than the older folks. Apparently, on this
scatter plot, that's not really the case.
It appears there's a weak negative association; the longer you work there, the less you make, which
doesn't really make a whole lot of sense. Typically, longevity is rewarded with higher salaries.
There might be a lurking variable behind this, where, if you look at it closely, you can see that there are
two groups.
In the first group with the younger workers, they may all have college degrees. They might have
ascended to higher positions, such as a foreman rather than an assembly line worker.
In the second group with the older workers, perhaps they don't have a college degree and have the
lower paying jobs than the younger folks.
So, you might have something like this.
If you look at the two groups separately, they both have a strong positive association. The longer you
work there or the older you are, at any rate, your salary will go up. However, when viewed as a whole, it
appeared that the association was negative.
 TERM TO KNOW
Combining together subgroups that should not be combined, resulting in a weakened, or even
reversed, association.
 SUMMARY
Correlation is a useful measure. However, like any statistical measurement, it doesn't tell the entire
story. You have to graph your data because correlation can be affected by influential points, non-
linearity, and inappropriate grouping. Inappropriate grouping is when you have a weakened or even a
reverse association when you group, versus if you didn't group. In the previous example, when we
didn't group the data, it appeared that there was a negative association, whereas when we did group
the data, we found that there was a positive association. That was an example of inappropriately
combining the two data sets of degrees and non-college degrees.
Good luck!
 TERMS TO KNOW
Combining together subgroups that should not be combined, resulting in a weakened, or even
reversed, association.
Correlation and Causation
by Sophia
 WHAT'S COVERED
This tutorial will introduce the connection between correlation and causality. Our discussion breaks
down as follows:
1. Correlation and Causation

a. Lurking Variables
b. Reversed Association
1. Correlation and Causation

Correlation and causation are not the same thing. However, it's often tempting to say that two well-correlated
variables have what we call a "causal" link; that the two variables are causing each other to happen.
 EXAMPLE Suppose you have two variables and find that the correlation coefficient is 1, meaning
they have a perfect linear correlation and they are strongly associated. However, you cannot say that
one variable causes the other variable to happen without doing some other tests and making other
assertions.
Correlation is just saying that the two variables or events have a linear association. Causation is when one
variable actually causes another variable to occur.
There doesn't always have to be an explanation for the relationship between two events. It's possible that
two variables might be very well-correlated, but the correlation is simply a coincidence. Therefore, the best
way to prove cause-and-effect is with a controlled experiment where the explanatory variable is administered
to one group and withheld from the other.
If the experiment follows the basic experimental design principles of control, randomization, and replication,
the experiment can, in fact, prove a cause-and-effect relationship. It can give the best evidence for causation.
⭐ BIG IDEA
Correlation does NOT imply causation!

If you do find a strong correlation, there are a variety of explanations for why we cannot say there is
causation:
There could be something called a "lurking variable" behind the scene that causes an increase or
decrease in one or both of them.
It could simply be that you got the association reversed.
 TERMS TO KNOW
Correlation
A statistic which measures the strength and direction of the linear association between two
quantitative variables.
Causation/Cause-and-Effect
A phenomenon whereby an increase in one variable directly leads to an increase or decrease in
another variable.
1a. Lurking Variables

One reason that we cannot say there is causation with a strong correlation could be a lurking variable. This
other variable could be confusing the relationship between the explanatory variable and the response
variable.
 EXAMPLE In many families where parents left the light on in their infant's room as they slept, the
infant developed nearsightedness. This is an actual studied scenario, where researchers noticed that
there was a positive relationship between sleeping with the light on and having nearsightedness.
Therefore, researchers concluded that sleeping with the light on might cause nearsightedness.
Is this conclusion correct?
Upon follow-up studies, this conclusion was shown to be incorrect. The nearsightedness of the
children was genetic and was therefore caused by their parents' nearsightedness, not by sleeping in a
room with the light on. In fact, the parents' nearsightedness caused them to leave the light on in the
child's room so that the parents could see.
Therefore, the nearsightedness of the child and the light being left on were both due to the lurking
variable of their parents' nearsightedness. It wasn't the light that caused the child's nearsightedness.
 EXAMPLE As ice cream sales increase, so do the number of drowning deaths. Suppose you come
up with this conclusion: "Eating ice cream causes drowning."
So, should you not go swimming after eating ice cream because it's dangerous for you? Well, not
really.
Both the variables of ice cream sales and drownings just happen to increase with higher temperatures:
As the summer months go on, more people consume ice cream because it's warmer and they want to
cool off.
They also want to cool off by going to the beach and the pools in the summer.
A higher volume of people attending those beaches and pools will sadly cause the number of people that
drown to go up, as well.
Just as in the case of the nearsightedness and sleeping with the light on, there's a lurking variable behind
the scenes causing the increase in both ice cream sales and drowning. It's not the increase in ice cream
sales that causes the drowning, nor does the drowning cause an increase in ice cream sales. They're both
increased by the higher temperatures.
1b. Reversed Association
Another reason that we cannot say there is causation with a strong correlation could be that the association is
reversed. If we don't know the direction of the cause-and-effect of two variables, we cannot say that it is a
causal relationship, only that they are strongly correlated.
 EXAMPLE As the number of firefighters at a fire increases, so does the damage the fire causes.
Suppose you come up with this conclusion: "Sending firefighters is counterproductive because they
only increase the size of the fire."
This is obviously a ludicrous conclusion to draw. In fact, the true association is just the other way
around. The association is reversed. It is cause-and-effect relationship; however, it is a severe fire that
causes the firefighters to arrive, not the other way around.
 SUMMARY
Sometimes two variables will be related because one causes the other, whereas other times they will
be well-correlated, but the association isn't what we call "causal"; this is the difference between
correlation and causation. In many cases, there's a lurking variable--something behind the scenes
that's causing an increase or decrease in both variables, or maybe a decrease in one and an increase
in the other. Finally, sometimes there appears to be a relationship between two variables, but it is only
a coincidence. Thus, the most valid way to prove causation is with a controlled, randomized
experiment. However, strong evidence for causation can be made with an observational study.
Good luck!
 TERMS TO KNOW
A phenomenon whereby an increase in one variable directly leads to an increase or decrease in another
variable.
Correlation
A statistic which measures the strength and direction of the linear association between two quantitative
variables.
Establishing Causality
by Sophia
 WHAT'S COVERED
This tutorial will explain guidelines for establishing causality. Our discussion breaks down as follows:
1. Causality
2. Levels of Confidence
1. Causality
You might recall that causality is a cause-and-effect relationship between variables.
Sometimes you may want to determine whether two variables are well-correlated due to cause and effect.
The best way to do it is with a controlled experiment, but sometimes you cannot do a controlled experiment.
Perhaps you have to do an observational study due to ethical or practical concerns.
How can you prove cause and effect under those circumstances? It's still possible, though very difficult, to
prove cause and effect with a study that isn't an experiment.
The study will need to meet these five criteria:
Criteria Questions to Ask
Does the association

remain even when
other variables are
allowed to vary?
Does this work across

Consistency: You need to look for cases when correlation remains while other
different races and
factors vary.
genders?
Do high amounts of the

alleged cause lead to
high or low amounts of
the alleged effect?
Is the effect absent

Control: You need something similar to a control. It's not exactly using a control when the cause is
group, but it's similar to what you would do if you had done an experiment. This is absent?
essentially like splitting a group of volunteers into two groups and having a
treatment group and a control group. Although you're not assigning them that way, Is the effect present
you're looking for the same thing. when the cause is
present?
Does an increase in
the cause correspond
Correlation: You need to look for evidence that larger amounts of suspected cause
to an increase, or
produce a larger effect.
hypothetically a
decrease, in the effect?
Is there be something
else, perhaps some
lurking variable, that
Consideration of Alternatives: You need to check for other possible causes. you're missing?
Might there be other

plausible causes?
What physically might

create this effect?
Connection: You need to try to determine the physical mechanism for the cause What is the physical
and effect. mechanism behind the
effect, and how could it
plausibly be led to from
the cause?
These are pretty strict requirements. They are necessary in order to determine, without an experiment,
whether or not two correlated variables are going to be cause-and-effect related.
⚙ THINK ABOUT IT
Consider the following claims and determine if you can establish causality:
Claim:
"Eating a lot of carbohydrates makes you gain weight."
Is this consistent across different races, genders,

etc?
Consistency ✔
More or less, this claim is consistent.
Is the effect present when the cause is present?

Do people who eat lots of carbohydrates gain
weight?
Control ✘
You can see a lot of people that eat lots of

carbohydrates and don't gain a lot of weight.
Does an increase in the amount of carbohydrates

increase the amount of weight gained?
Correlation ✔
All other things being constant--yes, more or less.
Is there anything else besides eating lots of

carbohydrates that might make people gain
weight?
Consideration It's possible that people that eat lots of

of ✘ carbohydrates don't exercise as much as people
Alternatives that eat fewer carbohydrates. Maybe that's what
is making those people gain weight. So since
we've considered alternatives and found them to
be plausible, we're going to say that we can't say
that this is the only cause.
Is eating lots of carbohydrates physically related

to weight gain?
Connection ✔
They are.
So, this claim almost passed, but it did not meet all of the criteria. So you can't say that this claim is 100%
true.
Claim:
"Smoking causes lung cancer."
Do you see higher lung cancer rates among

smokers across different genders and races?
Consistency ✔
Yes, even across different countries. This is true
worldwide.
Is the effect present when the cause is present?

Do people who smoke tend to get lung cancer?
Control ✔ People can get lung cancer even if they don't

smoke. But you see it in much higher rates with
people that do smoke, and much lower rates in
people that don't smoke.
Do groups of people who smoke have higher

incidences of lung cancer than people that
Correlation ✔ smoke less?
Yes, they do.
What else might be causing lung cancer?
It's possible that there's a genetic link that both

Consideration
causes people to smoke and predisposes them to
of ✔
lung cancer. Although that is somewhat plausible,
Alternatives
it isn't highly plausible. Considering the
alternatives, you can say that smoking is a more
likely cause than genetics.
Is there a scientifically understood physical

connection between smoking and lung cancer?
Connection ✔ Yes. There have been experiments using the tar

in cigarettes on animals, and those animals have
developed cancerous tumors. So we understand
the physical connection.
This passes all of the criteria so we can reasonably claim that smoking does cause lung cancer. Now,
smoking is not going to cause lung cancer in 100% of people. Not everyone who smokes is going to get
lung cancer. But we can say this is a large contributor.
 TERM TO KNOW
Causality
A cause-and-effect relationship between two variables.
2. Levels of Confidence
You can have different levels in your confidence in the causation. You can have:
Possible cause, which means you can imagine a scenario where A causes B. One thing causes the other.
Probable cause, which means you're pretty sure that A causes B.
Cause beyond a reasonable doubt, which means that you cannot think of a scenario where the response
of second variable B could have been caused by anything other than A.
IN CONTEXT
Consider the criminal justice system.
Possible cause would be the case where someone becomes a suspect. There may be evidence to
suggest that this suspect committed some crime.
Probable cause would be the instance where the person actually gets arrested for the crime.
Cause beyond a reasonable doubt would be the part where the person is convicted in a court of
law.
 SUMMARY
The only way to prove 100% definitively causation is with a controlled, randomized experiment.
However, by using a set of very stringent criteria, you can reasonably conclude that there's a causal
link between two variables based on whether or not they meet five criteria. Sometimes the alleged
causes don't hold up under the scrutiny, but we can be certain of the ones that do. For this reason, we
can describe levels of confidence in our causation.
Good luck!
 TERMS TO KNOW
Causality
Best-Fit Line and Regression Line
by Sophia
 WHAT'S COVERED
This tutorial is going to cover what the best-fit line is used for. Our discussion breaks down as follows:
1. Best-Fit Line/Regression Line

2. Features of a Best-Fit Line
3. Uses for a Best-Fit Line
1. Best-Fit Line/Regression Line

Imagine a line going through a pack of points. That line is going to be called abest-fit line, or a trend line, or a
regression line. The idea of a line of best fit is that it will roughly approximate what's going on with the data in
the form of a single line.
 EXAMPLE Suppose we have the following scatterplot.
One easy way visually draw a best-fit line is to first place an oval over the top of your points.
The oval can be symmetric along what we call the minor axis, which is essentially cutting it the hamburger
way, or you can cut it along the longer, major axis, which is typically called the hot dog way. You're going
to cut it the longer way, which is a fairly good approximation at a line of best fit.
Roughly half the points fall above and below the line. In this particular example, about five of them are fairly
near the line, three are substantially below, and three are substantially above.
⭐ BIG IDEA
The term "best-fit line" can be used interchangeably with terms "trend line" and "regression line".
 TERM TO KNOW
Best-Fit Line/Trend Line/Regression Line

A line that closely approximates the response values for given explanatory values when the form of
the scatterplot is linear.
2. Features of a Best-Fit Line

A good best-fit line will have the following features:
Roughly half the points above and below the line

No pattern to how the points are "off" from the line
 EXAMPLE This is a poor choice of a trend line. It does not cut the "oval" the long way, and there is
a pattern to how the points are above or below the line. With this trend line, any point below the line is
off to the right, and any point above the line is off to the left.
 EXAMPLE Below is a better trend line, because the points that are above and below are
peppered throughout. You don't want a pattern to how the points are off from the line.
3. Uses for a Best-Fit Line

What is a trend line used for? A line of best fit is used to give approximations for values of x and values of y--
even on places where there is an existing value of y.
 EXAMPLE In the scatterplot below, when x equals 6, there's a difference between the actual value
of y at 6, and what the line predicts as the value of y at 6.
 TRY IT
You can use this line to predict other values. What does this best-fit line predict for y if x was 14?
Go over to 14 and then up until you get to the best-fit line. Then go over to y-axis to figure out how high it
is at that point. It's at about 960. You can say that the prediction for x being 14 is y being 960.
 SUMMARY
A line of best fit can understand the general trend of what's is occurring in a scatterplot--how the y
values relate to the x values. A good trend line will cut down the middle and have a peppering of
points above and below it. It will be a random scatter, as opposed to some systematic flaw in it.
Good luck!
 TERMS TO KNOW

A line that closely approximates the response values for given explanatory values when the form of the
scatterplot is linear.
Linear Equation Algebra Review
by Sophia
 WHAT'S COVERED
This tutorial will review the basic algebra of linear equations. Our discussion breaks down as follows:
1. Linear Equations
a. Slope and Intercept In Practice
b. Slope and Intercept in Statistics
2. Using a Line to Obtain Points
3. Using Two Points to Obtain a Line
4. Establishing Variables
1. Linear Equations
As lines fit to data, figuring out the equations of those lines is the focus of this tutorial. You may recall the
following linear equation:
 FORMULA
Slope-Intercept Form of a Linear Equation
In the above equation, y and x are variables. Now, x is recognized as the explanatory variable and y as the
response variable. The other two values are numbers, and they represent something.
The value of m is called the slope. The slope is a rate of change. You may have heard several terms of rates of
change, like miles per hour or meters per second or miles per gallon in a car. In general, it's an increase of 1 in
x corresponds to an increase or decrease of m in y.
 EXAMPLE If the rate of change was 30 miles per gallon, that means that an increase of one gallon
would correspond to an increase of 30 miles that you could travel.
The slope is calculated by taking the difference in y values divided by the difference in x values.
 FORMULA
Slope
The other value, b, in the equation is called the y-intercept. It's the value of y when x is 0. So the line will pass
through the point (0, b) on the y-axis.
 TERMS TO KNOW
Slope
The rate of change relating the increase or decrease in y to an increase of 1 in x.
y-intercept
The value of y when x = 0.
1a. Slope and Intercept in Practice

Let's show these terms in practice, in the graph below.
To find the slope, what we have to do is find out the points that are on the line, and figure out by how much
vertically this went up with an increase of 1 in x. Let consider how much it went up between 1 and 2 on the x-
axis.
As the graph shows an increase of 1 in x, there was an increase of 2 in y. So the slope, which explains the rate
of change that relates the increase in y to an increase of 1 in x, would be 2.
Also, the y-intercept is the point where the line passes through the y-axis, or at (0,1). So the value of y when x
is 0 is 1, which would be b in the linear equation.
1b. Slope and Intercept in Statistics

Below is the general linear equation compared to the formula used in statistics.
 FORMULA
You may notice that the order is flipped, but it is still telling us the same information and can be used to find
the best-fit line.
Instead of m, the slope becomes b1

Instead of b, the y-intercept becomes b0
Instead of y, we now have what is called y hat.
The variable y-hat is the notation for the prediction. There are values of y that are not predictions; they're
actual data points. But because we're doing a best fit, this is our best guess as the value of y--it's a prediction.
Anything with a hat is called a prediction.
2. Using a Line to Obtain Points

Suppose that you have a trend line and the equation is:
We can use this equation to find the predicted y-coordinate, y-hat, when x equals 20. To solve this
algebraically, all you have to do is substitute 20 in for x in the given equation:
3. Using Two Points to Obtain a Line
Suppose we don't know the equation of a trend line, but we know that it passes through (4, 500) and (12, 900).
To find the equation of that line, two pieces of information are needed: the slope and the y-intercept.
First, find the slope. You can see visually that from (4, 500) to (12, 900), it went up 8 in the x-direction. But it
also actually went up 400 in the y-direction.
Recall that slope is the difference in y values divided by the difference in x values. So a change of 400 in the
y-direction divided by a change of 8 in the x-direction means that for every 1 increase in the x-direction, it
actually went up 400 divided by 8, or 50 in the y. Therefore, the slope, b1, is 50.
To figure out the y-intercept, plug any (x, y) pair that is on the line into the equation, for this example, (12, 900).
Put 12 temporarily in for x and 900 temporarily in for y-hat. We also know the slope, 50, so we can plug this
value in for b1.
This tells us that the y-intercept, b0 is 300. Plug in this value and the slope into the linear equation formula to
get:
4. Establishing Variables
One thing that's important to note is that the best-fit line will change if you switch the explanatory and
response variables. That's why it's important to choose at the beginning which one is the explanatory verses
which one is the response variable.
 EXAMPLE Slope is a rate of change. So if you take a look below, miles per gallon would be the
rate of change in the example on the right. If you switched to put gallons on the y-axis and miles on the
x-axis, the rate of change here would actually be measuring gallons per mile, which is a different
number.
If a car is getting an average of 20 miles per gallon that will actually only be one-twentieth gallons per
mile. It's a different line.
One thing that's important to note, though, is that the value of the correlation coefficient is going to be the
same for each of these two graphs, but the line itself is different, and that's why we need to choose which
variable is the explanatory versus which one is the response.
 SUMMARY
Fitting a line to data points on the scatter plot requires a bit of algebraic savvy. There are two parts to
the equation of a line: the slope and the y-intercept. Linear equations involve figuring out the
equations of those lines. The slope is a rate of change. It's how quickly the response variable y
changes when the explanatory variable x increases by 1. The y-intercept is the value of y when x
equals 0. In actual practice, you will want to put variable names and units attached to these, not just
x's and y's.
Good luck!
 TERMS TO KNOW
Slope
y-intercept
Slope
Residuals
by Sophia
 WHAT'S COVERED
This tutorial will cover the topic of residuals, which occur when you fit a line to data points. Our
discussion breaks down as follows:
1. Residuals
1. Residuals
When you create a best-fit line, typically it doesn't pass through all the points. The only way it would pass
through all the points is if the correlation was exactly 1, which means that all the points lie exactly on a line.
Most of the time, they don't lie exactly on a line. In that case, most of the points are going to have some
difference between what the line predicts and the value that they actually are. Because the line shows
predictions, they'll be off a little bit from the actual values, even if only by a little.
A residual is the amount by which the predictions are off from the actual amount.
 EXAMPLE The scatterplot below shows the 1992 payrolls for the National Football League for their
quarterback, who's usually their most expensive player, and for the entire team.
The best-fit line shows the predicted payrolls of a team if the quarterback makes a certain amount of money.
The predicted payroll (payroll-hat) is equal to $18.8 million plus 3 times the quarterback salary (QB). The
equation of this line is:
Let's consider the Dallas Cowboys, circled in the scatterplot.
They pay their quarterback $1.75 million, and they pay the overall team $28.394 million, which is well above
what the line would predict for a team that pays their quarterback that amount of money. To find the
predictive value, we can use the best-fit line equation, plug in the value of the quarterback salary for the
Cowboys, and solve for the team payroll.
We would predict that if a team pays their quarterback $1.75 million, their team payroll would be $24.05
million. We can also look at this visually to confirm this predictive value.
However, when you look at the Dallas Cowboys' data, the actual payroll is $28.394 million. That's over $4
million more than the line would have predicted their payroll to be. This vertical distance between the
$28.394 million that is actually being paid versus the $24.05 million that's being predicted is called the
residual between those two values.
The residual is calculated by taking the actual response value, y, minus the predicted response value, y-hat.
 FORMULA
Residual
So in this case, the residual for the Dallas Cowboys is calculated by taking the actual team payroll minus the
predicted team payroll:
In this particular problem, the residual for the Dallas Cowboys ends up being $4.344 million. This is a positive
number.
Every point has a residual value:
If the actual response falls above the best-fit line, meaning the actual response is higher than the
predicted response, the residual value is positive.
If the actual response falls below the best-fit line, meaning the actual response is lower than the
predicted response, the residual value is negative.
If by some chance the point falls on the line, the residual value is zero.
 TERM TO KNOW
Residual
The difference between the actual value of the response variable for a particular data point and its
predicted value from the regression line.
2. Residual Plots
Since every point has a residual value, you can actually plot the explanatory variable vs. the residual value, as
opposed to the explanatory variable vs. the response variable.
Scatterplot
Explanatory
vs.
Response
Residual Plot
Explanatory
vs.
Residual
The second graph, where you see how far off the predictions are, is called aresidual plot. A residual plot is
quite useful because it can help you evaluate whether or not a line is actually a useful predictor for the data.
A good linear model will have:
Points above and below the line in random scatters

No curved pattern in the residuals
Equal variability throughout the entire residual plot.
Good Example of Linear Model
This is a good choice for a best-fit line.
The points above and below the line are in random scatters, there is no curved pattern in the residual plot,
and there is equal variability throughout the entire residual plot.
Bad Example: Does Not Have Random Scatter
This is a bad choice for a best-fit line.
Although it has points above and below as residuals, it is not randomly scattered like the original one was.
There is a clear pattern that is shown on the residual. This one has points that are below only on the left, and
points that are above only on the right. That's what makes this line a poor choice for a line of best fit.
Bad Example: Has a Curved Pattern in Residual
Actually, a line doesn't make sense to predict this at all. You can verify that from the residual plot. What you
see is a curved pattern in the residual plot. Also, it means that the scatter is not very random. What a curved
pattern in the residual plot implies is that there is a better fit than a line for your data.
Bad Example: Unequal Variability
This residual plot shows sort of a trumpet pattern where the variability gets wider. The line is a good fit at
the beginning because the residuals are small, but it's a poor fit at the end, where the residuals are getting
larger. You can also see this in the scatter plot. They're close to the line; some are fitting the line well, and
others are not fitting the line.
 TERM TO KNOW
Residual Plot
A scatter plot that plots Residuals vs. explanatory variable, as opposed to response variable vs.
explanatory variable. It can be used to assess the fit of a line.
 SUMMARY
Residuals are how much the data points are different than the line of best fit. They're positive if a point
lies above the line, negative if it falls below the line, and zero if it falls on the line. You can use the
resulting residual plot to determine if a line is actually an effective model for predicting the data.
Good luck!
 TERMS TO KNOW
Residual
The difference between the actual value of the response variable for a particular data point and its
predicted value from the regression line.
Residual Plot
A scatter plot that plots Residuals vs. explanatory variable, as opposed to response variable vs.
explanatory variable. It can be used to assess the fit of a line.
Residual
Least-Squares Line
by Sophia
 WHAT'S COVERED
In this tutorial, you're going to learn about how to find a line of best fit using the method of least
squares. Our discussion breaks down as follows:
1. Least-Squares Line
1. Least-Squares Line
When you look at data on a scatterplot, there are lots of lines that provide good fits for the data. You can
usually eyeball them. In fact, there are many criteria for which you can create what's called a best-fit line.
The least-squares lines is one of the most common types of best-fit line and focuses on the residuals. Recall
that residuals are the distance from the predicted values and the actual values on the scatterplot.
You can use Excel or other statistical software to create a least-squares line for a set of data. The least-
squares line is calculated by minimizing the sum of the squares of the vertical differences from the line of best
fit to each point. We will cover how to calculate this by hand in a later tutorial.
 EXAMPLE This is the price of seafood, for different types of seafood. :Unsurprisingly, the most
expensive ones in 1970 were still the most expensive in 1980. The trend is linear.
Draw a regression line and test to see how great of a fit it is to the rest of the data.
Good Fit:
Suppose we draw the following regression line.
The equation of this line is the predicted price of 1980 equals three times the price of 1970's price. Note
that there is a hat symbol of the Price 1980, which indicates what is being predicted.
If you take a look at that line, it seems fair. There are a couple of points that are noticeably lower than the
line, but regardless of the line we could draw, we will end up with residuals. So what makes it a good fit?
Every point has a residual. If it lies exactly on the line, its residual is zero. There are three points not on this
line that have fairly large negative residuals. So, if you look at all the residuals put together, the sum of the
residuals is negative 189.
You want the sum of the residuals to be low because that would mean that the points are close to the
line. Let's create a line that we know for sure is a worse fit and compare the values of those residuals to
the value of negative 189 to see if this first example is really a good fit.
Bad Fit:
Suppose we now draw this regression line.
This regression line shows that all 1980 prices are going to be predicted to be 109.8. Regardless of it was
really cheap or really expensive in 1970, just say that it will be predicted to be 109.8 cents per pound for
everything in 1980.
Now, that's a bad idea. This is a poor fit for a line. You can see a lot of points are above and below the
line. It's not a good fit; it doesn't go through the pack.
You can see visually there are some very large residuals here. But the problem is, when you add up all the
residuals, it actually equals zero. What happened here?
Well, there are some large positive residuals that are canceled out by adding together several of these
fairly large negative residuals. They end up canceling each other out so that even though the blue line is a
poor fit for the data, the sum of the residuals is equal to zero. However, it’s agreed that this first model
was, in fact, a better fit than the second model. So how can that be reconciled?
Instead of minimizing the sum of the residuals, you will use the method of least squares, which involves
minimizing the sum of the squares of the residuals. What that means is the negative residuals, when you
square them, become positive. The positive ones, when you square them, also become positive so that
this negates the effect of having positive and negative residuals that might cancel each other out.
Now check which line is a better fit, using the method of least squares.
Sum of Squares of
Type of Fit Regression Line
Residuals
Good Fit 13,519
Bad Fit 143,838
Sure enough, 13,519 is a lot smaller than 143,838, indicating that the first line is a better fit for the data than
the second line below.
Best Fit:
The first line is not even the best fit for the line. Actually, when calculated correctly, the best-fit line is the
predicted 1980 price equal to 2.7 times the 1970 price, minus 1.2 cents per pound.
In this case, with this line being the model, the sum of the squares of residuals is 9,326, which is even
better than the 13,519. 9,326 is the smallest that the sum of squares can be, which makes this line the
best-fit line.
 TERM TO KNOW
Least-Squares Line
A best-fit line that is found through a process of minimizing the sum of the squared residuals.
 SUMMARY
The method of least squares requires that you minimize the sum of the squares of the residuals. The
line that does this is the best-fit line. It is called the least-squares line or the least-squares regression
line.
Good luck!
 TERMS TO KNOW
Least-squares Line
The regression line where the sum of the squares of the residuals are the smallest.
Finding the Least-Squares Line
by Sophia
 WHAT'S COVERED
This tutorial is going to teach you how to find the least-squares line of a data set. Our discussion
breaks down as follows:
1. Discussing the Least-Squares Line

2. Calculating the Least-Squares Line
1. Discussing the Least-Squares Line

Recall that the least-squares line is a best-fit line that is found through a process of minimizing the sum of the
squared residuals. The general form for a least-squares equation is:
In this equation, b0 is the y-intercept and b1 is the slope.
For a given data set, the least-squares line will always pass through the point (x̅, ȳ), wherex-bar (x̅) is the
mean of the explanatory data and y-bar (ȳ) is the mean of the response data.
The slope can be found using the following formula:
 FORMULA
Slope of Least Squares Line
The slope, b1, is found by multiplying the correlation coefficient by the ratio of the standard deviation of the y-
data to the standard deviation of the x-data.
We can use these pieces of information to find the y-intercept and then create the least-square line equation.
 TERMS TO KNOW
Least-Squares Line
A best-fit line that is found through a process of minimizing the sum of the squared residuals
X-bar ( )
The average x value for a sample
Y-bar ( )
The average y value for a sample
2. Calculating the Least-Squares Line

Look at airfare prices for certain destinations from the Minneapolis/St. Paul Airport. Boston is 1,266 miles from
St. Paul, and it has an airfare of $263, and so forth.
Boston 1,266 263
Charleston 1,294 306
Chicago 407 128
Denver 834 212
Detroit 611 261
The scatter plot for this data looks like this:
We need to find a least-squares line that incorporates this data. The explanatory variable, x, will be miles and
the predicted response variable, y-hat, will be airfare, so we can write the following equation to start:
So we need to find the slope and the y-intercept. To begin, we can use Excel to calculate the mean and
standard deviation of both the x and y data. Type the data into an Excel spreadsheet and use the function
"=AVERAGE" to calculate the mean and "=STDEV.S" to calculate the standard deviation.
Miles Airfare
1,266 263
1,294 306
407 128
834 212
611 261
Mean 882.4 234
Std. Dev. 393 68
Correlation r = 0.794
In this scenario, miles is the explanatory x-variable, and airfare is the response y-variable. So the average
miles, x̅, is 882.4 with a standard deviation, sx, of 393. The mean airfare, ȳ, is $234 per ticket with a standard
deviation, sy, of $68.
We can also find the correlation easily with Excel. Use the function "=CORREL", highlight both the x- and y-
data, and find the correlation coefficient of 0.794.
The slope of the line is equal to the correlation times the standard deviation of the response y-value, over the
standard deviation of the explanatory x-value. Since you have these three values, all you have to do is plug
them into the slope formula:
The slope is going to be 0.794 times 68 over 393. The result of that is 0.137. So, what is that 0.137? That's the
change in y, airfare in dollars, over a change in one of the miles. It's about 13.7 cents per mile.
Going back to the equation of the best-fit line, we still need to find the remaining information. We just found
the slope, b1, is $0.137 per mile. We still need to find the y-intercept, b0. We don't know this value, however,
we do know a value for and . We know the average number of miles, x̅, and the average value of
airfare, ȳ. Airfare is predicted to be $234 when the miles is 882.4. Substitute this information into the equation
and solve for the y-intercept, b0.
You get 113.11 for b0 so put that all together with the slope to create a least-squares line:
Once it is graphed, it does appear to go right through the pack of points like it's supposed to.
 TRY IT
You can also use a spreadsheet. In Excel, the easiest way to do this is to highlight your data and create a
chart that is a scatter plot. When you do this, you have to actually right-click or control-click onto the data
points themselves so that they are highlighted. Click "Add Trendline." Under Options, click "Display
Equation." Essentially it’s the same idea. Don't get too frustrated because technology can rescue you
here. Especially for larger data sets, finding this by hand can be difficult.
 SUMMARY
Calculation of the least-squares line involves two key facts: First, the point (x bar, y bar)--mean of
explanatory variable, mean of response variable--is a point on the line; and second, that the slope is a
calculable value from the correlation and the standard deviations that you have. You learned about
the least-squares line and calculating the least-squares line, and you used all of these values plus
correlation in order to find it.
Good luck!
 TERMS TO KNOW
Least-Squares Regression Line

The line of best fit, according to the method of Least-Squares.
x-bar
The mean of the explanatory variable.
y-bar
The mean of the response variable.
Interpreting Intercept and Slope
by Sophia
 WHAT'S COVERED
This tutorial will explain interpreting the intercept and slope of a regression line. Our discussion
breaks down as follows:
1. Interpreting Slope
2. Interpreting Y-Intercept
1. Interpreting Slope
A slope is a rate of change. You've talked about rates of change a lot in everyday life.
Rates of Change Application
If a car's driving at 50 miles per hour, how much farther can you go by
Miles Per Hour driving one additional hour? 50 miles per hour means 50 miles every hour.
So, one additional hour means 50 additional miles.
If a fisherman is paid $4.04 per pound for sea scallops, how much more
money do the fishermen stand to make by catching one additional pound
Dollars Per Pound
of scallops? Well, $4.04 per pound, so every additional pound means
$4.04 additional money.
When looking at the regression line formula, the slope is represented by b1 and multiplied by the explanatory
variable, x, to find the average rate of change.
On a scatterplot, you know that there are actual data points on there, so you can calculate the rates of change
between two data points. However, you’re only interested in the average over all the points. That is the
average increase or decrease in the response variable that corresponds to an increase of one in the
explanatory variable.
 EXAMPLE Suppose you had the following equation where distance-hat equals 15 times time,
where time is in hours and distance is in miles.
The 15 in the equation would be 15 miles per hour. But because this is a regression equation, the 15 is an
average speed. There's no guarantee that you go 15 miles every single hour. What this indicates is that,
on average, for each additional hour in time, the distance increases by 15 miles.
 EXAMPLE This chart details the miles and airfare from the Minneapolis/Saint Paul airport to various
destinations.
Boston 1,266 263
Chicago 407 128
Denver 834 212
Detroit 611 261

The equation of the regression line would be airfare-hat, which is the predicted airfare, equals 113.11 plus
0.137 times miles.
What do those values mean?

The 0.137 times miles is the slope, which is the rate of change. It's how quickly the airfare changes if you
increase the miles by one. So, for each additional mile, the airfare is predicted to increase by 0.137 dollars
or 13.7 cents.
⭐ BIG IDEA
There are a couple of important ideas to note as you interpret these values:
It's for every additional mile. You can't leave this word out. You can't say for every mile because it has to
do with the fact that it doesn't start at zero miles costing zero dollars.
You have to say it's a predicted increase. Using airfare-hat, we're not figuring out actual airfares.
Remember this is an average. We're using it to predict the additional airfare for each mile. It's not a hard
and fast rule, and that's why it can be used to predict airfare but not to actually assign airfares.
Lastly, we're using units--miles and dollars. You can't say airfare increases by 0.137. You must specify the
unit assigned to 0.137. It's 0.137 dollars. For each additional mile, it is increased by this many dollars.
 TERM TO KNOW
Slope of Regression Line

The amount y changes (on average) for a one unit increase in x.
2. Interpreting Y-Intercept
The y-intercept of a regression line is the expected value of y if x is equal to 0. When looking at the
regression line formula, the y-intercept is represented by b0:
Sometimes when you are trying to interpret the y-intercept, it may not make the most sense. It could be the
case where x equals 0 seems unreasonable, or the x-values in your data set are far from 0.
 EXAMPLE Let's consider the same equation from above:
The y-intercept here is 113.11. The y-intercept is the value of y, which is the response variable (airfare),
when the value of x, which is the explanatory variable (miles), is zero. When a flight is zero miles, which is
when the explanatory variable is zero, the airfare is predicted to be $113.11.
Remember, you're talking about an ordered pair. So, it's zero miles and $113.11. You need to have both
those numbers in there because this number really corresponds to an ordered pair on the graph.
Secondly, just like how the slope was a prediction, the y-intercept is also a prediction. Now, it's not a
meaningful prediction, but it's a prediction. It's because this line is a prediction line; it's a best-fit line. It's
not actually finding airfares for us.
 HINT
Make sure you include units.

Here, the y-intercept didn't make a lot of contextual sense. You wouldn't buy a ticket for $113.11 just to go
nowhere. The reason has to do with the range of miles values for which this line is an appropriate airfare
guess.
What we have here is miles values from 407 up to almost 1,300 miles. This means that we can use this line
within that range of 407 to 1,294 to make reasonable predictions on airfare. If we wanted to go to San
Antonio, for instance, we could certainly use this line to do that, because the distance from Minneapolis to
San Antonio is within this range. Therefore, it's reasonable to use this line to make predictions within this
400 to 1,300 range.
Boston 1,266 263
Chicago 407 128
Denver 834 212
Detroit 611 261
Anything outside of that range might not be reasonable. Sometimes we need to acknowledge the fact that
perhaps the y-intercept isn't part of that reasonable prediction range, so it might not have much
contextual sense. It's still a good idea to know how to interpret it.
 TRY IT
Try this one on your own. Interpret the slope and y-intercept involving sodium content and calories for
certain hot dogs.
You should have identified the 160 as the y-intercept, meaning if a hot dog had zero calories, then the
predicted sodium would be 160 milligrams.
The 2.5 is the slope, which means that for each additional calorie a hot dog has, the sodium content is
predicted to increase by 2.5 milligrams. Remember that 2.5 milligram per calorie increase is an average.
This is not a hard and fast rule.
 TERM TO KNOW
Y-Intercept of Regression Line

The expected y value when x = 0
 SUMMARY
The slope is a rate of change, and it explains how an increase in the explanatory variable affects the
response variable. The y-intercept shows you what's predicted for the response when the explanatory
value is zero. Sometimes, it doesn't have a meaningful interpretation because it falls outside the
reasonable predictions window, but it is still important to know how to interpret it.
Good luck!
 TERMS TO KNOW
Slope of a least-squares regression line

The amount by which the response variable increases or decreases, on average, when the explanatory
variable increases by one.
y-intercept of a least-squares regression line

The predicted value of the response variable when the explanatory variable is zero.
Multiple Regression
by Sophia
 WHAT'S COVERED
This tutorial will cover the topic of multiple regression. Our discussion breaks down as follows:
1. Multiple Regression
1. Multiple Regression
Multiple regression is going to allow you to predict a response based on more than one explanatory variable,
although they have to be independent.
 EXAMPLE In many school districts, teacher salaries are dependent on two variables: years of
experience and number of postgraduate hours accumulated.
It's possible that a teacher with a lot of years of experience might not have a high number of postgrad
hours. It’s also possible that someone with a lot of postgrad hours doesn't have a whole lot of experience.
Consider the table below with those three variables--salary, years of experience, and postgrad hours--
listed for Mr. Backman, Mr. Jones, Ms. Nordstrom, Mr. Osters, and Ms. Williams.
Teacher Salary Years Hours
Backman 38,000 4 14
Jones 42,000 3 45
Nordstrom 59,000 10 55
Osters 44,000 6 28
Williams 48,000 5 39
We can use this information to come up with three different linear regressions models:
Model A: Salary vs. Years
Model B: Salary vs. Hours
Model C: Salary vs. Both Years and Hours
Model A
Variables Regression Line Coefficient of Determination (r2 )
Explanatory:
Years
A starting salary for someone with no years of experience If you look at the r-squared for this, it's fairly
Response: is $31,164. For every additional year that a person works, high at 0.83. It's clear there's something of
Salary they are predicted to make an additional $2,685 on an association here between salary and
average. years.
Model B
Variables Regression Line Coefficient of Determination (r2 )
Explanatory:
Hours
The r-squared here isn't as high, so there's a
A starting salary for someone with no postgrad hours
Response: little bit less of an association between
is $31,384. For each additional postgrad hour, they are
Salary postgraduate hours and salary than the one
predicted to make an additional $409 on average.
with years.
Model C
Coefficient of
Variables Regression Line
Determination (r2 )
Explanatory:
Years and Hours
A starting salary for someone with no years of experience and no postgrad The r-squared value
Response: hours is $26,807. For every additional year that a person works, they are is higher than either
Salary predicted to make an additional $1,970 on average. For each additional of the two individual
postgrad hour, they are predicted to make an additional $23 on average. linear regressions.
For multiple regression, if those variables are independent, then you can do a regression on both
variables, like in Model C.
The predicted salary is going to have some part that has a constant, some coefficient for the number of
years that the teacher has, and some coefficient for the number of postgrad hours that that teacher has
accumulated.
Look at the r-squared value for Model C. It's higher than either of the two individual linear regressions.
Every time you add an independent variable, the r-squared would continue to increase. It will always go
up when you add another variable because more of the variability in salary is going to be explained by an
additional variable.
Look how well these models did. These lists below indicate the residuals for each model, which is how far
off each model was in predicting the teacher's salary.
Teacher Salary Years Hrs Model A Model B Model C
Backman 38,000 4 14 -3,904 890 80
Jones 42,000 3 45 2,781 -7,789 -1,112
Nordstrom 59,000 10 55 986 5,121 -210
Osters 44,000 6 28 -3,274 1,164 -1,094
Williams 48,000 5 39 3,411 665 2,335
If you look at Model A, the residuals indicate that the predicted values were somewhat off from the actual
values. Model A under-predicted Mr. Backman's salary by nearly $4,000 and over-predicted Mr. Jones'
salary by about $2,800.
If you look at Model B, these residuals are fairly big. Mr. Jones' salary was under-predicted by nearly
$8,000 in Model B, and Ms. Nordstrom's salary was over-predicted by over about $5,000.
If you look at Model C, on average, these residuals are much smaller than those of Model A or Model B.
There was only one teacher that had a better prediction from either Model A or Model B than from Model
C. Overall, Model C, the one from multiple regression, is the most accurate model.
 TERM TO KNOW
Multiple Regression
Using more than one explanatory variable to predict the value of the response variable.
 SUMMARY
Multiple regression is going to allow us to use more than one explanatory variable to predict the
response. Those explanatory variables must be independent. This allows for certain variables to have
a larger effect on the response than others, but still shows what those effects are and allows us to
explain more of the variation in the response, increasing the r-squared value. By adding a second
explanatory variable independent of the first, or a third independent of the first two, etc., the value of r-
squared will increase.
Good luck!
 TERMS TO KNOW
Multiple Regression
Predictions from Best-Fit Lines
by Sophia
 WHAT'S COVERED
This tutorial will explain prediction from best-fit lines. Our discussion breaks down as follows:
1. Making Predictions From Best-Fit Lines

2. Extrapolation
1. Making Predictions From Best-Fit Lines

A best-fit line can be used to make predictions about the response variable based on some value of the
 EXAMPLE The data here is the miles and airfares for different city destinations from the
Minneapolis/Saint Paul Airport.
Boston 1,266 263
Chicago 407 128
Denver 834 212
Detroit 611 261
In this case, you'd figure that miles is the explanatory variable because, in theory, things that are further
away should cost more to get to, based on gasoline, etc.
The regression equation is predicted airfare is equal to 113.11 plus 0.137 times miles.
Suppose you are asking what the predicted airfare would be for a flight from Atlanta from Minneapolis,
which would have a total distance of 1,064 miles. To find the airfare, simply put 1,064 in for miles in the
regression equation.
The airfare would be about $259. The biggest question here, though, is how confident are you in that
prediction? How confident are you that your prediction is close to what it actually costs to get to Atlanta
from Minneapolis?
Look back at the data.
Boston 1,266 263
Chicago 407 128
Denver 834 212
Detroit 611 261
This linear model is based on data that had distances that were less than the distance to Atlanta and also
more than the distance to Atlanta. It seems to make sense to use this model to predict what the cost
would be for a place that is 1,064 miles from Minneapolis.
 EXAMPLE What about the predicted airfare for a flight to Anchorage at a distance of 3,163 miles
from Minneapolis?
The predicted airfare from Anchorage to Minneapolis is $546.44, however, the actual airfare is $727.48. In
this case, the prediction differs largely from what the actual airfare ends up being. So why are these
values so far apart?
You can't really use the prediction equation to predict what the airfare to Anchorage would be because
this distance is so far out of the bounds of the data that you use to actually create the model.
The range of miles that was used to create your model was from about 400 miles away from Minneapolis
to about 1,300 miles away from Minneapolis. Charleston was the longest distance away, and Chicago was
the shortest distance away. What you're saying is that within the window of 400 to about 1,300, the line
gives reasonable predictions for airfare.
Boston 1,266 263
Chicago 407 128
Denver 834 212
Detroit 611 261
Outside of that window, though, it might not. Therefore, you have to be very cautious about using this
prediction line to predict the airfare to Milwaukee, which is closer than 400 miles to Minneapolis, or to a place
like Anchorage. It might not give accurate predictions outside of this particular window.
2. Extrapolation
The whole idea of making predictions outside of a range is called extrapolation. It's using the linear model to
make predictions outside the range of values for which the estimate was intended.
It's not always bad to extrapolate, because sometimes linear trends do continue outside of the window from
the data that made them, but not always.
However, proceed with caution if you do end up extrapolating data because it is risky. You're trusting the
linear model to continue outside of the bounds that created the model itself. Using the linear model to try and
predict outside those bounds might be an unwise decision.
 EXAMPLE Men's Olympic gold medal 100-meter dash times have decreased at a rate of about
100th of a second per year for the last 60 years. The graph below shows this relationship.
Someone who extrapolates might say that if this trend continues, then in about 1,000 years, there will be a
person whose gold medal sprint time is zero seconds.
Clearly, this is nonsense. You can't use this line to predict what might happen even 100 years down the
road, much less 1,000 years down the road. Extrapolation might not be a good idea, especially with this
particular data set, because it can lead to nonsense results.
 TERM TO KNOW
Extrapolation
Using the regression line to make predictions outside the window for which the model was intended.
 SUMMARY
A linear model is a reasonable predictor of response values. In our example, we used airfare values
within the range of values of the explanatory variable, within the range that created it. Using that
model to predict responses for values outside that range is called extrapolation. This should always
be done with caution because sometimes it gives you values that don't make any practical sense.
Extrapolation is the reason why sometimes the y-intercept of a least-squares line doesn't have a
meaningful interpretation.
Good luck!
 TERMS TO KNOW
Extrapolation
Using the regression line to make predictions outside the window for which the model was intended.
Terms to Know
A line that closely approximates the response values for given explanatory values when the
form of the scatterplot is linear.
Causality
A phenomenon whereby an increase in one variable directly leads to an increase or
decrease in another variable.
Coefficient of Determination (r2)

A value that explains the percent of variation in the response variable that can be explained
by a linear association with the explanatory variable. It is the square of the correlation
coefficient.
Correlation
Correlation coefficient (r)

The numerical value between -1 and +1 that measures the correlation between two
quantitative variables.
Direction
The way one variable responds to an increase in the other. With a negative association, an
increase in one variable is associated with a decrease in the other, whereas with a positive
association, an increase in one variable is associated with an increase in the other.
The variable whose increase or decrease we believe helps explain a tendency to increase
or decrease in some other variable.
Extrapolation
Using the regression line to make predictions outside the window for which the model was
intended.
Form
The overall shape of the data points. The form may be linear or nonlinear, or there may not
be any form at all to the points, if they form a "cloud."
Combining together subgroups that should not be combined, resulting in a weakened, or
even reversed, association.
Influential Point
An observation that, if removed, significantly changes a statistical measure.
Least-Squares Regression Line

The line of best fit, according to the method of Least-Squares.
Least-squares Line
The regression line where the sum of the squares of the residuals are the smallest.
Multiple Data Sets

Plotting more than one data set on a scatterplot requires that we use different colors or
symbols for the different data sets so we can see the relationships separately.
Multiple Regression
The type of correlation present when two variables have a correlation coefficient generally
less than or equal to -0.5.
Non-linear Relationships
Outlier
In a scatter plot, an outlier is an observation that has an extreme x value, an extreme y
value, both an extreme x and y, or is well away from the main trend of points.
greater than or equal to 0.5.

between -0.5 and 0.5.
Residual
The difference between the actual value of the response variable for a particular data point
and its predicted value from the regression line.
Residual Plot
A scatter plot that plots Residuals vs. explanatory variable, as opposed to response variable
vs. explanatory variable. It can be used to assess the fit of a line.
Response Variable
The variable that tends to increase or decrease due to an increase or decrease in the
Scatterplot
A graphical display that allows us to see the relationship between two quantitative
variables.
Slope
Slope of a least-squares regression line

The amount by which the response variable increases or decreases, on average, when the
explanatory variable increases by one.
Strength
The closeness of the points to the indicated form. Points that are strongly linear will all fall
on or near a straight line.
x-bar
The mean of the explanatory variable.
y-bar
The mean of the response variable.
y-intercept
y-intercept of a least-squares regression line

The predicted value of the response variable when the explanatory variable is zero.
Formulas to Know
Correlation
Residual
Slope

Unit 4 Tutorials Correlation and Causation in Context

Uploaded by

Copyright:

Available Formats

Unit 4 Tutorials Correlation and Causation in Context

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Tutorials Correlation and Causation in Context

Uploaded by

Copyright:

Available Formats

Unit 4 Tutorials: Correlation and

Interpreting Correlation and Causation

Positive and Negative Correlations

Line of Best Fit

Best-Fit Line and Regression Line

Team QB Salary Total Payroll Team QB Salary Total Payroll

49ers 900 17,256 Falcons 2,250 25,642

Bears 3,000 23,074 Giants 1,600 23,258

Bengals 1,050 20,666 Jets 800 19,063

Bills 650 24,249 Lions 1,525 24,644

Broncos 500 21,992 Oilers 1,700 21,399

Browns 967 19,413 Packers 1,500 23,245

Buccaneers 675 19,545 Patriots 2,250 23,294

Cardinals 1,450 20,397 Raiders 1,300 20,390

Chargers 1,200 18,698 Rams 1,500 24,378

Chiefs 1,100 25,859 Redskins 1,450 20,780

Colts 2,000 22,022 Saints 1,200 23,695

Cowboys 1,750 28,349 Seahawks 1,250 25,348

Dolphins 1,400 23,728 Steelers 3,500 30,131

Eagles 425 19,325 Vikings 1,250 23,246

2. Multiple Data Sets

Multiple Data Sets

Source: Adapted from Sophia tutorial by Jonathan Osters.

Multiple Data Sets

Instead, you're actually going to describe:

Non-Linear: The data points follow a curve.

No Association: Data points resemble a cloud and there is

Negative: The variables go in opposite directions (one

Strong: The scatterplot would most resemble the form. The

Moderate: The data points are less clustered in a line or

Form, Direction, and Strength of Scatterplot

(1970 vs 1980 Seafood Prices)

Source: Adapted from Sophia tutorial by Jonathan Osters.

1. Explanatory Variables and Response Variables

1. Explanatory Variables and Response Variables

Maximum Explanatory: Maximum Daily Temperature

Rent and Explanatory: Square Footage of an Apartment

Source: Adapted from Sophia tutorial by Jonathan Osters.

Form assessed the linearity

Destination Miles Airfare

Kansas City 460 379

Los Angelas 1,870 377

Milwaukee 338 158

New York City 1,167 283

Philadelphia 1,141 323

1b. Calculating Using Excel Function

2. Scatter Plots with Different Correlation

Graph Correlation Explanation

The data points are in a negative direction, so the

The data points are in a negative direction, so the

The data points have a cloudy association, so it is

The data points show an upward association, so the

The data points are in a positive direction, so the

Even though the points are spread out, a positive

Can you spot the errors in the following statements?

"There is a strong correlation between IQ and religious affiliation."

Source: Adapted from Sophia tutorial by Jonathan Osters.

Correlation coefficient (r)

1a. Positive and Negative Correlation