Stat 9-10
Stat 9-10
Stat 9-10
Lecture 9-10:
Correlation and Regression
Master of Management
Faculty of Economics and Business
Universitas Gadjah Mada
2019
Correlation and Regression
Correlation
Regression
Data Transformation
Correlation and Regression
Correlation
Regression
Data Transformation
Learning Objectives
▪ LO13-1: Explain the purpose of correlation analysis
▪ LO13-2: Calculate a correlation coefficient to test and interpret the
relationship between two variables
▪ LO13-3: Apply regression analysis to estimate the linear relationship
between two variables
▪ LO13-4: Evaluate the significance of the slope of the regression equation
▪ LO13-5: Evaluate a regression equation’s ability to predict using the
standard estimate of the error and the coefficient of determination
▪ LO13-6: Calculate and interpret confidence and prediction intervals
▪ LO13-7: Use a log function to transform a nonlinear relationship
Background
In this meeting, we shift the emphasis to the study of
relationships between two interval- or ratio-level variables
(such as the profit made on a car sale, the income of bank
presidents, etc.).
Background
In this meeting, we shift the emphasis to the study of
relationships between two interval- or ratio-level variables
(such as the profit made on a car sale, the income of bank
presidents, etc.).
In all business fields, identifying and
studying relationships between variables
can provide information on ways to increase
profits, methods to decrease costs, or
variables to predict demand.
Background
Examples of relationships between two variables are:
▪ Does the amount company spends per month on training its
sales force affect its monthly sales?
▪ In a study of fuel efficiency, is there a relationship between
miles per gallon an the weight of a car?
▪ Does the number of hours that students study for an exam
influence the exam score?
Relationship Between Two Variables
Graphical
Representation
Relationship between
two variables
Data
Variable 1
(Sample) 1
Scatter
Diagram
Data
Variable 2
(Sample) 2
Relationship Between Two Variables
Graphical
Representation Statistical Measures
Relationship between
two variables
Correlation
Data
Variable 1
(Sample) 1
Scatter
Covariance
Diagram
Data
Variable 2
(Sample) 2
Regression
Relationship Between Two Variables
Graphical
Representation Statistical Measures
Relationship between
two variables
Correlation
Relationship between
two variables
Correlation
Relationship between
two variables
Correlation
Correlation coefficient
Data
Variable 1
(Sample) 1
Scatter
Covariance
Diagram
Data
Variable 2
(Sample) 2
Regression
Regression equation
Scatter Diagram
▪ A scatter diagram is a graphic tool used to portray the
relationship between two variables.
▪ Example:
A sales manager wants to know if there is a relationship between the
number of sales calls made in a month and the number of copiers sold
that month and begins the analysis with a random sample of 15 sales
representatives.
With this data, the number of sales calls is the independent variable and
number of copiers sold is the dependent variable.
Scatter Diagram
CORRELATION ANALYSIS
A group of techniques to measure the
relationship between two variables.
Correlation Analysis
How can we measure the way that two variables move together?
1. We start with the idea of variance (a measure of the average
dispersion around the mean):
This is intuitive: we take the squares to get rid of negative values and
then sum up in order to get the total variation.
Finally, we divide through to get the 'average variation' from the mean.
Correlation Analysis
How can we measure the way that two variables move together?
2. We identify the idea of covariance to measure the linear
association between two variables by looking at how they
vary in terms of deviations from means.
▪ It's hard to interpret this number; what kind of units is the covariance measuring?
▪ Imagine that X is weight in kilograms and Y is number of hamburgers eaten in the past month.
Then the covariance is measuring a new unit of measurement called kilograms*hamburgers.
Correlation Analysis
We resolve this by normalizing the covariance by another number
𝟏 ഥ 𝒀𝒊 − 𝒀
σ𝒏𝒊=𝟏 𝑿𝒊 − 𝑿 ഥ
𝐧𝐨𝐫𝐦𝐚𝐥𝐢𝐳𝐞𝐝_𝐜𝐨𝐯(𝐗, 𝐘) =
𝒏
ഥ 𝟐 /𝒏 σ𝒏𝒊=𝟏 𝒀𝒊 − 𝒀
σ𝒏𝒊=𝟏 𝑿𝒊 − 𝑿 ഥ 𝟐 /𝒏
𝒏
𝟏 ഥ
𝑿𝒊 − 𝑿 ഥ
𝒀𝒊 − 𝒀
=
𝒏 𝒔𝒙 𝒔𝒚
𝒊=𝟏
Coefficient Correlation
We can rewrite the formula to get the CORRELATION COEFFICIENT
𝟏 𝒏 ഥ
𝑿 𝒊 −𝑿 ഥ
𝒀 𝒊 −𝒀
𝐫= σ𝒊=𝟏
𝒏 𝒔𝒙 𝒔𝒚
𝟏 ഥ
𝑿 𝒊 −𝑿 ഥ
𝒀 𝒊 −𝒀
For sample data 𝐫= σ𝒏𝒊=𝟏
𝒏−𝟏 𝒔𝒙 𝒔𝒚
Coefficient Correlation
We can rewrite the formula to get the CORRELATION COEFFICIENT
𝟏 𝒏 ഥ
𝑿 𝒊 −𝑿 ഥ
𝒀 𝒊 −𝒀
𝐫= σ𝒊=𝟏
𝒏 𝒔𝒙 𝒔𝒚
𝟏 ഥ
𝑿 𝒊 −𝑿 ഥ
𝒀 𝒊 −𝒀
For sample data 𝐫= σ𝒏𝒊=𝟏
𝒏−𝟏 𝒔𝒙 𝒔𝒚
ഥ 𝒀𝒊 −𝒀
σ 𝑿𝒊 −𝑿 ഥ It represents the average level
For sample data
𝐫= of observed joint variation.
▪ If there is absolutely no relationship between the two sets of variables, the r is zero.
▪ if the r is below 0.5, it is considered weak relationship.
▪ if the r is above 0.5, it is considered strong relationship.
Characteristics of Coefficient Correlation
Now we find the deviations from the mean number of sales calls
and the mean number of copiers sold; then multiply the them.
The sum of their product is 6,672 and will be used in the
coefficient correlation formula to find r.
We also need the standard deviations.
6672
r= = 0.865
(15−1)(42.76)(12.89)
Regression
Data Transformation
Regression
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable
Regression
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable
(y) (x)
The variable being estimated The variable used to make the
is the dependent variable estimate or predict the value is the
independent variable
Regression
Regression analysis:
▪ It is concerned with the study of the dependence of one
variable on one (or more) other variable(s)
▪ It estimates one variable based on another variable
(y) (x)
𝑥
𝑥1 𝑥2 𝑥3 𝑥4
Least Squares Method
The method accommodates us to minimize the total distance
associated with the gap between the actual values Y and fitted
values 𝑦ො
Least Squares Method
▪ We take squares of the distance (𝑦 − 𝑦)ො so that we treat distances above
and below the line equally (i.e. we get rid of negative values).
▪ Hence we form the following objective function for minimizing the
distance:
σ𝑵 ഥ 𝒚𝒊 − 𝒚
𝒊=𝟏 𝒙𝒊 − 𝒙 ഥ 𝒄𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝑿, 𝒀)
𝜷= =
σ𝑵 𝒙
𝒊=𝟏 𝒊 − ഥ
𝒙 𝟐 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝑿)
ഥ − 𝜷ഥ
𝜶=𝒚 𝒙
Least Squares Estimators
The result:
σ𝑵 ഥ 𝒚𝒊 − 𝒚
𝒊=𝟏 𝒙𝒊 − 𝒙 ഥ 𝒔𝒚
𝜷= =𝒓
σ𝑵 𝒙
𝒊=𝟏 𝒊 − ഥ
𝒙 𝟐 𝒔𝒙
ഥ − 𝜷ഥ
𝜶=𝒚 𝒙
Least Squares Regression Example
Recall the example of North American Copier Sales. The sales manager
gathered information on the number of sales calls made and the number of
copiers sold.
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
Least Squares Regression Example
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
1. The first step is to find the slope of the least squares regression line, 𝛽
𝒔𝒚 𝟏𝟐. 𝟖𝟗
𝜷=𝒓 = 𝟎. 𝟖𝟔𝟓 = 𝟎. 𝟐𝟔𝟎𝟖
𝒔𝒙 𝟒𝟐. 𝟕𝟔
▪ The 𝛽 value of .2608 indicates that for each additional sales call, the sales representative can expect to
increase the number of copiers sold by about .2608.
▪ So 20 additional sales calls in a month will result in about five more copiers being sold.
Least Squares Regression Example
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
2. The second step is to find 𝛼
ഥ − 𝜷ഥ
𝜶=𝒚 𝒙
𝜶 = 𝟒𝟓 − 𝟎. 𝟐𝟔𝟎𝟖 𝟗𝟔 = 𝟏𝟗. 𝟗𝟔𝟑
Least Squares Regression Example
Use the least squares method to determine a linear equation to express the
relationship between the two variables.
3. Then determine the regression line
ෝ = 𝟏𝟗. 𝟗𝟔𝟑 + 𝟎. 𝟐𝟔𝟎𝟖𝒙
𝒚
So if a salesperson makes 100 calls, he or she can expect to sell
46.0432 copiers
▪ If the standard error of estimate is small, this indicates that the data are
relatively close to the regression line and the regression equation can be
used.
▪ If it is large, the data are widely scattered around the regression line and
the regression equation will not provide a precise estimate of y.
The Standard Error of Estimate Example
▪ We calculate the standard error of estimate in this example.
▪ We need the sum of the squared differences between each observed
value of y and the predicted value of y, which is 𝑦ത
The Coefficient of Determination
▪ The coefficient of determination is the proportion of the total variation in
the dependent variable Y that is explained, or accounted for, by the
variation in the independent variable X.
▪ The coefficient of determination provides a more interpretable measure
of a regression equation’s ability to predict.
▪ It is found from the following formula
The Coefficient of Determination
▪ The characteristics of coefficient of determination:
▪ It ranges from 0 to 1.0
▪ It is the square of the correlation coefficient
Regression
Data Transformation
Transforming Data
▪ Regression analysis and the correlation coefficient requires
data to be linear
▪ But what if data is not linear?
Transforming Data
But what if data is not linear?
1. Rescale: we can rescale one or both of the variables so the
new relationship is linear
2. Transform: the common transformation techniques include:
▪ Computing the log to the base 10 of y, Log(y)
▪ Taking the square root
▪ Taking the reciprocal
▪ Squaring one or both variables
Transforming Data: An Example
▪ The director of marketing of Grocery Land Supermarkets
wishes to study the effect of price on weekly sales of their
two-liter private brand diet cola.
▪ The objectives of the study are:
1. To determine whether there is a relationship between selling price
and weekly sales.
2. To determine the effect of price increases or decreases on sales.
Transforming Data: An Example