Correlation PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Correlation

Prof. Kunal Ghosh


Correlation

• Correlation is a statistical measure of an association relationship


between two random variables.

• Correlation is not necessarily a causal relationship. Correlation is


important in analytics since it helps to identify variables that may be
used in the model building and also useful for identifying issues
such as multi-collinearity that can destabilize regression-based
models.
Correlation

• One of the challenging tasks in analytics, especially in predictive


analytics, is identifying the variables or features that may be
associated to the response variable or the outcome variable that is
of interest to the data scientists.

• Organizations collect data on several variables, sometimes the


number of variables can run into thousands (including derived
variables such as ratios and interactions).
Correlation

• For example, mobile service providers collect data on variables such


as call duration, number of calls, numbers to which the calls are
made, number of calls received, the device that was used to make
the call, location (and mobile tower that the phone was attached
to), time between calls, last recharge (in case of pre-paid mobile
services), recharge amount, service plan (in case of post-paid
connection), number of messages sent, number of messages
received, apps downloaded, time spent on surfing internet, and so
on.
Correlation

• The number of variables collected and new variables generated may


exceed several thousands.

• Few of these variables are regulatory requirements as the part of


the internal security and the mobile phone service providers are
expected to collect and store. The idea behind collecting all these
variables is to find answer to questions such as
Correlation

• Which customer is likely to churn?


• How to increase the revenue generated from a customer?
• What is the customer lifetime value?
• What is the best service plan for a customer?
• What recommendations can be made to a customer?
Correlation

• Finding answer to the aforementioned questions involves building


predictive/prescriptive analytics
• models. Model building involves identifying the variables among
thousands of variables (in analytics
• terminology this is called variable selection or feature selection) to
build the model.
Correlation

• Taking all the variables simultaneously to create a model can result


in problems such as multi collinearity, which can destabilize the
model and is also time consuming since most predictive analytics
model development involves matrix operations such as matrix
inverse calculation.
Correlation

• So, the knowledge of how different variables are related to one


another is important in building analytical models.

• Correlation is a measure of the strength and direction of


relationship that exists between two random variables and is
measured using correlation coefficient.
Correlation

• In others words, correlation is a measure of association between


two variables.
• Correlation can assist the data scientists to choose the variables for
model building that is used for solving an analytics problem.
• We will be discussing different types of correlation coefficients in
this chapter depending on the scale of measurement of the
variables involved.
Correlation

• Correlation is only an association relationship and not a causal


relationship.

• Thus the user should be aware of the fact that two variables may
have high correlation coefficient value, although there may not be
any direct dependence between these variables.
Correlation

• Pearson product moment correlation (in short Pearson correlation)


is used for measuring the strength and direction of the linear
relationship between two continuous random variables X and Y.
• For example, consider two variables − the average call duration
(variable Y) and the age (variable X).
• We may like to know whether the average call duration is related to
the age of the caller, that is, whether change in age is related to
change in average call duration.
Correlation

• It is also possible that there may not be any relationship between


age and average call duration.
• A simple approach for checking existence of association relationship
is to draw a scatter plot.
• Scatter plot may reveal if there exists any relationship between two
variables.
Correlation

• In Table 8.1 we have age of customer and average call duration


(measured in seconds) from a sample data; the corresponding
scatter plot is shown in Figure 8.1.
Correlation
Correlation
Correlation

• In Figure 8.1, we can see that the average call duration (Y) decreases
as the age of the customer (X) increases.
• We can measure the strength of the linear association relationship
using a numerical measure called correlation coefficient.
• In the next section, we will be discussing mathematical equations
for calculating Pearson product moment correlation coefficient.
Correlation

• Pearson product moment correlation is used when we are


interested in finding linear relationship between two continuous
random variables (that is, the variable should be either of ratio or
interval scale).

• When we try to measure how change in a variable (say Y) is related


to changes in another variable (say X), one of the issues that we
need to consider is the measurement scale and unit of
measurement of the two variables.
Correlation

• In the example discussed in Table 8.1, the variable age is measured


in years and the call duration is measured in seconds.

• The range of two variables can be different, thus we need to


standardize the variables which can be used for measuring the
correlation between two variables.
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
(Example 8.1)
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation
Correlation

You might also like