Bike Assignment - Subjective Sol

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Question1

This article was published as a part of the Data Science Blogathon

Introduction

Regression is a supervised learning technique that supports finding the correlation among variables. A regression
problem is when the output variable is a real or continuous value.

In this article, we will understand the following concepts:

What is a Regression?

Types of a Regression.

What is the mean of Linear regression and the importance of Linear regression?

Importance of cost function and gradient descent in a Linear regression.

Impact of different values for learning rate.


Implement use case of Linear regression with python code.

General Subjective Questions


Question 1

Regression

In Regression, we plot a graph between the variables which best fit the given data points. The machine learning
model can deliver predictions regarding the data. In naïve words, “Regression shows a line or curve that passes
through all the data points on a target-predictor graph in such a way that the vertical distance between the data
points and the regression line is minimum.” It is used principally for prediction, forecasting, time series modeling,
and determining the causal-effect relationship between variables.

Types of Regression models

 Linear Regression
 Polynomial Regression
 Logistics Regression

Question 3

In Statistics, the Pearson's Correlation Coefficient is also referred to as Pearson's r, the Pearson product-moment
correlation coefficient (PPMCC), or bivariate correlation. It is a statistic that measures the linear correlation
between two variables. Like all correlations, it also has a numerical value that lies between -1.0 and +1.0. 
 
Whenever we discuss correlation in statistics, it is generally Pearson's correlation coefficient. However, it cannot
capture nonlinear relationships between two variables and cannot differentiate between dependent and
independent variables.

Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard
deviations. The form of the definition involves a "product moment", that is, the mean (the first moment about the
origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name. 

Question 6

Quantile-Quantile (Q-Q) plot, is a graphical tool to help us assess if a set of data plausibly came from some
theoretical distribution such as a Normal, exponential or Uniform distribution. Also, it helps to determine if two
data sets come from populations with a common distribution.
This helps in a scenario of linear regression when we have training and test data set received separately and then
we can confirm using Q-Q plot that both the data sets are from populations with same distributions.
A Q Q plot showing the 45 degree reference line:
If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the
line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line,
but not necessarily on the line y = x. Q–Q plots can also be used as a graphical means of estimating
parameters in a location-scale family of distributions.

A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such
as location, scale, and skewness are similar or different in the two distributions.  

Question 5

If there is perfect correlation, then VIF = infinity. This shows a perfect correlation between two independent
variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this
problem we need to drop one of the variables from the dataset which is causing this perfect
multicollinearity.

An infinite VIF value indicates that the corresponding variable may be expressed exactly by a linear
combination of other variables (which show an infinite VIF as well).

Question 2

Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet
appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in
1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before
analyzing it and the effect of outliers on statistical properties.

Simple understanding:

Once Francis John “Frank” Anscombe who was a statistician of great repute found 4 sets of 11 data-points in
his dream and requested the council as his last wish to plot those points. Those 4 sets of 11 data-points are
given below.
After that, the council analyzed them using only descriptive statistics and found the mean, standard
deviation, and correlation between x and y.

Question 4

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means
rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

S.NO
Normalisation Standardisation
.

Minimum and maximum value of features are


1. Mean and standard deviation is used for scaling.
used for scaling

It is used when we want to ensure zero mean


2. It is used when features are of different scales.
and unit standard deviation.

3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.

4. It is really affected by outliers. It is much less affected by outliers.

5. Scikit-Learn provides a transformer Scikit-Learn provides a transformer


S.NO
Normalisation Standardisation
.

called MinMaxScaler for Normalization. called StandardScaler for standardization.

This transformation squishes the n- It translates the data to the mean vector of
6. dimensional data into an n-dimensional unit original data to the origin and squishes or
hypercube. expands.

It is useful when we don’t know about the It is useful when the feature distribution is
7.
distribution Normal or Gaussian.

8. It is a often called as Scaling Normalization It is a often called as Z-Score Normalization.

You might also like