Bike Assignment - Subjective Sol
Bike Assignment - Subjective Sol
Bike Assignment - Subjective Sol
Introduction
Regression is a supervised learning technique that supports finding the correlation among variables. A regression
problem is when the output variable is a real or continuous value.
What is a Regression?
Types of a Regression.
What is the mean of Linear regression and the importance of Linear regression?
Regression
In Regression, we plot a graph between the variables which best fit the given data points. The machine learning
model can deliver predictions regarding the data. In naïve words, “Regression shows a line or curve that passes
through all the data points on a target-predictor graph in such a way that the vertical distance between the data
points and the regression line is minimum.” It is used principally for prediction, forecasting, time series modeling,
and determining the causal-effect relationship between variables.
Linear Regression
Polynomial Regression
Logistics Regression
Question 3
In Statistics, the Pearson's Correlation Coefficient is also referred to as Pearson's r, the Pearson product-moment
correlation coefficient (PPMCC), or bivariate correlation. It is a statistic that measures the linear correlation
between two variables. Like all correlations, it also has a numerical value that lies between -1.0 and +1.0.
Whenever we discuss correlation in statistics, it is generally Pearson's correlation coefficient. However, it cannot
capture nonlinear relationships between two variables and cannot differentiate between dependent and
independent variables.
Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard
deviations. The form of the definition involves a "product moment", that is, the mean (the first moment about the
origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name.
Question 6
Quantile-Quantile (Q-Q) plot, is a graphical tool to help us assess if a set of data plausibly came from some
theoretical distribution such as a Normal, exponential or Uniform distribution. Also, it helps to determine if two
data sets come from populations with a common distribution.
This helps in a scenario of linear regression when we have training and test data set received separately and then
we can confirm using Q-Q plot that both the data sets are from populations with same distributions.
A Q Q plot showing the 45 degree reference line:
If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the
line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line,
but not necessarily on the line y = x. Q–Q plots can also be used as a graphical means of estimating
parameters in a location-scale family of distributions.
A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such
as location, scale, and skewness are similar or different in the two distributions.
Question 5
If there is perfect correlation, then VIF = infinity. This shows a perfect correlation between two independent
variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this
problem we need to drop one of the variables from the dataset which is causing this perfect
multicollinearity.
An infinite VIF value indicates that the corresponding variable may be expressed exactly by a linear
combination of other variables (which show an infinite VIF as well).
Question 2
Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet
appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in
1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before
analyzing it and the effect of outliers on statistical properties.
Simple understanding:
Once Francis John “Frank” Anscombe who was a statistician of great repute found 4 sets of 11 data-points in
his dream and requested the council as his last wish to plot those points. Those 4 sets of 11 data-points are
given below.
After that, the council analyzed them using only descriptive statistics and found the mean, standard
deviation, and correlation between x and y.
Question 4
Normalization typically means rescales the values into a range of [0,1]. Standardization typically means
rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
S.NO
Normalisation Standardisation
.
3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.
This transformation squishes the n- It translates the data to the mean vector of
6. dimensional data into an n-dimensional unit original data to the origin and squishes or
hypercube. expands.
It is useful when we don’t know about the It is useful when the feature distribution is
7.
distribution Normal or Gaussian.