Machine Learning: Notes by Aniket Sahoo - Part II
Machine Learning: Notes by Aniket Sahoo - Part II
Machine Learning: Notes by Aniket Sahoo - Part II
MACHINE LEARNING
Notes by Aniket Sahoo - Part II
Disclaimer : These are the personal notes prepared by me while undergoing the PGDMLAI course by
Upgrad for my future reference. It includes all the topics (both mandatory and optional sections) and may be
a couple of extra topics not included in the course from various scholarly articles over the internet. Please
do use it for your personal reading only and refrain from sharing it over public domains such as LinkedIn,
Github, Facebook, GoogleDrive etc as it might infringe upon various copyrights. I have tried my best to
avoid any errors, still if you find any please let me know by dropping a mail to [email protected] so
that it can be rectified.
3. REGRESSION
LINEAR REGRESSION
Cost Function
The line that fits the data best will be the one for which the n prediction errors (one for each
observed data point) are as small as possible in some overall sense. One way to achieve this is to
use the least squares criterion, which minimizes the sum of all the squared prediction errors ε i
(difference between actual value and predicted value of the dependent variable y ), i.e.
2. Iterative Form Solution : It is a mathematical process that uses an initial guess to generate a
sequence of improving approximate solutions for a class of problems, in which the n th
approximation is derived from the previous (n − 1)th ones. An iterative method is called
convergent if the corresponding sequence converges for given initial approximations.
Iterative methods are often the only choice for nonlinear equations. However, iterative
methods are often useful even for linear problems involving a large number of variables
(sometimes of the order of millions), where direct methods would be prohibitively expensive
(and in some cases impossible) even with the best available computing power.
a. First Order (Gradient Descent Method)
b. Second Order (Newton’s Method)
3. Move till the sequence θt converges to the local minimum.
The term η , called the Learning Rate controls the steps taken in downward direction in each
iteration. If it is too low, the algorithm may take longer to reach the minimum value. On the other
hand, if it is high, the algorithm may overstep the minimum value. The following figures demonstrate
the different scenarios one can encounter while configuring the learning rate.
The various variations of Gradient Descent are,
1. Batch or Vanilla gradient descent : It computes the gradient of the cost function w.r.t. to the
parameters θ for the entire training dataset.
2. Stochastic gradient descent : In contrast to batch gradient descent it performs a parameter
update for each training example θt .
3. Mini-Batch gradient descent : It takes the best of both worlds and performs an update for
every mini-batch of n training examples.
The following figures show the paths taken by various gradient descent variations for reaching the
local minima.
1. Residual Sum of Squares (RSS) is defined as the total sum of error across the whole sample.
It is the measure of the difference between the expected and the actual output. A small RSS
indicates a tight fit of the model to the data. Mathematically RSS is,
2. Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean
of the response variable. Mathematically TSS is,
The following figures show the significance of R2.
Root Mean Squared Error (RSME) and Residual Standard Error (RSE)
The Root Mean Squared Error is the square root of the variance of the residuals. It indicates the
absolute fit of the model to the data i.e. how close the observed data points are to the model’s
predicted values. Mathematically it can be represented as,
The R-squared is a relative measure of fit, whereas the RMSE is an absolute measure of fit. Because
the value of RMSE depends on the units of the variables (i.e. it is not a normalized measure), it can
change with the change in the unit of the variables. Thus, R-squared is a better measure than RSME.
2. Independence of residuals : The error terms should not be dependent on one another (like
in a time-series data wherein the next value is dependent on the previous one). There
should be no correlation between the residual (error) terms. Absence of this phenomenon is
known as Autocorrelation.
Autocorrelation causes the confidence intervals and prediction intervals to be narrower.
Narrower confidence interval means that a 95% confidence interval would have lesser
probability than 0.95 that it would contain the actual value of coefficients.
4. Equal variance of residuals : The error terms must have constant variance. This phenomenon
is known as homoscedasticity. The presence of non-constant variance is referred to
heteroskedasticity.
The presence of non-constant variance in the error terms results in heteroskedasticity.
Generally, non-constant variance arises in the presence of outliers or extreme leverage
values and when these values get too much weight, they disproportionately influence the
model’s performance. Thereby, causing the confidence interval to be unrealistically wide or
narrow for the out of sample predictions.
Visualize Data
>> sns.pairplot(df,x_vars=[columns],y_vars=column,size=4,aspect=1,kind='scatter')
Check Collinearity
>> sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)
Scaling Features
>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>> scaler = StandardScaler() # Any one of the scaling methods can be used
>> scaler = MinMaxScaler() # Any one of the scaling methods can be used
>> X_train_scaled = scaler.fit_transform(X_train.values.reshape(-1,1))
>> X_test_scaled = scaler.transform(X_test.values.reshape(-1,1))
# It is a general convention in scikit-learn that observations are rows, while
features are columns. This is needed only when using a single feature
>> import statsmodels.api as sm
>> X_train_scaled_sm = sm.add_constant(X_train_scaled)
>> X_test_scaled_sm = sm.add_constant(X_test_scaled)
# In statsmodels intercept variable needs to be added explicitly
Train Model
>> model = sm.OLS(y_train, X_train_scaled_sm).fit()
Analyze Model
>> model.params
>> model.summary()
# Statsmodel provides an extensive summary of various metrics
>> plt.scatter(X_train_scaled, y_train)
>> plt.plot(X_train_scaled, coeff_A + coeff_B * X_train_scaled, 'r')
Analyze Residuals
>> y_train_pred = model.predict(X_train_scaled_sm)
>> residual = (y_train - y_train_pred)
>> sns.distplot(residual, bins = 15) # Checking if residuals are normally distributed
>> plt.scatter(X_train_scaled, residual) # Checking for independence of residuals
Predict
>> y_pred = model.predict(X_test_scaled_sm)
Evaluate Model
>> from sklearn.metrics import mean_squared_error
>> from sklearn.metrics import r2_score
Visualize Model
>> plt.scatter(X_test_scaled, y_test)
>> plt.plot(X_test_scaled, coeff_A + coeff_B * X_test_scaled, 'r')
Visualize Data
>> sns.pairplot(df,x_vars=[columns],y_vars=column,size=4, aspect=1,kind='scatter')
Check Collinearity
>> sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)
Scaling Features
>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>> scaler = StandardScaler() # Any one of the scaling methods can be used
>> scaler = MinMaxScaler() # Any one of the scaling methods can be used
>> X_train_scaled = scaler.fit_transform(X_train.values.reshape(-1,1))
>> X_test_scaled = scaler.transform(X_test.values.reshape(-1,1))
# It is a general convention in scikit-learn that observations are rows, while
features are columns. This is needed only when using a single feature
Train Model
>> from sklearn.linear_model import LinearRegression
>> model = LinearRegression()
>> model.fit(X_train_scaled, y_train)
Analyze Model
>> coeff_A = model.intercept_
>> coeff_B = model.coef_[0]
>> plt.scatter(X_train_scaled, y_train)
>> plt.plot(X_train_scaled, coeff_A + coeff_B * X_train_scaled, 'r')
Analyze Residuals
>> y_train_pred = model.predict(X_train_scaled)
>> residual = (y_train - y_train_pred)
>> sns.distplot(residual, bins = 15) # Checking if residuals are normally distributed
>> plt.scatter(X_train, residual) # Checking for independence of residuals
Predict
>> y_pred = model.predict(X_test_scaled)
Evaluate Model
>> from sklearn.metrics import mean_squared_error
>> from sklearn.metrics import r2_score
>> rsme = np.sqrt(mean_squared_error(y_test, y_pred))
>> r_squared = r2_score(y_test, y_pred)
Visualize Model
>> plt.scatter(X_test_scaled, y_test)
>> plt.plot(X_test_scaled, coeff_A + coeff_B * X_test_scaled, 'r')
Cost Function
2. Normal Equation : It is an analytical approach to solving the cost function. One can directly
find the value of β without using Gradient Descent. This approach is an effective and a
time-saving option when working with a dataset with small features. β is given by,
Multicollinearity
Multicollinearity refers to the phenomenon of having related predictor variables in the input dataset.
As multicollinearity makes it difficult to find out which variable is actually contributing towards the
prediction of the response variable, it leads one to conclude incorrectly the effects (strong/weak) of
a variable on the target variable. Thus, it is a big issue when one is trying to interpret the model.
Though it does not affect the precision of the predictions, it is essential to properly detect and deal
with the multicollinearity present in the model, as random removal of any of these correlated
variables from the model cause the coefficient values to swing wildly and even change signs.
Multicollinearity can be detected using the following methods.
1. Pairwise Correlations : Checking the pairwise correlations between different pairs of
independent variables can throw useful insights in detecting multicollinearity.
2. Variance Inflation Factor (VIF) : Pairwise correlations may not always be useful as it is
possible that just one variable might not be able to completely explain some other variable
but some of the variables combined might be able to do that. Thus, to check these sorts of
where,
i refers to the ith variable which is being represented as a linear combination
of the rest of the independent variables.
The common heuristic followed for the VIF values is,
a. V IF > 10 indicates definite removal of the variable.
b. V IF > 5 indicates variable needs inspection.
c. V IF < 5 indicates the variable is good to go.
Once multicollinearity has been detected in the dataset then this can be dealt using the following
methods.
1. Highly correlated variables can be dropped.
2. Business Interpretable variables can be picked up.
3. New interpretable features can be derived using the correlated variables.
4. Variable transformations can be done using PCA (Principal Component Analysis) or PLS
(Partial Least Squares).
Dummy Variables
While dealing with multiple variables, there may be some categorical variables that might turn out to
be useful for the model. So, it is essential to handle these variables appropriately in order to get a
good model. These can be dealt with using the following methods.
1. Label Encoding : It is used to transform the non-numerical categorical variables to numerical
labels. Numerical labels are always between 0 and n − 1 . This is usually performed on
ordinal categorical variables.
2. One Hot Encoding : One way to deal with the nominal categorical variables is to create n
new columns (dummy variables) each indicating whether that level exists or not using a zero
or one. For example for a variable say, Relationship with three levels namely, Single, In a
relationship and Married, one can create a dummy table like the following.
Relationship Status Single In a relationship Married
Single 1 0 0
In a relationship 0 1 0
Married 0 0 1
Dummy Variable Trap is a scenario where there are attributes which are highly correlated
(multicollinear). After using one hot encoding, any one of the dummy variables can be
predicted with the help of the other remaining dummy variables. Hence, one dummy variable
is always highly correlated with the other remaining dummy variables. Thus, one needs to
exclude one of the dummy variables before proceeding with the model.
Feature Scaling
Another important aspect to consider is feature scaling. When there are many independent
variables in a model, a lot of them might be on very different scales. This leads to a model with very
weird coefficients, which are difficult to interpret. Thus, one needs to scale the features for ease of
interpretation of the coefficients as well as for the faster convergence of gradient descent methods.
One can scale the features using the following methods.
1. Standardization : The variables are scaled in such a way that their mean is zero and standard
deviation is one.
2. Min Max Scaling : The variables are scaled in such a way that all the values lie between zero
and one.
It is important to note that scaling just affects the coefficients and none of the other parameters like
t-statistic, F statistic, p-values, R-square, etc. Also the dummy variables are usually never scaled for
the ease of their interpretation.
Feature Selection
There may be quite a few potential predictors for the model, but choosing the correct features (not
the redundant ones but the ones that add some value to the model) is quite essential. To get the
optimal model, one can always try all the possible combinations of independent variables and check
which model fits the best. But this method is obviously time-consuming and infeasible. Hence, one
needs some other method to get a decent model. Following is a list of different approaches for
choosing the best set of predictors that will give the least test error.
Model Assessment
While creating the best model for any problem statement, one ends up choosing from a set of
models which gives the least test error. Hence, the test error, and not only the training error, needs
to be estimated in order to select the best model. Besides, selecting the best model to obtain
decent predictions, one also needs to maintain a balance between keeping the model simple and
explaining the highest variance (i.e. keeping as many required variables as possible).
R-squared is never used for comparing the models as the value of R2 increases with the increase in
the number of predictors (even if these predictors did not add any value to the model). Thus, in
came the idea to penalize the model for every predictor variable (if the variable did not improve the
model more than would be expected by chance) being added to the model. This can be done in the
following two ways.
1. Use metrics which take into account both model fit and simplicity, such metrics are Mallow's
Cp, Adjusted R2, AIC and BIC.
2. Estimate the test error via a validation set or a cross-validation approach.
For a model where,
n is the number of training examples,
d is the number of predictors,
σ 2 is the estimate of the variance of training error,
D is the entire dataset,
M is the model,
ln P (D∣M ) is the model likelihood,
( ) ( ) ( )
n n n
2 2 2
T SS ∑ ( yi − y ) = ESS ∑ ( y︿i − y ) + RSS ∑ ( y︿i − y i )
i=1 i=1 i=1
Mallow’s CP
Visualize Data
>> sns.pairplot(df)
>> sns.boxplot(x='column', y='column1', hue='column2', data=df)
Prepare Data
>> def binary_map(x):
return x.map({'yes': 1, 'no': 0})
Scaling Features
>> from sklearn.preprocessing import StandardScaler
>> scaler = StandardScaler()
>> X_train_scaled[column_list] = scaler.fit_transform(X_train[column_list])
>> X_test_scaled[column_list] = scaler.transform(X_test[column_list])
Train Model
>> from sklearn.linear_model import LinearRegression
>> model = LinearRegression()
>> model.fit(X_train, y_train)
Checking VIF
>> from statsmodels.stats.outliers_influence import variance_inflation_factor
>> vif = pd.DataFrame()
>> vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i)
for i in range(X_train_rfe.shape[1])]
>> vif['VIF'] = round(vif['VIF'], 2)
>> vif['Features'] = X_train_rfe.columns
>> vif = vif.sort_values(by='VIF', ascending=False)
Analyze Residuals
>> y_train_pred = model_final.predict(X_train_manual)
>> residual = (y_train - y_train_pred)
>> sns.distplot(residual, bins = 15)
# Checking if residuals are normally distributed
>> plt.scatter(X_train_manual, residual)
# Checking for independence of residuals
Predict
>> X_test = X_test[selected_features]
>> X_test = sm.add_constant(X_test)
>> y_pred = model_final.predict(X_test)
Evaluate Model
>> from sklearn.metrics import mean_squared_error
>> from sklearn.metrics import r2_score
>> rsme = np.sqrt(mean_squared_error(y_test, y_pred))
>> r_squared = r2_score(y_test, y_pred)
Prediction vs Projection
Although prediction and projection sound synonymous but they are different applications in
analytics.
Prediction Projection/Forecast
Identification of predictor variables and Finding the final projected result or forecasted
measurement of their impact is the main aim. value is the main aim.
No new assumptions are made other than those Forecasts for the next day are made with the
of linear regression. assumption that everything remains the same
today. Inclusion of any new incidents changes
the forecast for the next day.
Simplicity of the model is important. Accuracy of the model is important and not the
model.
Interpolation vs Extrapolation
The basic difference between interpolation and extrapolation is given in the following table,
Interpolation Extrapolation
Model is used to predict the value of a Model is used to predict the dependent variable
dependent variable on independent values that on the independent values that lie outside the
lie within the range of data one already has. range of the data the model was built on.
In the preceding figure when one predicts Y 1 for X 1 , which lies between a and b , it is called
interpolation. On the other hand, extrapolation would be extending the line to predict Y 2 for X 2
Parametric vs Non-Parametric
The basic difference between parametric and non-parametric is given in the following table,
Parametric Non-Parametric
The number of parameters is fixed with respect The number of parameters can grow with the
to the sample size, simplifying the function to a sample size, freeing it to learn any functional
known form. form from the training data.
Simpler and faster to train requiring less training Flexible and highly accurate, but slower to train
data. requiring a lot of training data.
Eg: Simple Neural Networks, Naive Bayes, Eg: k-Nearest Neighbors, Decision Trees like
Logistic Regression, etc. CART and C4.5, SVM. etc.
ADVANCED REGRESSION
General Equation
F (x) = a0 + a1 f 1 (x) + a2 f 2 (x) + . . . . . . + ak f k (x)
Feature Engineering
Raw attributes, as these appear in the training data, may not be the ones best suited to predict the
response variable values. Thus, in comes the derived features, which is a combination of two or
more raw attributes and/or transformations of individual raw attributes. These combinations and
transformations could be linear (only multiplication by a constant or addition) or nonlinear.
Solution : On minimizing the cost function,
T −1 T
The final solution is given by a = (X X) X y
Contour
Regularization Regression
In regularization regression, an additional regularization term is added to the cost function along
with the error term, while minimizing the cost function as follows.
Where Regularization terms can be as follows,
1. Ridge Regression (Sum of squares of coefficients) : λ Σα2
2. Lasso Regression (Sum of absolute values of coefficients) : λ Σ∣α∣
The final solution changes to,
At every crossing one could move along the arrow (black arrow in the zoomed image) shown to
keep the error term same and reduce the regularisation term, giving a better solution. Thus, for the
optimum solution for 𝛼 (sum of the error and regularisation terms is minimum), the corresponding
regularisation contour and the error contour must touch each other tangentially and not cross.
The blue stars highlight the touch points between the error contours and the lasso regularisation
contours. The green stars highlight the touch points between the error contours and the ridge
regularisation terms. The picture illustrates the fact that because of the corners in the lasso contours
(unlike ridge regression), the touch points are more likely to be on one or more of the axes. This
implies that the other coefficients become zero. Hence, lasso regression also serves as a feature
selection method, whereas ridge regression does not.
PRINCIPLES OF MODEL
SELECTION
Occam’s Razor
It states when in dilemma choose the simpler model. A simpler model can be any of these.
1. Number of parameters required to specify the model completely. The model y = ax1 + bx2 is
simpler than the model y = ax1 + bx2 + cx3 .
2. The degree of the function, if it is a polynomial.
3. Size of the best-possible representation of the model. The expression
0.552984567x + 932.4710001276 could be considered to be more complex than 2x + 10 .
4. The depth or size of a decision tree.
Advantages of simpler models.
1. Simpler models are usually more generic and are more widely applicable. One who
understands a few basic principles of a subject (simple model) well, is better equipped to
solve any new unfamiliar problem than someone who has memorized an entire set of
specific problems (complex model).
2. Simpler models require fewer training samples for effective training than the more complex
ones and are consequently easier to train.
3. Simpler models are more robust as these are not as sensitive to the specifics of the training
data set in comparison to their more complex counterparts.
4. Simpler models make more errors in the training set but give better results on test samples.
Whereas complex models lead to overfitting and work very well for the training samples, but
fail miserably when applied to test samples.
Bias-Variance Tradeoff
The following figure represents the Bias-Variance tradeoff.
Overfitting
A model memorizing the data rather than intelligently learning the underlying trends in it is called
overfitting. It happens because it is possible for the model to memorize data, which is a problem as
the real test happens on unseen real world data. In the following figure the higher degree
polynomial fit (red line) shows an overfit model.
Regularization
It is the process of deliberately simplifying models to achieve the correct balance between keeping
the model simple and yet not too naive.
Hyperparameters
These are the parameters that the algorithm designer passes to the learning algorithm in order to
control the complexity of the final model. This in term fine tunes the behaviour of the learning
algorithm and has a lot of bearing on the final model produced. In short it governs the behaviour of
the algorithm.
Data Classification
The data is divided into three parts as follows,
1. Training data : data on which model trains.
2. Validation data : data on which model is evaluated for various hyperparameters (prevents
peeking of test data).
3. Test data : data on which model predicts.
Notes by Aniket Sahoo - Part II (Machine Learning) - 37
The following steps are taken for each of the k folds.
1. A model is trained using k − 1 of the folds as training data.
2. The resulting model is validated on the remaining part of the data.
The performance measure reported by k-fold cross-validation is then the average of the values
computed in the loop. The model with the best performance is then selected.
K-Fold CV
>> from sklearn.model_selection import KFold
>> from sklearn.model_selection import cross_val_score
>> lm = LinearRegression()
>> folds = KFold(n_splits=5, shuffle=True, random_state=100)
>> scores = cross_val_score(lm,X_train,y_train,cv=folds,scoring='r2')
>> scores = cross_val_score(lm, X_train, y_train, cv=folds,
scoring='mean_squared_error')
Grid Search CV
>> hyper_params = [{'n_features_to_select': [parameters]}]
>> lm.fit(X_train, y_train)
>> rfe = RFE(lm)
>> from sklearn.model_selection import GridSearchCV
>> model_cv = GridSearchCV(estimator=rfe, param_grid=hyper_params, cv=folds,
scoring='r2', verbose=1, return_train_score=True)
>> model_cv.fit(X_train, y_train)
>> scores = pd.DataFrame(model_cv.cv_results_)
Bootstrapping Method
It is very important to both present the expected skill of a machine learning model as well as the
confidence intervals for that model skill. Confidence intervals provide a range of model skills and a
likelihood that the model skill will fall between the ranges when making predictions on new data. A
robust way to calculate confidence intervals for machine learning algorithms is to use the bootstrap
method. Bootstrap refers to random sampling with replacement. This is a general technique for
estimating statistics mean and standard deviation from the dataset (such as R-squared, Adjusted R2,
AIC, BIC, etc.) that can be used to calculate empirical confidence intervals, regardless of the
distribution of skill scores. Calculation of confidence intervals with the bootstrap involves the
following steps.
1. Calculation of Population of Statistics :
The first step is to use the bootstrap procedure to resample the original data a number of
times and calculate the statistic of interest. The dataset is sampled with replacement (i.e.
each time an item is selected from the original dataset, it is not removed, allowing that item
to possibly be selected again for the sample). The statistic is calculated on the sample and is
stored so as to build up a population of the statistic of interest. The number of bootstrap
repeats defines the variance of the estimate, the more the better.
2. Calculation of Confidence Interval :
The second step is to calculate the confidence intervals. This is done by first ordering the
population of statistics and then selecting the values at the chosen percentile for the
confidence interval. For example, if a confidence interval of 95% is chosen, then for the
calculated 1, 000 statistics from an ordered 1, 000 bootstrap samples, the lower bound
would be the 25th value and the upper bound would be the 975th value. Here a
non-parametric confidence interval is calculated which does not make any assumption about
the functional form of the distribution of the statistic. This confidence interval is often called
the empirical confidence interval.
Configure Bootstrap
>> n_iterations = no_of_iterations
>> n_size = int(len(df) * percentage_of_sample)
Run Bootstrap
>> from sklearn.utils import resample
>> from sklearn.metrics import accuracy_score
>> stats = list()
>> values = df.values
>> for i in range(n_iterations):
>> train = resample(values, n_samples=n_size)
>> X_train = train[:,:-1]
>> y_train = train[:,-1]
>> test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
>> X_test = test[:,:-1]
>> y_test = test[:,-1]
>> model = any_machine_learning_model()
>> model.fit(X_train, y_train)
>> predictions = model.predict(X_test)
>> score = accuracy_score(y_test, predictions)
>> stats.append(score)
Plot Stats
>> from matplotlib import pyplot
>> pyplot.hist(stats)
>> pyplot.show()
4. CLASSIFICATION
LOGISTIC REGRESSION
Sigmoid Curve
As a simple boundary decision method is not going to work for the diabetic example, one needs to
come up with some other method. A curve having extremely low values in the start, extremely high
values in the end and intermediate values in the middle would be a good choice. The sigmoid curve
has got all these required properties, but so does also a straight line. The main reason for not using
the straight line is that it is not steep enough. As can be easily seen from the figure above, the
sigmoid curve has low values for a lot of points, then the values rise all of a sudden and at last there
are a lot of high values. Whereas, for a straight line, the values rise from low to high very uniformly,
and hence, the boundary region (where the probabilities transition from high to low) is not present.
Also, there is the cloglog form of logistic regression, given by,
For the diabetes example, the best fitting combination of β 0 and β 1 will be the one which will have
lower probabilities for non-diabetic persons and higher probabilities for diabetic persons. This can
be done by maximizing the product of all the probabilities, i.e.
This product of probabilities is called the likelihood function. The cost function can be derived using
the likelihood function.
Likelihood Function
Consider a dataset (x1 , x2 , ....xn ) with probability density function f (x1 , θ) , then the likelihood function
is given by the joint density of the dataset which by independence is equal to the product of the
marginal densities,
For a continuous function,
The cost function is given by,
For a discrete function,
Odds are defined as the ratio of probability of success to the probability of failure. So, if the odds of
success is 4 (0.8/0.2) , it shows that the odds of success ( 80% ) has an accompanying odds of failure
( 20% ). Whereas , Log odds is the logarithm of the odds i.e. Log(4) = 1.386 .
Confusion Matrix
A confusion matrix is an n × n matrix, where n is the number of classes being predicted. The
following figure represents a confusion matrix for two classes.
Notes by Aniket Sahoo - Part II (Machine Learning) - 47
Target Precision
Confusion Matrix
Positive Negative TP/(TP+FP)
True False Positive Predictive Value
Positive
Positive Positive TP/(TP+FP)
Model
False True Negative Predictive Value
Negative
Negative Negative TN/(FN+TN)
Recall Sensitivity Specificity Accuracy
TP/(TP+FN) TP/(TP+FN) TN/(FP+TN) (TP+TN)/(TP+FP+FN+TN)
Accuracy of the model is the proportion of the total number of predictions that were correct.
Accuracy = (T P + T N ) / (T P + F P + F N + T N )
For a model with an accuracy of about 90% (which looks good), on revaluation of the confusion
matrix, one could see that there were still a lot of misclassifications present. Thus, other new
discriminative metrics were brought in.
1. Positive Predictive Value / Precision : Proportion of positive cases correctly identified.
P ositive P redictive V alue or P recision = T P /(T P + F P )
2. Negative Predictive Value : Proportion of negative cases correctly identified.
N egative P redictive V alue = T N /(T N + F N )
3. Sensitivity / Recall : Proportion of actual positive cases correctly identified.
S ensitivity or Recall = T P /(T P + F N )
4. Specificity : Proportion of actual negative cases correctly identified.
S pecif icity = T N /(T N + F P )
5. True Positive Rate : Proportion of actual positive cases correctly identified.
T rue P ositive Rate = T P /(T P + F N )
6. False Positive Rate : Proportion of actual negative cases incorrectly identified.
S pecif icity = F P /(F P + T N )
7. F1 Score : It is the harmonic mean of Precision and Recall.
F 1 Score = 2 × (P recision × Recall)/(P recision + Recall)
ROC Curve
The ROC or Receiver Operating Characteristics curve is the plot between True Positive Rate and the
False Positive Rate, or simply, a tradeoff between sensitivity and specificity. The biggest advantage
of using the ROC curve is that it is independent of the change in proportion of responders. This is
because it has both the axes derived out from the columnar calculations of confusion matrix, where
the numerator and denominator of both x and y axis change on a similar scale for any change in
the response rate. The following figure shows the more the ROC curve is towards the upper-left
corner (i.e the more is the area under the curve), the better is the model.
K-S Statistic
K-S or Kolmogorov-Smirnov statistic is an indicator of how well the model discriminates between the
two classes i.e a measure of the degree of separation between the positive and negative
distributions. It is equal to 0% for the random model, and 100% for the perfect model. Thus, the KS
statistic gives an indicator of where the model lies between the two.
A good model is one for which the K-S statistic treads the following conditions,
1) It is equal to 40% or more.
2) It lies in the top deciles, i.e. 1st, 2nd, 3rd or 4th.
Gini Coefficient
The Gini coefficient is nothing but a fancy word and is given by,
Model
>> import statsmodels.api as sm
>> links = sm.families.links
>> model = sm.GLM(y_train,(sm.add_constant(X_train)),
family=sm.families.Binomial(link=links.logit))
# link can take any of the value logit, probit or cloglog
>> model.fit().summary()
Predict
>> y_pred = model.predict(X_test)
Analyze Model
>> from sklearn import metrics
Sample Selection
Selecting the right sample is very essential for solving any business problem. One should look out
for the following mentioned errors while selecting a sample.
1. Seasonal or cyclic fluctuations population : The seasonal fluctuations in the business (such
as new year sales, economic ups and downs etc.) needs to be taken care of while building
the samples. A wide range of data should be selected representing each type of seasonal
fluctuations.
2. Representative population : The sample should be representative of the population on which
the model is going to be applied in the future rather than all the types of population.
3. Rare incidence population : For rare events samples (such as fraud transactions, anomaly
detection etc.), the samples should be stratified and balanced before being used for
modelling.
Segmentation
Usually even after having selected the samples with all due considerations, there are chances that
the model may not perform well on the chosen sample data set. Assuming that no new sample is
Variable Transformation
It is already known that the categorical variables need to be transformed into dummies and numeric
variables be standardised before proceeding with any modelling. However, one could also convert
the numeric variables into dummy variables. The major advantage offered by dummies especially for
continuous variables is that they stabilize the model i.e small variations in the variables will not have
a very big impact on a model built using dummies whereas it will still have a sizable impact on a
model built using continuous variables as is. But, it also has a certain disadvantage of the data
getting compressed into very few categories resulting in information loss and data clumping.
Apart from creating dummies, there are also other techniques for transforming variables. The
commonly used techniques used for transforming variables are,
1. WOE (Weight of Evidence) Transformation
2. Interaction Variables
3. Splines
4. Mathematical Transformation
5. PCA (Principal Component Analysis)
Interaction Variables
Combining various variables to create one variable which gives a good meaning is called an
interaction variable. The interaction variables can be built on a judgement call based upon a deep
understanding of the business or by building a small decision tree. Indeed, it is only because of
incorporation of these interaction variables that Random Forests and Neural networks are able to
outperform the Logistic Regression.
Splines
A spline is basically obtained by fitting a polynomial over the WOE values. Using a spline offers high
predictive power, but it may result in unstable models.
Mathematical Transformation
Mathematical transformations such as x2 , x3 and log x are commonly used under this type of
transformation. But these transformations are not very easy to explain or very intuitive.
Model Evaluation
The model performance measures are grouped under three broad classes.
1. Discriminatory Power : It measures how good the model is at separating out the classes.
Various metrics being used for measuring are,
a. KS - Statistic
b. Gini Coefficient
c. Rank Ordering
d. Sensitivity
e. Specificity
2. Accuracy : It measures how accurately the model is able to predict the classes. Various
metrics being used for measuring are,
a. Sensitivity
b. Specificity
c. Comparing Actual v/s Predicted Log Odds
3. Stability : It measures how stable the model is in predicting unseen data. Various metrics
being used for measuring are,
a. Performance Stability : Results of in-sample validation approximately match those of
out-of-time validation.
b. Variable Stability : The sample used for model building has not changed too much
and has the same general characteristics. PSI (Population Stability Index) is used for
the same.
Model Validation
The model can be validated using the following methods,
1. In Sample validation
2. Out of Time validation
3. K-Cross validation
Model Governance
The process of tracking the model over time as it is being used is referred to as Model Governance.
The tracking is done by evaluating the model on the basis of the ongoing samples and comparing it
with the previous such evaluation result. The following figure represents the metrics of a model
performance over time.
NAIVE BAYES
Bayes Theorem
Bayes Theorem gives the probability of an event, based on prior knowledge of conditions that might
be related to the event. For example, if diabetes is related to age, then, using Bayes' theorem, a
person's age can be used to more accurately assess the probability that they have diabetes,
compared to the assessment of the probability of diabetes made without the knowledge of the
person's age. If the probability of an event B occurring given that event A has already occurred and
individual probabilities of A and B are known, then the probability of event A given B has already
occurred can be found out using Bayes Theorem given by,
P (A ∣ B) = P (B ∣ A) × P (A) / P (B)
or
P (B ∣ A) = P (A ∣ B) × P (B) / P (A)
The value of the denominator P (x1 )P (x2 )...P (xn ) remains the same for all the classes for any given
data point. Thus, the denominator behaving as a scaling factor out here can be ignored without
affecting the final outcome giving,
Using the above BOW table and the Multinomial Naive Bayes classifier let's classify some
documents.
Document : “Great Story”
Class : education Class : cinema Result
Prior : Prior :
P(education) = 3/5 P(cinema) = 2/5 As
P(education | document) >
Likelihood : Likelihood : P(cinema | document)
P(Great | education) = 2/13 P(Great | cinema) = 1/8
P(Story | education) = 1/13 P(Story | cinema) = 1/8 Thus, it can be concluded
that the document belongs to
Posterity : Posterity : education
Likelihood x Prior = 0.007 Likelihood x Prior = 0.006
Document : “Good Story”
Class : education Class : cinema Result
Prior : Prior :
P(education) = 3/5 P(cinema) = 2/5
As can be seen,
Likelihood : Likelihood :
one has encountered the
P(Good | education) = 0/13 P(Good | cinema) = 2/8
zero probability problem.
P(Story | education) = 1/13 P(Story | cinema) = 1/8
It needs to be solved.
Posterity : Posterity :
Likelihood x Prior = 0 Likelihood x Prior = 0.012
The most fundamental difference in Bernoulli Naive Bayes Classifier is that unlike the Multinomial
way which is concerned about the number of occurrences of the word in the class, in Bernoulli one
is just concerned about whether the word is present or not. The bag of words representation in this
case is just a 0 or 1 rather than the frequency of occurrence of the word. The BOW representation
for Bernoulli Naive Bayes is as follows,
Notes by Aniket Sahoo - Part II (Machine Learning) - 62
Vocabulary D0 D1 D2 Total P(W|C) D3 D4 Total P(W|C)
0 cinema 0 0 0 0 0/3 1 0 1 1/2
1 depends 0 1 0 1 1/3 0 1 1 1/2
2 educational 1 1 1 3 3/3 0 0 0 0/2
3 ethics 0 1 1 2 2/3 0 0 0 0/2
4 good 0 0 0 0 0/3 0 1 1 1/2
5 great 1 0 1 2 2/3 1 0 1 1/2
6 greatness 0 1 1 2 2/3 0 0 0 0/2
7 institution 1 0 0 1 1/3 0 0 0 0/2
8 movie 0 0 0 0 0/3 0 1 1 1/2
9 titanic 0 0 0 0 0/3 1 0 1 1/2
10 story 0 0 1 1 1/3 0 1 1 1/2
11 harvard 1 0 0 1 1/3 0 0 0 0/2
Using the above BOW table and the Bernoulli Naive Bayes classifier let's classify some documents.
Document : “Very good educational institution”
Class : education Class : cinema Result
Vocabulary
P(W|C) Likelihood P(W|C) Likelihood
0 cinema 0 0/3 (0x0/3)+(1-0)(1-0/3) 1/2 (0x1/2)+(1-0)(1-1/2)
1 depends 0 1/3 (0x1/3)+(1-0)(1-1/3) 1/2 (0x1/2)+(1-0)(1-1/2)
2 educational 1 3/3 (1x3/3)+(1-1)(1-3/3) 0/2 (1x0/2)+(1-1)(1-0/2)
3 ethics 0 2/3 (0x2/3)+(1-0)(1-2/3) 0/2 (0x0/2)+(1-0)(1-0/2) The word Very is
4 good 1 0/3 (1x0/3)+(1-1)(1-0/3) 1/2 (1x1/2)+(1-1)(1-1/2) not there in the
vocabulary so it
5 great 0 2/3 (0x2/3)+(1-0)(1-2/3) 1/2 (0x1/2)+(1-0)(1-1/2) has been ignored.
6 greatness 0 2/3 (0x2/3)+(1-0)(1-2/3) 0/2 (0x0/2)+(1-0)(1-0/2)
As the Likelihood
7 institution 1 1/3 (1x1/3)+(1-1)(1-1/3) 0/2 (1x0/2)+(1-1)(1-0/2) of word good is
8 movie 0 0/3 (0x0/3)+(1-0)(1-0/3) 1/2 (0x1/2)+(1-0)(1-1/2) becoming 0, it
gives rise to the
9 titanic 0 0/3 (0x0/3)+(1-0)(1-0/3) 1/2 (0x1/2)+(1-0)(1-1/2)
zero probability
10 story 0 1/3 (0x1/3)+(1-0)(1-1/3) 1/2 (0x1/2)+(1-0)(1-1/2) problem.
11 harvard 0 1/3 (0x1/3)+(1-0)(1-1/3) 0/2 (0x0/2)+(1-0)(1-0/2)
Prior : P(education) = 3/5 Prior : P(cinema) = 2/5
Posterity : Posterity :
Likelihood x Prior = 0.001 Likelihood x Prior = 0.006
Document Processing
>> train_docs['Class'] = train_docs.Class.map({'class_1':0, 'class_2':1})
>> train_array = train_docs.values
>> X_train = train_array[:,0]
>> y_train = train_array[:,1].astype('int')
Train Model
>> from sklearn.naive_bayes import MultinomialNB
# Multinomial Naive Bayes
>> model = MultinomialNB()
>> model.fit(X_transformed, y_train)
Predict
>> test_docs['Class'] = test_docs.Class.map({'class_1':0, 'class_2':1})
>> test_array = test_docs.values
>> X_test = test_array[:,0]
>> y_test = test_array[:,1]
>> X_test_transformed = vec.transform(X_test)
>> X_test = X_test_transformed.toarray()
>> y_pred_class = model.predict(X_test_tranformed)
# Predict Class
>> y_pred_proba =model.predict_proba(X_test_tranformed)
# Predict probability
Analyze Model
>> from sklearn import metrics
>> print("PRECISION SCORE :",metrics.precision_score(y_test, y_pred_class))
>> print("RECALL SCORE :", metrics.recall_score(y_test, y_pred_class))
>> print("F1 SCORE :",metrics.f1_score(y_test, y_pred_class))
Hyperplanes
Essentially, it is a boundary which separates the data set into its classes. It could be lines, 2D planes,
or even n-dimensional planes. In general, the hyperplane in 2D can be represented by a line,
whereas a hyperplane in 3D can be represented by a plane as shown in the following figure.
The dimension of a hyperplane is always equal to the number of f eatures − 1 . If the value of any
point is positive it would mean that the set of values of the features is in one class; however, if it is
negative then it belongs to the other class. A value of zero would imply that the point lies on the
hyperplane.
Linear Discriminator
The generalized equation of the hyperplane is given as,
th
where, y i is the label of the i datapoint (such as C lass_1 (+ 1) & C lass_2 (− 1) )
Thus, the final optimisation problem becomes,
The following figure explains it geometrically.
Maximal Margin Classifier is possible only on datasets which are perfectly linearly separable, so it
has a rather limited applicability. Thus, in order to classify data points which are partially
intermingled one cannot use this classifier.
Support Vectors
One can observe in the above geometric diagram, that the Maximum Margin Hyperplane is
completely determined by the xi , which lie nearest to it. These xi ’s are called support vectors. Thus,
the support vectors are the points that lie close to the hyperplane and are the only points that are
used in constructing the hyperplane.
Slack Variable
In order to allow certain points to be deliberately misclassified, one has to include a new variable
called as slack variable ε . A slack variable ε is used to control the misclassifications. It tells where
an observation is located relative to the margin and hyperplane. The following figure explains it
geometrically.
1. For data points which are at a distance of more than the margin M , i.e. at a safe distance
from the hyperplane, the value of the slack variable is 0 .
2. For data points correctly classified but falling inside the margin M (or violating the margin),
the value of the slack variable lies between 0 and 1 .
3. For data points incorrectly classified (i.e. violating the hyperplane), the value of the slack
variable is greater than 1 .
Thus, the slack variable ε takes a value between 0 to infinity. Lower values of slack are better than
higher values ( ε = 0 correct classification, ε > 1 incorrect classification, 0 < ε < 1 classifies correctly
but violates the margin).
SVM
>> from sklearn.svm import SVC
>> model = SVC(C = 1)
# Hyperparameter C used in the SVM formulation (in theory) and the C in the SVC()
functions are the inverse of each other. C is analogous to the penalty imposed for
misclassification, i.e. a higher C will force the model to classify most (training)
data points correctly (and thus, overfit).
>> model.fit(X_train, y_train)
Feature Transformation
A nonlinear boundary can be transformed into a linear boundary by applying certain functions to the
original attributes to create new features. The original space (X, Y ) is called the original attribute
space, and the transformed space (X ′, Y ′) is called the feature space.
The preceding diagram represents one such transformation. Now in the above example, the
transformation was very neat as the shape of the separator was known to be a circle. But in the real
scenario it is going to be very tough to guess the shape (functional form) of the separator, other than
that it is linear or nonlinear. Thus, one has to eventually go for a n degree polynomial transformation
for m number of attributes.
For example let’s consider a 2 degree polynomial transformation of 3 attributes. Solving this, one
ends up with 10 features in the new feature space as shown in the following figure.
Thus, it can be concluded that as the number of attributes increases, the number of dimensions in
the transformed feature space also increases exponentially.
This function K is known as the Kernel function. Thus, one just needs to know only the Kernel
function and not the mapping function itself, for transforming the non-linear attributes into linear
features. The benefits of this implicit transformation are as follows.
1. Manually finding the mathematical transformation is now not required to convert a nonlinear
to a linear feature space.
2. Computationally heavy transformations are no longer required.
Some of the most popular types of kernel functions are as follows.
1. Linear Kernel : This is the same as the support vector classifier, or the hyperplane, without
any transformation at all.
2. Polynomial Kernel : It is capable of creating nonlinear, polynomial decision boundaries.
3. Radial Basis Function Kernel (RBF) : This is the most complex one, which is capable of
transforming highly nonlinear feature spaces to linear ones. It is even capable of creating
elliptical (i.e. enclosed) decision boundaries.
Kernel SVM
>> from sklearn.svm import SVC
>> model = SVC(C=1, kernel='linear')
# Kernels can be any one of 'linear', 'poly', 'rbf', 'sigmoid' or 'precomputed'.
>> model.fit(X_train, y_train)
Similarly, the nonlinear SVR makes use of the kernel functions to transform the data into a higher
dimensional feature space, which then makes it possible to perform the linear separation of the
data. The following figures explain the linear SVR geometrically and give an example of SVR using
linear, polynomial and RBF kernels.
Geometric interpretation of SVR Various SVR kernels
SVR
>> from sklearn.svm import SVR
>> model = SVR(C=1, epsilon=0.2, kernel='linear')
# Kernels can be any one of 'linear', 'poly', 'rbf', 'sigmoid' or 'precomputed'.
>> model.fit(X_train, y_train)
Notes by Aniket Sahoo - Part II (Machine Learning) - 74
4.4. TREE MODEL
TREE MODELS
Decision Trees
A decision tree uses a tree-like model (resembling an upside-down tree) to make predictions. It is
also very similar to decision making in real life (asking a series of questions in a nested if-then-else
structure to arrive at a decision). A decision tree splits the data into multiple sets. Then, each of
these sets is further split into subsets to arrive at a decision.
If a test splits the data into two partitions it is called a binary decision tree and if it splits the data into
more than two partitions, it is called a multiway decision tree. The decision trees are easy to
interpret (i.e. almost always, one can identify the various factors that lead to the decision).
The algorithm goes on step by step, picking an attribute and splitting the data such that the
homogeneity increases after every split. Usually, the attribute that results in the maximum increase
in homogeneity is chosen for splitting. A split that gives a homogenous subset is much more
desirable than the one that results in a 50 − 50 distribution (in the case of two labels). The splitting is
stopped when the resulting leaves are sufficiently homogenous. Here, sufficiently homogeneous
means that most of the data points in the set should belong to the same class label. Thus, the
decision tree construction problem gets narrowed down to, splitting of the dataset such that the
homogeneity of the resulting partitions is maximum.
Gini Index
Gini index is a measure of homogeneity that measures the degree or probability of a particular
variable being correctly classified when it is randomly chosen, i.e. if all the elements belong to a
single class, then it can be called pure. The degree of Gini index varies between 0 and 1 , where 0
denotes that the elements are randomly distributed across various classes and 1 denotes that all
elements belong to a certain class. A Gini Index of 0.5 denotes equally distributed elements into
some classes. Thus, the higher the homogeneity, the higher the Gini index.
Gini index of a partition D with n labels is given by,
While calculating the Gini index of an attribute after splitting, one has to multiply the Gini index of the
partition with the fraction of data points of the respective partition. Gini index for an attribute A
having n partitions is given by,
Notes by Aniket Sahoo - Part II (Machine Learning) - 77
Following is an example where one has to make a choice between two attributes for splitting, Age
and Gender.
On computing, it was found that Gender is a better attribute to split on as it yields a higher value of
Gini index as compared to Age. This means that gender gives a better split that helps in
distinguishing between football players and non-football players. This is intuitive as well, splitting on
gender is expected to be more informative than age because football is usually more popular
among males. Thus, one can split the dataset on the attribute Gender.
Entropy
The idea here is to use the notion of entropy (lack of order or predictability) for measuring the
homogeneity. Entropy quantifies the degree of disorder in the data. Similar to the Gini index, its
value also varies from 0 to 1 . If a data set is completely homogenous, then the entropy of such a
data set is 0 , i.e. there’s no disorder. The lower the entropy (or higher the Gini index), the lesser the
disorder, and the greater homogeneity. One needs to note that Entropy is a measure of
disorderness, while the Gini index is a measure of homogeneity in the data set.
Entropy of partition D with n labels is given by,
While calculating the Entropy of an attribute after splitting, one has to multiply the Entropy of the
partition with the fraction of data points of the respective partition. Entropy for an attribute A having
n partitions is given by,
Following is the same example used for the Gini index, where one has to make a choice between
two attributes for splitting, Age and Gender.
Information Gain
Information gain is another measure of homogeneity which is based on the decrease in entropy
after a data set is split on an attribute. It measures by how much entropy has decreased between
the parent set and the partitions obtained after splitting. Information gain of an attribute A with a
partition D is given by,
As the information gain is equal to the entropy change from the parent set to the partitions, it is
maximum when the entropy of the parent set minus the entropy of the partitions is maximum. Thus,
while splitting one should choose an attribute such that the information gained is maximum.
Chi-Square
It measures the significance of the relationship between sub-nodes and parent node. Chi-Square of
an attribute A with a partition D is given by,
Following is the same example used for the Gini index, where one has to make a choice between
two attributes for splitting, Age and Gender. On computing, it was found that Gender is a better
attribute to split on as it yields a higher Chi-squared value as compared to Age. Thus, one can split
the dataset on the attribute Gender.
R-squared Splitting
So far the splitting has been done only on discrete target variables. But, the splitting can also be
done for continuous output variables too. One can calculate the R-squared values of the data sets
(before and after splitting) in a similar manner to what is done for linear regression models. The data
is split such that the R-squared of the partitions obtained after splitting is greater than that of the
original or parent data set. In other words, the fit of the model should be as good as possible after
splitting.
Train Model
>> from sklearn.tree import DecisionTreeClassifier
>> model = DecisionTreeClassifier(max_depth=level_of_depth)
>> model.fit(X, y)
Tree Truncation
There are several ways to truncate the decision trees before they start to overfit. Some of these are
listed as follows.
1. Minimum Size of the Partition for a Split : Stop partitioning further when the current partition
is small enough.
2. Minimum Change in Homogeneity Measure : Do not partition further when even the best
split causes an insignificant change in the purity measure (difference between the current
purity and the purity of the partitions created by the split).
3. Limit on Tree Depth : If the current node is farther away from the root than a threshold, then
stop partitioning further.
4. Minimum Size of the Partition at a Leaf : If any of the partitions after a split has fewer than this
threshold minimum, then do not consider the split.
5. Maximum Number of Leaves in the Tree : If the current number of the bottom-most nodes in
the tree exceeds this limit then stop partitioning.
Grid Search
>> from sklearn.tree import DecisionTreeClassifier
>> hyper_params = [{'max_depth' : parameters,
'min_samples_leaf' : parameters,
'min_samples_split' : parameters,
'max_features' : parameters,
'max_leaf_nodes' : parameters,
'criterion' : ['entropy', 'gini']}]
Test Model
>> from sklearn.metrics import classification_report, confusion_matrix
>> y_pred = model_final.predict(X_test)
>> classification_report(y_test, y_pred)
>> confusion_matrix(y_test,y_pred)
4.4.4. ENSEMBLES
Ensembles are a group of things viewed as a whole rather than individually. In ensembles, a
collection of models is used to make predictions, rather than individual models. The random forest is
an ensemble made by the combination of a large number of decision trees. In principle, ensembles
can be made by combining all types of models such as logistic regression, neural network or a few
decision trees all working in unison.
Bagging Algorithm
Bagging or Bootstrapped Aggregation is an ensemble method. It is used when the goal is to reduce
the variance of a decision tree classifier. Bootstrapping means to create several random subsets of
data (about 30 − 70% ) from training samples chosen randomly with replacement. Each collection of
subset data is then used to build their tree using a random sample of features while splitting a node
(random choice of attributes ensure that the prominent features do not appear in every tree, thus
ensuring diversity). Aggregation implies combining the results of different models present in the
ensemble, resulting in an ensemble of different models. Average of all the predictions from different
trees are then used which is more robust than a single decision tree classifier. It needs to be kept in
mind that bagging is just a sampling technique and is not specific to random forests. Following is the
algorithm for the same.
1. Consider there are n observations and k features in a training data set. A sample from the
training data set is taken randomly with replacement.
2. A subset of k features are selected randomly and whichever feature gives the best split is
used to split the node iteratively.
3. The tree is grown to the largest and not pruned (as may be done in constructing a normal
tree classifier).
4. The above steps are repeated N times, to construct N trees in the forest. As each tree is
constructed independently, it is possible to construct each tree in parallel.
5. The final prediction is given based on the aggregation (majority score) of predictions from
the ensemble of the N number of trees.
For measuring the accuracy of the model the OOB (Out-Of-Bag) error can be used. For each
bootstrap sample, there is one third of data which is not used in the creation of a tree (i.e. it was out
of the sample). This data is referred to as the out of bag data. In order to get an unbiased measure
of the accuracy of the model over test data, out of bag error is used.
Advantages of Bagging
Some of the advantages of random forests are given as follows.
1. Higher resolutions (a smoother decision boundary) in the feature space due to trees being
independent of each other. Trees are independent of each other due to the diversity created
in each tree by the use of random subsets of features (i.e. not all attributes are considered
while making each tree).
2. Reduction in overfitting as the chances are extremely low that more than half of the models
have overfitted in the ensemble.
3. Increased stability as the final prediction is given by the aggregation of a large number of
trees. The decision made by a single tree (on unseen data) depends highly on the training
data since trees are unstable. In a forest, even if a few trees are unstable, averaging out their
decisions decreases the mistakes made because of these few unstable trees. A random
forest usually always has a lower model variance than an ordinary individual tree.
4. Immunity from the curse of dimensionality. Since each tree does not consider all the
features, the feature space reduces. This makes the algorithm immune to a large feature
space causing computational and complexity issues. This in turn also allows it to handle a
large number of input variables effectively without the necessity of variable deletion.
5. Maintains accuracy for missing data by having an effective method for estimating missing
data even when a large proportion of the data are missing.
6. Allows parallelizability, as each tree can be built separately owing to the fact that each of the
trees are independently built on different data and attributes. This allows one to make full
use of the multi-core CPU to build random forests.
7. Uses all the data to train the model. There is no need for the data to be split into training and
validation samples as one can calculate the OOB (Out-of-Bag) error using the training set
which gives a really good estimate of the performance of the forest on unseen data.
Even though bagging has many advantages, at times it may not give precise values for the
classification and regression model. It is because the final prediction is based on the mean
predictions from subset trees.
Random Forest
The following figure gives a representation of random forests using bagging.
Grid Search
>> from sklearn.ensemble import RandomForestClassifier
>> hyper_params = [{'n_estimators' : parameters,
'max_depth' : parameters,
'min_samples_leaf' : parameters,
'min_samples_split' : parameters,
'max_features' : parameters,
'max_leaf_nodes' : parameters,
'criterion' : ['entropy', 'gini']}]
# n_estimators: number of trees in the forest
# max_depth: maximum depth of the tree
# min_samples_leaf: minimum number of samples required to be at a leaf node
# min_samples_split: minimum no. of samples reqd
# max_features: no. of features to consider when looking for the best split
# max_leaf_nodes: maximum number of possible leaf nodes
# criterion: function to measure the quality of a split
>> model = RandomForestClassifier()
>> model_cv = GridSearchCV(estimator=model, param_grid=hyper_params,
cv=folds, n_jobs=-1, verbose=1)
>> model_cv.fit(X_train, y_train)
Train Model
>> model_final = RandomForestClassifier(bootstrap=True,
n_estimators=best_parameter,
max_depth=best_parameter,
min_samples_leaf=best_parameter,
min_samples_split=best_parameter,
max_features=best_parameter,
max_leaf_nodes=best_parameter,
criterion=best_parameter)
>> model_final.fit(X, y)
>> model_final.score(X_test,y_test)
Test Model
>> from sklearn.metrics import classification_report, confusion_matrix
>> y_pred = model_final.predict(X_test)
>> classification_report(y_test, y_pred)
>> confusion_matrix(y_test,y_pred)
4.4.6. BOOSTING
Boosting is a very powerful idea in the field of machine learning. Some of the most popular boosting
algorithms being used are AdaBoost, Gradient Boosting and XGBoost.
Boosting
The key idea of boosting is to create an ensemble which makes high errors only on the less
frequent data points. Consider the following example showing the expected errors of two models.
This model works extremely well for the This model works consistently for all the data
majority of the data points but goes horribly points but is not preferable as it is worse than
wrong for few data points. the first model.
As can be seen the aim is to choose a model which minimises the total expected error. Boosting
leverages the fact that one can build a series of models specifically targeted at the data points
which have been incorrectly predicted by the other models in the ensemble. If a series of models
keep reducing the average error, one will have an ensemble having extremely high accuracy. To
summarize boosting is a way of generating a strong model from a weak learning algorithm (SVM,
regression, and other techniques are algorithms which are used to create models). The following
figure gives a representation of the same.
Usually a weak learning algorithm produces a model that does marginally better than a random
guess (a random guess has a 50 % chance of being right) having around 60 to 70 % chance of
being correct. The final objective of boosting algorithms is to create a strong model by making an
ensemble of such weak models. Weak learners can be created by applying regularisation rules.
AdaBoost Algorithm
AdaBoost stands for Adaptive Boosting. The AdaBoost algorithm improves the performance by
boosting the probability of some points while suppressing the probability of others. Consider a
model M = A(T , D) , which uses an algorithm A over a training data set T having uniform
distribution D (the weight assigned to each data point). Now, with each new model, the distribution
of the data is changed such that it boosts the probability of points that are incorrectly classified and
suppresses the probability of points that are correctly classified. In other words, the distribution
changes after every iteration. This distribution is used for the calculation of the objective function
that needs to be minimized while optimising the model. In other words, the objective function is the
expected value of the loss which transforms to the following objective function when the distribution
is not uniform.
So, there are essentially two major steps involved in the AdaBoost algorithm,
1. Modification of the current distribution to create a new distribution to generate a new model.
2. Calculation of the weights given to each of the models to get the final ensemble.
Following is the algorithm for the same.
1. The probabilities of the distribution having n datapoints is initialized as,
2. An algorithm ht is trained on the training data using the respective probabilities.
3. The algorithm is trained such that it performs well on all the data points. Performing well
means that the tth algorithm ht should have a low expected loss ( 0 − 1 ) on the data points
for the current distribution P Dt (xi ) . i.e,
4. Based on the performance of the of the algorithm ht , a weight is assigned to it given by,
The higher the value of εt the the larger the denominator and smaller the numerator and
thus smaller will be the value of αt . Thus, if the algorithm ht , has a high error εt , then it is
assigned a smaller weight (i.e. it will contribute less to the output of the ensemble).
5. Once the tth algorithm is added to the ensemble, the training set distribution is recomputed
to assign each data point with a probability proportional to how well the current ensemble
H t performs on the training set, i.e.
It can be seen that if ht (xi ) = y i , then y i ht (xi ) = 1 , which means that e −αt yi ht (xi ) = e −αt .
Similarly, if ht (xi ) =/ y i , then y i ht (xi ) =− 1 , which means that e −αt yi ht (xi ) = e αt . Thus, the
exponential term is smaller if the algorithm’s prediction agrees with the true value, i.e. higher
probability is assigned to the ith data point if the algorithm ht is wrong on xi .
th
The idea behind this is that the (t + 1) algorithm will correct the errors that the first t
algorithms made on the training set. More specifically, after selecting the first t algorithms, it
th
is ensured that the (t + 1) algorithm performs well on those data points for which the first t
algorithms had performed poorly.
6. The above steps starting from S tep − 2 to S tep − 5 is repeated N times, to construct an
ensemble of N models. Each model is dependent on the previous model so these are
constructed sequentially.
It should be noted that before applying the AdaBoost algorithm, one should remove the Outliers.
Since, AdaBoost tends to boost up the probabilities of misclassified points and there is a high
chance that outliers will be misclassified, it will keep increasing the probability associated with the
outliers and make the progress difficult. The following figure gives a representation of AdaBoost
algorithm.
Train Model
>> from sklearn.tree import DecisionTreeClassifier
# Base estimator shallow decision tree : a weak learner with max_depth as 2
>> model_0 = DecisionTreeClassifier(max_depth=2, random_state=100)
# Training the shallow decision tree
>> model_0.fit(X_train, y_train)
>> from sklearn.ensemble import AdaBoostClassifier
# Training the adaboost model
>> model = AdaBoostClassifier(base_estimator=model_0, n_estimators = no_trees)
>> model.fit(X_train, y_train)
Predict
>> y_pred = model.predict(X_test)
3. In order to reduce this loss an incremental model ht+1 is generated, which is fitted on the
target values given by the negative gradient of the current loss function or pseudo-residuals,
i.e,
4. Once the model ht+1 is determined, the next model F t+1 is computed as,
5. The above steps starting from S tep − 2 to S tep − 4 is repeated N times, to construct an
ensemble of N models. Each model is dependent on the previous model so these are
constructed sequentially.
6. The final model is given by,
λt is the learning rate having values typically between 0 and 1 . Smaller values of λt lead to a
larger number of trees because with a slower learning rate, larger number of trees are required to
reach the minima. This, in turn, leads to longer training time. On the other hand, if λt is large, a
lesser number of trees will be required, but there's the risk that the minima might actually be missed
altogether because of the long strides.
XGBoost Algorithm
Extreme Gradient Boosting (XGBoost) is similar to the Gradient Boosting framework but a more
efficient and advanced implementation of Gradient Boosting algorithm. It has become one of the
most popular algorithms due to its robust accuracy. Both XGBoost and GBM follow the principle of
gradient boosted trees, but XGBoost uses a more regularised model formulation to control
overfitting, which gives it better performance, which is why it’s also known as regularised boosting
technique.
Any ideal machine learning model has an objective function which is a sum of the Loss function L
and regularization Ω . The Loss function controls the predictive power of the algorithm and the
regularization Ω controls its simplicity. The Gradient Boosting algorithm has its objective function as
only the Training Loss while the XGBoost algorithm has an objective function that constitutes a loss
function evaluated over all predictions and sum of regularization terms for all predictors. XGBoost
algorithm is the same as the Gradient Boosting algorithm, just that the model being fit at each
iteration is a decision tree. Hence, there is a change in S tep − 3 and S tep − 4 . Following is the
algorithm for the same.
1. A crude model F 0 (xi ) is initialized having a constant value as the output.
2. The loss L(y i , F t (xi )) for the current model can be computed using various loss functions
depending upon the kind of model being used. AdaBoost uses an exponential loss function.
As the squared error loss functions are very sensitive to outliers, alternate loss functions,
such as the Huber loss criterion can be used. The Huber loss criterion is defined as follows,
3. In order to reduce this loss a shallow decision tree Dt is generated having J t terminal
nodes, which is fitted on the target values given by the negative gradient of the current loss
function or pseudo-residuals, i.e,
In simple words an algorithm ht+1 is trained using the training set {xi , y i ′} where y i ′ is,
The terminal nodes Rt generated in the decision tree Dt is given by,
4. The incremental model ht+1 constitutes of all the Ωj ’s where each Ωj is given by,
Thus,
5. Once the model ht+1 is determined, the next model F t+1 is computed as,
6. The above steps starting from S tep − 2 to S tep − 5 is repeated N times, to construct an
ensemble of N models. Each model is dependent on the previous model so these are
constructed sequentially.
7. The final model is given by,
Advantages of XGBoost
1. Parallel Computing : XGBoost, by default uses all the cores of the machine enabling its
capacity to do parallel computation.
2. Regularization : The biggest advantage of XGBoost is that it uses regularization and controls
the overfitting and simplicity of the model which gives it better performance.
3. Enabled Cross Validation : XGBoost is enabled with internal Cross Validation function.
4. Missing Values : XGBoost is designed to handle missing values internally. The missing values
are treated in such a manner that if there exists any trend in missing values, it is captured by
the model.
Train Model
>> from sklearn.ensemble import GradientBoostingClassifier
>> model = GradientBoostingClassifier(max_depth=2, n_estimators=200)
>> model.fit(X_train, y_train)
Predict
>> y_pred = model.predict(X_test)
Evaluate Model
>> from sklearn.metrics import accuracy_score
>> score = metrics.accuracy_score(y_test, y_pred)
Train Model
>> import xgboost as xgb
>> from xgboost import XGBClassifier
>> from xgboost import plot_importance
>> params = {'learning_rate': 0.2,'max_depth': 2,'n_estimators':200,'subsample':0.6,
'objective':'binary:logistic'}
>> model = XGBClassifier(params = params)
>> model.fit(X_train, y_train)
Predict
>> y_pred = model.predict_proba(X_test)
Evaluate Model
>> from sklearn.metrics import roc_auc_score
>> score = roc_auc_score(y_test, y_pred[:, 1])
>> importance = dict(zip(X_train.columns, model.feature_importances_))
The following figure shows an example of Decision Tree Regression with a random sample set
having five data points.
Notes by Aniket Sahoo - Part II (Machine Learning) - 98
The following figure shows examples of predictions of some Decision Tree Regression models.
Train Model
>> from sklearn.tree import DecisionTreeRegressor
>> model = DecisionTreeRegressor(max_depth=2)
>> model.fit(X_train, y_train)
Predict
>> y_pred = model.predict(X_test)
Evaluate Model
>> from sklearn.metrics import mean_squared_error
>> rsme = np.sqrt(mean_squared_error(y_test, y_pred))
Train Model
>> from sklearn.ensemble import RandomForestRegressor
>> model = RandomForestRegressor(n_estimators=10, max_depth=2, random_state=0)
>> model.fit(X_train, y_train)
Predict
>> y_pred = model.predict(X_test)
Evaluate Model
>> from sklearn.metrics import mean_squared_error
>> rsme = np.sqrt(mean_squared_error(y_test, y_pred))
KNN Algorithm
The KNN algorithm works on the principle that similar things exist in close proximity. In other words,
“Birds of a feather flock together”. Suppose, there is an image of a creature that looks similar to a
cat and a dog, but the aim is to find whether it is a cat or dog. In order to identify the creature, one
has to compare the similarities of the given creature with the existing details about cats and dogs.
Finally, based on the most similar features one can tag the creature either as a cat or a dog. Similarly
the KNN algorithm hinges on this idea of similarity (also known as distance, proximity, or closeness).
The K-NN algorithm assumes the similarity between the new case/data and available cases and puts
the new case into the category that is most similar to the available categories. It is also called a lazy
learner algorithm because it does not learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on the dataset. Consider the following
figure showing two categories A and B , with a new data point x needing to be categorized
correctly to one of A or B .
In order to classify the data point x , the five nearest As can be seen three of the nearest neighbors are
neighbours are considered having the minimum from category A , hence the data point x must
distance from x . belong to category A .
Following is the algorithm for the same.
1. The value of K is initialized.
2. For each training data point xi , the distance di from the test data point is computed. The
Euclidean distance being the most popular method can be used as the distance metric. The
other metrics that can be used are Chebyshev, cosine, etc. The Euclidean distance between
two data points x and y is given by,
Disadvantages of KNN
1. It is always required to determine the value of K which might be complex some time.
2. It gets significantly slower as the number of examples and/or predictors/independent
variables increase. The computation cost is high because of calculating the distance
between the data points for all the training samples.
Train Model
>> from sklearn.neighbors import KNeighborsClassifier
>> model = KNeighborsClassifier(n_neighbors=5)
>> model.fit(X_train, y_train)
Predict
>> y_pred = model.predict(X_test)
Evaluate Model
>> from sklearn.metrics import confusion_matrix
>> cm = confusion_matrix(mean_squared_error(y_test, y_pred))
The following figures show the performance of the various models used over the data.
Logistic Regression : The Decision Trees/Random Forests : SVM : It does a really good job
diagonal line is clearly not able It does better differentiation, but
of differentiating the two
to differentiate the two classes, can potentially overfit with
classes almost perfectly.
with a lot of misclassifications. increase in the trees.
Logistic Regression
Decision Trees
Random Forests
5. CLUSTERING
CLUSTERING
Clustering
Clustering refers to a very broad set of techniques for finding subgroups or segments in a data set.
Clustering techniques use the raw data to form clusters based on common factors among various
data points. The following figure shows the difference between clustering and classification.
For successful segmentation,
1) The segments formed need to be stable (i.e. the same data point should not fall under
different segments upon segmenting the data on the same criteria).
2) The segments should have intra-segment homogeneity (i.e. the behaviour in a segment
needs to be similar).
3) The segments should have inter-segment heterogeneity (i.e. the behaviour between
segments needs to be different).
There are mainly three types of segmentation used for customer segmentation.
1) Behavioural segmentation : Segmentation is based on the actual behavioural patterns
displayed by the person.
2) Attitudinal segmentation : Segmentation is based on the beliefs or intentions of the person,
which may not translate into similar action.
Behavioural Segmentation
The commonly used customer behavioural segmentation techniques are as follows.
RFM RPI CDJ
R : Recency (Time since last R : Relationship (Past CDJ : Consumer Decision
purchase/visit) interaction) Journey
F : Frequency (Total number of P : Persona (Type of customer)
purchases/visits) I : Intent (Intention at the time of Based on the customer’s life
M : Monetary (Total revenue purchase) journey with the product.
generated)
K-Means Algorithm
For a dataset where,
N is the total no of observations
K is the total number of desired clusters
C k denotes the set containing the observations in k th cluster
The following algorithm is applied to create K clusters.
1. Initialisation : The cluster centers for each of the K clusters are randomly picked. These can
either be from any of the N observations or totally a different point.
2. Assignment : Each observation n is assigned to the cluster whose cluster center is the
closest to it. The closest cluster center is found using the squared Euclidean distance.
The equation for the assignment step is as follows.
The observations are to be assigned in such a way that they satisfy the following conditions.
a. C 1 ⋃ C 2 ⋃ ... C k = {1, 2, ..., N } i.e. each observation belongs to at least one of the K
clusters.
b. C k ⋂ C k′ = ϕ where k =/ k ′ i.e. no observation belongs to more than one cluster.
4. Iteration : The process of assignment and optimisation is repeated until there is no change in
the clusters or possibly until the algorithm converges.
The following figures are a graphical representation of the K-Means algorithm.
Initialization Iteration 1, Step 1 Iteration 1, Step 2 Iteration 2, Step 1 Iteration 2, Step 2 Final Result
K-means algorithm Each observation Cluster centers Each observation Cluster centers After N iterations
with K=3. The 3 is assigned to the are computed is assigned to the are again the result is found
cluster center are nearest cluster using squared nearest cluster computed leading when the clusters
randomly selected center euclidean distance center to new centers stop changing
Cost Function
In the K-Means algorithm it is known that the centroids being computed are the cluster means for
each feature and are the constants that minimize the sum of squared deviations. Added to this, the
reallocation of the observations help in improving the above result until the result no longer changes
(i.e optimum point is reached). Thus, giving the cost function.
K-Means++ Algorithm
In the K-Means algorithm during the initialisation step the cluster centers are selected randomly. This
may lead to the algorithm converging at a local minima. So, one can choose the cluster centers
smartly during the initialisation to achieve the global minima using the K-Means++ algorithm.
K-Means++ is just an initialisation procedure for K-Means for picking the initial cluster centers using
an algorithm that tries to initialise the cluster centers that are far apart from each other.
The following algorithm is applied to select the cluster centers.
6. One of the data points is randomly chosen as a cluster center.
7. For each data point xi , the distance di is computed. It is the distance between xi and the
nearest center that had already been chosen.
8. The next cluster center is chosen using the weighted probability distribution where a point x
is chosen with probability proportional to di 2 .
9. Steps 2 and 3 are repeated until K cluster centers have been chosen.
The following figure is a graphical representation of the K-Means++ algorithm.
5. Since the K-Means algorithm tries to allocate each of the data points to one of the clusters,
outliers have a serious impact on the performance of the algorithm and prevent optimal
clustering.
6. There is a chance that the algorithm may not converge in the given number of iterations, so
one needs to always check for convergence.
Silhouette Analysis
Silhouette analysis is an approach to choosing the value of K for the K-Means algorithm. The
Silhouette Analysis or Silhouette Coefficient is a measure of how similar a data point is to its own
cluster (cohesion) compared to other clusters (separation).
For a given dataset where,
ai is the average distance from own cluster
bi is the average distance from the nearest neighbour cluster
Silhouette Value is given by,
The value of S (i) varies between − 1 and 1 . For S (i) to be close to 1 , it is required for a(i) ≪ b(i) .
As a(i) is a measure of dissimilarity of xi with its own cluster, a small value of a(i) means it is well
matched. Furthermore, a large value of b(i) implies that xi is badly matched to its neighbouring
cluster. Thus, S (i) values close to 1 means that the data is appropriately clustered. If S (i) is close to
− 1 , it implies that xi would be more appropriate if it were clustered in its neighbouring cluster. An
S (i) value close to 0 means that xi is on the border of two natural clusters.
The average S (i) over all the points of a cluster measures how tightly grouped all the points in the
cluster are. Thus the average S (i) over all data of the entire dataset is a measure of how
appropriately the data has been clustered. If there are too many or too few clusters, some of the
clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and
averages are used to determine the natural number of clusters within a dataset as shown in the
following figure.
The intention is to always have a small SSE, but as the SSE tends to decrease towards 0 with the
increase in K (the SSE is 0 when K is equal to the number of data points in the dataset, because
then each data point is its own cluster, and there is no error between it and the center of its cluster).
Thus, the goal is to choose a small value of K which still has a low SSE, and the elbow usually
represents the point from where the returns start diminishing with increasing values of K . However,
the elbow method doesn't always work well, especially if the data is not very clustered. In such
cases one can try the Silhouette analysis, or even reevaluate whether clustering is the right thing to
do.
Cluster Tendency
Before applying the clustering algorithm to any given dataset, it is important to check whether the
given data has some meaningful clusters or not (i.e the data is not random). This process of
evaluating the data for its feasibility for clustering is known as the Clustering Tendency. The
clustering algorithms will return K clusters even when the datasets do not have any meaningful
Hopkins Test
The Hopkins test is used to assess the clustering tendency of a data set by measuring the
probability that a given data set is generated by a uniform data distribution. In other words, it tests
the spatial randomness of the data.
For a dataset where,
X is a set of n data points
pi is a member of a random sample (without replacement) of m ≪ n data points.
xi is the distance of pi ∈ X , from its nearest neighbour in X
Y is a set of m uniformly randomly distributed data points (similar to X )
q i is a member of a data set Y
y i is the distance of q i ∈ Y , from its nearest neighbour in X
Hopkins Test is given by,
The value of H varies between 0 and 1 . If X were uniformly distributed, then Σ xi and Σ y i would
be close to each other, and thus H would be about 0.5 . However, if X were to have clusters, then
the distances for artificial points Σ y i would be substantially larger than for the real ones Σ xi , thus
increasing the value of H . A value for H between 0.01 to 0.3 indicates the data is regularly
spaced, around 0.5 indicates the data is random and more than 0.7 indicates a clustering
tendency at 90% confidence level. The following figure shows the clustering tendency for various
Hopkins factor.
Build Model
>> model = KMeans(n_clusters = best_k, max_iter=50)
>> model.fit(data)
Analyse Model
>> data.index = pd.RangeIndex(len(data.index))
>> data = pd.concat([data, pd.Series(model.labels_)], axis=1)
>> data.columns = ['id','recency_col','frequency_col','monetary_col','cluster_id']
>> data['recency_col'] = data['recency_col'].dt.days
>> cluster_R = pd.DataFrame(data.groupby(['cluster_id']).recency_col.mean())
The y-axis of the dendrogram is some measure of dissimilarity or distance at which clusters join. In
the dendrogram given, samples 5 and 7 are the most similar and so join to form the first cluster,
followed by samples 1 and 10 . The last two clusters to fuse together to form the final single cluster
are 10 − 9 − 1 − 5 − 7 − 8 and 2 − 3 − 6 − 4 . There are two types of hierarchical clustering algorithms.
1. Agglomerative : It is a bottom-up algorithm which starts with n distinct clusters and
iteratively merge (or agglomerate) pairs of clusters (which have the smallest dissimilarity)
until all clusters have been merged into a single cluster.
2. Divisive : It is a top-down algorithm which proceeds by splitting (or dividing) a single cluster
(resulting in the biggest dissimilarity) recursively until n distinct clusters are reached. Even
though divisive algorithms are more complex than agglomerative algorithms, they produce
more accurate hierarchies in certain scenarios. The bottom-up algorithms make clustering
decisions based on local patterns without initially taking into account the global distribution,
which cannot be undone. Whereas, top-down algorithms benefit from having the complete
information about the global distribution while partitioning.
Linkages
In a clustering algorithm, before any clustering is performed, it is required to determine the proximity
matrix (displaying the distance between each cluster) using a distance function. This proximity or
linkage between any two clusters can be measured by the following three distance functions.
Single Linkage Complete Linkage Average Linkage
The distance between two The distance between two The distance between two
clusters is defined as the clusters is defined as the clusters is defined as the
shortest distance between two longest distance between two average distance between
points in each cluster. points in each cluster. each point in one cluster to
It suffers from chaining. In It suffers from crowding. Its It tries to strike a balance
order to merge two groups, score is based on the worst between the two, by using
only one pair of points is case dissimilarity between mean pairwise dissimilarity. So
required to be near to each pairs, making a point to be the clusters tend to be
other, irrespective of the closer to points in other relatively compact as well as
location of all other points. clusters than to points in its relatively far apart.
Therefore clusters can be too own cluster. Thus, clusters are
spread out, and not compact compact, but not far enough
enough. apart.
The type of linkage to be used is usually decided after having a look at the data. One convenient
way to decide is by looking at how the dendrogram looks. Usually, single linkage types produce
dendrograms which are not structured properly, whereas complete or average linkages produce
dendrograms which have a proper tree-like structure.
Dendrogram
Determining the number of groups in a cluster analysis is usually the primary goal. To determine the
cutting section, various methods can be used.
1. By means of observation or experience based on the knowledge of the researcher. For
example, based on appearance differences, clusters can be divided into groups.
2. Using statistical conventions, the dendrogram can be cut where the difference is most
significant. Another technique is to use the square root of the number of observations. Yet
another technique is to use at least 70% of the distance between the two groups.
3. Using the function discrimination and classification based on discrimination function.
Thus, the use of the method for cutting a dendrogram depends on the purpose. Typically, one can
look for natural groupings defined by long stems. Once the appropriate level is decided upon, one
can obtain the clusters by cutting the dendrogram at that appropriate level. The number of vertical
lines intersecting the cutting line represents the number of clusters. The following figures show the
various clusters obtained at various levels of dissimilarity.
Build Model
>> from scipy.cluster.hierarchy import linkage
>> from scipy.cluster.hierarchy import dendrogram
>> model = linkage(data, method = 'single', metric='euclidean')
# method can be any of the values single, complete or average
>> dendrogram(model)
>> plt.show()
Cut Dendrogram
>> from scipy.cluster.hierarchy import cut_tree
>> clusterCut = pd.Series(cut_tree(model, n_clusters = 5).reshape(-1,))
Analyse Model
>> data = pd.concat([RFM, clusterCut], axis=1)
>> data.columns = ['id','recency_col','frequency_col','monetary_col','cluster_id']
>> data['recency_col'] = data['recency_col'].dt.days
>> cluster_R = pd.DataFrame(data.groupby(['cluster_id']).recency_col.mean())
>> cluster_F = pd.DataFrame(data.groupby(['cluster_id']).frequency_col.mean())
>> cluster_M = pd.DataFrame(data.groupby(['cluster_id']).monetary_col.mean())
>> data_plot = pd.concat([pd.Series(range(k)),cluster_R,cluster_F, cluster_M],axis=1)
K-Modes Algorithm
The following algorithm is applied to select the clusters.
1. Initialisation : The cluster centers for each of the K clusters are randomly picked.
2. Assignment : Each observation n is assigned to the cluster whose cluster center is the
closest to it. The closest cluster center is found using the dissimilarity score (i.e.
quantification of the total mismatches between two objects, the smaller this number, the
more similar the two objects) rather than the squared Euclidean distance. The equation for
the assignment step is as follows.
3. Optimisation : For each of the K clusters, the cluster center is computed such that the k th
cluster center is the vector of the p feature modes (rather than the means) for the
observations in the k th cluster. A mode is a vector of elements that minimizes the
dissimilarities between the vector itself and each object of the data.
4. Iteration : The process of assignment and optimisation is repeated until there is no change in
the clusters or possibly until the algorithm converges.
Build Model
>> from kmodes.kmodes import KModes
>> model = KModes(n_clusters=k, init = 'Cao', n_init = 1, verbose=1)
# init can be any of the values Cao, Huang or random
>> clusters = model.fit_predict(data)
>> cluster_centroids_df = pd.DataFrame(model.cluster_centroids_)
>> cluster_centroids_df.columns = data.columns
Analyse Model
>> data = data.reset_index()
>> clusters_df = pd.DataFrame(clusters)
>> clusters_df.columns = ['cluster_predicted']
>> data = pd.concat([data, clusters_df], axis = 1).reset_index()
>> data = data.drop(['index', 'level_0'], axis = 1)
K-Prototype Algorithm
The following algorithm is applied to select the clusters.
1. Initialisation : The cluster centers for each of the K clusters are randomly picked.
2. Assignment : Each observation n is assigned to the cluster whose cluster center is the
closest to it. The closest cluster center is found using both the squared Euclidean distance
and the dissimilarity score. The equation for the assignment step is as follows.
3. Optimisation : For each of the K clusters, the cluster center is computed such that the k th
cluster center is the vector of the p feature modes/means for the observations in the k th
cluster.
4. Iteration : The process of assignment and optimisation is repeated until there is no change in
the clusters or possibly until the algorithm converges.
Build Model
>> from kmodes.kprototypes import KPrototypes
>> model = KPrototypes(n_clusters=k, init='Cao')
# init can be any of the values Cao, Huang or random
>> clusters = model.fit_predict(data, categorical=['categorical_col'])
>> cluster_centroids_df = pd.DataFrame(model.cluster_centroids_)
>> cluster_centroids_df.columns = data.columns
Analyse Model
>> data['cluster_id'] = clusters
>> data_plot = pd.DataFrame(data['cluster_id'].value_counts())
>> sns.barplot(x=data_plot.index, y=data_plot['cluster_id'])
DBSCAN Algorithm
The following algorithm is applied to select the clusters.
1. Initialisation : A random observation n is selected.
2. Assignment : If the observation n has the minimum number of close neighbours, it is
considered as a part of the cluster.
3. Grouping : The process of assignment is repeated recursively for all of the n ’s neighbours,
then neighbours neighbour and so on. These are the cluster’s core members.
A point p is a core point if at least the minimum number of points are within the distance ε
of it (including p ). A point q is directly reachable from p if point q is within distance ε from
core point p . A point q is also reachable from p if there is a path p1 , p2 , ... pn with p1 = p
and pn = q , where each pi+1 is directly reachable from pi (i.e. all points on the path must be
core points, with the possible exception of q ). All points not reachable from any other points
are outliers. If p is a core point, then it forms a cluster together with all points (core or
non-core) that are reachable from it. Each cluster contains at least one core point where the
non-core points can be a part of the cluster forming its edge, since they cannot be used to
reach more points. The following figure represents the same.
The minimum number of points is taken as 4 .
Point A and the other red points are core points,
as the area surrounding these points in a radius
of ε contain at least 4 points (including the
point itself). As all these points are reachable
from one another, they form a single cluster.
Points B and C are not core points, but are still
reachable from A (via other core points), thus
they belong to the cluster as well. Point N is an
outlier as it is neither a core point nor is
reachable directly via other points.
Build Model
>> from sklearn.cluster import DBSCAN
>> model = DBSCAN(eps=radius, min_samples=n)
>> clusters = model.fit(data)
>> clusters.labels_
Hard Clustering
In hard clustering, each data point is assigned to any one cluster completely (i.e. there is no
consideration of uncertainty in the assignment of a data point). The limitations with hard clustering is
that it tends to cluster even the data points in one of the clusters, even if the data point doesn't
follow the clustering trend completely. For example while clustering a set of customers into two
groups high value and low value customers, one may end up clustering the average value
customers to either one of the clusters.
Soft Clustering
In soft clustering, the assignment of the data point to a cluster is based on the probability or
likelihood of that data point existing in that cluster. For soft clustering, an algorithm called GMMs or
Gaussian Mixture Models are used.
Advantages of GMM
The GMM has two major advantages over K-Means as follows,
1. GMM is a lot more flexible in terms of cluster covariance : K-Means is actually a special case
of GMM in which each cluster’s covariance along all dimensions approaches 0 . This implies
that a point will get assigned only to the cluster closest to it. With GMM, each cluster can
have unconstrained covariance structure (think of rotated or elongated distribution of points
in a cluster, instead of spherical as in K-Means). As a result, cluster assignment is much more
flexible in GMM than in K-Means.
2. GMM model accommodates mixed membership : Another implication of the covariance
structure is that GMM allows for mixed membership of points to clusters. In K-Means, a point
belongs to one and only one cluster, whereas in GMM a point belongs to each cluster to a
different degree. The degree is based on the probability of the point being generated from
each cluster’s (multivariate) normal distribution, with cluster center as the distribution’s mean
and cluster covariance as its covariance. Depending on the task, mixed membership may be
more appropriate (e.g. news articles can belong to multiple topic clusters) or not (e.g.
organisms can belong to only one species).
3. Maximisation : For each of the components C k , the values of μk , ϕk and Σk are updated so
as to maximize the expectations calculated above with respect to the model parameters. The
equations for the maximisation step are as follows.
4. Iteration : The entire iterative process is repeated until the algorithm converges, giving a
maximum likelihood estimate. Intuitively, the algorithm works because knowing the
component assignment C k for each xi makes solving for μk , ϕk and Σk easy, while
knowing μk , ϕk and Σk makes inferring p( C k ∣xi ) easy. The expectation step corresponds to
the latter while the maximization step corresponds to the former. Thus, by alternating
between the values which are assumed fixed or known, the maximum likelihood estimates of
the non-fixed values are calculated in an efficient manner.
The following figures represent the 2-dimensional and 3-dimensional GMM algorithm.
Build Model
>> from sklearn.mixture import GMM
>> model = GMM(n_components=k, random_state=n)
# Behaves similar to K-Means model
>> model = GMM(n_components=k, covariance_type='full', random_state=n)
>> clusters = model.fit(data)
PRINCIPAL COMPONENT
ANALYSIS
For a dataset or dataframe or table or matrix, It is the process of representing the data in new
columns different from the original (usually for convenience, efficiency or just from common sense).
It is approximately the line in which the projection of the data points is the most spread out.
Or mathematically, it’s the line that maximizes the variance (the average of the squared
distances from the projected data points to the origin).
2. Assignment : The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first principal component and
that it accounts for the next highest variance.
3. Iteration : The process of assignment is repeated until a total of p principal components
have been calculated, equal to the original number of variables. Each additional component
captures incremental variance.
The above steps detect p Principal Components which is equal to the number of variables in the
dataset. If one chooses to represent the data with k components, such that k < p , then it is called
dimensional reduction.
The following figures are a graphical representation of the detecting principal components.
On projecting upon the Weight On projecting upon the Height
Objective is to find vectors on axis, the vertically aligned data axis, the horizontally aligned
which the projected data has the points get overlapped (i.e. the points get overlapped (i.e.
maximum variance. vertical variation is not horizontal variation is not
captured). captured).
PCA Algorithm (Eigen Decomposition Method)
The following algorithm is applied to select the principal components using the Eigen
Decomposition method.
1. A covariance matrix is a matrix that explains both the spread (variance in diagonal elements)
and the orientation (covariance in off-diagonal elements) of the dataset. Thus, the original
matrix X is first centered and scaled and then the covariance matrix Σ is computed using
the following formula,
2. In order to represent the covariance matrix with a vector and its magnitude, one can usually
always find a vector v that points into the direction of the largest spread of the data with
magnitude equal to the spread (variance) in that direction. As any linearly transformed
square matrix ( N × N ) can be completely defined by its eigenvectors and eigenvalues, Eigen
Decomposition is performed on the covariance matrix to get the eigenvalues and
eigenvectors by using the following concept.
An eigenvector v of a linear operator A (described by a square matrix), is a vector that is
mapped to a scaled version of itself, i.e. Av = λv , where λ is the corresponding eigenvalue.
Using this property any matrix A of rank r , can be represented as,
The following figure shows the algorithm applied to select the principal components using Singular
Value Decomposition method.
Scree Plots
Scree plots are line plots of the eigenvalues of factors or principal components in an analysis. These
are used to determine the most important themes or the directions with the maximum amount of
variance, by plotting the total cumulative variance against each principal component. The plot
displays the eigenvalues in a downward curve, with the eigenvalues ordered from the largest to the
smallest. The elbow of the graph where the eigenvalues seem to level off is deemed as the optimal
point and the components to the left of this point are considered to be significant. The following
figure shows a scree plot.
Practical Considerations for PCA
Some of the points to be considered while implementing PCA are as follows.
1. PCA being very scale sensitive, the data needs to be centered and scaled before feeding to
a classifier model.
2. PCA being inherently a linear transformation method, it significantly boosts the performance
when followed by linear models (like logistic regression). But, it can still give significant
speed boosts to non-linear methods, by reducing the number of variables.
3. If the data already has largely uncorrelated features, a forced reduction in dimensionality
using PCA can lead to severe loss of information. Thus, one needs to be careful while
capturing the significant variance.
As can be seen, even after having reduced the number of dimensions from a much higher number
of dimensions k = 512 to a very low number of dimensions k = 16 , the image is still pretty much
distinct and interpretable. Indeed, one can see that it is a very huge gain. Thus, one can conclude
that despite there being some drawbacks in PCA, it is still a very efficient and powerful technique to
reduce the complexity and discover hidden patterns present in the data.
PCA
>> from sklearn.decomposition import PCA
>> from sklearn.decomposition import IncrementalPCA
>> model = PCA(n_components=number_of_components, svd_solver='randomized')
>> model = IncrementalPCA(n_components=number_of_components,
svd_solver='randomized')
>> model.fit(X_train)
>> model.components_
>> model.explained_variance_ratio_
Scree Plot
>> plt.plot(np.cumsum(model.explained_variance_ratio_))
>> plt.xlabel('number of components')
>> plt.ylabel('cumulative explained variance')
LINEAR DISCRIMINANT
ANALYSIS