Lecture-5-HCL-DSE - Sumita Narang-2
Lecture-5-HCL-DSE - Sumita Narang-2
Lecture-5-HCL-DSE - Sumita Narang-2
Lecture – 4, 5 (Part-2)
Sumita Narang
Objectives
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Modelling & Evaluation
5
BITS Pilani, Pilani Campus
Choose a Measure of Success
This measure should be directly aligned with the higher level goals of the
business at hand. And it is also directly related with the kind of problem we are
facing:
8
BITS Pilani, Pilani Campus
Setting an Evaluation Protocol 1/
Maintaining a Hold Out Validation Set
This method consists on setting apart some portion of the data as the test set.
The process would be to train the model with the remaining fraction of the data,
tuning its parameters with the validation set and finally evaluating its
performance on the test set.
The reason to split data in three parts is to avoid information leaks. The main
inconvenient of this method is that if there is little data available, the validation
and test sets will contain so few samples that the tuning and evaluation
processes of the model will not be effective.
10
BITS Pilani, Pilani Campus
Cross-Validation (1/3)
• Cross-Validation is a very useful technique for assessing the performance
of machine learning models.
• We are given two type of data sets: known data set (training data set) and
unknown data set (test data set).
Sources: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
https://magoosh.com/data-science/k-fold-cross-validation/
19 BAZG523(Introduction to Data Science)
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
Cross-Validation (2/3)
K-Fold Cross-Validation: If k=5 the dataset will be divided into 5 equal parts
and the below process will run 5 times, each time with a different holdout set.
1. Take a group as a test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test data set
4. Retain the evaluation score and discard the model
At the end of the above process summarize the skill of the model using the
average of model evaluation scores.
Sources: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
https://magoosh.com/data-science/k-fold-cross-validation/
20 BAZG523(Introduction to Data Science)
Cross-Validation (3/3)
Leave One Out Cross-Validation : It is K-fold cross validation taken to its
logical extreme, with K equal to N, the number of data points in the dataset.
Sources: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
https://magoosh.com/data-science/k-fold-cross-validation/
21 BAZG523(Introduction to Data Science)
Setting an Evaluation Protocol 3/
It consist on applying K-Fold validation several times and shuffling the data
every time before splitting it into K partitions. The Final score is the average of
the scores obtained at the end of each run of K-Fold validation.
11
BITS Pilani, Pilani Campus
Non-Exhaustive & Exhaustive
Cross-Validation Techniques
References-
1. https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
2. https://blog.contactsunny.com/data-science/different-types-of-validations-in-machin
e-learning-cross-validation
Predictive modeling refers to the task of building a model for the target variable
as a function of the explanatory variables. There are two types of predictive
modeling tasks:
• Classification, which is used for discrete target variables, and regression,
which is used for continuous target variables. For example, predicting whether
a Web user will make a purchase at an online bookstore is a classification
task because the target variable is binary-valued.
• On the other hand, forecasting the future price of a stock is a regression task
because price is a continuous-valued attribute. The goal of both tasks is to
learn a model that minimizes the error between the predicted and true values
of the target variable.
where x1 contains the predictors you know are affecting the outcome, so are not wanting to
test, while x2 contains the predictors you are testing.
So the null hypothesis will be β 2=0 and the null model would be –
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Validating models
30
BITS Pilani, Pilani Campus
Ensuring model quality
k-fold cross-validation
The idea behind k-fold cross-validation is to repeat the construction
of the model on different subsets of the available training data and
then evaluate the model only on data not seen during construction.
This is an attempt to simulate the performance of the model on
unseen future data.
Significance Testing
“What is your p-value?” 31
BITS Pilani, Pilani Campus
Balancing Bias & Variance to
Control Errors in Machine Learning
https://towardsdatascience.com/balancing-bias-and-variance-to-control-errors-in-machine-learning-16ced95724db
Y = f(X) + e
Estimation of this relation or f(X) is known as statistical learning. On general, we won’t be able to make a perfect estimate
of f(X), and this gives rise to an error term, known as reducible error. The accuracy of the model can be improved
by making a more accurate estimate of f(X) and therefore reducing the reducible error. But, even if we make a
100% accurate estimate of f(X), our model won’t be error free, this is known as irreducible error (e in the above
equation). The quantity e may contain unmeasured variables that are useful in predicting Y : since we don’t
measure them, f cannot use them for its prediction. The quantity e may also contain unmeasurable variation.
Bias
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely
complicated, by a much simpler model. So, if the true relation is complex and you try to use linear regression,
then it will undoubtedly result in some bias in the estimation of f(X). No matter how many observations you have, it is
impossible to produce an accurate prediction if you are using a restrictive/ simple algorithm, when the true relation is
highly complex.
Variance
Variance refers to the amount by which your estimate of f(X) would change if we estimated it using a di fferent
training data set. Since the training data is used to fit the statistical learning method, different training data sets will
result in a different estimation. But ideally the estimate for f(X) should not vary too much between training sets.
However, if a method has high variance then small changes in the training data can result in large changes in f(X).
A general rule is that, as a statistical method tries to match data points more closely or when a more flexible
method is used, the bias reduces, but variance increases.
In order to minimize the expected test error, we need to select a statistical learning method that simultaneously
achieves low variance and low bias.
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
This is a form of regression, that constrains/ regularizes or shrinks the coefficient
estimates towards zero. In other words, this technique discourages learning a
more complex or flexible model, so as to avoid the risk of overfitting.
A simple relation for linear regression looks like this. Here Y represents the learned
relation and β represents the coefficient estimates for different variables or
predictors(X).
Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
The fitting procedure involves a loss function, known as residual sum of squares or RSS.
The coefficients are chosen, such that they minimize this loss function. Now,
this will adjust the coefficients based on your training data. If there is noise in the
training data, then the estimated coefficients won’t generalize well to the future data.
This is where regularization comes in and shrinks or regularizes these learned
estimates towards zero.
Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning
parameter that decides how much we want to penalize the flexibility of our model. The
increase in flexibility of a model is represented by increase in its coefficients, and if we want to
minimize the above function, then these coefficients need to be small. This is how the Ridge
regression technique prevents coefficients from rising too high.
Lasso Regression
Communicate
information
• more effectively
Analyze data to
• Share and persuade
support reasoning
(visual explanation)
• Understand your
Record Information
data better and act
• Blueprints, upon that
photographs, understanding
seismographs, … • Develop and assess
hypotheses (visual
exploration) Find
patterns and discover
errors in data
Temporal
• 2 conditions: that they are linear, and are one-dimensional.
• Temporal visualizations normally feature lines that either stand alone or
overlap with each other, with a start and finish time.
• Easy to read graphs
Hierarchical
• those that order groups within larger groups. Hierarchical visualizations
are best suited if you’re looking to display clusters of information,
especially if they flow from a single origin point.
• more complex and difficult to read,
Network
• Datasets connect deeply with other datasets. Network data visualizations show how
they relate to one another within a network. In other words, demonstrating
relationships between datasets without wordy explanations.
Multidimensional
• there are always 2 or more variables in the mix to create a 3D data visualization.
• Because of the many concurrent layers and datasets, these types of visualizations tend
to be the most vibrant or eye-catching visuals. Another plus? These visuals can break
down a ton of data down to key takeaways.
Geospatial
• relate to real life physical locations, overlaying familiar maps with different data points.
• These types of data visualizations are commonly used to display sales or acquisitions
over time, and can be most recognizable for their use in political campaigns or to
display market penetration in multinational corporations.
Multidimen
Temporal Hierarchical Network Geospatial
sional
Alluvial Stacked
Timelines
diagrams bar graphs Heat map
Line
Histograms
graphs
https://www.klipfolio.com/resources/articles/what-is-data-visualization
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sage 7 - Deployment &
Iterative Lifecycle
Standard Methodology for Analytical Models (SMAM)
Operationalization –
Implementing the model as a deployable software solution
2. Solution Deployment
• Hosting solution in company’s data centers or on cloud based on company’s
policies , infrastructure and cost
KPI Check
• Validating that target KPIs are met
Solution Approach
• Used Large Neighborhood Search (LNS) – AI Search Metaheuristic
• Constraints built into LNS as Rules
Solution Benefits
• Overall Cost Optimization (Fuel, Security) and Travel Time Optimization
• No tedious manual planning required; more time window for user requests
Total No. of No. of Manual No. of proposed
Time Date Savings Vehicles used in manual routes Vehicles used in optimized routes
Customers routes routes
7:00 PM 1/19/2018 56% 28 12 7 1 Amaze- D, 2 Dzire D, 3 Etios D,1 Indica 6 Amaze D 4 seater, 1 Tavera D 9 Seater
D,1 Innova D, 4 Tempo Traveller
38
Project Example of Data Science
- Diagnostic Analytics
FSO analytics example –
Business Objective fro FSO Department: Target for 5 top clients in India,
Europe, South Africa and Costa Rica Markets
• FTR (Frist Time Right) Improvement by 2%
• Improve incoming WO Quality by 5%
• https://www.bouvet.no/bouvet-deler/roles-in-a-data-science-
project
• https://www.altexsoft.com/blog/datascience/how-to-structure
-data-science-team-key-models-and-roles/
• https://www.quora.com/What-is-the-life-cycle-of-a-data-scie
nce-project
• https://towardsdatascience.com/5-steps-of-a-data-science-p
roject-lifecycle-26c50372b492
• https://www.dezyre.com/article/life-cycle-of-a-data-science-p
roject/270
• https://www.slideshare.net/priyansakthi/methods-of-data-coll
ection-16037781
• https://www.questionpro.com/blog/qualitative-data/
• https://surfstat.anu.edu.au/surfstat-home/1-1-1.html
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956