Capstone Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Unit 1: Capstone Project

A capstone project is a project where students must research a topic independently to


find a deep understanding of the subject matter and integrate all their knowledge and
demonstrate it through a comprehensive project.

1. Understanding The Problem


AI project follows the following six steps:
1) Problem definition i.e. Understanding the problem
2) Data gathering
3) Feature definition
4) AI model construction
5) Evaluation & refinements
6) Deployment

Find if there is a pattern as it is the premise that underlies all ML disciplines. If there is
no pattern, then the problem cannot be solved with AI technology.

These techniques are used to answer five types of questions, all falling under umbrella
of predictive analysis:
1) Which category? (Classification)
2) How much or how many? (Regression)
3) Which group? (Clustering)
4) Is this unusual? (Anomaly Detection)
5) Which option should be taken? (Recommendation)
Determine which of these questions you’re asking, and how answering it helps you
solve your problem.

2. Decomposing The Problem Through DT Framework


Design Thinking is a design methodology that provides a solution-based approach to
solving problems. It helps to tackle complex problems that are ill-defined or unknown.

To accomplish real computational tasks, you need to break down the problem into
smaller units before coding.

Problem decomposition steps:


1. Understand and restate the problem.
 Know the desired inputs and outputs
 Ask questions for clarification
2. Break the problem down into a few large pieces.
3. Break complicated pieces down into smaller pieces. Do that until all of the pieces are
small.
4. Code one small piece at a time.
● Think about how to implement it
● Write the code/query
● Test it
● Fix problems, if any

Imagine that you want to create your first app. This is a complex problem. How would
you
decompose the task of creating an app?
To decompose this task, you would need to know the answer to a series of smaller
problems:
 what kind of app you want to create?
 what will your app will look like?
 who is the target audience for your app?
 what will the graphics will look like?
 what audio will you include?
 what software will you use to build your app?
 how will the user navigate your app?
 how will you test your app?
This list has broken down the complex problem of creating an app into much simpler
problems that can now be worked out.

3.Analytic Approach
Models are built to predict outcomes or discover underlying patterns, all to
gain insights leading to actions that will improve future outcomes. It is the
‘ Foundational Methodology of Data Science’
It has 10 stages -

Every project starts with business understanding, which lays the foundation for
successful resolution of the business problem. In this stage by defining the problem,
project objectives and solution requirements from a business perspective.
Then an analytic approach is defined to solve the problem. It involves expressing the
problem in the context of statistical and machine learning techniques so that suitable
techniques can be identified which for achieving the desired outcome.
Selecting the right analytic approach depends on the question being asked.
 If the question is to determine probabilities of an action, then a predictive model
might be used.
 If the question is to show relationships, a descriptive approach maybe be required.
 Statistical analysis applies to problems that require counts
f the question requires a yes/ no answer, then a classification approach to predicting a
response would be suitable.

4. Data Requirement
Define the data requirements for decision-tree classification.
Identify the necessary data content, formats and sources for initial data collection.
Data requirements are revised and it is decided whether less/more data is needed.
Data scientists will have a good understanding of what they will be working with.
Techniques such as descriptive statistics and visualization can be applied to the data set,
to assess the content, quality, and initial insights about the data.
Gaps in data will be identified and it will either be filled or replaced.

5. Modeling Approach
Data Modeling focuses on developing models that are either descriptive or predictive.
 A descriptive model might examine things
 A predictive model tries to yield yes/no, or stop/go type outcomes.
Here, Training set is used for predictive modelling which acts as a gauge to determine if
the model needs to be calibrated.
A training set is a set of historical data in which the outcomes are already known.
The success of data compilation, preparation and modelling, depends on the
understanding of the problem and choosing appropriate analytical approach.
Constant refinement, adjustments and tweaking are necessary to ensure a solid
outcome.
The framework does 3 things:
 Understand the question.
 Select an analytic approach or method to solve the problem.
 Obtain, understand, prepare, and model the data.
The end goal is to build a model to answer the question.

6. How to validate model quality


6.1 Train-Test Split Evaluation
The train-test split is a technique for evaluating the performance of a machine learning
algorithm.
It can be used for classification or regression problems and can be used for any
supervised learning algorithm.Take and divide the dataset into two subsets.
 Train Dataset: Used to fit the machine learning model.
 Test Dataset: Used to evaluate the fit machine learning model.
In the test dataset, the input element of the dataset is provided to the model, then
predictions are made and compared to the expected values.
The objective is to estimate the performance of the machine learning model on new
data
The train-test procedure is appropriate when there is a sufficiently large dataset
available.
How to Configure the Train-Test Split
The procedure has one main configuration parameter, which is the size of the train and
test sets.It is expressed as a percentage between 0 and 1 for either the train or test
datasets.
There is no optimal split percentage.

Split percentage is chosen w.r.t your project’s objectives with considerations that
includes:
 Computational cost in training the model.
 Computational cost in evaluating the model.
 Training set representativeness.
 Test set representativeness.

Common split percentages include:


 Train: 80%, Test: 20%
 Train: 67%, Test: 33%
 Train: 50%, Test: 50%

The Shortcoming of Train-Test Split


The larger the test set, the less randomness (aka "noise") in measure of model quality.

The Cross-Validation Procedure


Modeling process is done on different subsets of the data to get multiple measures of
model quality.

Trade-offs Between Cross-Validation and Train-Test Split


Cross-validation gives a more accurate measure of model quality but it can take more
time to run, because it estimates models once for each fold. So it is doing more total
work.
On small datasets, the extra computational burden of running cross-validation isn't a big
deal. So, if your dataset is smaller, you should run cross validation.
Simple train-test split is sufficient for larger datasets. It will run faster
Cross-validation can be run to see if the scores for each experiment seem close.
If each experiment gives the same results, train-test split is probably sufficient.

Conclusion
Using cross-validation gave us much better measures of model quality, with the added
benefit of cleaning up our code.

7. Metrics of model quality by simple Math and examples


Performance metrics like classification accuracy and root mean squared error can give
you a clear objective idea of how good a set of predictions is, and how good the model is
that generated them.

It allows you to tell the difference and select among:


 Different transforms of the data used to train the same machine learning model.
 Different machine learning models trained on the same data.
 Different configurations for a machine learning model trained on the same data.
All the algorithms in ML rely on minimizing or maximizing a function, which we call
“objective function”.
A most commonly used method of finding the minimum point of function is “gradient
descent”

Loss functions:
It is the group of functions that are minimized
It is a measure of how good a prediction model does in terms of being able to predict
the expected outcome.
It can be categorized into 2 types: Classification Loss and Regression Loss.

Regression functions predict a quantity, and classification functions predict a label.

7.1 RMSE (Root Mean Squared Error)


In ML when we want to look at the accuracy of our model we take the root mean square
of the error that has occurred between the test values and the predicted values

Mathematically:

Graphically:
The red dots are the actual values and the
blue line is the set of predicted values drawn
by our model.
X represents the distance between the actual
value and the predicted line. It represents
the error.
Taking mean of all those distances and
squaring them and finally taking the root will
give us RMSE of our model.
A good model should have an RMSE value less than 180.
If you have a higher RMSE value, you either need to change your feature or tweak your
hyperparameters.

Hyperparameters: Parameters whose values govern the learning process.

7.2 MSE (Mean Squared Error)


It is the most commonly used regression loss function.
It is the sum of squared distances between our target variable and predicted values.
The MSE loss (Y-axis) reaches its minimum value at prediction (X-axis) = 100.
The range is 0 to ∞.

Why use mean squared error?


MSE is sensitive towards outliers and the optimal prediction will be their mean target
value. While, the optimal prediction of Mean Absolute Error is the median.
MSE is thus good to use if you believe that your target data, conditioned on the input, is
normally distributed around a mean value, and when it’s important to penalize outliers.

You might also like