Project Report
Project Report
Project Report
Principles, Computational tools and Case studies with Data Science and Machine
Learning
By
Samar Pratap ( 202100485 )
1. Introduction
1.1 Introduction to Data science, Descriptive Statistics
1.2 Introduction to Machine Learning
1.3 Supervised learning with hands on
1.4 Unsupervised learning
2. Regression and Classification
2.1 Regression analysis, Type of Regression
2.2 OLS, Linear, Logistic, Multiple
2.3 Multi-class classification, Neural Network, K-Nearest Neighbours
2.4 Decision Tree, Random Forest, and Naïve Bayes.
3. Clustering
3.1 Introduction of Clustering
3.2 Types of clustering algorithms
3.3 Density-based algorithm, Distribution-based algorithm, Centroid-based
algorithm
3.4 Hierarchical based ,algorithm K-means clustering algorithm, DBSCAN
clustering
4. Tableau: Hands on
4.1 Basics of Data Visualization
4.2 Tableau: Basic and advanced charts
4.3 maps, interactive dashboards and
4.4 Case studies Summary
2
5.Summary of Achievement
6. Main Difficulties Encountered
7. Conclusion
8. References
LIST OF TABLES
Content Page No.
LIST OF FIGURES
Content Page No.
3
ABSTRACT
4
CHAPTER 1
INTRODUCTION
1. It requires preparation.
2. Data science is a fast-paced field.
3. Data science involves potential privacy risks.
What Can You Do With a Data Science Degree?
If you’re wondering what you can do with a data science degree, the data science
career outlook is strong. In the past few years alone, the demand for data science
skills and data-driven decision making has steeply risen as they become essential
tools for organizations that want to do everything they can to ensure success.
Big data and the field of data science present big opportunities for career
advancement. Depending on the industry and role you choose, you’ll find a wide
variety of data science job titles with nuanced job descriptions to match specific skill
sets.
Common data science career paths or job titles include:
1. Business analyst
2. Data analyst
3. Data architect
4. Data engineer. 5
5. Data scientist
6. Machine learning engineer
7. Research scientist
What is Big Data?
The definition of big data is data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data
sources. These data sets are so voluminous that traditional data processing software
just can’t manage them. But these massive volumes of data can be used to address
Variety - Variety refers to the many types of data that are available. Traditional data
types were structured and fit neatly in a relational database. With the rise of big data,
data comes in new unstructured data types. Unstructured and semistructured data
types, such as text, audio, and video, require additional preprocessing to derive
meaning and support metadata.
6
Fig .1.1.1
Data Collection Methods
For Quantitative:
1. Closed-ended Surveys and Online Quizzes
Closed-ended surveys and online quiz are based on questions that give respondents
predefined answer options to opt for.
Categorical survey questions can be further classified into dichotomous (‘yes/
8
9
Descriptive Statistics
Descriptive statistics are brief informational coefficients that summarize a given data
set, which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central tendency
and measures of variability (spread). Measures of central tendency include the mean,
median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.
Examples of Descriptive Statistics - Descriptive statistics are informational and
meant to describe the actual characteristics of a data set. When analyzing numbers
regarding the prior Major League Baseball season, descriptive statistics including the
highest batting average for a single player, the number of runs allowed per team, and
the average wins per division.
Common tools of descriptive statistics
Descriptive statistics provides us the tools to define our data in a most understandable
and appropriate way:
1. Central tendency: Use the mean or the median to locate the center of the dataset.
This measure tells you where most values fall.
2. Dispersion: How far out from the center do the data extend? You can use the range
or standard deviation to measure the dispersion. A low dispersion indicates that the
values cluster more tightly around the center. Higher dispersion signifies that data
points fall further away from the center. We can also graph the frequency
distribution.
3. Skewness: The measure tells you whether the distribution of values is symmetric or
skewed. Fig .1.1.2
TABLE 1.1
1.2 Introduction to Machine Learning
Machine learning (ML) continues to grow in importance for many organizations
across nearly all domains. Some example applications of machine learning in practice
include:
1.Predicting the likelihood of a patient returning to the hospital (readmission) within
30 days of discharge.
2. Segmenting customers based on common attributes or purchasing behavior for
targeted marketing.
3. Predicting coupon redemption rates for a given marketing campaign.
4. Predicting customer churn so an organization can perform preventative
intervention.
5. And many more!
10
11
The Connection between Machine Learning, Deep Learning, and AI
Fig .1.2.1
The Categories of Machine Learning
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Fig.1.2.2
4.Semi-supervised Learning
5.Self-supervised Learning
1.3 Supervised learning with hands on
Supervised learning is concerned with predicting a target value given input
observations. In machine learning, we call the model inputs “features”3). The target
values that supervised models are trained to predict are also often called labels4.
Supervised learning can be categorized into two major subcategories: regression
analysis and classification. In regression analysis, the target values or labels are
continuous variables (Figure 3A). In classification, the labels are so-called class
labels, which can be understood as discrete class- or group-membership indicators
(Figure).
Figure 1.4.1 . Illustrations of PCA. (A) A dataset (blue circles) with two feature
axis X1 and X2 that has been projected onto the principal components PC1
and PC2(B) A view of the data projected onto the first principal component (PC1).
In a nutshell, PCA identifies the directions of maximum variance in a dataset. These
directions (eigenvectors of the covariance matrix) form the principal component axes
of a new coordinate system. In practice, PCA is typically used to reduce the
dimensionality of a dataset while retaining most of its information (variance). PCA is
a linear transformation technique. Thus, it cannot capture complex non-linear patterns
in data; however, kernel PCA and other non-linear dimensionality reduction
techniques exist. Chapter 15 covers autoencoders, which are deep neural network
architectures that can be used for non-linear dimensionality reduction. An autoencoder
consists of two subnetworks, an encoder and a decoder, as illustrated in Figure 5. An
input (for example, an image) is passed to the encoder, which compresses the input
into a lower-dimensional representation, also known as “embedding vector” or “latent
representation.” After the encoder embedded the feature into this lower-dimensional
space, the decoder reconstructs the original image from this latent representation. The
two subnetworks, encoder and decoder, are connected end to end and often depicted
as an hourglass where the width of the hourglass represents the size of the feature map
produced by the multilayer neural network. 14
Figure 1.4.2. Illustration of an autoencoder.
In a typical autoencoder architecture, the encoder’s output is the smallest feature map
in the network, which is then fed into the decoder – this connection between encoder
and decoder is also often referred to as the “bottleneck.” The rationale behind the
autoencoder idea is that for the decoder to produce a faithful reconstruction of the
original input image, the encoder must learn to create a latent representation that
preserves most of the information. Typically, the autoencoder architecture is designed
such that the latent representation is lower-dimensional than the input (and its
reconstruction), so that the autoencoder cannot just memorize and pass the original
input through the architecture – this is sometimes also referred to as information
bottleneck.
Another major subcategory of unsupervised learning is clustering, which assigns
group membership information to data points. It can be thought of as a task similar to
classification but without labeling information given in the training dataset. Hence, in
the absence of class label information, the clustering approach is to group data records
by similarity and define distinct groups based on similarity thresholds.
Clustering can be divided into three major groups: prototype-based, density-based,
and hierarchical clustering. In prototype-based clustering algorithms, such as K-
Means [@macqueen1967some; @lloyd1982least], a fixed number of cluster centers is
defined (the cluster centers are repositioned iteratively), and data points are assigned
to the closest prototype based on a pair-wise distance measure (for example,
Euclidean distance). In density-based clustering, unlike in prototype-based clustering,
the number of cluster centers is not fixed but assigned by identifying regions 15
of high density (the location of many data records close to each other measured by a
user-defined distance metric. In hierarchical clustering, a distance metric is used to
group examples in a tree like-fashion, where examples at the root are more related to
each other. The depth of the tree defines the number of clusters.
16
17
CHAPTER 2
REGRESSION AND CLASSIFICATION
OLS stands for Ordinary Least Squares. It is the most common type of linear
regression. It minimizes the sum of the squared residuals, which are the differences
between the predicted values and the actual values.
Linear regression is a statistical method that is used to model the relationship between
a dependent variable and one or more independent variables. The dependent variable
is the variable that you are trying to predict, and the independent variables are the
variables that you are using to make the prediction.
Logistic regression is a type of regression analysis that is used for binary
classification problems, where the dependent variable can have two possible values,
such as "yes" or "no." It is a non-linear model that predicts the probability of a binary
outcome.
Multiple regression is a type of regression analysis that is used when there are more
than one independent variable. It is used to model the relationship between a
dependent variable and multiple independent variables.
Here is a table that summarizes the key differences between OLS, linear, logistic, and
multiple regression:
The best type of regression analysis to use depends on the research question and the
data that is available. For example, if you are trying to predict the price of a house,
you might use linear regression. If you are trying to predict whether or not someone
will vote for a particular candidate, you might use logistic regression. If you are trying
to predict the risk of heart disease, you might use multiple regression.
18
2.3 Multi-class classification, Neural Network, K-Nearest Neighbours
Sure, I can help you with that.
Multi-class classification is a type of classification problem where the dependent
variable can have more than two possible values. For example, if you are trying to
classify images of animals, the dependent variable could be the animal's species.
There are many different algorithms that can be used for multi-class classification,
including:
• Neural networks: Neural networks are a type of machine learning algorithm that
is inspired by the human brain. They are very powerful and can be used to solve
a variety of problems, including multi-class classification.
• K-nearest neighbors: K-nearest neighbors is a simple algorithm that classifies a
new data point based on the k most similar data points in the training set. It is
easy to understand and implement, but it can be less accurate than other
algorithms.
• Support vector machines: Support vector machines are a type of machine
learning algorithm that creates a hyperplane that separates the different classes
of data. They are very accurate, but they can be more difficult to understand and
implement than other algorithms.
The best algorithm to use for multi-class classification depends on the specific
problem and the data that is available. For example, if you have a lot of data and you
are willing to spend time training the model, then you might use a neural network. If
you have less data or you are not willing to spend as much time training the model,
then you might use a simpler algorithm like K-nearest neighbors.
Neural networks are a type of machine learning algorithm that is inspired by the
human brain. They are made up of interconnected nodes, and each node performs a
simple computation. The nodes are arranged in layers, and the information flows from
the input layer to the output layer.
Neural networks can be used to solve a variety of problems, including classification,
regression, and clustering. They are very powerful and can be used to achieve state-
of-the-art results on many problems. However, they can be difficult to train and
require a lot of data. 19
K-nearest neighbors is a simple algorithm that classifies a new data point based on the
k most similar data points in the training set. The similarity is typically measured
using a distance metric, such as the Euclidean distance.
K-nearest neighbors is easy to understand and implement, but it can be less accurate
than other algorithms. It is also sensitive to the choice of k, which is the number of
neighbors to consider.
Support vector machines are a type of machine learning algorithm that creates a
hyperplane that separates the different classes of data. The hyperplane is chosen so
that it maximizes the margin between the classes.
Support vector machines are very accurate, but they can be more difficult to
understand and implement than other algorithms. They also require a lot of data to
train.
2.4 Decision Tree, Random Forest, and Naïve Bayes.
Decision trees are a type of supervised learning algorithm that can be used for both
classification and regression problems. They work by breaking down the data into
smaller and smaller subsets until each subset can be classified or predicted with a high
degree of certainty.
Random forests are an ensemble learning algorithm that combines multiple decision
trees to improve the accuracy of predictions. They work by training each decision tree
on a different subset of the data and then combining the predictions of the individual
trees.
Naive Bayes is a simple probabilistic classifier that assumes that the features are
independent of each other. This assumption makes it very fast to train and predict, but
it can also lead to overfitting.
Here is a table that summarizes the key differences between decision trees, random
forests, and naive Bayes:
The best algorithm to use depends on the specific problem and the data that is
available. For example, if you have a small dataset and you need to make predictions
quickly, then you might use a decision tree. If you have a large dataset and you need
to make accurate predictions, then you might use a random forest. If you have a
dataset with categorical features, then you might use naive Bayes. 20
Here are some additional things to keep in mind when choosing between decision
trees, random forests, and naive Bayes:
• Decision trees are easy to understand and interpret, but they can be overfitting.
• Random forests are more accurate than decision trees, but they can be more
difficult to interpret.
• Naive Bayes is fast and easy to train, but it can be less accurate than decision
trees or random forests.
TABLE 2.1
21
22
CHAPTER 3
Clustering
Clustering is a type of unsupervised learning algorithm that is used to group similar
data points together. The goal of clustering is to find groups of data points that are as
similar as possible within each group and as different as possible between groups.
There are many different clustering algorithms, each with its own strengths and
weaknesses. Some of the most common clustering algorithms include:
• K-means clustering: This is a simple algorithm that divides the data into k
clusters, where k is a user-defined number. The algorithm starts by randomly
assigning each data point to a cluster. Then, it iterates through the clusters,
moving data points from one cluster to another until the clusters are no longer
changing.
• Hierarchical clustering: This algorithm builds a hierarchy of clusters, where
each cluster is a subcluster of another cluster. The algorithm starts by creating a
cluster for each data point. Then, it merges the two most similar clusters until
there is only one cluster left.
• Density-based clustering: This algorithm clusters data points that are densely
packed together. The algorithm starts by identifying dense regions of data
points. Then, it merges these regions together until there are no more dense
regions left.
• Distribution-based clustering: This algorithm clusters data points that have
similar distributions. The algorithm starts by fitting a distribution to each data
point. Then, it clusters data points that have similar distributions together.
The best clustering algorithm to use depends on the specific problem and the data that
is available. For example, if you have a small dataset and you need to make
predictions quickly, then you might use k-means clustering. If you have a large
dataset and you need to make accurate predictions, then you might use hierarchical
clustering. If you have a dataset with categorical features, then you might use density-
based clustering.
Here are some of the key concepts in clustering:
• Cluster: A group of data points that are similar to each other. 23
• Centroid: The center of a cluster.
• Density: The number of data points in a particular region of space.
• Distribution: The shape of the data points in a particular region of space.
Here are some of the benefits of clustering:
• It can be used to find hidden patterns in data.
• It can be used to reduce the dimensionality of data.
• It can be used to identify outliers.
• It can be used to segment customers.
• It can be used to group documents.
Here are some of the challenges of clustering:
• The choice of clustering algorithm can be difficult.
• The number of clusters is a hyperparameter that needs to be tuned.
• The clusters can be noisy or overlapping.
• The clusters can be unstable to changes in the data.
Fig 3.1.1
CHAPTER 4
Tableau: Hands on
Tableau is a data visualization software that allows you to create interactive
dashboards and charts. It is a powerful tool that can be used to communicate insights
from data in a clear and concise way.
4.1 Basics of Data Visualization
The basics of data visualization include:
• Choosing the right chart type for the data.
• Using colors and fonts that are easy to read.
• Adding labels and annotations to explain the data.
• Formatting the data in a way that is visually appealing.
4.2 Tableau: Basic and advanced charts
Tableau has a wide variety of charts that can be used to visualize data. Some of the
most common charts include:
• Bar charts
• Line charts
• Pie charts
• Scatterplots
• Maps
Tableau also has a number of advanced charts that can be used to visualize more
complex data. These charts include:
• Treemaps
• Heatmaps
• Waterfall charts
• Sankey diagrams
4.3 Maps, interactive dashboards and
Tableau can be used to create interactive maps that allow you to explore data by
location. It can also be used to create interactive dashboards that allow you to present
data in a way that is easy to understand and navigate.
Here are some of the benefits of using Tableau for data visualization:
24
It is easy to use.
• It is powerful.
• It is versatile.
• It is interactive.
• It is shareable.
Here are some of the challenges of using Tableau:
• It can be expensive.
• It can be time-consuming to learn.
• It can be difficult to create complex visualizations.
Overall, Tableau is a powerful data visualization tool that can be used to communicate
insights from data in a clear and concise way. It is a versatile tool that can be used for
a variety of purposes.
26
SUMMARY OF ACHIEVEMENT
In our journey through Principles, Computational Tools, and Case Studies in Data
Science and Machine Learning, we've attained a comprehensive grasp of the core
tenets that guide data-driven decision-making. Equipped with this understanding,
we've harnessed a diverse range of computational tools and libraries, transforming
theoretical concepts into practical applications. Through real-world case studies,
we've witnessed the transformative potential of predictive modeling, pattern
recognition, and automation across industries. This exploration has emphasized not
only the power of data but also the ethical imperative of responsible and unbiased
practices. As we stand at the intersection of human expertise and machine learning
capabilities, our achievements signify a readiness to navigate complexities and
innovate within the dynamic landscape of data science and machine learning.
27
CONCLUSION
The exploration of Principles, Computational Tools, and Case Studies in Data Science
and Machine Learning underscores the transformative potential of these fields in
reshaping our understanding of data, patterns, and decision-making.
Principles serve as the foundation, emphasizing the significance of problem
understanding, data preprocessing, feature selection, algorithm choice, and model
evaluation. The iterative nature of the process highlights the need for continuous
refinement and adaptation.
Computational Tools are the engines that power innovation. From libraries like
Scikit-Learn and TensorFlow to languages like Python and R, these tools democratize
complex processes, enabling experts and novices alike to engage in impactful data
analysis and modeling.
Case Studies illuminate the real-world impact of Data Science and Machine
Learning. Whether it's predicting customer behavior in e-commerce, optimizing
energy consumption, or revolutionizing healthcare with predictive diagnostics, these
studies highlight the tangible benefits of harnessing data's potential.
In a world inundated with data, these disciplines offer a lens to decipher complexities.
Yet, challenges abound – ethical considerations, bias mitigation, and privacy
preservation demand vigilant attention. As we navigate forward, a symbiotic
relationship between human expertise and machine learning capabilities emerges,
driving innovation and pushing boundaries.
In essence, the amalgamation of Principles, Computational Tools, and Case Studies in
Data Science and Machine Learning encapsulates a paradigm shift in problem-
solving. It's a journey of continuous learning, where data becomes a canvas and
algorithms paint the future. With each breakthrough, we inch closer to a reality where
data-driven insights illuminate the path to progress.
28
REFERENCES
1. H. Trevor et al., “the elements of statistical learning”, Vol. 2. No.1.New York,
Springer, 2009.
2. . Jurgen Kai-Uwe Brock, Data Design: The Visual Display of Qualitative and
Quantitative Information, (1e), Consulting Press, 2017.
3. https://media.licdn.com/dms/image/C5612AQHpg16qX4Bvhw/article-inline_image-
shrink_1000_1488/0/1605942452368?
e=1698278400&v=beta&t=Ca6dD2nlHN6dQZOcXYhA-bhnrAT4GbvzRSjrmBGbaAE
- Image credit Fig .1.1.1
4. https://www.assignmenthelppro.com/blog/inferential-vs-descriptive-statistics/
- Table credit 1.1
29