Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Report on

Principles, Computational tools and Case studies with Data Science and Machine
Learning

By
Samar Pratap ( 202100485 )

In partial fulfillment of requirements for the award of degree in


Bachelor of Technology in Computer Science and Engineering
(2023)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SIKKIM MANIPAL INSTITUTE OF TECHNOLOGY


(A constituent college of Sikkim Manipal University)
MAJITAR, RANGPO, EAST SIKKIM – 737136
1
LIST OF CONTENTS

Content Page No.

LIST OF TABLES ……………………….………………….………………. 4


LIST OF FIGURES …………………………………………………….…… 5
ABSTRACT ………………….………………….……………………………6

1. Introduction
1.1 Introduction to Data science, Descriptive Statistics
1.2 Introduction to Machine Learning
1.3 Supervised learning with hands on
1.4 Unsupervised learning
2. Regression and Classification
2.1 Regression analysis, Type of Regression
2.2 OLS, Linear, Logistic, Multiple
2.3 Multi-class classification, Neural Network, K-Nearest Neighbours
2.4 Decision Tree, Random Forest, and Naïve Bayes.
3. Clustering
3.1 Introduction of Clustering
3.2 Types of clustering algorithms
3.3 Density-based algorithm, Distribution-based algorithm, Centroid-based
algorithm
3.4 Hierarchical based ,algorithm K-means clustering algorithm, DBSCAN
clustering
4. Tableau: Hands on
4.1 Basics of Data Visualization
4.2 Tableau: Basic and advanced charts
4.3 maps, interactive dashboards and
4.4 Case studies Summary

2
5.Summary of Achievement
6. Main Difficulties Encountered
7. Conclusion
8. References

LIST OF TABLES
Content Page No.

TABLE 1.1 Difference between Descriptive & Inferential …….…………10

TABLE 2.1 Algorithm Type ………………………………………………..21

LIST OF FIGURES
Content Page No.

Fig .1.1.1 The three Vs of big data …………………………………………….7


Fig .1.1.2 Tools of Descriptive Analysis……………………………………….9
Fig .1.2.1 Connection between ML, Deep Learning , AI……………………11
Fig .1.2.2 Reinforcement Learning……………………………………………11
Fig .1.3.1 Supervised learning, regression and classification………………..12
Fig .1.4.1 Illustrations of PCA…………………………………………………14
Fig .1.4.2 Illustration of an auto encoder……………………………………..15
Fig .1.4.3 Illustration of clustering…………………………………………….16
Fig .3.1.1 Types of clustering algorithm……………………………………….23
Fig.4.4.1 Amazon fig Case studies……………………………………………..25

3
ABSTRACT

This training program provides a comprehensive introduction to the principles,


computational tools, and case studies of data science and machine learning. The
program begins with an overview of data science, including its significance and the
role of descriptive statistics in aalyzing and interpreting data patterns. Participants
then learn about machine learning, including its principles and methodologies that
fuel intelligent decision-making. They gain hands-on experience with supervised
learning, constructing predictive models from labeled data, and unsupervised
learning, extracting latent insights from unlabeled data.

The core of the program focuses on regression and classification techniques.


Participants learn about a variety of regression types, from simple linear to complex
logistic and multiple regressions. They also learn about classification methods like
neural networks, k-nearest neighbors, decision trees, random forests, and naive Bayes.
The program also covers clustering techniques, exploring various algorithmic
approaches, including density-based, distribution-based, and centroid-based
methodologies.The program concludes with hands-on instruction in data visualization
using Tableau. Participants learn the basics of data visualization, create both basic and
advanced charts, and build interactive dashboards. They also apply these skills to real-
world case studies.

Overall, this training program provides a comprehensive and practical introduction to


data science and machine learning. It is designed to equip participants with the
knowledge and skills they need to succeed in data-driven industries.

4
CHAPTER 1
INTRODUCTION

1.1 Introduction to Data science, Descriptive Statistics

Data science combines math and statistics, specialized programming, advanced


analytics, artificial intelligence (AI), and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization’s data. These
insights can be used to guide decision making and strategic planning.
Why Data Science? Pros of Working in the Data Science Field
1. Data science careers are in high demand.
2. There’s a low supply of workers in the data science field.
3. The data science field is versatile and broadly applicable.
4. A data science career has the potential to make a lasting impact.
Working in the Data Science Field (Cons):

1. It requires preparation.
2. Data science is a fast-paced field.
3. Data science involves potential privacy risks.
What Can You Do With a Data Science Degree?

If you’re wondering what you can do with a data science degree, the data science
career outlook is strong. In the past few years alone, the demand for data science
skills and data-driven decision making has steeply risen as they become essential
tools for organizations that want to do everything they can to ensure success.
Big data and the field of data science present big opportunities for career
advancement. Depending on the industry and role you choose, you’ll find a wide
variety of data science job titles with nuanced job descriptions to match specific skill
sets.
Common data science career paths or job titles include:

1. Business analyst
2. Data analyst
3. Data architect
4. Data engineer. 5
5. Data scientist
6. Machine learning engineer
7. Research scientist
What is Big Data?
The definition of big data is data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data
sources. These data sets are so voluminous that traditional data processing software
just can’t manage them. But these massive volumes of data can be used to address

business problems you wouldn’t have been able to tackle before.

The three Vs of big data


Volume - The amount of data matters. With big data, you’ll have to process high
volumes of low-density, unstructured data. This can be data of unknown value, such
as Twitter data feeds, clickstreams on a web page or a mobile app, or sensor-enabled
equipment. For some organizations, this might be tens of terabytes of data. For others,
it may be hundreds of petabytes.
Velocity - Velocity is the fast rate at which data is received and (perhaps) acted on.
Normally, the highest velocity of data streams directly into memory versus being
written to disk. Some internet-enabled smart products operate in real time or near real
time and will require real-time evaluation and action.

Variety - Variety refers to the many types of data that are available. Traditional data
types were structured and fit neatly in a relational database. With the rise of big data,
data comes in new unstructured data types. Unstructured and semistructured data
types, such as text, audio, and video, require additional preprocessing to derive
meaning and support metadata.

6
Fig .1.1.1
Data Collection Methods
For Quantitative:
1. Closed-ended Surveys and Online Quizzes
Closed-ended surveys and online quiz are based on questions that give respondents
predefined answer options to opt for.
Categorical survey questions can be further classified into dichotomous (‘yes/

no’), multiple-choice questions, or checkbox questions and can be answered with a

simple “yes” or “no” or a specific piece of predefined information.


Interval/ratio questions, on the other hand, can consist of rating-scale, Likert-
scale, or matrix questions and involve a set of predefined values to choose from on a
fixed scale.
ForQualitative:
1. Open-Ended Surveys and Questionnaires
The main difference between the two is the fact that closed-ended surveys offer
predefined answer options the respondent must choose from, whereas open-ended
surveys allow the respondents much more freedom and flexibility when providing
their answers.
When creating an open-ended survey, keep in mind the length of your survey
and the number and complexity of questions.
Compared to closed-ended surveys, one of the quantitative data collection
7
methods, the findings of open-ended surveys are more difficult to compile and
analyze due to the fact that there are no uniform answer options to choose from.

Primary and Secondary Data


Primary Data: Primary data is original data that is collected firsthand for a specific
research purpose. It is data that has not been previously published or analyzed.
Researchers collect primary data directly from individuals, surveys, experiments,
observations, interviews, focus groups, and other methods. Primary data is tailored to
the research objectives and can provide unique insights into a particular research
question.
Advantages of primary data include:
1. Accuracy and Relevance: Since primary data is collected for a specific purpose, it
is likely to be directly relevant to the research question.
2. Control: Researchers have control over the data collection process, ensuring that
the data collected is accurate and consistent.
3. Freshness: Primary data is the most current and up-to-date information available.
Secondary Data:
Secondary data refers to data that has been collected and published by someone else
for a different purpose. This data is not gathered directly by the researcher but is
obtained from sources such as books, articles, reports, databases, and existing
datasets. Secondary data can be used to supplement primary data or to address
research questions that do not require new data collection.
Advantages of secondary data include:
1. Time and Cost Savings: Using existing data saves time and resources compared to
collecting new data.
2. Comparability: Secondary data can be used to compare different studies or time
periods, making it useful for trend analysis.
3. Broad Scope: Secondary data sources often cover a wide range of topics and areas.

8
9

Descriptive Statistics
Descriptive statistics are brief informational coefficients that summarize a given data
set, which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central tendency
and measures of variability (spread). Measures of central tendency include the mean,
median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.
Examples of Descriptive Statistics - Descriptive statistics are informational and
meant to describe the actual characteristics of a data set. When analyzing numbers
regarding the prior Major League Baseball season, descriptive statistics including the
highest batting average for a single player, the number of runs allowed per team, and
the average wins per division.
Common tools of descriptive statistics
Descriptive statistics provides us the tools to define our data in a most understandable
and appropriate way:
1. Central tendency: Use the mean or the median to locate the center of the dataset.
This measure tells you where most values fall.
2. Dispersion: How far out from the center do the data extend? You can use the range
or standard deviation to measure the dispersion. A low dispersion indicates that the
values cluster more tightly around the center. Higher dispersion signifies that data
points fall further away from the center. We can also graph the frequency
distribution.
3. Skewness: The measure tells you whether the distribution of values is symmetric or
skewed. Fig .1.1.2
TABLE 1.1
1.2 Introduction to Machine Learning
Machine learning (ML) continues to grow in importance for many organizations
across nearly all domains. Some example applications of machine learning in practice
include:
1.Predicting the likelihood of a patient returning to the hospital (readmission) within
30 days of discharge.
2. Segmenting customers based on common attributes or purchasing behavior for
targeted marketing.
3. Predicting coupon redemption rates for a given marketing campaign.
4. Predicting customer churn so an organization can perform preventative
intervention.
5. And many more!

10
11
The Connection between Machine Learning, Deep Learning, and AI

Fig .1.2.1
The Categories of Machine Learning
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Fig.1.2.2
4.Semi-supervised Learning
5.Self-supervised Learning
1.3 Supervised learning with hands on
Supervised learning is concerned with predicting a target value given input
observations. In machine learning, we call the model inputs “features”3). The target
values that supervised models are trained to predict are also often called labels4.
Supervised learning can be categorized into two major subcategories: regression
analysis and classification. In regression analysis, the target values or labels are
continuous variables (Figure 3A). In classification, the labels are so-called class
labels, which can be understood as discrete class- or group-membership indicators
(Figure).

Figure 1.3.1 . Illustrations of the two main categories of supervised learning,


regression (A) and classification (B).
In machine learning, we often work with high-dimensional datasets; that is, datasets
consisting of many input features. However, due to the limitations of the human
imagination and the written medium, conventional illustrations can only depict two
(or at most three) spatial dimensions. The 2D scatterplots in Figure 3 show two simple
datasets. Here, sub-panel A depicts a simple regression example for a dataset with
only a single feature. The target variable, the values we want to predict, is depicted as
the y-axis. Sub-panel B depicts a 2-dimensional classification dataset, where the
target variable, the discrete class label information, is encoded as a symbol (triangle
vs. circle). In both cases, there is a target variable that the model learns to predict. In
the case of the linear regression example, the target variable is a continuous variable
12
depicted on the y-axis in Figure 3A. For the classification example in Figure 3B, the
target variable is comprised of class labels depicted as symbols (triangles and circles).
The third category of supervised learning is ordinal regression, which is sometimes
also referred to as ordinal classification. Ordinal regression can be understood as a
hybrid between those as mentioned above, (metric) regression analysis and
classification. In ordinal regression, we have categories just like in classification.
However, there is ordering information between the classes. In contrast to metric
regression, the labels are discrete, and the distance between the labels is arbitrary. For
example, predicting a person’s height is an example of metric regression. The target
variable (height) can be measured on a continuous scale, and the distance between
150 cm and 160 cm is the same as the distance between 180 and 190 cm. Predicting a
movie rating on a 1-5 scale would be a better fit for an ordinal regression model.
Assuming the movie rating scale is composed as follows: 1=bad, 2=ok, 3=good,
4=great, and 5=awesome. Here, the distance between 1 (bad) and 2 (ok) is arbitrary.
Or in other words, we cannot directly compare the difference between 1 (bad) and 2
(ok) to the distance between 3 (good) and 4 (great). A task that is related to ordinal
regression is ranking. However, note that in ranking, we are only interested in
the relative order of the items. For example, we may order a collection of movies
from worst to best. In ordinal regression, we care about the absolute values on a given
scale. A detailed treatment of ordinal regression is out of the scope of this book.
Readers interested in this topic may refer to the manuscript “Rank-consistent Ordinal
Regression for Neural Networks” [@cao2019rank], which contains references to
ordinal regression literature relevant to deep learning.

1.4 Unsupervised learning


The previous section introduced supervised learning, which is the most prominent
subcategory of machine learning. This section discusses the second major category of
machine learning: unsupervised learning. In unsupervised learning, in contrast to
supervised learning, no labeling information is given. The goal is to discover or model
hidden structures in data rather than predicting continuous or discrete target labels.
13
One major subcategory of unsupervised learning is representation learning and
dimensionality reduction. A popular and classic technique for this is principal
component analysis (PCA; Figure ).

Figure 1.4.1 . Illustrations of PCA. (A) A dataset (blue circles) with two feature
axis X1 and X2 that has been projected onto the principal components PC1
and PC2(B) A view of the data projected onto the first principal component (PC1).
In a nutshell, PCA identifies the directions of maximum variance in a dataset. These
directions (eigenvectors of the covariance matrix) form the principal component axes
of a new coordinate system. In practice, PCA is typically used to reduce the
dimensionality of a dataset while retaining most of its information (variance). PCA is
a linear transformation technique. Thus, it cannot capture complex non-linear patterns
in data; however, kernel PCA and other non-linear dimensionality reduction
techniques exist. Chapter 15 covers autoencoders, which are deep neural network
architectures that can be used for non-linear dimensionality reduction. An autoencoder
consists of two subnetworks, an encoder and a decoder, as illustrated in Figure 5. An
input (for example, an image) is passed to the encoder, which compresses the input
into a lower-dimensional representation, also known as “embedding vector” or “latent
representation.” After the encoder embedded the feature into this lower-dimensional
space, the decoder reconstructs the original image from this latent representation. The
two subnetworks, encoder and decoder, are connected end to end and often depicted
as an hourglass where the width of the hourglass represents the size of the feature map
produced by the multilayer neural network. 14
Figure 1.4.2. Illustration of an autoencoder.
In a typical autoencoder architecture, the encoder’s output is the smallest feature map
in the network, which is then fed into the decoder – this connection between encoder
and decoder is also often referred to as the “bottleneck.” The rationale behind the
autoencoder idea is that for the decoder to produce a faithful reconstruction of the
original input image, the encoder must learn to create a latent representation that
preserves most of the information. Typically, the autoencoder architecture is designed
such that the latent representation is lower-dimensional than the input (and its
reconstruction), so that the autoencoder cannot just memorize and pass the original
input through the architecture – this is sometimes also referred to as information
bottleneck.
Another major subcategory of unsupervised learning is clustering, which assigns
group membership information to data points. It can be thought of as a task similar to
classification but without labeling information given in the training dataset. Hence, in
the absence of class label information, the clustering approach is to group data records
by similarity and define distinct groups based on similarity thresholds.
Clustering can be divided into three major groups: prototype-based, density-based,
and hierarchical clustering. In prototype-based clustering algorithms, such as K-
Means [@macqueen1967some; @lloyd1982least], a fixed number of cluster centers is
defined (the cluster centers are repositioned iteratively), and data points are assigned
to the closest prototype based on a pair-wise distance measure (for example,
Euclidean distance). In density-based clustering, unlike in prototype-based clustering,
the number of cluster centers is not fixed but assigned by identifying regions 15
of high density (the location of many data records close to each other measured by a
user-defined distance metric. In hierarchical clustering, a distance metric is used to
group examples in a tree like-fashion, where examples at the root are more related to
each other. The depth of the tree defines the number of clusters.

Figure 1.4.3. Illustration of clustering. (A) A two-dimensional, unlabeled dataset. (B)


Clusters inferred by a clustering algorithm that groups similar points into the same
cluster.

16
17
CHAPTER 2
REGRESSION AND CLASSIFICATION

2.1 Regression analysis, Type of Regression


Regression analysis is a statistical method that is used to model the relationship
between a dependent variable and one or more independent variables. The dependent
variable is the variable that you are trying to predict, and the independent variables
are the variables that you are using to make the prediction.
There are many different types of regression analysis, each with its own strengths and
weaknesses. Some of the most common types of regression analysis include:
• Linear regression: This is the simplest type of regression analysis. It assumes
that the relationship between the dependent variable and the independent
variables is linear.
• Logistic regression: This is used for binary classification problems, where
the dependent variable can have two possible values, such as "yes" or "no."
• Polynomial regression: This is used when the relationship between the
dependent variable and the independent variables is not linear.
• Multiple regression: This is used when there are more than one independent
variable.
• Multivariate regression: This is used when there are more than one
dependent variable.
The choice of which type of regression analysis to use depends on the research
question and the data that is available. For example, if you are trying to predict the
price of a house, you might use linear regression. If you are trying to predict whether
or not someone will vote for a particular candidate, you might use logistic regression.
Regression analysis is a powerful tool that can be used to answer a variety of research
questions. However, it is important to remember that regression analysis is not a
magic bullet. It is important to understand the limitations of regression analysis and to
use it appropriately.
Here are some of the limitations of regression analysis:
• It assumes that the data is normally distributed.
• It assumes that the independent variables are not correlated.
• It assumes that the error terms are independent and identically distributed.
• It can be sensitive to outliers.
It is important to consider these limitations when interpreting the results of a
regression analysis.

2.2 OLS, Linear, Logistic, Multiple

OLS stands for Ordinary Least Squares. It is the most common type of linear
regression. It minimizes the sum of the squared residuals, which are the differences
between the predicted values and the actual values.
Linear regression is a statistical method that is used to model the relationship between
a dependent variable and one or more independent variables. The dependent variable
is the variable that you are trying to predict, and the independent variables are the
variables that you are using to make the prediction.
Logistic regression is a type of regression analysis that is used for binary
classification problems, where the dependent variable can have two possible values,
such as "yes" or "no." It is a non-linear model that predicts the probability of a binary
outcome.
Multiple regression is a type of regression analysis that is used when there are more
than one independent variable. It is used to model the relationship between a
dependent variable and multiple independent variables.
Here is a table that summarizes the key differences between OLS, linear, logistic, and
multiple regression:
The best type of regression analysis to use depends on the research question and the
data that is available. For example, if you are trying to predict the price of a house,
you might use linear regression. If you are trying to predict whether or not someone
will vote for a particular candidate, you might use logistic regression. If you are trying
to predict the risk of heart disease, you might use multiple regression.

18
2.3 Multi-class classification, Neural Network, K-Nearest Neighbours
Sure, I can help you with that.
Multi-class classification is a type of classification problem where the dependent
variable can have more than two possible values. For example, if you are trying to
classify images of animals, the dependent variable could be the animal's species.
There are many different algorithms that can be used for multi-class classification,
including:
• Neural networks: Neural networks are a type of machine learning algorithm that
is inspired by the human brain. They are very powerful and can be used to solve
a variety of problems, including multi-class classification.
• K-nearest neighbors: K-nearest neighbors is a simple algorithm that classifies a
new data point based on the k most similar data points in the training set. It is
easy to understand and implement, but it can be less accurate than other
algorithms.
• Support vector machines: Support vector machines are a type of machine
learning algorithm that creates a hyperplane that separates the different classes
of data. They are very accurate, but they can be more difficult to understand and
implement than other algorithms.
The best algorithm to use for multi-class classification depends on the specific
problem and the data that is available. For example, if you have a lot of data and you
are willing to spend time training the model, then you might use a neural network. If
you have less data or you are not willing to spend as much time training the model,
then you might use a simpler algorithm like K-nearest neighbors.
Neural networks are a type of machine learning algorithm that is inspired by the
human brain. They are made up of interconnected nodes, and each node performs a
simple computation. The nodes are arranged in layers, and the information flows from
the input layer to the output layer.
Neural networks can be used to solve a variety of problems, including classification,
regression, and clustering. They are very powerful and can be used to achieve state-
of-the-art results on many problems. However, they can be difficult to train and
require a lot of data. 19
K-nearest neighbors is a simple algorithm that classifies a new data point based on the
k most similar data points in the training set. The similarity is typically measured
using a distance metric, such as the Euclidean distance.
K-nearest neighbors is easy to understand and implement, but it can be less accurate
than other algorithms. It is also sensitive to the choice of k, which is the number of
neighbors to consider.
Support vector machines are a type of machine learning algorithm that creates a
hyperplane that separates the different classes of data. The hyperplane is chosen so
that it maximizes the margin between the classes.
Support vector machines are very accurate, but they can be more difficult to
understand and implement than other algorithms. They also require a lot of data to
train.
2.4 Decision Tree, Random Forest, and Naïve Bayes.
Decision trees are a type of supervised learning algorithm that can be used for both
classification and regression problems. They work by breaking down the data into
smaller and smaller subsets until each subset can be classified or predicted with a high
degree of certainty.
Random forests are an ensemble learning algorithm that combines multiple decision
trees to improve the accuracy of predictions. They work by training each decision tree
on a different subset of the data and then combining the predictions of the individual
trees.
Naive Bayes is a simple probabilistic classifier that assumes that the features are
independent of each other. This assumption makes it very fast to train and predict, but
it can also lead to overfitting.
Here is a table that summarizes the key differences between decision trees, random
forests, and naive Bayes:
The best algorithm to use depends on the specific problem and the data that is
available. For example, if you have a small dataset and you need to make predictions
quickly, then you might use a decision tree. If you have a large dataset and you need
to make accurate predictions, then you might use a random forest. If you have a
dataset with categorical features, then you might use naive Bayes. 20
Here are some additional things to keep in mind when choosing between decision
trees, random forests, and naive Bayes:
• Decision trees are easy to understand and interpret, but they can be overfitting.
• Random forests are more accurate than decision trees, but they can be more
difficult to interpret.
• Naive Bayes is fast and easy to train, but it can be less accurate than decision
trees or random forests.

TABLE 2.1

21
22
CHAPTER 3
Clustering
Clustering is a type of unsupervised learning algorithm that is used to group similar
data points together. The goal of clustering is to find groups of data points that are as
similar as possible within each group and as different as possible between groups.
There are many different clustering algorithms, each with its own strengths and
weaknesses. Some of the most common clustering algorithms include:
• K-means clustering: This is a simple algorithm that divides the data into k
clusters, where k is a user-defined number. The algorithm starts by randomly
assigning each data point to a cluster. Then, it iterates through the clusters,
moving data points from one cluster to another until the clusters are no longer
changing.
• Hierarchical clustering: This algorithm builds a hierarchy of clusters, where
each cluster is a subcluster of another cluster. The algorithm starts by creating a
cluster for each data point. Then, it merges the two most similar clusters until
there is only one cluster left.
• Density-based clustering: This algorithm clusters data points that are densely
packed together. The algorithm starts by identifying dense regions of data
points. Then, it merges these regions together until there are no more dense
regions left.
• Distribution-based clustering: This algorithm clusters data points that have
similar distributions. The algorithm starts by fitting a distribution to each data
point. Then, it clusters data points that have similar distributions together.
The best clustering algorithm to use depends on the specific problem and the data that
is available. For example, if you have a small dataset and you need to make
predictions quickly, then you might use k-means clustering. If you have a large
dataset and you need to make accurate predictions, then you might use hierarchical
clustering. If you have a dataset with categorical features, then you might use density-
based clustering.
Here are some of the key concepts in clustering:
• Cluster: A group of data points that are similar to each other. 23
• Centroid: The center of a cluster.
• Density: The number of data points in a particular region of space.
• Distribution: The shape of the data points in a particular region of space.
Here are some of the benefits of clustering:
• It can be used to find hidden patterns in data.
• It can be used to reduce the dimensionality of data.
• It can be used to identify outliers.
• It can be used to segment customers.
• It can be used to group documents.
Here are some of the challenges of clustering:
• The choice of clustering algorithm can be difficult.
• The number of clusters is a hyperparameter that needs to be tuned.
• The clusters can be noisy or overlapping.
• The clusters can be unstable to changes in the data.
Fig 3.1.1
CHAPTER 4
Tableau: Hands on
Tableau is a data visualization software that allows you to create interactive
dashboards and charts. It is a powerful tool that can be used to communicate insights
from data in a clear and concise way.
4.1 Basics of Data Visualization
The basics of data visualization include:
• Choosing the right chart type for the data.
• Using colors and fonts that are easy to read.
• Adding labels and annotations to explain the data.
• Formatting the data in a way that is visually appealing.
4.2 Tableau: Basic and advanced charts
Tableau has a wide variety of charts that can be used to visualize data. Some of the
most common charts include:
• Bar charts
• Line charts
• Pie charts
• Scatterplots
• Maps
Tableau also has a number of advanced charts that can be used to visualize more
complex data. These charts include:
• Treemaps
• Heatmaps
• Waterfall charts
• Sankey diagrams
4.3 Maps, interactive dashboards and
Tableau can be used to create interactive maps that allow you to explore data by
location. It can also be used to create interactive dashboards that allow you to present
data in a way that is easy to understand and navigate.
Here are some of the benefits of using Tableau for data visualization:
24
It is easy to use.
• It is powerful.
• It is versatile.
• It is interactive.
• It is shareable.
Here are some of the challenges of using Tableau:
• It can be expensive.
• It can be time-consuming to learn.
• It can be difficult to create complex visualizations.
Overall, Tableau is a powerful data visualization tool that can be used to communicate
insights from data in a clear and concise way. It is a versatile tool that can be used for
a variety of purposes.

4.4 Case studies


Case studies are a valuable way to learn about the principles and applications of data
science and machine learning. They can provide real-world examples of how these
techniques are being used to solve problems and make predictions.
The Netflix Prize, Google Flu Trends, Amazon's Recommendation Engine, and
Spotify's Music Recommendation Engine are all great examples of how data science
and machine learning are being used in the real world.
Amazon's Recommendation Engine
Fig.4.4.1
Amazon's Recommendation Engine is a system that recommends products to users
based on their past purchases, browsing history, and ratings. The engine is based on a
variety of factors, including:
• The user's past purchases: The engine takes into account the products that
the user has previously purchased.
• The user's browsing history: The engine also takes into account the
products that the user has browsed but not purchased.
• The user's ratings: The engine also takes into account the ratings that the
user has given to products.
• The user's demographics: The engine may also take into account the user's
demographics, such as their age, gender, and location.
• The popularity of the product: The engine may also take into account the
popularity of the product, such as how many other users have purchased it.
The Amazon Recommendation Engine is a powerful tool that can help users find
products that they are interested in. It is also a valuable tool for Amazon, as it can help
them increase sales and improve the customer experience.
Here are some of the benefits of Amazon's Recommendation Engine:
• It can help users find products that they are interested in.
• It can help Amazon increase sales.
• It can help Amazon improve the customer experience.
• It can help Amazon personalize the user experience.
• It can help Amazon prevent customers from abandoning their carts.
Here are some of the challenges of Amazon's Recommendation Engine:
• It can be difficult to keep up with the ever-changing data.
• It can be difficult to personalize the recommendations for each user.
• It can be difficult to prevent the recommendations from becoming stale.
Overall, Amazon's Recommendation Engine is a powerful tool that can help Amazon
improve the customer experience and increase sales. However, there are some
challenges that need to be addressed in order to make the engine even more effective.

26
SUMMARY OF ACHIEVEMENT
In our journey through Principles, Computational Tools, and Case Studies in Data
Science and Machine Learning, we've attained a comprehensive grasp of the core
tenets that guide data-driven decision-making. Equipped with this understanding,
we've harnessed a diverse range of computational tools and libraries, transforming
theoretical concepts into practical applications. Through real-world case studies,
we've witnessed the transformative potential of predictive modeling, pattern
recognition, and automation across industries. This exploration has emphasized not
only the power of data but also the ethical imperative of responsible and unbiased
practices. As we stand at the intersection of human expertise and machine learning
capabilities, our achievements signify a readiness to navigate complexities and
innovate within the dynamic landscape of data science and machine learning.

MAIN DIFFICULTY ENCOUNTERED


The primary challenge encountered throughout our exploration of Principles,
Computational Tools, and Case Studies in Data Science and Machine Learning was
the intricate balance between theory and practice. While comprehending the
foundational principles was crucial for effective decision-making, translating these
principles into actionable insights using computational tools often presented hurdles.
The transition from abstract concepts to tangible models demanded a deep
understanding of algorithms, data preprocessing techniques, and parameter tuning.
Furthermore, the application of these tools in real-world case studies revealed the
complexity of adapting solutions to unique contexts, requiring creative problem-
solving and continuous adaptation. This interplay between theory, tools, and real-
world scenarios highlighted the need for a multidisciplinary skill set, where expertise
in both technical aspects and domain knowledge proved essential in overcoming the
challenges inherent to data science and machine learning endeavors.

27
CONCLUSION
The exploration of Principles, Computational Tools, and Case Studies in Data Science
and Machine Learning underscores the transformative potential of these fields in
reshaping our understanding of data, patterns, and decision-making.
Principles serve as the foundation, emphasizing the significance of problem
understanding, data preprocessing, feature selection, algorithm choice, and model
evaluation. The iterative nature of the process highlights the need for continuous
refinement and adaptation.
Computational Tools are the engines that power innovation. From libraries like
Scikit-Learn and TensorFlow to languages like Python and R, these tools democratize
complex processes, enabling experts and novices alike to engage in impactful data
analysis and modeling.
Case Studies illuminate the real-world impact of Data Science and Machine
Learning. Whether it's predicting customer behavior in e-commerce, optimizing
energy consumption, or revolutionizing healthcare with predictive diagnostics, these
studies highlight the tangible benefits of harnessing data's potential.
In a world inundated with data, these disciplines offer a lens to decipher complexities.
Yet, challenges abound – ethical considerations, bias mitigation, and privacy
preservation demand vigilant attention. As we navigate forward, a symbiotic
relationship between human expertise and machine learning capabilities emerges,
driving innovation and pushing boundaries.
In essence, the amalgamation of Principles, Computational Tools, and Case Studies in
Data Science and Machine Learning encapsulates a paradigm shift in problem-
solving. It's a journey of continuous learning, where data becomes a canvas and
algorithms paint the future. With each breakthrough, we inch closer to a reality where
data-driven insights illuminate the path to progress.

28
REFERENCES
1. H. Trevor et al., “the elements of statistical learning”, Vol. 2. No.1.New York,
Springer, 2009.
2. . Jurgen Kai-Uwe Brock, Data Design: The Visual Display of Qualitative and
Quantitative Information, (1e), Consulting Press, 2017.
3. https://media.licdn.com/dms/image/C5612AQHpg16qX4Bvhw/article-inline_image-
shrink_1000_1488/0/1605942452368?
e=1698278400&v=beta&t=Ca6dD2nlHN6dQZOcXYhA-bhnrAT4GbvzRSjrmBGbaAE
- Image credit Fig .1.1.1

4. https://www.assignmenthelppro.com/blog/inferential-vs-descriptive-statistics/
- Table credit 1.1

29

You might also like