Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 116

SWDML501: MACHINE LEARNING APPLICATION

Competence: APPLY MACHINE LEARNING FUNDAMENTALS

School: ESSJT

ICT AND MULTIMEDIA

Class: level 5 SWD

Module name: SWDML501 MACHINE LEARNING APPLICATION

Competency: APPLY MACHINE LEARNING FUNDAMENTALS

Learning Hours: 80

TRAINER:

Elements of Competency and Performance Criteria

Elements of competency Performance criteria


1. Apply Data Pre-processing 1.1. Environment is properly prepared based on system
requirements.
1.2. Data is properly manipulated based on the python
libraries functionalities.
1.3. Visualization results are properly interpreted based on
its statistical analysis.

Page 1 of 116
1.4. Data cleaning is appropriately performed based on the
provided dataset.
2. Develop Machine Learning Model 2.1. Machine Learning algorithm is properly selected based
on the characteristics of the dataset.
2.2. Machine Learning models are properly trained based on
a training set of data.
2.3. Machine Learning model performance is properly
evaluated based on appropriate evaluation metrics
2.4. Hyperparameters are properly finetuned based on
evaluation results
3. Perform Model deployment 3.1. Deployment methods are clearly selected based on the
requirements.
3.2. Model file is properly integrated in the system based on
the Deployment method (RESTful API guidelines.)
3.3. Prediction responses are accurately delivered to the
Clients based on the model insights.
1. Apply Data Pre-processing
1.1. Environment is properly prepared based on system requirements.

Description of Machine learning concepts

 Machine learning overview


Definition:

Machine Learning (ML) is that field of computer science with the help of which computer
systems can provide sense to data in much the same way as human beings do.

In simple words, ML is a type of artificial intelligence that extract patterns out of raw data

by using an algorithm or method. The main focus of ML is to allow computer systems learn

from experience without being explicitly programmed or human intervention.

Page 2 of 116
Key Components of Machine Learning

1. Data: It is the cornerstone of all machine learning algorithms. Without data, ML


algorithms cannot learn. The data can be in various formats, such as text, images,
videos, or even sensor data.

2. Models: A model in machine learning is a mathematical representation of a real-


world process. The model is what learns from the data by adjusting its parameters
to fit the observed data as closely as possible.

3. Algorithms: These are the methods used to train models. They adjust the model's
parameters to minimize the difference between the model's predictions and the
actual observed outcomes.

4. Evaluation: This involves assessing how well your model is performing. Common
metrics for this include accuracy, precision, recall, and F1 score, depending on the
problem you're solving (e.g., classification, regression).

Machine learning life cycle

Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

Page 3 of 116
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we need
data, hence, life cycle starts by collecting data.

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

Page 4 of 116
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.

It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:

o Missing Values
o Duplicate data
o Invalid data

o Noise

Page 5 of 116
So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques

o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.

5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.

7. Deployment
Page 6 of 116
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.

If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.

https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-
applications

Machine Learning applications


Popular Machine Learning Applications and Examples

1. Social Media Features

Social media platforms use machine learning algorithms and approaches to create some
attractive and excellent features. For instance, Facebook notices and records your activities,
chats, likes, and comments, and the time you spend on specific kinds of posts. Machine
learning learns from your own experience and makes friends and page suggestions for your
profile.

Page 7 of 116
2. Product Recommendations

Product recommendation is one of the most popular and known applications of machine
learning. Product recommendation is one of the stark features of almost every e-commerce
website today, which is an advanced application of machine learning techniques.
Using machine learning and AI, websites track your behavior based on your previous
purchases, searching patterns, and cart history, and then make product recommendations.

Page 8 of 116
3. Image Recognition

Image recognition, which is an approach for cataloging and detecting a feature or an object in
the digital image, is one of the most significant and notable machine learning and AI
techniques. This technique is being adopted for further analysis, such as pattern recognition,
face detection, and face recognition.

Page 9 of 116
4. Sentiment Analysis

Sentiment analysis is one of the most necessary applications of machine learning. Sentiment
analysis is a real-time machine learning application that determines the emotion or opinion of
the speaker or the writer. For instance, if someone has written a review or email (or any form
of a document), a sentiment analyzer will instantly find out the actual thought and tone of the
text. This sentiment analysis application can be used to analyze a review based website,
decision-making applications, etc.

5. Automating Employee Access Control

Organizations are actively implementing machine learning algorithms to determine the level
of access employees would need in various areas, depending on their job profiles. This is one
of the coolest applications of machine learning.

6. Marine Wildlife Preservation

Machine learning algorithms are used to develop behavior models for endangered cetaceans
and other marine species, helping scientists regulate and monitor their populations.

Page 10 of 116
7. Regulating Healthcare Efficiency and Medical Services

Significant healthcare sectors are actively looking at using machine learning algorithms to
manage better. They predict the waiting times of patients in the emergency waiting rooms
across various departments of hospitals. The models use vital factors that help define the
algorithm, details of staff at various times of day, records of patients, and complete logs of
department chats and the layout of emergency rooms. Machine learning algorithms also come
to play when detecting a disease, therapy planning, and prediction of the disease situation.
This is one of the most necessary machine learning applications.

8. Predict Potential Heart Failure

An algorithm designed to scan a doctor’s free-form e-notes and identify patterns in a patient’s
cardiovascular history is making waves in medicine. Instead of a physician digging through
multiple health records to arrive at a sound diagnosis, redundancy is now reduced with
computers making an analysis based on available information.

9. Banking Domain

Banks are now using the latest advanced technology machine learning has to offer to help
prevent fraud and protect accounts from hackers. The algorithms determine what factors to
consider to create a filter to keep harm at bay. Various sites that are unauthentic will be
automatically filtered out and restricted from initiating transactions.

10. Language Translation

One of the most common machine learning applications is language translation. Machine
learning plays a significant role in the translation of one language to another. We are amazed
at how websites can translate from one language to another effortlessly and give contextual
meaning as well. The technology behind the translation tool is called ‘machine translation.’ It
has enabled people to interact with others from all around the world; without it, life would
not be as easy as it is now. It has provided confidence to travelers and business associates to
safely venture into foreign lands with the conviction that language will no longer be a barrier.

Page 11 of 116
https://www.javatpoint.com/machine-learning-life-cycle

Machine learning Advantages and disadvantages

Advantages of Machine Learning

1. Automation

Machine Learning is one of the driving forces behind automation, and it is cutting down time
and human workload. Automation can now be seen everywhere, and the complex algorithm
does the hard work for the user. Automation is more reliable, efficient, and quick. With the
help of machine learning, now advanced computers are being designed. Now this advanced
computer can handle several machine-learning models and complex algorithms. However,
automation is spreading faster in the industry but, a lot of research and innovation are required
in this field.

2. Scope of Improvement

Page 12 of 116
Machine Learning is a field where things keep evolving. It gives many opportunities for
improvement and can become the leading technology in the future. A lot of research and
innovation is happening in this technology, which helps improve software and hardware.

3. Enhanced Experience in Online Shopping and Quality Education

Machine Learning is going to be used in the education sector extensively, and it will be going
to enhance the quality of education and student experience. It has emerged in China; machine
learning has improved student focus. In the e-commerce field, Machine Learning studies your
search feed and give suggestion based on them. Depending upon search and browsing history,
it pushes targeted advertisements and notifications to users.

4. Wide Range of Applicability

This technology has a very wide range of applications. Machine learning plays a role in almost
every field, like hospitality, ed-tech, medicine, science, banking, and business. It creates
more opportunities.

Disadvantages of the Machine Learning

Nothing is perfect in the world. Machine Learning has some serious limitations, which are
bigger than human errors.

1. Data Acquisition

The whole concept of machine learning is about identifying useful data. The outcome will be
incorrect if a credible data source is not provided. The quality of the data is also significant. If
the user or institution needs more quality data, wait for it. It will cause delays in providing the
output. So, machine learning significantly depends on the data and its quality.

2. Time and Resources

The data that machines process remains huge in quantity and differs greatly. Machines require
time so that their algorithm can adjust to the environment and learn it. Trials runs are held to
check the accuracy and reliability of the machine. It requires massive and expensive resources
and high-quality expertise to set up that quality of infrastructure. Trials runs are costly as they
would cost in terms of time and expenses.

Page 13 of 116
3. Results Interpretations

One of the biggest advantages of Machine learning is that interpreted data that we get from the
cannot be hundred percent accurate. It will have some degree of inaccuracy. For a high degree
of accuracy, algorithms should be developed so that they give reliable results.

4. High Error Chances

The error committed during the initial stages is huge, and if not corrected at that time, it creates
havoc. Biasness and wrongness have to be dealt with separately; they are not interconnected.
Machine learning depends on two factors, i.e., data and algorithm. All the errors are
dependent on the two variables. Any incorrectness in any variables would have huge
repercussions on the output.

5. Social Changes

Machine learning is bringing numerous social changes in society. The role of machine learning-
based technology in society has increased multifold. It is influencing the thought process of
society and creating unwanted problems in society. Character assassination and sensitive
details are disturbing the social fabric of society.

6. Elimination of Human Interface

Automation, Artificial Intelligence, and Machine Learning have eliminated human interface
from some work. It has eliminated employment opportunities. Now, all those works are
conducted with the help of artificial intelligence and machine learning.

7. Changing Nature of Jobs

With the advancement of machine learning, the nature of the job is changing. Now, all the work
are done by machine, and it is eating up the jobs for human which were done earlier by them.
It is difficult for those without technical education to adjust to these changes.

8. Highly Expensive

This software is highly expensive, and not everybody can own it. Government agencies, big
private firms, and enterprises mostly own it. It needs to be made accessible to everybody for
wide use.
Page 14 of 116
9. Privacy Concern

As we know that one of the pillars of machine learning is data. The collection of data has raised
the fundamental question of privacy. The way data is collected and used for commercial
purposes has always been a contentious issue. In India, the Supreme court of India has
declared privacy a fundamental right of Indians. Without the user's permission, data cannot be
collected, used, or stored. However, many cases have come up that big firms collect the data
without the user's knowledge and using it for their commercial gains.

10. Research and Innovations

Machine learning is evolving concept. This area has not seen any major developments yet that
fully revolutionized any economic sector. The area requires continuous research and
innovation.

https://www.javatpoint.com/advantages-and-disadvantages-of-machine-learning

Difference between machine learning, artificial intelligence, and deep learning

 Artificial Intelligence
Artificial Intelligence is basically the mechanism to incorporate human intelligence into
machines through a set of rules(algorithm). AI is a combination of two words: “Artificial”
meaning something made by humans or non-natural things and “Intelligence” meaning the
ability to understand or think accordingly. Another definition could be that “AI is basically
the study of training your machine(computers) to mimic a human brain and its
thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to obtain
the maximum efficiency possible.
 Machine Learning:
Machine Learning is basically the study/process which provides the system(computer) to
learn automatically on its own through experiences it had and improve accordingly without
being explicitly programmed. ML is an application or subset of AI. ML focuses on the
development of programs so that it can access data to use it for itself. The entire process
makes observations on data to identify the possible patterns being formed and make better
future decisions as per the examples provided to them. The major aim of ML is to allow

Page 15 of 116
the systems to learn by themselves through experience without any kind of human
intervention or assistance.
 Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning which
makes use of Neural Networks(similar to the neurons working in our brain) to mimic
human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does and
classifies the information accordingly. DL works on larger sets of data when compared to
ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning and Deep
Learning:

Artificial Intelligence Machine Learning Deep Learning

AI stands for Artificial DL stands for Deep


Intelligence, and is Learning, and is the study
ML stands for Machine
basically the that makes use of Neural
Learning, and is the study
study/process which Networks (similar to
that uses statistical methods
enables machines to neurons present in human
enabling machines to improve
mimic human behaviour brain) to imitate
with experience.
through particular functionality just like a
algorithm. human brain.

AI is the broader family


consisting of ML and DL ML is the subset of AI. DL is the subset of ML.
as it’s components.

DL is a ML algorithm that
AI is a computer
ML is an AI algorithm which uses deep (more than one
algorithm which exhibits
allows system to learn from layer) neural networks to
intelligence through
data. analyze data and provide
decision making.
output accordingly.

Page 16 of 116
Artificial Intelligence Machine Learning Deep Learning

If you are clear about the


If you have a clear idea about math involved in it but
the logic(math) involved in don’t have idea about the
Search Trees and much behind and you can visualize features, so you break the
complex math is involved the complex functionalities complex functionalities into
in AI. like K-Mean, Support Vector linear/lower dimension
Machines, etc., then it defines features by adding more
the ML aspect. layers, then it defines the
DL aspect.

It attains the highest rank in


The aim is to basically The aim is to increase
terms of accuracy when it is
increase chances of accuracy not caring much
trained with large amount
success and not accuracy. about the success ratio.
of data.

DL can be considered as
neural networks with a
Three broad
large number of parameters
categories/types Of AI
Three broad categories/types layers lying in one of the
are: Artificial Narrow
Of ML are: Supervised four fundamental network
Intelligence (ANI),
Learning, Unsupervised architectures: Unsupervised
Artificial General
Learning and Reinforcement Pre-trained Networks,
Intelligence (AGI) and
Learning Convolutional Neural
Artificial Super
Networks, Recurrent
Intelligence (ASI)
Neural Networks and
Recursive Neural Networks

The efficiency Of AI is Less efficient than DL as it


More powerful than ML as
basically the efficiency can’t work for longer
it can easily work for larger
provided by ML and DL dimensions or higher amount
sets of data.
respectively. of data.

Page 17 of 116
Artificial Intelligence Machine Learning Deep Learning

Examples of AI
applications include: Examples of ML applications Examples of DL
Google’s AI-Powered include: Virtual Personal applications include:
Predictions, Ridesharing Assistants: Siri, Alexa, Sentiment based news
Apps Like Uber and Lyft, Google, etc., Email Spam and aggregation, Image analysis
Commercial Flights Use Malware Filtering. and caption generation, etc.
an AI Autopilot, etc.

AI refers to the broad


field of computer science ML is a subset of AI that
that focuses on creating focuses on developing DL is a subset of ML that
intelligent machines that algorithms that can learn focuses on developing deep
can perform tasks that from data and improve their neural networks that can
would normally require performance over time automatically learn and
human intelligence, such without being explicitly extract features from data.
as reasoning, perception, programmed.
and decision-making.

ML algorithms can be
categorized as supervised,
unsupervised, or
AI can be further broken DL algorithms are inspired
reinforcement learning. In
down into various by the structure and
supervised learning, the
subfields such as function of the human
algorithm is trained on
robotics, natural language brain, and they are
labeled data, where the
processing, computer particularly well-suited to
desired output is known. In
vision, expert systems, tasks such as image and
unsupervised learning, the
and more. speech recognition.
algorithm is trained on
unlabeled data, where the
desired output is unknown.

Page 18 of 116
Artificial Intelligence Machine Learning Deep Learning

DL networks consist of
multiple layers of
In reinforcement learning, the
interconnected neurons that
AI systems can be rule- algorithm learns by trial and
process data in a
based, knowledge-based, error, receiving feedback in
hierarchical manner,
or data-driven. the form of rewards or
allowing them to learn
punishments.
increasingly complex
representations of the data.

AI vs. Machine Learning vs. Deep Learning Examples:


Artificial Intelligence (AI) refers to the development of computer systems that can perform
tasks that would normally require human intelligence.

Some examples of AI include:


There are numerous examples of AI applications across various industries. Here are some
common examples:

 Speech recognition: speech recognition systems use deep learning algorithms


to recognize and classify images and speech. These systems are used in a variety
of applications, such as self-driving cars, security systems, and medical imaging.
 Personalized recommendations: E-commerce sites and streaming services like
Amazon and Netflix use AI algorithms to analyze users’ browsing and viewing
history to recommend products and content that they are likely to be interested
in.
 Predictive maintenance: AI-powered predictive maintenance systems analyze
data from sensors and other sources to predict when equipment is likely to fail,
helping to reduce downtime and maintenance costs.
 Medical diagnosis: AI-powered medical diagnosis systems analyze medical
images and other patient data to help doctors make more accurate diagnoses and
treatment plans.
 Autonomous vehicles: Self-driving cars and other autonomous vehicles use AI
algorithms and sensors to analyze their environment and make decisions about
speed, direction, and other factors.

Page 19 of 116
 Virtual Personal Assistants (VPA) like Siri or Alexa – these use natural
language processing to understand and respond to user requests, such as playing
music, setting reminders, and answering questions.
 Autonomous vehicles – self-driving cars use AI to analyze sensor data, such as
cameras and lidar, to make decisions about navigation, obstacle avoidance, and
route planning.
 Fraud detection – financial institutions use AI to analyze transactions and
detect patterns that are indicative of fraud, such as unusual spending patterns or
transactions from unfamiliar locations.
 Image recognition – AI is used in applications such as photo organization,
security systems, and autonomous robots to identify objects, people, and scenes
in images.
 Natural language processing – AI is used in chatbots and language translation
systems to understand and generate human-like text.
 Predictive analytics – AI is used in industries such as healthcare and marketing
to analyze large amounts of data and make predictions about future events, such
as disease outbreaks or consumer behavior.
 Game-playing AI – AI algorithms have been developed to play games such as
chess, Go, and poker at a superhuman level, by analyzing game data and making
predictions about the outcomes of moves.
Examples of Machine Learning:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves the use of
algorithms and statistical models to allow a computer system to “learn” from data and
improve its performance over time, without being explicitly programmed to do so.

Here are some examples of Machine Learning:


 Image recognition: Machine learning algorithms are used in image recognition
systems to classify images based on their contents. These systems are used in a
variety of applications, such as self-driving cars, security systems, and medical
imaging.
 Speech recognition: Machine learning algorithms are used in speech
recognition systems to transcribe speech and identify the words spoken. These
systems are used in virtual assistants like Siri and Alexa, as well as in call
centers and other applications.

Page 20 of 116
 Natural language processing (NLP): Machine learning algorithms are used in
NLP systems to understand and generate human language. These systems are
used in chatbots, virtual assistants, and other applications that involve natural
language interactions.
 Recommendation systems: Machine learning algorithms are used in
recommendation systems to analyze user data and recommend products or
services that are likely to be of interest. These systems are used in e-commerce
sites, streaming services, and other applications.
 Sentiment analysis: Machine learning algorithms are used in sentiment analysis
systems to classify the sentiment of text or speech as positive, negative, or
neutral. These systems are used in social media monitoring and other
applications.
 Predictive maintenance: Machine learning algorithms are used in predictive
maintenance systems to analyze data from sensors and other sources to predict
when equipment is likely to fail, helping to reduce downtime and maintenance
costs.
 Spam filters in email – ML algorithms analyze email content and metadata to
identify and flag messages that are likely to be spam.
 Recommendation systems – ML algorithms are used in e-commerce websites
and streaming services to make personalized recommendations to users based on
their browsing and purchase history.
 Predictive maintenance – ML algorithms are used in manufacturing to predict
when machinery is likely to fail, allowing for proactive maintenance and
reducing downtime.
 Credit risk assessment – ML algorithms are used by financial institutions to
assess the credit risk of loan applicants, by analyzing data such as their income,
employment history, and credit score.
 Customer segmentation – ML algorithms are used in marketing to segment
customers into different groups based on their characteristics and behavior,
allowing for targeted advertising and promotions.
 Fraud detection – ML algorithms are used in financial transactions to detect
patterns of behavior that are indicative of fraud, such as unusual spending
patterns or transactions from unfamiliar locations.

Page 21 of 116
 Speech recognition – ML algorithms are used to transcribe spoken words into
text, allowing for voice-controlled interfaces and dictation software.
Examples of Deep Learning:
Deep Learning is a type of Machine Learning that uses artificial neural networks with
multiple layers to learn and make decisions.

Here are some examples of Deep Learning:


 Image and video recognition: Deep learning algorithms are used in image and
video recognition systems to classify and analyze visual data. These systems are
used in self-driving cars, security systems, and medical imaging.
 Generative models: Deep learning algorithms are used in generative models to
create new content based on existing data. These systems are used in image and
video generation, text generation, and other applications.
 Autonomous vehicles: Deep learning algorithms are used in self-driving cars
and other autonomous vehicles to analyze sensor data and make decisions about
speed, direction, and other factors.
 Image classification – Deep Learning algorithms are used to recognize objects
and scenes in images, such as recognizing faces in photos or identifying items in
an image for an e-commerce website.
 Speech recognition – Deep Learning algorithms are used to transcribe spoken
words into text, allowing for voice-controlled interfaces and dictation software.
 Natural language processing – Deep Learning algorithms are used for tasks
such as sentiment analysis, language translation, and text generation.
 Recommender systems – Deep Learning algorithms are used in
recommendation systems to make personalized recommendations based on
users’ behavior and preferences.
 Fraud detection – Deep Learning algorithms are used in financial transactions
to detect patterns of behavior that are indicative of fraud, such as unusual
spending patterns or transactions from unfamiliar locations.
 Game-playing AI – Deep Learning algorithms have been used to develop game-
playing AI that can compete at a superhuman level, such as the AlphaGo AI that
defeated the world champion in the game of Go.
 Time series forecasting – Deep Learning algorithms are used to forecast future
values in time series data, such as stock prices, energy consumption, and
weather patterns.
Page 22 of 116
https://www.geeksforgeeks.org/difference-between-artificial-intelligence-vs-machine-
learning-vs-deep-learning/
 Types of Machine Learning
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

Types of Machine Learning

1. Supervised Machine Learning


Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.
Labelled datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It has both
training and validation datasets labelled.

Page 23 of 116
Supervised Learning

Let’s understand it with the help of an example.


Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use the learned
algorithms and predict whether it is a dog or a cat. This is how supervised learning works,
and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
 Classification
 Regression
Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting
whether a patient has a high risk of heart disease. Classification algorithms learn to map the
input features to one of the predefined classes.
Here are some classification algorithms:
 Logistic Regression
 Support Vector Machine
 Random Forest
Page 24 of 116
 Decision Tree
 K-Nearest Neighbors (KNN) 
 Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
 Linear Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
 Decision tree
 Random Forest
Advantages of Supervised Machine Learning
 Supervised Learning models can have high accuracy as they are trained
on labelled data.
 The process of decision-making in supervised learning models is often
interpretable. 
 It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.
 Natural language processing: Extract information from text, such as sentiment,
entities, and relationships. 
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.

Page 25 of 116
 Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
 Medical diagnosis: Detect diseases and other medical conditions.
 Fraud detection: Identify fraudulent transactions. 
 Autonomous vehicles: Recognize and respond to objects in the environment. 
 Email spam detection: Classify emails as spam or not spam.
 Quality control in manufacturing: Inspect products for defects.
 Credit scoring: Assess the risk of a borrower defaulting on a loan.
 Gaming: Recognize characters, analyze player behavior, and create NPCs.
 Customer support: Automate customer support tasks.
 Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
 Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
2. Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.

Unsupervised Learning

Let’s understand it with the help of an example.


Example: Consider that you have a dataset that contains information about the purchases
you made from the shop. Through clustering, the algorithm can group the same purchasing
Page 26 of 116
behavior among you and other customers, which reveals potential customers without
predefined labels. This type of information can help businesses get target customers as well
as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
 Clustering
 Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.
Here are some clustering algorithms:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items in a
dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.
Here are some association rule learning algorithms:
 Apriori Algorithm
 Eclat
 FP-growth Algorithm
Advantages of Unsupervised Machine Learning
 It helps to discover hidden patterns and various relationships between the data.
 Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
 It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
 Without using labels, it may be difficult to predict the quality of the model’s
output.
 Cluster Interpretability may not be clear and may not have meaningful
interpretations. 

Page 27 of 116
 It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
 Clustering: Group similar data points into clusters.
 Anomaly detection: Identify outliers or anomalies in data.
 Dimensionality reduction: Reduce the dimensionality of data while preserving
its essential information.
 Recommendation systems: Suggest products, movies, or content to users based
on their historical behavior or preferences.
 Topic modeling: Discover latent topics within a collection of documents. 
 Density estimation: Estimate the probability density function of data.
 Image and video compression: Reduce the amount of storage required for
multimedia content.
 Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
 Market basket analysis: Discover associations between products.
 Genomic data analysis: Identify patterns or group genes with similar
expression profiles.
 Image segmentation: Segment images into meaningful regions.
 Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections. 
 Customer behavior analysis: Uncover patterns and insights for better
marketing and product recommendations. 
 Content recommendation: Classify and tag content to make it easier to
recommend similar items to users.
 Exploratory data analysis (EDA): Explore data and gain insights before
defining specific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labeled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming. Semi-

Page 28 of 116
supervised learning is chosen when labeled data requires skills and relevant resources in
order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the
rest large portion of it is unlabeled. We can use the unsupervised techniques to predict
labels and then feed these labels to supervised techniques. This technique is mostly
applicable in the case of image data sets where usually all images are not labeled.

Semi-Supervised Learning

Let’s understand it with the help of an example.


Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labeled and unlabeled sentence pairs, making them more accurate. This technique has
led to significant improvements in the quality of machine translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
 Graph-based semi-supervised learning: This approach uses a graph to
represent the relationships between the data points. The graph is then used to
propagate labels from the labeled data points to the unlabeled data points.
 Label propagation: This approach iteratively propagates labels from the labeled
data points to the unlabeled data points, based on the similarities between the
data points.

Page 29 of 116
 Co-training: This approach trains two different machine learning models on
different subsets of the unlabeled data. The two models are then used to label
each other’s predictions.
 Self-training: This approach trains a machine learning model on the labeled
data and then uses the model to predict labels for the unlabeled data. The model
is then retrained on the labeled data and the predicted labels for the unlabeled
data.
 Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to
generate unlabeled data for semi-supervised learning by training two neural
networks, a generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
 It leads to better generalization as compared to supervised learning, as it takes
both labeled and unlabeled data.
 Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
 Semi-supervised methods can be more complex to implement compared to
other approaches.
 It still requires some labeled data that might not always be available or easy to
obtain.
 The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
 Image Classification and Object Recognition: Improve the accuracy of models
by combining a small set of labeled images with a larger set of unlabeled
images.
 Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a vast
amount of unlabeled text.
 Speech Recognition: Improve the accuracy of speech recognition by leveraging
a limited amount of transcribed speech data and a more extensive set of
unlabeled audio.

Page 30 of 116
 Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions
(labeled data) with a wealth of unlabeled user behavior data.
 Healthcare and Medical Imaging: Enhance medical image analysis by utilizing
a small set of labeled medical images alongside a larger set of unlabeled images.
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps
on increasing its performance using Reward Feedback to learn the behavior or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving car,
AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence
experienced.
Here are some of most common reinforcement learning algorithms:
 Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function,
which maps states to actions. The Q-function estimates the expected reward of
taking a particular action in a given state.
 SARSA (State-Action-Reward-State-Action): SARSA is another model-free
RL algorithm that learns a Q-function. However, unlike Q-learning, SARSA
updates the Q-function for the action that was actually taken, rather than the
optimal action.
 Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.

Page 31 of 116
Reinforcement Machine Learning

Let’s understand it with the help of examples.


Example: Consider that you are training an AI agent to play a game like chess. The agent
explores different moves and receives positive or negative feedback based on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
 Rewards the agent for taking a desired action.
 Encourages the agent to repeat the behavior.
 Examples: Giving a treat to a dog for sitting, providing a point in a game for a
correct answer.
Negative reinforcement
 Removes an undesirable stimulus to encourage a desired behavior.
 Discourages the agent from repeating the behavior.
 Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty
by completing a task.
Advantages of Reinforcement Machine Learning
 It has autonomous decision-making that is well-suited for tasks and that can
learn to make a sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult to
achieve.
Page 32 of 116
 It is used to solve a complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement Machine Learning
 Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
 Reinforcement learning is not preferable to solving simple problems. 
 It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
 Game Playing: RL can teach agents to play games, even complex ones.
 Robotics: RL can teach robots to perform tasks autonomously. 
 Autonomous Vehicles: RL can help self-driving cars navigate and make
decisions.
 Recommendation Systems: RL can enhance recommendation algorithms by
learning user preferences.
 Healthcare: RL can be used to optimize treatment plans and drug discovery. 
 Natural Language Processing (NLP): RL can be used in dialogue systems and
chatbots.
 Finance and Trading: RL can be used for algorithmic trading.
 Supply Chain and Inventory Management: RL can be used to optimize supply
chain operations.
 Energy Management: RL can be used to optimize energy consumption. 
 Game AI: RL can be used to create more intelligent and adaptive NPCs in video
games.
 Adaptive Personal Assistants: RL can be used to improve personal assistants.
 Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create
immersive and interactive experiences. 
 Industrial Control: RL can be used to optimize industrial processes. 
 Education: RL can be used to create adaptive learning systems.
 Agriculture: RL can be used to optimize agricultural operations
https://www.geeksforgeeks.org/types-of-machine-learning/ 

 Machine Learning tools

Page 33 of 116
Top 10 Machine Learning Tools

Machine learning has witnessed exponential growth in tools and frameworks designed to help
data scientists and engineers efficiently build and deploy ML models. Below is a detailed
overview of some of the top machine learning tools, highlighting their key features.

1. Microsoft Azure Machine Learning

Microsoft Azure is a cloud-based environment you can use to train, deploy, automate, manage,
and track ML models. It is designed to help data scientists and ML engineers leverage their
existing data processing and model development skills & frameworks.

Key Features

 Drag-and-drop visual interface (Azure ML Studio).

 Support for popular ML frameworks and languages.

 Scalable cloud resources for training and deployment.

2. IBM Watson

IBM Watson is an enterprise-ready AI services, applications, and tooling suite. It provides


various tools for data analysis, natural language processing, and machine learning model
development and deployment.

Key Features

 Pre-built applications for various industries.

 Powerful natural language processing capabilities.

 Robust toolset for building, training, and deploying models.


Page 34 of 116
3. TensorFlow

TensorFlow, an open-source software library, facilitates numerical computation through data


flow graphs. Developed by the Google Brain team's researchers and engineers, it is utilized
both in research and production activities within Google.

Key Features

 Extensive library for deep learning and machine learning.

 Strong support for research and production projects.

 Runs on CPUs, GPUs, and TPUs.

4. Amazon Machine Learning

Amazon Machine Learning is a cloud service that makes it easy for professionals of all skill
levels to use machine learning technology. It provides visualization tools and wizards to create
machine learning models without learning complex ML algorithms and technology.

Key Features

 Easy to use for creating ML models.

 Automatic data transformation and model evaluation.

 Integration with Amazon S3, Redshift, and RDS for data storage.

5. OpenNN

OpenNN is an open-source neural network library written in C++. It is designed to implement


neural networks flexibly and robustly, focusing on advanced analytics.

Page 35 of 116
Key Features

 High performance and parallelization.

 Comprehensive documentation and examples.

 Designed for research and development in deep learning.

6. PyTorch

PyTorch, a machine learning framework that's open-source and built upon the Torch library,
supports a wide range of applications, including computer vision and natural language
processing. It's celebrated for its adaptability and its capacity to dynamically manage
computational graphs.

Key Features

 Dynamic computation graph that allows for flexibility in model architecture.

 Strong support for deep learning and neural networks.

 Large ecosystem of tools and libraries.

7. Vertex AI

Vertex AI is Google Cloud's AI platform. It consolidates its ML offerings into a unified API,
client library, and user interface. This enables ML engineers and data scientists to accelerate
the development and maintenance of artificial intelligence models.

Key Features

 Unified tooling and workflow for model training, hosting, and deployment.

 AutoML features for training high-quality models with minimal effort.


Page 36 of 116
 Integration with Google Cloud services for storage, data analysis, and more.

8. BigML

BigML is a machine learning platform that helps users create, deploy, and maintain machine
learning models. It offers a comprehensive environment for preprocessing, machine learning,
and model evaluation tasks.

Key Features

 Interactive visualizations for data analysis.

 Automated model tuning and selection.

 REST API for integration and model deployment.

9. Apache Mahout

Apache Mahout serves as a scalable linear algebra framework and offers a mathematically
expressive Scala-based domain-specific language (DSL). This design aims to facilitate the
rapid development of custom algorithms by mathematicians, statisticians, and data scientists.
Its primary areas of application include filtering, clustering, and classification, streamlining
these processes for professionals in the field.

Key Features

 Scalable machine learning library.

 Support for multiple distributed backends (including Apache Spark).

 Extensible and customizable for developing new ML algorithms.

10. Weka

Page 37 of 116
Weka is an open-source software suite written in Java, designed for data mining tasks. It
includes a variety of machine learning algorithms geared towards tasks such as data pre-
processing, classification, regression, clustering, discovering association rules, and data
visualization.

Key Features

 User-friendly interface for exploring data and models.

 Wide range of algorithms for data analysis tasks.

 Suitable for developing new machine learning schemes.

11. Scikit-learn

Scikit-learn is a complimentary, open-source library dedicated to machine learning within the


Python ecosystem. It is celebrated for its user-friendly nature and straightforwardness, offering
an extensive array of supervised and unsupervised learning algorithms. Anchored by
foundational libraries such as NumPy, SciPy, and matplotlib, it emerges as a primary choice
for data mining and analysis tasks.

Key Features

 Comprehensive collection of algorithms for classification, regression, clustering,


and dimensionality reduction.

 Tools for model selection, evaluation, and preprocessing.

 Extensive documentation and community support.

12. Google Cloud AutoML

Google Cloud AutoML offers a collection of machine learning tools designed to help
developers with minimal ML knowledge create tailored, high-quality models for their unique
business requirements. It leverages Google's advanced transfer learning and neural architecture
search technologies.

Page 38 of 116
Key Features

 User-friendly interface for training custom models.

 Supports various ML tasks such as vision, language, and structured data.

 Integration with Google Cloud services for seamless deployment and scalability.

13. Colab

Colab, or Google Colaboratory, is a free cloud service based on Jupyter Notebooks that
supports Python. It is designed to facilitate ML education and research with no setup required.
Colab provides an easy way to write and execute arbitrary Python code through the browser.

Key Features

 Free access to GPUs and TPUs for training.

 Easy sharing of notebooks within the community.

 Integration with Google Drive for easy storage and access to notebooks.

14. KNIME

KNIME is an open-source data analytics, reporting, and integration platform allowing users to
create data flows visually, selectively execute some or all analysis steps, and inspect the results,
models, and interactive views.

Key Features

 A graphical user interface for easy workflow assembly.

 Wide range of nodes for data integration, transformation, analysis, and visualization.

 Extensible through plugins and integration with other languages.

Page 39 of 116
15. Keras

Keras, a Python-based open-source library for neural networks, facilitates swift


experimentation in the realm of deep learning. Serving as an interface for TensorFlow, it
simplifies the construction and training of models.

Key Features

 User-friendly, modular, and extensible.

 Supports convolutional and recurrent networks, as well as combinations of the two.

 Runs seamlessly on CPU and GPU.

16. RapidMiner

RapidMiner serves as a comprehensive data science tool, offering a cohesive platform for tasks
like data prep, machine learning, deep learning, text mining, and predictive analytics. It caters
to users of varying expertise, accommodating both novices and seasoned professionals.

Key Features

 Visual workflow designer for easy creation of analysis processes.

 Extensive collection of algorithms for data analysis.

 Supports deployment of models in enterprise applications.

17. Shogun

Shogun is a freely available machine learning library that encompasses a wide range of efficient
and cohesive techniques. Developed in C++, it features interfaces for several programming
languages, including C++, Python, R, Java, Ruby, Lua, and Octave.

Page 40 of 116
Key Features

 Supports many ML algorithms and frameworks for regression, classification, and


clustering.

 Integration with other scientific computing libraries.

 Focus on kernel methods and support vector machines.

18. Project Jupyter

Project Jupyter is a free, open-source initiative designed to enhance interactive data science
and scientific computing across various programming languages. Originating from the IPython
project, it offers a comprehensive framework for interactive computing, including notebooks,
code, and data management.

Key Features

 Supports interactive data visualization and sharing of live code.

 Extensible with a large number of extensions and widgets.

 Cross-language support, including Python, Julia, R, and many more.

19. Amazon SageMaker

Amazon SageMaker empowers all developers and data scientists to create, train, and deploy
ML models with ease. It simplifies and streamlines every stage of the machine learning
workflow. Discover how to efficiently use Amazon SageMaker to develop, train, optimize, and
deploy machine learning models.

Key Features

 Built-in algorithms and support for custom algorithms.

 One-click deployment and automatic model tuning.

 Integration with AWS services for data processing and storage.

Page 41 of 116
20. Apache Spark

Apache Spark serves as an integrated analytics engine designed to process data on a large scale.
It offers advanced APIs for Java, Scala, Python, and R, alongside an efficient engine that backs
versatile computation graphs for data analysis. Engineered for rapid processing, Spark enables
in-memory computation and supports a range of machine learning algorithms through its MLlib
library.

Key Features

 Fast processing of large datasets.

 Spark supports SQL queries and streaming data.

 MLlib for machine learning (common libraries).

 Runs in standalone mode or scales up to thousands of nodes.

 A very active community that contributes to its extensive ecosystem.

Importance of Machine Learning Tools in the Modern Era

Machine learning tools and techniques are indispensable in the modern era for several
compelling reasons:

1. Data Analysis and Interpretation: With the explosion of data in recent years, ML
tools are critical for analyzing and interpreting vast amounts of data quickly and
efficiently, uncovering patterns and insights that would be impossible for humans to
find.

2. Automation: ML enables the automation of decision-making processes and can


perform tasks without human intervention, increasing efficiency and productivity in
various industries.

Page 42 of 116
3. Personalization: ML tools are at the heart of personalization technologies used in e-
commerce, content platforms, and marketing. They provide tailored experiences to
users based on their behaviors and preferences.

4. Innovation and Competitive Advantage: Businesses that leverage ML tools can


innovate faster, creating new products and services that more effectively meet
customer needs.

5. Solving Complex Problems: ML tools have the potential to solve complex problems
in diverse domains, including healthcare, finance, environmental protection, and
more, by finding solutions that are not apparent through traditional methods.

Preparing Machine Learning environment

 Installation of Python
The process of How to install Python in Windows, operating system is relatively easy and
involves a few uncomplicated steps. This article aims to take you through the process of
downloading and installing Python on your Windows computer.
How to Install Python in Windows?
We have provided step-by-step instructions to guide you and ensure a successful installation.
Whether you are new to programming or have some experience, mastering how to install
Python on Windows will enable you to utilize this potent language and uncover its full range
of potential applications.
To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows
operating system. Locate a reliable version of Python 3, preferably version 3.10.11, which
was used in testing this tutorial. Choose the correct link for your device from the options
provided: either Windows installer (64-bit) or Windows installer (32-bit) and proceed to
download the executable file.

Page 43 of 116
Python Homepage

Step 2: Downloading the Python Installer


Once you have downloaded the installer, open the .exe file, such as python-3.10.11-
amd64.exe, by double-clicking it to launch the Python installer. Choose the option to
Install the launcher for all users by checking the corresponding checkbox, so that all users
of the computer can access the Python launcher application. Enable users to run Python
from the command line by checking the Add python.exe to PATH checkbox.

Page 44 of 116
Python Installer

After Clicking the Install Now Button the setup will start installing Python on your
Windows system. You will see a window like this.

Python Setup

Page 45 of 116
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will see
a successful message.

Python Successfully installed

Step 4: Verify the Python Installation in Windows


Close the window after successful installation of Python. You can check if the installation
of Python was successful by using either the command line or the Integrated Development
Environment (IDLE), which you may have installed. To access the command line, click
on the Start menu and type “cmd” in the search bar. Then click on Command Prompt.
python --version

Python version

You can also check the version of Python by opening the IDLE application. Go to Start and
enter IDLE in the search bar and then click the IDLE app, for example, IDLE (Python
3.10.11 64-bit). If you can see the Python IDLE window then you are successfully able to
download and installed Python on Windows.

Page 46 of 116
Python IDLE

Getting Started with Python


Python is a lot easier to code and learn. Python programs can be written on any plain text
editor like Notepad, notepad++, or anything of that sort. One can also use an Online IDE to
run Python code or can even install one on their system to make it more feasible to write
these codes because IDEs provide a lot of features like an intuitive code editor, debugger,
compiler, etc. To begin with, writing Python Codes and performing various intriguing and
useful operations, one must have Python installed on their System.

 Installation of Tools
In a terminal, run:

$ python3 -m pip install python-dev-tools --user --upgrade

Installation with Visual Studio Code


 Follow the installation procedure for python-dev-tools
 Be sure to have the official Python extension installed in VS Code
 Open VS Code from within your activated virtual environment (in fact, make
sure that flake8 from python-dev-tools is in your PYTHON_PATH)
 In VS Code, open settings (F1 key, then type “Open Settings (JSON)”, then
enter)
 Add in the opened JSON file (before the closing }):

Page 47 of 116
"python.linting.enabled": true,

"python.linting.flake8Enabled": true,

"python.linting.flake8Path": "flake8",

"python.formatting.provider": "black",

"python.formatting.blackPath": "whataformatter",

"python.formatting.blackArgs": [],

Environment Testing
Think of how you might test the lights on a car. You would turn on the lights (known as the test
step) and go outside the car or ask a friend to check that the lights are on (known as the test
assertion). Testing multiple components is known as integration testing.
You have just seen two types of tests:

1. An integration test checks that components in your application operate with each
other.
2. A unit test checks a small component in your application.

You can write both integration tests and unit tests in Python. To write a unit test for the built-
in function sum(), you would check the output of sum() against a known output.

For example, here’s how you check that the sum() of the numbers (1, 2, 3) equals 6:

Python
>>> assert sum([1, 2, 3]) == 6, "Should be 6"

Data Collection and Acquisition

 Description of key terms


Data: data is a distinct piece of information that is gathered and translated
for some purpose. Data is information that has been translated into a
form that is efficient for movement or processing.
Page 48 of 116
Information: Information is a result of processing or transforming data
into a useful form. We understand information because it's more
organized and has context. Information can be in the form of graphs,
tables, or videos.
Dataset: A dataset is an organized collection of data. The most basic
representation of a dataset is data elements presented in tabular form.
Each column represents a particular variable. Each row corresponds to a
given value of that column's variable.
Data warehouse: A data warehouse (DW) is a digital storage system that
connects and harmonizes large amounts of data from many different
sources.
Big data: Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large size and
complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.
Types Of Big Data

Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data.

However, nowadays, we are foreseeing issues when a size of such data grows to a huge
extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.

Page 49 of 116
Do you know? Data stored in a relational database management system is one example of
a ‘structured’ data.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs


2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000

Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc.

Examples Of Un-structured Data

The output returned by ‘Google Search’

Page 50 of 116
Example Of Un-structured Data

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data

Big data can be described by the following characteristics:

 Volume: The name Big Data itself is related to a size which is enormous. ‘Volume’ is
one characteristic which needs to be considered while dealing with Big Data
solutions.
 Variety: Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. Nowadays, data in the form of emails, photos, videos,

Page 51 of 116
monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining
and analyzing data.

 Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.

 Variability: This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively. 

 Identification of Source of data


IoT Sensors: IoT sensors are one of the key components in IoT devices
that collect data from surroundings and transmit them over networks.
Camera: A camera is an instrument used to capture and store images
and videos, either digitally via an electronic image sensor, or chemically
via a light-sensitive.
Computer: The data might be located on the same computer as the
program, or on another computer somewhere on a network.
Smartphone:
Social data: information that social media users publicly share and
includes metadata such as the user's location, language spoken,
biographical data, and shared links.
Transactional data: Transactional data relates to the transactions of the
organization and includes data that is captured, for example, when a
product is sold or purchased.
Gathering Machine Learning dataset

Before we dive into such topics as ML and Data Science, and try to explain how it works, we
should answer several questions:

 What can we achieve in business or on the project with the help of ML? What goals
do I want to accomplish using ML? 

Page 52 of 116
 Do I only want to hop on the trend, or will the use of ML really improve user
experience, increase profitability, or protect my product and its users? 

 Do I need the system to predict anything or does it need to be able to detect


anomalies? 

Understanding the Data Collection Process

1. Defining the Problem Statement

Clearly outline the objectives of the data collection process and the specific research
questions you want to answer. This step will guide the entire process and ensure you collect
the right data to meet your goals.

Also, it is recommended to identify data sources. Determine the sources from which you
will collect data. These sources may include primary data (collected directly for your study)
or secondary data (previously collected by others). Common data sources include surveys,
interviews, existing databases, observation, experiments, and online platforms.

2. Planning Data Collection

In this stage, it is better to start with the selection of data collection methods. Choose the
appropriate methods to collect data from the identified sources. The methods may vary
depending on the nature of the data and research objectives.

Common methods include:

 Surveys: Structured questionnaires administered to a target group to gather specific


information.

 Interviews: Conducting one-on-one or group conversations to gain in-depth insights.

 Observation: Systematically observing and recording behaviors or events.

 Experiments: Controlling variables to study cause-and-effect relationships.

 Web scraping: Extracting data from websites and online sources.

 Sensor data collection: Gathering data from sensors or IoT devices.


Page 53 of 116
3. Ensuring Data Quality

The next step is very crucial. Ensuring data quality means reviewing the collected data to
check for errors, inconsistencies, or missing values. Apply quality assurance techniques to
ensure the data is reliable and suitable for analysis.

The following step would be data storage and management. It will require organizing and
storing the collected data in a secure and accessible manner. Consider using databases or
other data management systems for efficient storage and retrieval.

How to Start Collecting Data for ML: Data Collection Strategy

1. Synthetic Data Generation

Synthetic data is any information manufactured artificially which does not represent events or
objects in the real world. Algorithms create synthetic data used in model datasets for testing
or training purposes. This data can mimic operational or production data and help train ML
models or test out mathematical models.

2. Active Learning

Active learning is a machine learning technique that focuses on selecting the most
informative data points to label or annotate from an unlabeled dataset. The aim of active
learning is to reduce the amount of labeled data required to build an accurate model by
strategically choosing which instances to query for labels. This is especially useful when
labeling data can be time-consuming or expensive.

Active Learning in Data Collection: Steps

Page 54 of 116
Steps Features

Initial Data Initially, a small labeled dataset is collected through random sampling or any
Collection other standard method.

Model
The initial labeled data is used to train a machine learning model.
Training

The model is then used to predict the labels of the remaining unlabeled data
Uncertainty points. During this process, the model’s uncertainty about its predictions is often
Estimation estimated. There are various ways to measure uncertainty, such as entropy, margin
sampling, and least confidence.

A query strategy is chosen to decide which data points to request labels for. The
query strategy selects instances with high uncertainty as these instances are likely
Query Strategy
to have the most impact on improving the model’s performance.
Selection
There are a few methods to apply query strategy selection: uncertainty sampling,
diversity sampling, and representative sampling.

Labeling New
The selected instances are then sent for labeling or annotation by domain experts.
Data Points

The newly labeled data is added to the labeled dataset, and the model is retrained
Model Update
using the expanded labeled set.

3. Transfer Learning

It is a popular approach for training models when there is not enough training data or time to
train from scratch. A common technique is to start from an existing model that is well trained
(also called a source task), one can incrementally train a new model (a target task) that
already performs well.

Page 55 of 116
4. Open Source Datasets

Open source datasets are a valuable resource in data collection for various research and
analysis purposes. These datasets are typically publicly available and can be freely accessed,
used, and redistributed. Leveraging open source datasets can save time, resources, and effort,
as they often come pre-cleaned and curated, ready for analysis.

Common Methods For Utilizing Open Source Datasets in Data Collection:

Open Source Datasets Methods Features

Start by identifying the relevant open source datasets for


your research or analysis needs. There are various platforms
Identifying Suitable Datasets and repositories where you can find open source datasets,
such as Kaggle, data.gov, UCI Machine Learning Repository,
GitHub, Google Dataset Search, and many others.

Before using a dataset, it’s essential to explore its contents to


understand its structure, the variables available, and the
Data Exploration
quality of the data. This preliminary analysis will help you
determine if the dataset meets your research requirements.

Pay close attention to the licensing terms associated with the


open source dataset. Some datasets might have specific
Data Licensing conditions for usage and redistribution, while others may be
entirely open for any purpose. Make sure to adhere to the
terms of use.

Although open source datasets are usually pre-cleaned, they


Data Preprocessing may still require some preprocessing to fit your specific
needs. This step could involve handling missing data,

Page 56 of 116
normalizing values, encoding categorical variables, and other
data transformations.

Ensure that the data you are using does not contain sensitive
or private information that could potentially harm individuals
Ethical Considerations
or organizations. Respect data privacy and consider
anonymizing or de-identifying data if necessary.

In some cases, your research might require data from


multiple sources. Open source datasets can be combined with
Data Integration
proprietary data or other open source datasets to enhance the
scope and depth of your analysis.

Just like with any data, it’s crucial to validate the open source
dataset for accuracy and quality. Cross-referencing the data
Validation and Quality Control
with other sources or performing sanity checks can help
ensure the dataset’s reliability.

When using open source datasets in your research or


analysis, it’s essential to give proper credit to the original
Citations and Attribution creators or contributors. Follow the provided citation
guidelines and acknowledge the source of the data
appropriately.

If your research involves publishing results or sharing


analyses, make sure to share the exact details of the open
Reproducibility
source datasets you used. This ensures that others can
replicate your work and verify your findings.

Page 57 of 116
5. Manual Data Generation

Manual data generation refers to the process of collecting data by hand, without the use of
automated tools or systems. Manual data generation can be time-consuming and resource-
intensive, but it can yield valuable and reliable data when performed carefully.

Building synthetic datasets is one of the most common methods in data collection when real
data is limited or unavailable, or when privacy concerns prevent the use of actual data.
Synthetic datasets are artificially generated datasets that mimic the statistical properties and
patterns of real data without containing any sensitive or identifiable information.

Here’s a step-by-step guide on how to build synthetic datasets:

 Define the Problem and Objectives: Clearly identify the purpose of the synthetic
dataset. Determine what specific features, relationships, and patterns you want the
synthetic data to capture. Understand the target domain and data characteristics to
ensure the synthetic dataset is meaningful and useful.

 Understand the Real Data: If possible, analyze and understand the real data you
want to emulate. Identify the key statistical properties, distributions, and relationships
within the data. This will help inform the design of the synthetic dataset.

 Choose a Data Generation Method: Several methods can be used to create synthetic
datasets (Statistical Methods, Generative Models, Data Augmentation, Simulations ).

 Choose the Right Features: Identify the essential features from the real data that
need to be included in the synthetic dataset. Avoid including personally identifiable
information (PII) or any sensitive data that might compromise privacy.

 Generate the Synthetic Data: Implement the chosen data generation method to
create the synthetic dataset. Ensure that the dataset follows the same format and data
types as the real data to be used seamlessly in analyses and modeling.

 Validate and Evaluate: Assess the quality and accuracy of the synthetic dataset by
comparing it to the real data. Use metrics and visualizations to validate that the
synthetic data adequately captures the patterns and distributions present in the real
data.

Page 58 of 116
 Modify and Iterate: If the initial synthetic dataset does not meet your expectations,
refine the data generation method or adjust parameters until it better aligns with the
desired objectives.

 Use Case Considerations: Understand the limitations of synthetic datasets. They


might not fully capture rare events or extreme cases present in real data.
Consequently, synthetic datasets are best suited for certain use cases, such as initial
model development, testing, and sharing with third parties.

 Ensure Privacy and Ethics: Always prioritize privacy and ethical considerations
when generating synthetic datasets. Ensure that no individual or sensitive information
can be inferred from the synthetic data.

By following these steps, you can create synthetic datasets that can serve as valuable
substitutes for real data in various scenarios, contributing to better model development and
analysis in data-scarce or privacy-sensitive environments.

7. Federated Learning

Federated learning is a privacy-preserving machine learning approach that enables multiple


parties to collaboratively build a global machine learning model without sharing their raw
data with a central server. This method is particularly useful in scenarios where data privacy
and security are major concerns, such as in healthcare, financial services, and other sensitive
industries.

 Machine Learning tools

Page 59 of 116
TensorFlow

TensorFlow, an open-source machine learning tool, is renowned for its flexibility, ideal
for crafting diverse models, and simple to use. With abundant resources and user-friendly
interfaces, it simplifies data comprehension.

PyTorch

PyTorch is a user-friendly machine learning tool, facilitating seamless model


construction. Loved by researchers for its simplicity, it fosters easy idea testing and error
identification. Its intuitive design makes it a preferred choice, offering a smooth and
precise experience in model development.

Scikit-learn

Scikit-learn is a valuable tool for everyday machine-learning tasks, offering a plethora of


tools for tasks like pattern recognition and prediction. Its user-friendly interface and
extensive functionality make it accessible for various applications, whether you’re
identifying patterns in data or making accurate predictions.

Page 60 of 116
Keras

Keras helps easily create models, great for quick experiments, especially with images or
words. It’s user-friendly, making it simple to try out ideas, whether you’re working on
recognizing images or understanding language.

XGBoost

XGBoost excels in analyzing tabular data, showcasing exceptional prowess in pattern


identification and prediction, making it a top choice for competitive scenarios. This
machine learning tool is particularly adept at discerning trends and delivering accurate
predictions, making it a standout performer, especially in competitive environments.

Apache Spark MLlib

Apache Spark MLlib is a powerful tool designed for handling massive datasets, making it
ideal for large-scale projects with extensive data. It simplifies complex data analysis tasks
by providing a robust machine-learning framework. Whether you’re dealing with
substantial amounts of information, Spark MLlib offers scalability and efficiency, making
it a valuable resource for projects requiring the processing of extensive data sets.

Microsoft Azure Machine Learning

Microsoft Azure Machine Learning makes it easy to do machine learning in the cloud.
It’s simple, user-friendly, and works well for many different projects, making machine
learning accessible and efficient in the cloud.

Google Cloud AI Platform

Google Cloud AI Platform is a strong tool for using machine learning on Google Cloud.
Great for big projects, it easily works with other Google tools. It provides detailed stats
and simple functions, making it a powerful option for large machine-learning tasks.

H2O.ai

Page 61 of 116
H2O.ai is a tool that helps you use machine learning easily. It’s good for many jobs and
has a helpful community. With H2O.ai, you can use machine learning well, thanks to its
easy interface and helpful people.

RapidMiner

RapidMiner is an all-rounder tool for the entire machine learning method, ideal for
concept exploration and collaboration on tremendous projects. It enables trying out ideas
and permits seamless teamwork, making it a versatile tool for diverse stages of machine
learning development.

 Preparing Machine Learning environment

 Installation of Python

To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the
Windows operating system. Locate a reliable version of Python 3, preferably
version 3.10.11, which was used in testing this tutorial. Choose the correct
link for your device from the options provided: either Windows installer (64-
bit) or Windows installer (32-bit) and proceed to download the executable
file.

Page 62 of 116
 Python Homepage

Step 2: Downloading the Python Installer


Once you have downloaded the installer, open the .exe file, such as python-
3.10.11-amd64.exe, by double-clicking it to launch the Python installer.
Choose the option to Install the launcher for all users by checking the
corresponding checkbox, so that all users of the computer can access the
Python launcher application.Enable users to run Python from the command
line by checking the Add python.exe to PATH checkbox.

Page 63 of 116
 Python Installer

After Clicking the Install Now Button the setup will start installing Python
on your Windows system. You will see a window like this.

 Python Setup

Page 64 of 116
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system.
You will see a successful message.

 Python Successfully installed

Step 4: Verify the Python Installation in Windows


Close the window after successful installation of Python. You can check if the
installation of Python was successful by using either the command line or the
Integrated Development Environment (IDLE), which you may have
installed. To access the command line, click on the Start menu and type “cmd”
in the search bar. Then click on Command Prompt.
python --version


 Python version

You can also check the version of Python by opening the IDLE application. Go
to Start and enter IDLE in the search bar and then click the IDLE app, for

Page 65 of 116
example, IDLE (Python 3.10.11 64-bit). If you can see the Python IDLE
window then you are successfully able to download and installed Python on
Windows.

 Python IDLE

Getting Started with Python


Python is a lot easier to code and learn. Python programs can be written on any
plain text editor like Notepad, notepad++, or anything of that sort. One can also
use an Online IDE to run Python code or can even install one on their system to
make it more feasible to write these codes because IDEs provide a lot of features
like an intuitive code editor, debugger, compiler, etc. To begin with, writing Python
Codes and performing various intriguing and useful operations, one must have
Python installed on their System.

Environment Variables for Python on Windows:

Environment variables are used to configure the environment in which processes


run. For Python, you often need to set the PATH environment variable so that you
can run Python and pip from the command line. This ensures that Python
executables and scripts can be accessed from any command line prompt without
specifying their full path.

How to Configure Python Path on Windows?

Page 66 of 116
Steps to Configure Python Path on Windows:

1. Add Python to PATH During Installation:

 When running the Python installer, ensure you check the box that says
“Add Python to PATH.”

2. Manually Add Python to PATH:

 Open the Start menu, search for “Environment Variables,” and select
“Edit the system environment variables.”

 In the System Properties window, click on the “Environment


Variables” button.

 Under “System variables,” find the Path variable and click “Edit.”

 Click “New” and add the path to the Python installation directory
(e.g., C:\Python39) and the Scripts directory
(e.g., C:\Python39\Scripts).

 Click “OK” to close all windows.

3. Verify PATH Configuration:

 Open Command Prompt.

 Type python and press Enter to start the Python interpreter. If Python
starts, the PATH is configured correctly.

 Type exit() to exit the Python interpreter.

 Type pip and press Enter to verify that pip can be called from the
command line.

 Installation of Tools
Page 67 of 116
Anaconda distribution provides installation of Python with various IDE's such
as Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence it is a very
convenient packaged solution which you can easily download and install in
your computer. It will automatically install Python and some basic IDEs and
libraries with it.

elow some steps are given to show the downloading and installing process of
Anaconda and IDE:

Step-1: Download Anaconda Python:

o To download Anaconda in your system, firstly, open your favorite browser


and type Download Anaconda Python, and then click on the first link as
given in the below image. Alternatively, you can directly download it by
clicking on this link, https://www.anaconda.com/distribution/#download-
section.

o After clicking on the first link, you will reach to download page of
Anaconda, as shown in the below image:

Page 68 of 116
o Since, Anaconda is available for Windows, Linux, and Mac OS, hence,
you can download it as per your OS type by clicking on available options
shown in below image. It will provide you Python 2.7 and Python 3.7
versions, but the latest version is 3.7, hence we will download Python 3.7
version. After clicking on the download option, it will start downloading
on your computer.

Note: In this topic, we are downloading Anaconda for Windows you can
choose it as per your OS.

Page 69 of 116
Step- 2: Install Anaconda Python (Python 3.7 version):

Once the downloading process gets completed, go to downloads → double


click on the ".exe" file (Anaconda3-2019.03-Windows-x86_64.exe) of
Anaconda. It will open a setup window for Anaconda installations as given in
below image, then click on Next.

Page 70 of 116
o It will open a License agreement window click on "I Agree" option and
move further.

Page 71 of 116
o In the next window, you will get two options for installations as given in
the below image. Select the first option (Just me) and click on Next.

o Now you will get a window for installing location, here, you can leave it as
default or change it by browsing a location, and then click on Next.
Consider the below image:

Page 72 of 116
o Now select the second option, and click on install.

o Once the installation gets complete, click on Next.

Page 73 of 116
o Now installation is completed, tick the checkbox if you want to learn more
about Anaconda and Anaconda cloud. Click on Finish to end the process.

Note: Here, we will use the Spyder IDE to run Python programs.

Page 74 of 116
Step- 3: Open Anaconda Navigator

o After successful installation of Anaconda, use Anaconda navigator to


launch a Python IDE such as Spyder and Jupyter Notebook.
o To open Anaconda Navigator, click on window Key and search for
Anaconda navigator, and click on it. Consider the below image:

o After opening the navigator, launch the Spyder IDE by clicking on


the Launch button given below the Spyder. It will install the Spyder IDE
in your system.

Page 75 of 116
Run your Python program in Spyder IDE.

o Open Spyder IDE, it will look like the below image:

o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right
side.

Page 76 of 116
Step- 4: Close the Spyder IDE.

 Data Collection and Acquisition

 Description of key terms

Machine learning data refers to the collection of information (or dataset) that is
used by machine learning models to learn, train, and make predictions. It can be
structured or unstructured and usually consists of features (input variables) and
labels (output variables or target values). The quality and nature of this data
directly impact the performance and accuracy of machine learning models.

Machine Learning information refers to the data, knowledge, and insights


utilized or generated in the process of training machine learning models. It
includes everything from raw datasets to model predictions, as well as the
intermediate knowledge gained during data analysis, feature extraction, model
evaluation, and decision-making.

A machine learning dataset is a structured collection of data used to train,


validate, and test machine learning models. It consists of multiple examples or
data points, where each data point typically contains features (input variables)
and may include a corresponding label (output or target variable) in the case of
supervised learning. The dataset is essential for enabling machine learning
algorithms to learn patterns, make predictions, and generalize to new data.

A machine learning data warehouse is a centralized repository designed to


store large volumes of structured, semi-structured, and unstructured data that can
be used for machine learning (ML) and analytics tasks. It provides the
infrastructure to collect, manage, and retrieve data efficiently for training and
deploying machine learning models. Data warehouses support complex queries
and enable users to perform large-scale data processing, making them essential
for preparing high-quality datasets for machine learning applications.

Page 77 of 116
Big data for machine learning refers to extremely large, complex, and diverse
datasets that are generated at high velocity and volume. These datasets require
advanced processing techniques and technologies to extract useful insights and
are used in machine learning (ML) to improve model performance, accuracy, and
scalability. Machine learning on big data enables models to learn from vast and
varied data sources, resulting in more accurate predictions and better decision-
making capabilities.

 Identification of Source of data

IoT Sensors: IoT sensors enable data collection and transmission from physical
objects to digital systems.

Camera: Refers to using imaging devices to capture visual information (images


or video) that can be processed and analyzed in various machine learning or
artificial intelligence applications. Cameras provide real-time or recorded data in
the form of pixels arranged in grids, which can be used for tasks like image
recognition, object detection, video analytics, and more.

Computer

Smartphone

Social data

Transactional data

6V’s of Big Data

In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of Big
Data which are also termed as the characteristics of Big Data as follows:

1. Volume:

The name ‘Big Data’ itself is related to a size which is enormous.


Page 78 of 116
Volume is a huge amount of data.

To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large, then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not,
is dependent upon the volume of data.

Hence while dealing with Big Data it is necessary to consider a characteristic


‘Volume’.

Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes
(6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000
Exabytes of data.

2. Velocity:

Velocity refers to the high speed of accumulation of data.

In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.

There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.

Example: There are more than 3.5 billion searches per day are made on Google.
Also, Facebook users are increasing by 22%(Approx.) year by year.

3. Variety:

It refers to nature of data that is structured, semi-structured and unstructured data.

It also refers to heterogeneous sources.


Page 79 of 116
Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise. It can be structured, semi-structured and unstructured.

Structured data: This data is basically an organized data. It generally refers to


data that has defined the length and format of data.

Semi- Structured data: This data is basically a semi-organised data. It is


generally a form of data that do not conform to the formal structure of data. Log
files are the examples of this type of data.

Unstructured data: This data basically refers to unorganized data. It generally


refers to data that doesn’t fit neatly into the traditional row and column structure
of the relational database. Texts, pictures, videos etc. are the examples of
unstructured data which can’t be stored in the form of rows and columns.

4. Veracity:

It refers to inconsistencies and uncertainty in data, that is data which is available


can sometimes get messy and quality and accuracy are difficult to control. Big
Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.

Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.

5. Value:

After having the 4 V’s into account there comes one more V which stands for
Value! The bulk of Data having no Value is of no good to the company, unless
you turn it into something useful.

Data in itself is of no use or importance but it needs to be converted into


something valuable to extract Information. Hence, you can state that Value! is the
most important V of all the 6V’s.
Page 80 of 116
6. Variability:

How fast or available data that extent is the structure of your data is changing?
How often does the meaning or shape of your data change?

Example: if you are eating same ice-cream daily and the taste just keep changing.

1. Structured Data: Structured data is highly organized and easily searchable. It


is stored in tabular form, where each column has a specific meaning (features),
and each row is an observation or data point.

Examples:

Spreadsheets (Excel, CSV files) with clearly defined columns like "Age,"
"Salary," "Gender."

Databases with fields like product ID, price, and customer name.

Characteristics:

Data is usually numerical or categorical.

Often stored in relational databases or data warehouses.

Applications:

Supervised learning algorithms like linear regression, decision trees, or logistic


regression often use structured data.

2. Unstructured Data: Unstructured data lacks a predefined structure or schema


and does not fit neatly into a table or database. It’s usually raw and needs
preprocessing before use in machine learning.

Examples:

Text (e.g., social media posts, reviews, emails).


Page 81 of 116
Images, audio, and video files.

PDFs or documents without strict formatting.

Characteristics:

Requires more preprocessing (e.g., NLP techniques for text data, image
processing for images).

Often large in volume and complexity.

Applications:

Natural Language Processing (NLP), computer vision, speech recognition, and


other AI applications that deal with text, images, or audio.

3. Semi-Structured Data: Semi-structured data is a hybrid of structured and


unstructured data. It doesn't conform to a strict tabular format but still has some
organizational elements (e.g., tags or markers) that make it easier to analyze than
completely unstructured data.

Examples:

JSON, XML, and HTML files.

Emails (structured in terms of headers but unstructured in the body).

Characteristics:

Contains both data and metadata (data about the data).

More flexible than structured data, with some level of organization (hierarchical
or nested).

Applications:

Page 82 of 116
Commonly used in web data scraping and APIs, which return data in formats like
JSON or XML.

Gathering Machine Learning dataset

Where can you “borrow” a dataset? Here are a couple of data sources you
could try:

 Dataset Search by Google – allows searching not only by the keywords,


but also filtering the results based on the types of the dataset that you want
(e.g., tables, images, text), or based on whether the dataset is available for
free from the provider.
 Visual Data Discovery – specializes in Computer Vision datasets, all
datasets are explicitly categorized and are easily filtered.
 OpenML – as stated in the documentation section it’s ‘an open,
collaborative, frictionless, automated machine learning
environment’. This is a whole resource that allows not only sharing data,
but also working on it collaboratively and solving problems in cooperation
with other data scientists.
 UCI: Machine Learning Repository – a collection of datasets and data
generators, that is listed in the top 100 most quoted resources in Computer
Science.
 Awesome Public Datasets on Github- it would be weird if Github didn’t
have its own list of datasets, divided into categories.
 Kaggle – one of the best, if not the best, resource for trying ML for
yourself. Here you can also find data sets divided into categories with
usability scores (an indicator that the dataset is well-documented).
 Amazon Datasets – lots of datasets stored in S3, available for quick
deployment if you’re using AWS.

Page 83 of 116
 and many other excellent resources where you can find data sets from
versatile areas: starting from the apartment prices in Manhattan for the last
10 years and ending with the description of space objects.

Considerations when Gathering a Dataset:

1. Relevance: Ensure the dataset is appropriate for your machine learning


task (classification, regression, clustering, etc.).
2. Size: Larger datasets generally yield better models, but ensure you have
the computational resources to handle them.
3. Quality: Remove noisy, irrelevant, or incomplete data before training the
model.
4. Balance: For classification tasks, check for class imbalance (e.g., one class
significantly outweighs others) and use techniques like SMOTE to balance
the data if needed.
5. Legal and Ethical Compliance: Always ensure that the data complies
with privacy laws (e.g., GDPR, HIPAA) and ethical standards.

Interpret Data Visualization

Description of data Visualization tools

Data visualization is the process of using visual elements like charts, graphs, or maps to
represent data. It translates complex, high-volume, or numerical data into a visual
representation that is easier to process.

Data visualization tools improve and automate the visual communication process for
accuracy and detail. You can use the visual representations to extract actionable insights from
raw data.

Matplotlib:

Description: Matplotlib is a popular Python library used for creating static, interactive, and
animated visualizations.

Key Features:

Page 84 of 116
 Supports 2D plotting, including line charts, scatter plots, histograms, bar charts, and
more.

 Highly customizable with fine control over every aspect of the figure (axes, labels,
colors).

 Integrates well with other Python libraries like Pandas and NumPy.

 Ideal for academic and scientific research.

Use Cases: Scientific visualizations, exploratory data analysis, and research reporting.

Seaborn

Description: Seaborn is a Python library built on top of Matplotlib, providing a high-level


interface for creating attractive and informative statistical visualizations.

Key Features:

 Simplifies the creation of complex visualizations such as heatmaps, pair plots, and
violin plots.

 Provides built-in themes and color palettes to improve aesthetics. 

 Integrates well with Pandas and Matplotlib.

Use Cases: Statistical data analysis, exploratory data visualization, and academic research.

Plotly

Description: Plotly is a web-based and Python-based visualization library that allows users to
create interactive and shareable visualizations.

Key Features:

 Supports a wide range of chart types including 3D charts, geographic maps, and time
series plots.

 Allows for interactive elements like hover, zoom, and click for better data exploration.

 Integrates well with Python, R, MATLAB, and JavaScript. 

 Provides both open-source (Plotly.py) and enterprise versions.

Use Cases: Business intelligence, interactive dashboards, and web-based data visualizations.

Page 85 of 116
Tableau

Description: Tableau is one of the most popular and powerful data visualization tools. It
allows users to create complex, interactive, and shareable dashboards.

Key Features:

 Drag-and-drop interface for creating visualizations. 

 Supports a wide range of data sources including Excel, SQL databases, cloud services,
and APIs.

 Extensive library of visualization types, including maps, scatter plots, heatmaps, and
more.

 Offers Tableau Public for free data visualization. 

Use Cases: Business intelligence, data analytics, real-time reporting, and exploratory
analysis.

Power BI:

Description: Developed by Microsoft, Power BI is a business analytics tool focused on


creating interactive reports and dashboards.

Key Features:

 Easy integration with Microsoft products (e.g., Excel, Azure).

 Drag-and-drop report building and data modeling.

 Real-time data visualization with interactive reports and custom visuals.

 Supports sharing and collaboration through Power BI service.

Use Cases: Business intelligence, financial reporting, data exploration, and collaborative data
analysis.

Use Types of Data Visualization

1. Line Charts: In a line chart, each data point is represented by a point on the graph,
and these points are connected by a line. We may find patterns and trends in the data
across time by using line charts. Time-series data is frequently displayed using line

Page 86 of 116
charts.

2. Scatter Plots: A quick and efficient method of displaying the relationship between
two variables is to use scatter plots. With one variable plotted on the x-axis and the
other variable drawn on the y-axis, each data point in a scatter plot is represented by a
point on the graph. We may use scatter plots to visualize data to find patterns, clusters,
and outliers.

3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar
chart, each category is represented by a bar, with the height of the bar indicating the
frequency or proportion of that category in the data. Bar graphs are useful for

Page 87 of 116
comparing several categories and seeing patterns over time.

4. Heat Maps: Heat maps are a type of graphical representation that displays data in a
matrix format. The value of the data point that each matrix cell represents determines
its hue. Heatmaps are often used to visualize the correlation between variables or to
identify patterns in time-series data.

Page 88 of 116
5. Tree Maps: Tree maps are used to display hierarchical data in a compact format and
are useful in showing the relationship between different levels of a hierarchy.

6. Box Plots: Box plots are a graphical representation of the distribution of a set of data.
In a box plot, the median is shown by a line inside the box, while the center box
depicts the range of the data. The whiskers extend from the box to the highest and
lowest values in the data, excluding outliers. Box plots can help us to identify the

Page 89 of 116
spread and skewness of the data.

Others:

Various types of visualizations cater to diverse data sets and analytical goals.

1. Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to
understand proportions and percentages.
2. Heatmaps: Visualize complex data sets through color-coding, emphasizing variations
and correlations in a matrix.
3. Area Charts: Similar to line charts but with the area under the line filled, these charts
accentuate cumulative data patterns.
4. Bubble Charts: Enhance scatter plots by introducing a third dimension through varying
bubble sizes, revealing additional insights.
5. Treemaps: Efficiently represent hierarchical data structures, breaking down categories
into nested rectangles.
6. Violin Plots: Violin plots combine aspects of box plots and kernel density plots,
providing a detailed representation of the distribution of data.
7. Word Clouds: Word clouds are visual representations of text data where words are sized
based on their frequency.

Page 90 of 116
8. 3D Surface Plots: 3D surface plots visualize three-dimensional data, illustrating how a
response variable changes in relation to two predictor variables.
9. Network Graphs: Network graphs represent relationships between entities using nodes
and edges. They are useful for visualizing connections in complex systems, such as
social networks, transportation networks, or organizational structures.
10. Sankey Diagrams: Sankey diagrams visualize flow and quantity relationships between
multiple entities. Often used in process engineering or energy flow analysis.

Applying data Visualization Best Practices

Effective data visualization is crucial for conveying insights accurately. Follow these
best practices to create compelling and understandable visualizations:

1. Audience-Centric Approach: Tailor visualizations to your audience’s knowledge


level, ensuring clarity and relevance. Consider their familiarity with data
interpretation and adjust the complexity of visual elements accordingly.

2. Design Clarity and Consistency: Choose appropriate chart types, simplify visual
elements, and maintain a consistent color scheme and legible fonts. This ensures a
clear, cohesive, and easily interpretable visualization.

3. Contextual Communication: Provide context through clear labels, titles,


annotations, and acknowledgments of data sources. This helps viewers understand the
significance of the information presented and builds transparency and credibility.

4. Engaging and Accessible Design: Design interactive features thoughtfully, ensuring


they enhance comprehension. Additionally, prioritize accessibility by testing
visualizations for responsiveness and accommodating various audience needs,
fostering an inclusive and engaging experience.

Interpreting visualizations results

Patterns and Trends:

Trends indicate the general direction in which data is moving over time or across a range of
values. They often suggest a consistent increase or decrease.

 Examples: 

o Time Series Data: A line graph showing a steady increase in sales over several
months.
Page 91 of 116
o Trend Lines: Fitted lines in scatter plots that show the overall movement of
data points.

 Interpretation: Trends help in forecasting future values. For instance, if sales


have been increasing steadily, one might predict continued growth.

Patterns

 Patterns refer to recognizable sequences or arrangements in data that indicate


regularities or repetitions. They can occur over time or across different variables.

 Examples: 

o Seasonal Patterns: A graph showing sales peaks during holiday seasons each
year.

o Clustering Patterns: Groups of data points that cluster together in a scatter


plot, indicating similar characteristics (e.g., customer segmentation).

 Interpretation: Identifying patterns can help in recognizing recurring phenomena in


data. For example, discovering that a specific demographic consistently buys a certain
product can inform targeted marketing strategies. 

Context and Background:

Context refers to the environment or situation surrounding the data. It includes details about
the dataset, the problem being solved, and any external factors influencing the analysis.

Components:

 Problem Statement: What specific problem are you trying to solve? (e.g., predicting
customer churn, classifying images) 

 Data Source: Where did the data come from? (e.g., customer databases, sensors,
public datasets) 

 Data Characteristics: Size, type, and structure of the data (e.g., time series,
categorical, continuous). 

 Business Goals: What are the objectives of the analysis? (e.g., increase sales, improve
customer satisfaction) 

Page 92 of 116
Importance: Understanding the context helps ensure that the analysis is relevant and that the
interpretations of the visualizations align with business needs. For example, visualizing sales
data without knowing the marketing strategies employed can lead to misleading conclusions.

Background

 Background encompasses the historical information and previous analyses related to


the dataset. It includes any relevant theories, prior findings, or established knowledge
in the field.

 Components: 

o Historical Data: Previous trends or patterns observed in the data (e.g., past
sales performance).

o Research Literature: Existing studies or models related to the problem (e.g.,


previous machine learning models used for similar tasks).

o Domain Knowledge: Understanding of the industry or field (e.g., seasonality


in retail, economic factors affecting sales).

 Importance: Background knowledge allows for a more nuanced interpretation of the


visualizations. It helps in identifying anomalies and understanding whether current
results are consistent with historical trends. For instance, if a sudden drop in sales is
observed, knowing that a major competitor launched a new product might provide a
clearer explanation.

Example Scenario

 Context: A retail company wants to analyze sales data to optimize inventory


management.

 Background: Previous analyses showed a seasonal increase in sales during the


holiday season, and the company previously faced stockouts.

 Visualization Result: A line graph shows an upward trend in sales leading up to the
holiday season, with a noticeable spike in late November. 

 Interpretation: Given the context of inventory challenges and the background of


seasonal patterns, the company should prepare to increase inventory ahead of the peak
sales period to avoid stockouts. 

Page 93 of 116
Correlations and Relationships:

 Correlation refers to a statistical measure that expresses the extent to which two
variables are linearly related. It indicates how changes in one variable are associated
with changes in another. 

 Types of Correlation: 

o Positive Correlation: As one variable increases, the other also tends to


increase (e.g., height and weight).

o Negative Correlation: As one variable increases, the other tends to decrease


(e.g., price and demand).

o No Correlation: There is no discernible relationship between the variables


(e.g., shoe size and intelligence).

 Measurement: 

o Pearson Correlation Coefficient (r): Measures linear correlation. Ranges


from -1 (perfect negative) to 1 (perfect positive), with 0 indicating no
correlation.

o Spearman Rank Correlation: Measures the strength and direction of


association between two ranked variables. Useful for non-linear relationships.

 Visualization: 

o Scatter Plots: Commonly used to visualize correlations. A positive correlation


results in an upward slope, while a negative correlation results in a downward
slope.

o Heatmaps: Often used to show correlation matrices for multiple variables,


where colors indicate the strength and direction of correlations.

Relationships

 Relationships encompass a broader range of associations between two or more


variables. They can be linear, non-linear, causal, or merely associative.

 Types of Relationships: 

Page 94 of 116
o Linear Relationships: Straight-line relationships where one variable changes
at a constant rate relative to another.

o Non-linear Relationships: Relationships that follow a curved path (e.g.,


quadratic, exponential). These require more complex modeling to understand.

o Causal Relationships: Indicate that one variable directly affects another (e.g.,
increased advertising leads to higher sales). Correlation does not imply
causation.

 Measurement: 

o Regression Analysis: Used to model the relationship between variables.


Linear regression models linear relationships, while non-linear regression
models handle more complex relationships.

o Causal Inference Methods: Techniques like randomized controlled trials


(RCTs) and observational studies to establish causal relationships.

 Visualization: 

o Line Charts: Good for showing trends and relationships over time.

o Multidimensional Plots: Visualize relationships among multiple variables


(e.g., 3D scatter plots).

Key Differences

Aspect Correlations Relationships

Specifically measures linear Broader concept including various


Scope
association types

Nature Quantitative, linear association Can be qualitative or quantitative

Causality Does not imply causation Can imply causation

Measurement Regression models, causal


Correlation coefficients
Tools inference

Visualization Scatter plots, heatmaps Line charts, multidimensional plots

Practical Applications

Page 95 of 116
 Correlations: Useful in exploratory data analysis to identify potential relationships
worth investigating further. For example, a strong positive correlation between study
time and exam scores might lead educators to implement more study sessions.

 Relationships: Crucial for predictive modeling in machine learning. Understanding


the relationships among features allows for building accurate models. For instance, if
you find that an increase in marketing spend has a causal relationship with sales
growth, you might allocate more budget to marketing. 

Example Scenario

1. Correlations: A dataset shows a Pearson correlation of 0.8 between temperature and


ice cream sales. This suggests a strong positive correlation—when temperature rises,
ice cream sales tend to increase.

2. Relationships: An analysis reveals that higher marketing spending causes increased


ice cream sales (causal relationship). This might be examined through regression
analysis, establishing that for every $1000 spent on marketing, sales increase by a
certain percentage.

Perform Data cleaning

Data cleaning overview

Definition: Data cleaning is the process of fixing or removing incorrect, corrupted,


incorrectly formatted, duplicate, or incomplete data within a dataset. When combining
multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If
data is incorrect, outcomes and algorithms are unreliable, even though they may look correct.

Purpose:

Improves Model Performance: Dirty data can lead to inaccurate models, high bias, or
overfitting.

Reduces Bias and Variance: Proper cleaning reduces noise and variability in the data,
improving generalization.

Ensures Consistency: Cleaning standardizes formats, handling inconsistencies in data types,


values, and missing information

Steps

Page 96 of 116
1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or


irrelevant observations. Duplicate observations will happen most often during data collection.

2. Fix structural errors


Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find “N/A” and “Not Applicable” both appear,
but they should be analyzed as the same category.

3. Filter unwanted outliers


Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with.

4. Handle missing data


You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.

1. As a first option, you can drop observations that have missing values, but doing this will drop
or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again, there is
an opportunity to lose integrity of the data because you may be operating from assumptions
and not actual observations.

3. As a third option, you might alter the way the data is used to effectively navigate null values.

5. Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation:

 Does the data make sense? 

 Does the data follow the appropriate rules for its field? 

 Does it prove or disprove your working theory, or bring any insight to light?

 Can you find trends in the data to help you form your next theory? 

 If not, is that because of a data quality issue? 

Description of Characteristics of quality Data

Page 97 of 116
1. Validity. The degree to which your data conforms to defined business rules or
constraints.

2. Accuracy. Ensure your data is close to the true values.

3. Completeness. The degree to which all required data is known.

4. Consistency. Ensure your data is consistent within the same dataset and/or across
multiple data sets.

5. Uniformity. The degree to which the data is specified using the same unit of measure.

Common Data Issues in Machine Learning


 Missing Data: Missing values in features or labels.
 Noisy Data: Data containing errors or outliers.
 Inconsistent Data: Mixed types (e.g., strings and numbers in the same column) or
discrepancies in categorical values.
 Duplicate Data: Redundant records that can bias the results.
 Irrelevant Data: Features that do not contribute meaningfully to the model.

Data cleaning for inconsistencies rectification

Importance of data Cleaning

Having clean data will ultimately increase overall productivity and allow for the highest
quality information in your decision-making. Benefits include:

 Removal of errors when multiple sources of data are at play.

 Fewer errors make for happier clients and less-frustrated employees.

 Ability to map the different functions and what your data is intended to do.

 Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications. 

 Using tools for data cleaning will make for more efficient business practices and
quicker decision-making.

Data cleaning Techniques

1. Handling Missing Values

Page 98 of 116
Missing values are a common problem in datasets and can occur for various reasons, such as
data entry errors, equipment malfunctions, or incomplete surveys. It is crucial to handle
missing values appropriately, as they can lead to biased analysis and inaccurate results. There
are several techniques for handling missing values, two of which we will discuss next.

Removing missing values

In some cases, removing observations that contain missing values may be appropriate. This
technique is called complete case analysis and involves removing any observation that
contains one or more missing values.

import pandas as pd

# Load data

df = pd.read_csv('data.csv')

# Drop rows with missing values


✓ Data normalization
df_clean = df.dropna()

Imputing missing values

Simply dropping values might lead to several data points being missing in the dataset.
Instead, it may be appropriate to impute missing values, which involves replacing missing
values with estimates based on the available data. There are several techniques for imputing
missing values, including mean imputation, median imputation, and regression imputation.

import pandas as pd

from sklearn.impute import SimpleImputer

# Load data

df = pd.read_csv('data.csv')

# Impute missing values using mean imputation

imputer = SimpleImputer(strategy='mean')

df_clean =
Page 99 of 116
pd.DataFrame(imputer.fit_transform(df),
columns=df.columns)
In this code snippet, we imputed missing values using mean imputation with the
SimpleImputer() method from scikit-learn.

2. Removing Duplicates

Duplicate observations can lead to biased analysis and inaccurate results. Identifying and
removing duplicates is crucial to ensure the data used for analysis is accurate and reliable.
Dropping exact duplicates

This technique involves removing observations that are identical across all variables.
Duplicate records can skew the results of data analysis. When there are exact duplicates in the
dataset, it can lead to inflated counts or misleading statistics, which can impact the accuracy
of the analysis. Additionally, duplicates can cause memory issues and slow the analysis
process, especially when working with large datasets.

import pandas as pd

# Load data

df = pd.read_csv('data.csv')

# Drop exact duplicates

df_clean = df.drop_duplicates()
Fuzzy matching
Sometimes, duplicates in a dataset may not be exact matches due to variations in data entry or
formatting inconsistencies. These variations can include spelling errors, abbreviations, and
different representations of the same data. Fuzzy matching techniques can be used to identify
and remove these duplicates.

Fuzzy matching algorithms can help to identify and remove these duplicates by comparing
the similarity of the data values rather than exact matches. By using fuzzy matching, analysts
can identify duplicates that may have been missed by traditional exact matching techniques,
leading to more accurate and complete data sets.

import pandas as pd
from fuzzywuzzy import fuzz
Page 100 of 116
# Load data
df = pd.read_csv('data.csv')
# Identify fuzzy duplicates
3. Handling Outliers

Outliers are extreme values that can have a significant impact on statistical analysis.
Identifying and handling outliers appropriately is essential to avoid biased analysis and
inaccurate results. Let us look at three techniques to handle outliers.

Trimming
Trimming is a method for handling outliers that involves removing extreme values from the
dataset. However, this technique is useful when the outliers are few and significantly different
from the rest of the data. Removing the outliers can reduce the impact of their extreme
values, and the remaining data can be analyzed without interference.

import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Trimming
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
df_clean = df[(df[column] > q1 - 1.5 * iqr) & (df[column] < q3 + 1.5 * iqr)]

Page 101 of 116


Winsorizing
Winsorization is a technique for handling outliers that involves replacing extreme values with
less extreme values. This technique reduces the impact of outliers on statistical analysis.

import pandas as pd

from scipy.stats.mstats import winsorize

# Load data

df = pd.read_csv('data.csv')

# Winsorize data

df_clean = pd.DataFrame(winsorize(df, limits=[0.05, 0.05]), columns=df.columns)

4. Standardizing Data

Data standardization is a technique for transforming variables to have a common scale. When
working with data from different sources or formats, there can be variations in how it is
represented, such as differences in units of measurement, data formats, and data structures,
making it difficult to compare variables or perform statistical analysis. Data standardization is
a crucial step in such cases.
Min-max scaling
Min-max scaling involves rescaling the values of a variable to a range between 0 and 1. This
is done by subtracting the minimum value of the variable from each value and dividing it by
the variable’s range.

This technique is useful for normalizing data with different scales or units, making it easier to
compare and analyze. Min-max scaling preserves the relative relationships between the
values of the variable, but it can be sensitive to outliers and reduce the impact of extreme
values.

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

# Load data
Page 102 of 116
df = pd.read_csv('data.csv')

# Min-max scale data


Z-score normalization
This technique involves transforming the values of a variable to a standardized distribution
with a mean of 0 and a standard deviation of 1. This is done by subtracting the variable’s
mean from each value and dividing it by its standard deviation.

Z-score normalization also preserves the relative relationships between the values of the
variable but is less sensitive to outliers than min-max scaling.

import pandas as pd

from sklearn.preprocessing import StandardScaler

# Load data

df = pd.read_csv('data.csv')

# Z-score normalize data

scaler = StandardScaler()

df_clean = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

5. Data Validation

Data validation is a process of checking and verifying that the data is accurate, complete, and
consistent. Data validation techniques are used to identify and correct errors, inconsistencies,
and outliers in the data that may affect the quality of the analysis.
Range Checks
Range checks are a data validation technique used to ensure that values fall within a specified

Page 103 of 116


range or domain. This technique is particularly useful for numerical data, where values
outside a specified range may indicate errors or outliers.

def range_check(data, min_val, max_val):

for value in data:

if value < min_val or value > max_val:

print("Value out of range:", value)

Logic Checks
Logic checks are a data validation technique used to ensure that the data conforms to a
specified logical structure or relationship. This technique is particularly useful for checking
the relationships between variables and can help identify errors in data entry or processing.
For example, if a dataset contains variables representing a person’s height and weight, a logic
check may be used to ensure that the Body Mass Index is calculated correctly.

def logic_check(data):
for i in range(len(data)-1):
if not is_valid(data[i], data[i+1]):
print("Invalid values:", data[i], data[i+1])
def is_valid(value1, value2):
# check if value1 and value2 conform to specified logical relationship
return True or False

Using Coresets
As previously stated, the DataHeroes Coreset library, at its build phase, computes an
“Importance” metric for each data sample which can help fix mislabeling errors, thus
validating the dataset.

For example, let’s say we use the following code snippet to identify samples with very high
“Importance” values, i.e., samples that are out of the ordinary for that class distribution.

Page 104 of 116


from dataheroes import CoresetTreeServiceLG

# Build the coreset tree service object

self.coreset_service =
CoresetTreeServiceLG(optimized_for='cleaning')

self.coreset_service.build(self.X, self.y)

# Get the top “num_images_to_view” most important samples

result = self.coreset_service.get_important_samples(

class_size={self.predicted_classid: self.num_images_to_view}

important_sample_indices = result["idx"]

Now, one can either assign these samples to the true class, which is not always obvious to
identify or feasible. So, in such cases, we may want to delete the samples to make the Coreset
more accurate. To achieve this, we use the remove_samples function, as shown in the
following code fragment.

# Removing the samples indices stored in `important_sample_indices` from the


Coreset
self.coreset_service.remove_samples(self.deleted_idxes[self.iteration])
# Re-evaluate the AUROC score for the cleaned dataset
self.sample_weights[important_sample_indices] = 0
new_score = self.eval_lg_cv(sample_weights=new_weights)

Final Thoughts

Data cleaning is an essential step in the data preprocessing pipeline of machine learning
projects. It involves identifying and correcting errors and removing inconsistencies to ensure
that the data used for analysis is reliable.

Page 105 of 116


By cleaning data regularly, organizations can improve their decision-making, increase
efficiency, improve the performance of predictive models, and ensure compliance with data
privacy and security regulations.

Data normalization

Importance of data normalization

Normalization is a specific form of feature scaling that transforms the range of features to a
standard scale. Normalization and, for that matter, any data scaling technique is required only
when your dataset has features of varying ranges. Normalization encompasses diverse
techniques tailored to different data distributions and model requirements.

Why Normalize Data?

Normalized data enhances model performance and improves the accuracy of a model. It aids
algorithms that rely on distance metrics, such as k-nearest neighbors or support vector
machines, by preventing features with larger scales from dominating the learning process.

Normalization fosters stability in the optimization process, promoting faster convergence


during gradient-based training. It mitigates issues related to vanishing or exploding gradients,
allowing models to reach optimal solutions more efficiently.

Let’s take a simple example to highlight the importance of normalizing data. We are trying to
predict housing prices based on various features such as square footage, number of bedrooms,
and distance to the supermarket, etc. The dataset contains diverse features with varying
scales, such as:

 Square footage: Ranges from 500 to 5000 square feet

 Number of bedrooms: Ranges from 1 to 5 

 Distance to supermarket: Ranges from 0.1 to 10 miles

Data normalization Techniques

Min-Max normalization:

This method of normalising data involves transforming the original data linearly. The data’s
minimum and maximum values are obtained, and each value is then changed using the
formula that follows.

Page 106 of 116


The formula works by subtracting the minimum value from the original value to determine
how far the value is from the minimum. Then, it divides this difference by the range of the
variable (the difference between the maximum and minimum values).

This division scales the variable to a proportion of the entire range. As a result, the
normalized value falls between 0 and 1.

 When the feature X is at its minimum, the normalized value (x’) is 0. This is because
the numerator becomes zero. 

 Conversely, when X is at its maximum, x’ is 1, indicating full-scale normalization. 

 For values between the minimum and maximum, x’ ranges between 0 and 1,
preserving the relative position of X within the original range. 

Normalisation through decimal scaling:

The data is normalised by shifting the decimal point of its values. By dividing each data value
by the maximum absolute value of the data, we can use this technique to normalise the data.
The following formula is used to normalise the data value, v, of the data to v’:

where v’ is the normalized value, v is the original value, and j is the smallest integer such

that . This formula involves dividing each data value by an appropriate


power of 10 to ensure that the resulting normalized values are within a specific range.

Normalisation of Z-score or Zero Mean (Standardisation):


https://www.geeksforgeeks.org/what-is-data-normalization/

Using the mean and standard deviation of the data, values are normalised in this technique
to create a standard normal distribution (mean: 0, standard deviation: 1). The equation that is
applied is:

Page 107 of 116


where,

is the mean of the data A is the standard deviation.

Difference Between Normalization and Standardization

Normalization Standardization

Normalization scales the values of a feature Standardization scales the features to have a
to a specific range, often between 0 and 1. mean of 0 and a standard deviation of 1.

Applicable when the feature distribution is Effective when the data distribution is
uncertain. Gaussian.

Susceptible to the influence of outliers Less affected by the presence of outliers.

Maintains the shape of the original


Alters the shape of the original distribution.
distribution

Scale values are not constrained to a specific


Scales values to ranges like [0, 1].
range.

✓ Data transformation

Data transformation: the process of converting and cleaning raw data from one data source
to meet the requirements of its new location. Also called data wrangling, transforming data is
essential to ingestion workflows that feed data warehouses and modern data lakes.

Why do we need to use transformation on our data?

You need to know why we need to apply this transformation. You need to know about
Normal distribution. What is normal distribution? The normal distribution, also known as the
Gaussian distribution, is a continuous probability distribution that is widely used in Machine
Learning and statistical modeling. It is a bell-shaped curve that is symmetrical around its
mean and is characterized by its mean and standard deviation.

Page 108 of 116


The above image says the data is distributed normally and sometimes data can be skewed to
either left to to right also. In a right-skewed data scenario, more data points fall to the right
side and in a left-skewed, data points fall towards the left.

In the above image, the first graph is right-skewed data which is a positive skew and the last
one is left-skewed data which is a negative skew. The center one is a normal distribution.

Let’s understand Right-skewed data:

 Mean: The mean is usually greater than the median due to the influence of the longer
right tail.

 Median: The median is the middle value when the data is ordered. In a right-skewed
distribution, it is less than the mean.

 Mode: The mode is the most frequent value. It tends to be on the left side, where the
majority of the data is concentrated.

Page 109 of 116


And now Left-skewed data:

 Mean: The mean is usually less than the median due to the influence of the longer left
tail.

 Median: The median is still the middle value when the data is ordered. In a left-
skewed distribution, it is greater than the mean.

 Mode: The mode is still the most frequent value but now tends to be on the right side,
where the majority of the data is concentrated.

Importance of data transformation

In simple terms, data transformation means changing and improving how we work with data.
Here are the benefits of data transformation:

1. Improved data quality: Data transformation helps identify and correct


inconsistencies, errors, and missing values, leading to cleaner and more accurate data
for analysis.

2. Enhanced data integration: By converting data into a standardized format, data


transformation enables data integration from multiple sources, fostering collaboration
and data sharing among different systems.

3. Better decision making and business intelligence: With clean and integrated data,
organizations can make more informed decisions based on accurate insights, which
improves efficiency and competitiveness.

4. Scalability: Data transformation helps teams manage increasing volumes of data,


allowing organizations to scale their data processing and analytics capabilities as
needed.

5. Data privacy: Protect data privacy and comply with data protection regulations by
transforming sensitive data through techniques like anonymization,
pseudonymization, or encryption.

6. Improved data visualization: Transforming data into appropriate formats or


aggregating it in meaningful ways makes it easier to create engaging and insightful
data visualizations.

Page 110 of 116


7. Easier machine learning: Data transformation prepares data for machine learning
algorithms by converting it into a suitable format and addressing issues like missing
values or class imbalance, which can improve model performance.

Types of Data transformation

Data Transformations are of two types:

1. Function Transformation: It is a method that we use to transform the data by using


some mathematical functions.

2. Power Transformations: It is a method that involves raising each data point to a


certain power.

Data transformation techniques

Different techniques used in Function Transformation:

1. Log Transformation

The log transformation is a mathematical operation applied to each data point in a dataset
using the natural logarithm (base ‘e’). The natural logarithm of a number x is denoted as ln(x)
or log_e(x).

Why Log Transformation?

The purpose of log transformation is to reduce the impact of extreme values and make the
data more interpretable and suitable for certain types of analyses or modeling.

Mathematical Formula:

A mathematical formula for a log transformer is,

y = log_e(x)

In this formula:

 y is the transformed value after applying the logarithm.

 x is the original value.

 log_e denotes the natural logarithm, which has the base e.

 e is the mathematical constant approximately equal to 2.71828

Page 111 of 116


Remember these points:

 This transformation makes our data close to a normal distribution but not able to
exactly abide by a normal distribution.

 This transformation is not applied to those features which have negative values.

 This transformation is mostly applied to right-skewed data.

Simple code:

import numpy as np

import matplotlib.pyplot as plt

# Generate example data

original_data = np.linspace(1, 100000, 100)

# Apply log transformation

log_transformed_data = np.log(original_data)

# Plot original and log-transformed data

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

plt.scatter(range(len(original_data)), original_data)

plt.title('Original Data')

plt.subplot(1, 2, 2)

plt.scatter(range(len(log_transformed_data)), log_transformed_data)

plt.title('Log-Transformed Data')

plt.show()

2. Reciprocal Transformation

It is another technique of transformation that involves converting the non-zero data points(x)
into their reciprocal(1/x).

Page 112 of 116


Why Reciprocal Transformation:

 If a feature contains large values, taking the reciprocal can help scale them down.

 In a right-skewed distribution where most values are small, taking the reciprocal can
spread out the small values and bring them closer together.

Mathematical Formula:

The reciprocal y of a non-zero number x is given by:

y = 1/x

 x is the original value.

 y is the reciprocal of x.

If x is a small positive number, its reciprocal will be a larger value, and vice versa.

Remember these points:

 This transformation is not defined for zero.

 This transformation reverses the order among values of the same sign, if x is a small
positive number, its reciprocal will be a larger value, and vice versa.

Simple Code:

import numpy as np

# Original data

data = np.array([2, 4, 8, 16, 32])

# Reciprocal transformation

reciprocal_data = np.reciprocal(data)

# Print the results

print("Original Data:", data)

print("Reciprocal Transformation:", reciprocal_data)

#Output

Original Data: [ 2 4 8 16 32]

Page 113 of 116


Reciprocal Transformation: [0.5 0.25 0.125 0.0625 0.03125]

3. Square Transformation:

This technique involves in converting the data points(x) into their square(x²).

Why Square Transformation:

By squaring the values, the transformation can spread out the data, reducing the skewness and
potentially achieving a more symmetrical distribution.

Formula:

y=(x²)

where y is a squared result and x is the original data point.

Remember these points:

 This technique is applied to the left-skewed data.

 When you apply this technique, squaring a negative value will result in a positive
value.

 However, it’s important to note that, the squaring of a smaller negative value results in
a higher positive value. Eg: (-5)**2 -> 25, (50)**2 -> 2500.

Simple Code:

import numpy as np

# Original data

data = np.array([-3, -2, -1, 0, 1, 2, 3])

# Squaring transformation

squared_data = np.square(data)

# Displaying the results

print("Original Data:", data)

print("Squared Data:", squared_data)

#Output

Original Data: [-3 -2 -1 0 1 2 3]


Page 114 of 116
Squared Data: [9 4 1 0 1 4 9]

4. Square Root Transformation:

This technique involves taking the square root of each value in the data, i.e., applying the
formula (sqrt(x)).

Why Square Root Transformation:

This transformation is beneficial when dealing with right-skewed data or a few extreme
outliers. Taking the square root can help compress the larger values and make the distribution
more symmetric.

Formula:

y = sqrt(x)

Remember these points:

 This transformation is defined only for positive numbers.

 This transformation is weaker than Log Transformation.

 This can be used to reduce the skewness of right-skewed data.

Simple Code:

import numpy as np

# Original data points

data_points = np.array([4, 9, 16, 25, 36])

# Square root transformation

transformed_data = np.sqrt(data_points)

# Display original and transformed data

print("Original Data Points:", data_points)

print("Transformed Data Points (Square Root):", transformed_data)

#Output

Original Data Points: [ 4 9 16 25 36]

Transformed Data Points (Square Root): [2. 3. 4. 5. 6]


Page 115 of 116
Learning outcome 2: Develop Machine Learning Model

2.1. Machine Learning algorithm is properly selected based on the characteristics of the
dataset.

2.2. Machine Learning models are properly trained based on a training set of data.

2.3. Machine Learning model performance is properly evaluated based on appropriate


evaluation metrics

2.4. Hyperparameters are properly finetuned based on evaluation results

Page 116 of 116

You might also like