Machine Learning
Machine Learning
Machine Learning
School: ESSJT
Learning Hours: 80
TRAINER:
Page 1 of 116
1.4. Data cleaning is appropriately performed based on the
provided dataset.
2. Develop Machine Learning Model 2.1. Machine Learning algorithm is properly selected based
on the characteristics of the dataset.
2.2. Machine Learning models are properly trained based on
a training set of data.
2.3. Machine Learning model performance is properly
evaluated based on appropriate evaluation metrics
2.4. Hyperparameters are properly finetuned based on
evaluation results
3. Perform Model deployment 3.1. Deployment methods are clearly selected based on the
requirements.
3.2. Model file is properly integrated in the system based on
the Deployment method (RESTful API guidelines.)
3.3. Prediction responses are accurately delivered to the
Clients based on the model insights.
1. Apply Data Pre-processing
1.1. Environment is properly prepared based on system requirements.
Machine Learning (ML) is that field of computer science with the help of which computer
systems can provide sense to data in much the same way as human beings do.
In simple words, ML is a type of artificial intelligence that extract patterns out of raw data
by using an algorithm or method. The main focus of ML is to allow computer systems learn
Page 2 of 116
Key Components of Machine Learning
3. Algorithms: These are the methods used to train models. They adjust the model's
parameters to minimize the difference between the model's predictions and the
actual observed outcomes.
4. Evaluation: This involves assessing how well your model is performing. Common
metrics for this include accuracy, precision, recall, and F1 score, depending on the
problem you're solving (e.g., classification, regression).
Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
Page 3 of 116
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we need
data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be the prediction.
Page 4 of 116
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.
It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
Page 5 of 116
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
Page 6 of 116
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.
https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-
applications
Social media platforms use machine learning algorithms and approaches to create some
attractive and excellent features. For instance, Facebook notices and records your activities,
chats, likes, and comments, and the time you spend on specific kinds of posts. Machine
learning learns from your own experience and makes friends and page suggestions for your
profile.
Page 7 of 116
2. Product Recommendations
Product recommendation is one of the most popular and known applications of machine
learning. Product recommendation is one of the stark features of almost every e-commerce
website today, which is an advanced application of machine learning techniques.
Using machine learning and AI, websites track your behavior based on your previous
purchases, searching patterns, and cart history, and then make product recommendations.
Page 8 of 116
3. Image Recognition
Image recognition, which is an approach for cataloging and detecting a feature or an object in
the digital image, is one of the most significant and notable machine learning and AI
techniques. This technique is being adopted for further analysis, such as pattern recognition,
face detection, and face recognition.
Page 9 of 116
4. Sentiment Analysis
Sentiment analysis is one of the most necessary applications of machine learning. Sentiment
analysis is a real-time machine learning application that determines the emotion or opinion of
the speaker or the writer. For instance, if someone has written a review or email (or any form
of a document), a sentiment analyzer will instantly find out the actual thought and tone of the
text. This sentiment analysis application can be used to analyze a review based website,
decision-making applications, etc.
Organizations are actively implementing machine learning algorithms to determine the level
of access employees would need in various areas, depending on their job profiles. This is one
of the coolest applications of machine learning.
Machine learning algorithms are used to develop behavior models for endangered cetaceans
and other marine species, helping scientists regulate and monitor their populations.
Page 10 of 116
7. Regulating Healthcare Efficiency and Medical Services
Significant healthcare sectors are actively looking at using machine learning algorithms to
manage better. They predict the waiting times of patients in the emergency waiting rooms
across various departments of hospitals. The models use vital factors that help define the
algorithm, details of staff at various times of day, records of patients, and complete logs of
department chats and the layout of emergency rooms. Machine learning algorithms also come
to play when detecting a disease, therapy planning, and prediction of the disease situation.
This is one of the most necessary machine learning applications.
An algorithm designed to scan a doctor’s free-form e-notes and identify patterns in a patient’s
cardiovascular history is making waves in medicine. Instead of a physician digging through
multiple health records to arrive at a sound diagnosis, redundancy is now reduced with
computers making an analysis based on available information.
9. Banking Domain
Banks are now using the latest advanced technology machine learning has to offer to help
prevent fraud and protect accounts from hackers. The algorithms determine what factors to
consider to create a filter to keep harm at bay. Various sites that are unauthentic will be
automatically filtered out and restricted from initiating transactions.
One of the most common machine learning applications is language translation. Machine
learning plays a significant role in the translation of one language to another. We are amazed
at how websites can translate from one language to another effortlessly and give contextual
meaning as well. The technology behind the translation tool is called ‘machine translation.’ It
has enabled people to interact with others from all around the world; without it, life would
not be as easy as it is now. It has provided confidence to travelers and business associates to
safely venture into foreign lands with the conviction that language will no longer be a barrier.
Page 11 of 116
https://www.javatpoint.com/machine-learning-life-cycle
1. Automation
Machine Learning is one of the driving forces behind automation, and it is cutting down time
and human workload. Automation can now be seen everywhere, and the complex algorithm
does the hard work for the user. Automation is more reliable, efficient, and quick. With the
help of machine learning, now advanced computers are being designed. Now this advanced
computer can handle several machine-learning models and complex algorithms. However,
automation is spreading faster in the industry but, a lot of research and innovation are required
in this field.
2. Scope of Improvement
Page 12 of 116
Machine Learning is a field where things keep evolving. It gives many opportunities for
improvement and can become the leading technology in the future. A lot of research and
innovation is happening in this technology, which helps improve software and hardware.
Machine Learning is going to be used in the education sector extensively, and it will be going
to enhance the quality of education and student experience. It has emerged in China; machine
learning has improved student focus. In the e-commerce field, Machine Learning studies your
search feed and give suggestion based on them. Depending upon search and browsing history,
it pushes targeted advertisements and notifications to users.
This technology has a very wide range of applications. Machine learning plays a role in almost
every field, like hospitality, ed-tech, medicine, science, banking, and business. It creates
more opportunities.
Nothing is perfect in the world. Machine Learning has some serious limitations, which are
bigger than human errors.
1. Data Acquisition
The whole concept of machine learning is about identifying useful data. The outcome will be
incorrect if a credible data source is not provided. The quality of the data is also significant. If
the user or institution needs more quality data, wait for it. It will cause delays in providing the
output. So, machine learning significantly depends on the data and its quality.
The data that machines process remains huge in quantity and differs greatly. Machines require
time so that their algorithm can adjust to the environment and learn it. Trials runs are held to
check the accuracy and reliability of the machine. It requires massive and expensive resources
and high-quality expertise to set up that quality of infrastructure. Trials runs are costly as they
would cost in terms of time and expenses.
Page 13 of 116
3. Results Interpretations
One of the biggest advantages of Machine learning is that interpreted data that we get from the
cannot be hundred percent accurate. It will have some degree of inaccuracy. For a high degree
of accuracy, algorithms should be developed so that they give reliable results.
The error committed during the initial stages is huge, and if not corrected at that time, it creates
havoc. Biasness and wrongness have to be dealt with separately; they are not interconnected.
Machine learning depends on two factors, i.e., data and algorithm. All the errors are
dependent on the two variables. Any incorrectness in any variables would have huge
repercussions on the output.
5. Social Changes
Machine learning is bringing numerous social changes in society. The role of machine learning-
based technology in society has increased multifold. It is influencing the thought process of
society and creating unwanted problems in society. Character assassination and sensitive
details are disturbing the social fabric of society.
Automation, Artificial Intelligence, and Machine Learning have eliminated human interface
from some work. It has eliminated employment opportunities. Now, all those works are
conducted with the help of artificial intelligence and machine learning.
With the advancement of machine learning, the nature of the job is changing. Now, all the work
are done by machine, and it is eating up the jobs for human which were done earlier by them.
It is difficult for those without technical education to adjust to these changes.
8. Highly Expensive
This software is highly expensive, and not everybody can own it. Government agencies, big
private firms, and enterprises mostly own it. It needs to be made accessible to everybody for
wide use.
Page 14 of 116
9. Privacy Concern
As we know that one of the pillars of machine learning is data. The collection of data has raised
the fundamental question of privacy. The way data is collected and used for commercial
purposes has always been a contentious issue. In India, the Supreme court of India has
declared privacy a fundamental right of Indians. Without the user's permission, data cannot be
collected, used, or stored. However, many cases have come up that big firms collect the data
without the user's knowledge and using it for their commercial gains.
Machine learning is evolving concept. This area has not seen any major developments yet that
fully revolutionized any economic sector. The area requires continuous research and
innovation.
https://www.javatpoint.com/advantages-and-disadvantages-of-machine-learning
Artificial Intelligence
Artificial Intelligence is basically the mechanism to incorporate human intelligence into
machines through a set of rules(algorithm). AI is a combination of two words: “Artificial”
meaning something made by humans or non-natural things and “Intelligence” meaning the
ability to understand or think accordingly. Another definition could be that “AI is basically
the study of training your machine(computers) to mimic a human brain and its
thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to obtain
the maximum efficiency possible.
Machine Learning:
Machine Learning is basically the study/process which provides the system(computer) to
learn automatically on its own through experiences it had and improve accordingly without
being explicitly programmed. ML is an application or subset of AI. ML focuses on the
development of programs so that it can access data to use it for itself. The entire process
makes observations on data to identify the possible patterns being formed and make better
future decisions as per the examples provided to them. The major aim of ML is to allow
Page 15 of 116
the systems to learn by themselves through experience without any kind of human
intervention or assistance.
Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning which
makes use of Neural Networks(similar to the neurons working in our brain) to mimic
human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does and
classifies the information accordingly. DL works on larger sets of data when compared to
ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning and Deep
Learning:
DL is a ML algorithm that
AI is a computer
ML is an AI algorithm which uses deep (more than one
algorithm which exhibits
allows system to learn from layer) neural networks to
intelligence through
data. analyze data and provide
decision making.
output accordingly.
Page 16 of 116
Artificial Intelligence Machine Learning Deep Learning
DL can be considered as
neural networks with a
Three broad
large number of parameters
categories/types Of AI
Three broad categories/types layers lying in one of the
are: Artificial Narrow
Of ML are: Supervised four fundamental network
Intelligence (ANI),
Learning, Unsupervised architectures: Unsupervised
Artificial General
Learning and Reinforcement Pre-trained Networks,
Intelligence (AGI) and
Learning Convolutional Neural
Artificial Super
Networks, Recurrent
Intelligence (ASI)
Neural Networks and
Recursive Neural Networks
Page 17 of 116
Artificial Intelligence Machine Learning Deep Learning
Examples of AI
applications include: Examples of ML applications Examples of DL
Google’s AI-Powered include: Virtual Personal applications include:
Predictions, Ridesharing Assistants: Siri, Alexa, Sentiment based news
Apps Like Uber and Lyft, Google, etc., Email Spam and aggregation, Image analysis
Commercial Flights Use Malware Filtering. and caption generation, etc.
an AI Autopilot, etc.
ML algorithms can be
categorized as supervised,
unsupervised, or
AI can be further broken DL algorithms are inspired
reinforcement learning. In
down into various by the structure and
supervised learning, the
subfields such as function of the human
algorithm is trained on
robotics, natural language brain, and they are
labeled data, where the
processing, computer particularly well-suited to
desired output is known. In
vision, expert systems, tasks such as image and
unsupervised learning, the
and more. speech recognition.
algorithm is trained on
unlabeled data, where the
desired output is unknown.
Page 18 of 116
Artificial Intelligence Machine Learning Deep Learning
DL networks consist of
multiple layers of
In reinforcement learning, the
interconnected neurons that
AI systems can be rule- algorithm learns by trial and
process data in a
based, knowledge-based, error, receiving feedback in
hierarchical manner,
or data-driven. the form of rewards or
allowing them to learn
punishments.
increasingly complex
representations of the data.
Page 19 of 116
Virtual Personal Assistants (VPA) like Siri or Alexa – these use natural
language processing to understand and respond to user requests, such as playing
music, setting reminders, and answering questions.
Autonomous vehicles – self-driving cars use AI to analyze sensor data, such as
cameras and lidar, to make decisions about navigation, obstacle avoidance, and
route planning.
Fraud detection – financial institutions use AI to analyze transactions and
detect patterns that are indicative of fraud, such as unusual spending patterns or
transactions from unfamiliar locations.
Image recognition – AI is used in applications such as photo organization,
security systems, and autonomous robots to identify objects, people, and scenes
in images.
Natural language processing – AI is used in chatbots and language translation
systems to understand and generate human-like text.
Predictive analytics – AI is used in industries such as healthcare and marketing
to analyze large amounts of data and make predictions about future events, such
as disease outbreaks or consumer behavior.
Game-playing AI – AI algorithms have been developed to play games such as
chess, Go, and poker at a superhuman level, by analyzing game data and making
predictions about the outcomes of moves.
Examples of Machine Learning:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves the use of
algorithms and statistical models to allow a computer system to “learn” from data and
improve its performance over time, without being explicitly programmed to do so.
Page 20 of 116
Natural language processing (NLP): Machine learning algorithms are used in
NLP systems to understand and generate human language. These systems are
used in chatbots, virtual assistants, and other applications that involve natural
language interactions.
Recommendation systems: Machine learning algorithms are used in
recommendation systems to analyze user data and recommend products or
services that are likely to be of interest. These systems are used in e-commerce
sites, streaming services, and other applications.
Sentiment analysis: Machine learning algorithms are used in sentiment analysis
systems to classify the sentiment of text or speech as positive, negative, or
neutral. These systems are used in social media monitoring and other
applications.
Predictive maintenance: Machine learning algorithms are used in predictive
maintenance systems to analyze data from sensors and other sources to predict
when equipment is likely to fail, helping to reduce downtime and maintenance
costs.
Spam filters in email – ML algorithms analyze email content and metadata to
identify and flag messages that are likely to be spam.
Recommendation systems – ML algorithms are used in e-commerce websites
and streaming services to make personalized recommendations to users based on
their browsing and purchase history.
Predictive maintenance – ML algorithms are used in manufacturing to predict
when machinery is likely to fail, allowing for proactive maintenance and
reducing downtime.
Credit risk assessment – ML algorithms are used by financial institutions to
assess the credit risk of loan applicants, by analyzing data such as their income,
employment history, and credit score.
Customer segmentation – ML algorithms are used in marketing to segment
customers into different groups based on their characteristics and behavior,
allowing for targeted advertising and promotions.
Fraud detection – ML algorithms are used in financial transactions to detect
patterns of behavior that are indicative of fraud, such as unusual spending
patterns or transactions from unfamiliar locations.
Page 21 of 116
Speech recognition – ML algorithms are used to transcribe spoken words into
text, allowing for voice-controlled interfaces and dictation software.
Examples of Deep Learning:
Deep Learning is a type of Machine Learning that uses artificial neural networks with
multiple layers to learn and make decisions.
Page 23 of 116
Supervised Learning
Page 25 of 116
Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
Medical diagnosis: Detect diseases and other medical conditions.
Fraud detection: Identify fraudulent transactions.
Autonomous vehicles: Recognize and respond to objects in the environment.
Email spam detection: Classify emails as spam or not spam.
Quality control in manufacturing: Inspect products for defects.
Credit scoring: Assess the risk of a borrower defaulting on a loan.
Gaming: Recognize characters, analyze player behavior, and create NPCs.
Customer support: Automate customer support tasks.
Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
2. Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.
Unsupervised Learning
Page 27 of 116
It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
Clustering: Group similar data points into clusters.
Anomaly detection: Identify outliers or anomalies in data.
Dimensionality reduction: Reduce the dimensionality of data while preserving
its essential information.
Recommendation systems: Suggest products, movies, or content to users based
on their historical behavior or preferences.
Topic modeling: Discover latent topics within a collection of documents.
Density estimation: Estimate the probability density function of data.
Image and video compression: Reduce the amount of storage required for
multimedia content.
Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
Market basket analysis: Discover associations between products.
Genomic data analysis: Identify patterns or group genes with similar
expression profiles.
Image segmentation: Segment images into meaningful regions.
Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
Customer behavior analysis: Uncover patterns and insights for better
marketing and product recommendations.
Content recommendation: Classify and tag content to make it easier to
recommend similar items to users.
Exploratory data analysis (EDA): Explore data and gain insights before
defining specific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labeled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming. Semi-
Page 28 of 116
supervised learning is chosen when labeled data requires skills and relevant resources in
order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the
rest large portion of it is unlabeled. We can use the unsupervised techniques to predict
labels and then feed these labels to supervised techniques. This technique is mostly
applicable in the case of image data sets where usually all images are not labeled.
Semi-Supervised Learning
Page 29 of 116
Co-training: This approach trains two different machine learning models on
different subsets of the unlabeled data. The two models are then used to label
each other’s predictions.
Self-training: This approach trains a machine learning model on the labeled
data and then uses the model to predict labels for the unlabeled data. The model
is then retrained on the labeled data and the predicted labels for the unlabeled
data.
Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to
generate unlabeled data for semi-supervised learning by training two neural
networks, a generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
It leads to better generalization as compared to supervised learning, as it takes
both labeled and unlabeled data.
Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
Semi-supervised methods can be more complex to implement compared to
other approaches.
It still requires some labeled data that might not always be available or easy to
obtain.
The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
Image Classification and Object Recognition: Improve the accuracy of models
by combining a small set of labeled images with a larger set of unlabeled
images.
Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a vast
amount of unlabeled text.
Speech Recognition: Improve the accuracy of speech recognition by leveraging
a limited amount of transcribed speech data and a more extensive set of
unlabeled audio.
Page 30 of 116
Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions
(labeled data) with a wealth of unlabeled user behavior data.
Healthcare and Medical Imaging: Enhance medical image analysis by utilizing
a small set of labeled medical images alongside a larger set of unlabeled images.
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps
on increasing its performance using Reward Feedback to learn the behavior or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving car,
AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence
experienced.
Here are some of most common reinforcement learning algorithms:
Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function,
which maps states to actions. The Q-function estimates the expected reward of
taking a particular action in a given state.
SARSA (State-Action-Reward-State-Action): SARSA is another model-free
RL algorithm that learns a Q-function. However, unlike Q-learning, SARSA
updates the Q-function for the action that was actually taken, rather than the
optimal action.
Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.
Page 31 of 116
Reinforcement Machine Learning
Page 33 of 116
Top 10 Machine Learning Tools
Machine learning has witnessed exponential growth in tools and frameworks designed to help
data scientists and engineers efficiently build and deploy ML models. Below is a detailed
overview of some of the top machine learning tools, highlighting their key features.
Microsoft Azure is a cloud-based environment you can use to train, deploy, automate, manage,
and track ML models. It is designed to help data scientists and ML engineers leverage their
existing data processing and model development skills & frameworks.
Key Features
2. IBM Watson
Key Features
Key Features
Amazon Machine Learning is a cloud service that makes it easy for professionals of all skill
levels to use machine learning technology. It provides visualization tools and wizards to create
machine learning models without learning complex ML algorithms and technology.
Key Features
Integration with Amazon S3, Redshift, and RDS for data storage.
5. OpenNN
Page 35 of 116
Key Features
6. PyTorch
PyTorch, a machine learning framework that's open-source and built upon the Torch library,
supports a wide range of applications, including computer vision and natural language
processing. It's celebrated for its adaptability and its capacity to dynamically manage
computational graphs.
Key Features
7. Vertex AI
Vertex AI is Google Cloud's AI platform. It consolidates its ML offerings into a unified API,
client library, and user interface. This enables ML engineers and data scientists to accelerate
the development and maintenance of artificial intelligence models.
Key Features
Unified tooling and workflow for model training, hosting, and deployment.
8. BigML
BigML is a machine learning platform that helps users create, deploy, and maintain machine
learning models. It offers a comprehensive environment for preprocessing, machine learning,
and model evaluation tasks.
Key Features
9. Apache Mahout
Apache Mahout serves as a scalable linear algebra framework and offers a mathematically
expressive Scala-based domain-specific language (DSL). This design aims to facilitate the
rapid development of custom algorithms by mathematicians, statisticians, and data scientists.
Its primary areas of application include filtering, clustering, and classification, streamlining
these processes for professionals in the field.
Key Features
10. Weka
Page 37 of 116
Weka is an open-source software suite written in Java, designed for data mining tasks. It
includes a variety of machine learning algorithms geared towards tasks such as data pre-
processing, classification, regression, clustering, discovering association rules, and data
visualization.
Key Features
11. Scikit-learn
Key Features
Google Cloud AutoML offers a collection of machine learning tools designed to help
developers with minimal ML knowledge create tailored, high-quality models for their unique
business requirements. It leverages Google's advanced transfer learning and neural architecture
search technologies.
Page 38 of 116
Key Features
Integration with Google Cloud services for seamless deployment and scalability.
13. Colab
Colab, or Google Colaboratory, is a free cloud service based on Jupyter Notebooks that
supports Python. It is designed to facilitate ML education and research with no setup required.
Colab provides an easy way to write and execute arbitrary Python code through the browser.
Key Features
Integration with Google Drive for easy storage and access to notebooks.
14. KNIME
KNIME is an open-source data analytics, reporting, and integration platform allowing users to
create data flows visually, selectively execute some or all analysis steps, and inspect the results,
models, and interactive views.
Key Features
Wide range of nodes for data integration, transformation, analysis, and visualization.
Page 39 of 116
15. Keras
Key Features
16. RapidMiner
RapidMiner serves as a comprehensive data science tool, offering a cohesive platform for tasks
like data prep, machine learning, deep learning, text mining, and predictive analytics. It caters
to users of varying expertise, accommodating both novices and seasoned professionals.
Key Features
17. Shogun
Shogun is a freely available machine learning library that encompasses a wide range of efficient
and cohesive techniques. Developed in C++, it features interfaces for several programming
languages, including C++, Python, R, Java, Ruby, Lua, and Octave.
Page 40 of 116
Key Features
Project Jupyter is a free, open-source initiative designed to enhance interactive data science
and scientific computing across various programming languages. Originating from the IPython
project, it offers a comprehensive framework for interactive computing, including notebooks,
code, and data management.
Key Features
Amazon SageMaker empowers all developers and data scientists to create, train, and deploy
ML models with ease. It simplifies and streamlines every stage of the machine learning
workflow. Discover how to efficiently use Amazon SageMaker to develop, train, optimize, and
deploy machine learning models.
Key Features
Page 41 of 116
20. Apache Spark
Apache Spark serves as an integrated analytics engine designed to process data on a large scale.
It offers advanced APIs for Java, Scala, Python, and R, alongside an efficient engine that backs
versatile computation graphs for data analysis. Engineered for rapid processing, Spark enables
in-memory computation and supports a range of machine learning algorithms through its MLlib
library.
Key Features
Machine learning tools and techniques are indispensable in the modern era for several
compelling reasons:
1. Data Analysis and Interpretation: With the explosion of data in recent years, ML
tools are critical for analyzing and interpreting vast amounts of data quickly and
efficiently, uncovering patterns and insights that would be impossible for humans to
find.
Page 42 of 116
3. Personalization: ML tools are at the heart of personalization technologies used in e-
commerce, content platforms, and marketing. They provide tailored experiences to
users based on their behaviors and preferences.
5. Solving Complex Problems: ML tools have the potential to solve complex problems
in diverse domains, including healthcare, finance, environmental protection, and
more, by finding solutions that are not apparent through traditional methods.
Installation of Python
The process of How to install Python in Windows, operating system is relatively easy and
involves a few uncomplicated steps. This article aims to take you through the process of
downloading and installing Python on your Windows computer.
How to Install Python in Windows?
We have provided step-by-step instructions to guide you and ensure a successful installation.
Whether you are new to programming or have some experience, mastering how to install
Python on Windows will enable you to utilize this potent language and uncover its full range
of potential applications.
To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows
operating system. Locate a reliable version of Python 3, preferably version 3.10.11, which
was used in testing this tutorial. Choose the correct link for your device from the options
provided: either Windows installer (64-bit) or Windows installer (32-bit) and proceed to
download the executable file.
Page 43 of 116
Python Homepage
Page 44 of 116
Python Installer
After Clicking the Install Now Button the setup will start installing Python on your
Windows system. You will see a window like this.
Python Setup
Page 45 of 116
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will see
a successful message.
Python version
You can also check the version of Python by opening the IDLE application. Go to Start and
enter IDLE in the search bar and then click the IDLE app, for example, IDLE (Python
3.10.11 64-bit). If you can see the Python IDLE window then you are successfully able to
download and installed Python on Windows.
Page 46 of 116
Python IDLE
Installation of Tools
In a terminal, run:
Page 47 of 116
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.linting.flake8Path": "flake8",
"python.formatting.provider": "black",
"python.formatting.blackPath": "whataformatter",
"python.formatting.blackArgs": [],
Environment Testing
Think of how you might test the lights on a car. You would turn on the lights (known as the test
step) and go outside the car or ask a friend to check that the lights are on (known as the test
assertion). Testing multiple components is known as integration testing.
You have just seen two types of tests:
1. An integration test checks that components in your application operate with each
other.
2. A unit test checks a small component in your application.
You can write both integration tests and unit tests in Python. To write a unit test for the built-
in function sum(), you would check the output of sum() against a known output.
For example, here’s how you check that the sum() of the numbers (1, 2, 3) equals 6:
Python
>>> assert sum([1, 2, 3]) == 6, "Should be 6"
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data.
However, nowadays, we are foreseeing issues when a size of such data grows to a huge
extent, typical sizes are being in the rage of multiple zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.
Page 49 of 116
Do you know? Data stored in a relational database management system is one example of
a ‘structured’ data.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc.
Page 50 of 116
Example Of Un-structured Data
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
Volume: The name Big Data itself is related to a size which is enormous. ‘Volume’ is
one characteristic which needs to be considered while dealing with Big Data
solutions.
Variety: Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. Nowadays, data in the form of emails, photos, videos,
Page 51 of 116
monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining
and analyzing data.
Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
Variability: This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.
Before we dive into such topics as ML and Data Science, and try to explain how it works, we
should answer several questions:
What can we achieve in business or on the project with the help of ML? What goals
do I want to accomplish using ML?
Page 52 of 116
Do I only want to hop on the trend, or will the use of ML really improve user
experience, increase profitability, or protect my product and its users?
Clearly outline the objectives of the data collection process and the specific research
questions you want to answer. This step will guide the entire process and ensure you collect
the right data to meet your goals.
Also, it is recommended to identify data sources. Determine the sources from which you
will collect data. These sources may include primary data (collected directly for your study)
or secondary data (previously collected by others). Common data sources include surveys,
interviews, existing databases, observation, experiments, and online platforms.
In this stage, it is better to start with the selection of data collection methods. Choose the
appropriate methods to collect data from the identified sources. The methods may vary
depending on the nature of the data and research objectives.
The next step is very crucial. Ensuring data quality means reviewing the collected data to
check for errors, inconsistencies, or missing values. Apply quality assurance techniques to
ensure the data is reliable and suitable for analysis.
The following step would be data storage and management. It will require organizing and
storing the collected data in a secure and accessible manner. Consider using databases or
other data management systems for efficient storage and retrieval.
Synthetic data is any information manufactured artificially which does not represent events or
objects in the real world. Algorithms create synthetic data used in model datasets for testing
or training purposes. This data can mimic operational or production data and help train ML
models or test out mathematical models.
2. Active Learning
Active learning is a machine learning technique that focuses on selecting the most
informative data points to label or annotate from an unlabeled dataset. The aim of active
learning is to reduce the amount of labeled data required to build an accurate model by
strategically choosing which instances to query for labels. This is especially useful when
labeling data can be time-consuming or expensive.
Page 54 of 116
Steps Features
Initial Data Initially, a small labeled dataset is collected through random sampling or any
Collection other standard method.
Model
The initial labeled data is used to train a machine learning model.
Training
The model is then used to predict the labels of the remaining unlabeled data
Uncertainty points. During this process, the model’s uncertainty about its predictions is often
Estimation estimated. There are various ways to measure uncertainty, such as entropy, margin
sampling, and least confidence.
A query strategy is chosen to decide which data points to request labels for. The
query strategy selects instances with high uncertainty as these instances are likely
Query Strategy
to have the most impact on improving the model’s performance.
Selection
There are a few methods to apply query strategy selection: uncertainty sampling,
diversity sampling, and representative sampling.
Labeling New
The selected instances are then sent for labeling or annotation by domain experts.
Data Points
The newly labeled data is added to the labeled dataset, and the model is retrained
Model Update
using the expanded labeled set.
3. Transfer Learning
It is a popular approach for training models when there is not enough training data or time to
train from scratch. A common technique is to start from an existing model that is well trained
(also called a source task), one can incrementally train a new model (a target task) that
already performs well.
Page 55 of 116
4. Open Source Datasets
Open source datasets are a valuable resource in data collection for various research and
analysis purposes. These datasets are typically publicly available and can be freely accessed,
used, and redistributed. Leveraging open source datasets can save time, resources, and effort,
as they often come pre-cleaned and curated, ready for analysis.
Page 56 of 116
normalizing values, encoding categorical variables, and other
data transformations.
Ensure that the data you are using does not contain sensitive
or private information that could potentially harm individuals
Ethical Considerations
or organizations. Respect data privacy and consider
anonymizing or de-identifying data if necessary.
Just like with any data, it’s crucial to validate the open source
dataset for accuracy and quality. Cross-referencing the data
Validation and Quality Control
with other sources or performing sanity checks can help
ensure the dataset’s reliability.
Page 57 of 116
5. Manual Data Generation
Manual data generation refers to the process of collecting data by hand, without the use of
automated tools or systems. Manual data generation can be time-consuming and resource-
intensive, but it can yield valuable and reliable data when performed carefully.
Building synthetic datasets is one of the most common methods in data collection when real
data is limited or unavailable, or when privacy concerns prevent the use of actual data.
Synthetic datasets are artificially generated datasets that mimic the statistical properties and
patterns of real data without containing any sensitive or identifiable information.
Define the Problem and Objectives: Clearly identify the purpose of the synthetic
dataset. Determine what specific features, relationships, and patterns you want the
synthetic data to capture. Understand the target domain and data characteristics to
ensure the synthetic dataset is meaningful and useful.
Understand the Real Data: If possible, analyze and understand the real data you
want to emulate. Identify the key statistical properties, distributions, and relationships
within the data. This will help inform the design of the synthetic dataset.
Choose a Data Generation Method: Several methods can be used to create synthetic
datasets (Statistical Methods, Generative Models, Data Augmentation, Simulations ).
Choose the Right Features: Identify the essential features from the real data that
need to be included in the synthetic dataset. Avoid including personally identifiable
information (PII) or any sensitive data that might compromise privacy.
Generate the Synthetic Data: Implement the chosen data generation method to
create the synthetic dataset. Ensure that the dataset follows the same format and data
types as the real data to be used seamlessly in analyses and modeling.
Validate and Evaluate: Assess the quality and accuracy of the synthetic dataset by
comparing it to the real data. Use metrics and visualizations to validate that the
synthetic data adequately captures the patterns and distributions present in the real
data.
Page 58 of 116
Modify and Iterate: If the initial synthetic dataset does not meet your expectations,
refine the data generation method or adjust parameters until it better aligns with the
desired objectives.
Ensure Privacy and Ethics: Always prioritize privacy and ethical considerations
when generating synthetic datasets. Ensure that no individual or sensitive information
can be inferred from the synthetic data.
By following these steps, you can create synthetic datasets that can serve as valuable
substitutes for real data in various scenarios, contributing to better model development and
analysis in data-scarce or privacy-sensitive environments.
7. Federated Learning
Page 59 of 116
TensorFlow
TensorFlow, an open-source machine learning tool, is renowned for its flexibility, ideal
for crafting diverse models, and simple to use. With abundant resources and user-friendly
interfaces, it simplifies data comprehension.
PyTorch
Scikit-learn
Page 60 of 116
Keras
Keras helps easily create models, great for quick experiments, especially with images or
words. It’s user-friendly, making it simple to try out ideas, whether you’re working on
recognizing images or understanding language.
XGBoost
Apache Spark MLlib is a powerful tool designed for handling massive datasets, making it
ideal for large-scale projects with extensive data. It simplifies complex data analysis tasks
by providing a robust machine-learning framework. Whether you’re dealing with
substantial amounts of information, Spark MLlib offers scalability and efficiency, making
it a valuable resource for projects requiring the processing of extensive data sets.
Microsoft Azure Machine Learning makes it easy to do machine learning in the cloud.
It’s simple, user-friendly, and works well for many different projects, making machine
learning accessible and efficient in the cloud.
Google Cloud AI Platform is a strong tool for using machine learning on Google Cloud.
Great for big projects, it easily works with other Google tools. It provides detailed stats
and simple functions, making it a powerful option for large machine-learning tasks.
H2O.ai
Page 61 of 116
H2O.ai is a tool that helps you use machine learning easily. It’s good for many jobs and
has a helpful community. With H2O.ai, you can use machine learning well, thanks to its
easy interface and helpful people.
RapidMiner
RapidMiner is an all-rounder tool for the entire machine learning method, ideal for
concept exploration and collaboration on tremendous projects. It enables trying out ideas
and permits seamless teamwork, making it a versatile tool for diverse stages of machine
learning development.
Installation of Python
To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the
Windows operating system. Locate a reliable version of Python 3, preferably
version 3.10.11, which was used in testing this tutorial. Choose the correct
link for your device from the options provided: either Windows installer (64-
bit) or Windows installer (32-bit) and proceed to download the executable
file.
Page 62 of 116
Python Homepage
Page 63 of 116
Python Installer
After Clicking the Install Now Button the setup will start installing Python
on your Windows system. You will see a window like this.
Python Setup
Page 64 of 116
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system.
You will see a successful message.
Python version
You can also check the version of Python by opening the IDLE application. Go
to Start and enter IDLE in the search bar and then click the IDLE app, for
Page 65 of 116
example, IDLE (Python 3.10.11 64-bit). If you can see the Python IDLE
window then you are successfully able to download and installed Python on
Windows.
Python IDLE
Page 66 of 116
Steps to Configure Python Path on Windows:
When running the Python installer, ensure you check the box that says
“Add Python to PATH.”
Open the Start menu, search for “Environment Variables,” and select
“Edit the system environment variables.”
Under “System variables,” find the Path variable and click “Edit.”
Click “New” and add the path to the Python installation directory
(e.g., C:\Python39) and the Scripts directory
(e.g., C:\Python39\Scripts).
Type python and press Enter to start the Python interpreter. If Python
starts, the PATH is configured correctly.
Type pip and press Enter to verify that pip can be called from the
command line.
Installation of Tools
Page 67 of 116
Anaconda distribution provides installation of Python with various IDE's such
as Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence it is a very
convenient packaged solution which you can easily download and install in
your computer. It will automatically install Python and some basic IDEs and
libraries with it.
elow some steps are given to show the downloading and installing process of
Anaconda and IDE:
o After clicking on the first link, you will reach to download page of
Anaconda, as shown in the below image:
Page 68 of 116
o Since, Anaconda is available for Windows, Linux, and Mac OS, hence,
you can download it as per your OS type by clicking on available options
shown in below image. It will provide you Python 2.7 and Python 3.7
versions, but the latest version is 3.7, hence we will download Python 3.7
version. After clicking on the download option, it will start downloading
on your computer.
Note: In this topic, we are downloading Anaconda for Windows you can
choose it as per your OS.
Page 69 of 116
Step- 2: Install Anaconda Python (Python 3.7 version):
Page 70 of 116
o It will open a License agreement window click on "I Agree" option and
move further.
Page 71 of 116
o In the next window, you will get two options for installations as given in
the below image. Select the first option (Just me) and click on Next.
o Now you will get a window for installing location, here, you can leave it as
default or change it by browsing a location, and then click on Next.
Consider the below image:
Page 72 of 116
o Now select the second option, and click on install.
Page 73 of 116
o Now installation is completed, tick the checkbox if you want to learn more
about Anaconda and Anaconda cloud. Click on Finish to end the process.
Note: Here, we will use the Spyder IDE to run Python programs.
Page 74 of 116
Step- 3: Open Anaconda Navigator
Page 75 of 116
Run your Python program in Spyder IDE.
o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right
side.
Page 76 of 116
Step- 4: Close the Spyder IDE.
Machine learning data refers to the collection of information (or dataset) that is
used by machine learning models to learn, train, and make predictions. It can be
structured or unstructured and usually consists of features (input variables) and
labels (output variables or target values). The quality and nature of this data
directly impact the performance and accuracy of machine learning models.
Page 77 of 116
Big data for machine learning refers to extremely large, complex, and diverse
datasets that are generated at high velocity and volume. These datasets require
advanced processing techniques and technologies to extract useful insights and
are used in machine learning (ML) to improve model performance, accuracy, and
scalability. Machine learning on big data enables models to learn from vast and
varied data sources, resulting in more accurate predictions and better decision-
making capabilities.
IoT Sensors: IoT sensors enable data collection and transmission from physical
objects to digital systems.
Computer
Smartphone
Social data
Transactional data
In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of Big
Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large, then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not,
is dependent upon the volume of data.
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes
(6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000
Exabytes of data.
2. Velocity:
In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.
Example: There are more than 3.5 billion searches per day are made on Google.
Also, Facebook users are increasing by 22%(Approx.) year by year.
3. Variety:
4. Veracity:
Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.
5. Value:
After having the 4 V’s into account there comes one more V which stands for
Value! The bulk of Data having no Value is of no good to the company, unless
you turn it into something useful.
How fast or available data that extent is the structure of your data is changing?
How often does the meaning or shape of your data change?
Example: if you are eating same ice-cream daily and the taste just keep changing.
Examples:
Spreadsheets (Excel, CSV files) with clearly defined columns like "Age,"
"Salary," "Gender."
Databases with fields like product ID, price, and customer name.
Characteristics:
Applications:
Examples:
Characteristics:
Requires more preprocessing (e.g., NLP techniques for text data, image
processing for images).
Applications:
Examples:
Characteristics:
More flexible than structured data, with some level of organization (hierarchical
or nested).
Applications:
Page 82 of 116
Commonly used in web data scraping and APIs, which return data in formats like
JSON or XML.
Where can you “borrow” a dataset? Here are a couple of data sources you
could try:
Page 83 of 116
and many other excellent resources where you can find data sets from
versatile areas: starting from the apartment prices in Manhattan for the last
10 years and ending with the description of space objects.
Data visualization is the process of using visual elements like charts, graphs, or maps to
represent data. It translates complex, high-volume, or numerical data into a visual
representation that is easier to process.
Data visualization tools improve and automate the visual communication process for
accuracy and detail. You can use the visual representations to extract actionable insights from
raw data.
Matplotlib:
Description: Matplotlib is a popular Python library used for creating static, interactive, and
animated visualizations.
Key Features:
Page 84 of 116
Supports 2D plotting, including line charts, scatter plots, histograms, bar charts, and
more.
Highly customizable with fine control over every aspect of the figure (axes, labels,
colors).
Integrates well with other Python libraries like Pandas and NumPy.
Use Cases: Scientific visualizations, exploratory data analysis, and research reporting.
Seaborn
Key Features:
Simplifies the creation of complex visualizations such as heatmaps, pair plots, and
violin plots.
Use Cases: Statistical data analysis, exploratory data visualization, and academic research.
Plotly
Description: Plotly is a web-based and Python-based visualization library that allows users to
create interactive and shareable visualizations.
Key Features:
Supports a wide range of chart types including 3D charts, geographic maps, and time
series plots.
Allows for interactive elements like hover, zoom, and click for better data exploration.
Use Cases: Business intelligence, interactive dashboards, and web-based data visualizations.
Page 85 of 116
Tableau
Description: Tableau is one of the most popular and powerful data visualization tools. It
allows users to create complex, interactive, and shareable dashboards.
Key Features:
Supports a wide range of data sources including Excel, SQL databases, cloud services,
and APIs.
Extensive library of visualization types, including maps, scatter plots, heatmaps, and
more.
Use Cases: Business intelligence, data analytics, real-time reporting, and exploratory
analysis.
Power BI:
Key Features:
Use Cases: Business intelligence, financial reporting, data exploration, and collaborative data
analysis.
1. Line Charts: In a line chart, each data point is represented by a point on the graph,
and these points are connected by a line. We may find patterns and trends in the data
across time by using line charts. Time-series data is frequently displayed using line
Page 86 of 116
charts.
2. Scatter Plots: A quick and efficient method of displaying the relationship between
two variables is to use scatter plots. With one variable plotted on the x-axis and the
other variable drawn on the y-axis, each data point in a scatter plot is represented by a
point on the graph. We may use scatter plots to visualize data to find patterns, clusters,
and outliers.
3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar
chart, each category is represented by a bar, with the height of the bar indicating the
frequency or proportion of that category in the data. Bar graphs are useful for
Page 87 of 116
comparing several categories and seeing patterns over time.
4. Heat Maps: Heat maps are a type of graphical representation that displays data in a
matrix format. The value of the data point that each matrix cell represents determines
its hue. Heatmaps are often used to visualize the correlation between variables or to
identify patterns in time-series data.
Page 88 of 116
5. Tree Maps: Tree maps are used to display hierarchical data in a compact format and
are useful in showing the relationship between different levels of a hierarchy.
6. Box Plots: Box plots are a graphical representation of the distribution of a set of data.
In a box plot, the median is shown by a line inside the box, while the center box
depicts the range of the data. The whiskers extend from the box to the highest and
lowest values in the data, excluding outliers. Box plots can help us to identify the
Page 89 of 116
spread and skewness of the data.
Others:
Various types of visualizations cater to diverse data sets and analytical goals.
1. Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to
understand proportions and percentages.
2. Heatmaps: Visualize complex data sets through color-coding, emphasizing variations
and correlations in a matrix.
3. Area Charts: Similar to line charts but with the area under the line filled, these charts
accentuate cumulative data patterns.
4. Bubble Charts: Enhance scatter plots by introducing a third dimension through varying
bubble sizes, revealing additional insights.
5. Treemaps: Efficiently represent hierarchical data structures, breaking down categories
into nested rectangles.
6. Violin Plots: Violin plots combine aspects of box plots and kernel density plots,
providing a detailed representation of the distribution of data.
7. Word Clouds: Word clouds are visual representations of text data where words are sized
based on their frequency.
Page 90 of 116
8. 3D Surface Plots: 3D surface plots visualize three-dimensional data, illustrating how a
response variable changes in relation to two predictor variables.
9. Network Graphs: Network graphs represent relationships between entities using nodes
and edges. They are useful for visualizing connections in complex systems, such as
social networks, transportation networks, or organizational structures.
10. Sankey Diagrams: Sankey diagrams visualize flow and quantity relationships between
multiple entities. Often used in process engineering or energy flow analysis.
Effective data visualization is crucial for conveying insights accurately. Follow these
best practices to create compelling and understandable visualizations:
2. Design Clarity and Consistency: Choose appropriate chart types, simplify visual
elements, and maintain a consistent color scheme and legible fonts. This ensures a
clear, cohesive, and easily interpretable visualization.
Trends indicate the general direction in which data is moving over time or across a range of
values. They often suggest a consistent increase or decrease.
Examples:
o Time Series Data: A line graph showing a steady increase in sales over several
months.
Page 91 of 116
o Trend Lines: Fitted lines in scatter plots that show the overall movement of
data points.
Patterns
Examples:
o Seasonal Patterns: A graph showing sales peaks during holiday seasons each
year.
Context refers to the environment or situation surrounding the data. It includes details about
the dataset, the problem being solved, and any external factors influencing the analysis.
Components:
Problem Statement: What specific problem are you trying to solve? (e.g., predicting
customer churn, classifying images)
Data Source: Where did the data come from? (e.g., customer databases, sensors,
public datasets)
Data Characteristics: Size, type, and structure of the data (e.g., time series,
categorical, continuous).
Business Goals: What are the objectives of the analysis? (e.g., increase sales, improve
customer satisfaction)
Page 92 of 116
Importance: Understanding the context helps ensure that the analysis is relevant and that the
interpretations of the visualizations align with business needs. For example, visualizing sales
data without knowing the marketing strategies employed can lead to misleading conclusions.
Background
Components:
o Historical Data: Previous trends or patterns observed in the data (e.g., past
sales performance).
Example Scenario
Visualization Result: A line graph shows an upward trend in sales leading up to the
holiday season, with a noticeable spike in late November.
Page 93 of 116
Correlations and Relationships:
Correlation refers to a statistical measure that expresses the extent to which two
variables are linearly related. It indicates how changes in one variable are associated
with changes in another.
Types of Correlation:
Measurement:
Visualization:
Relationships
Types of Relationships:
Page 94 of 116
o Linear Relationships: Straight-line relationships where one variable changes
at a constant rate relative to another.
o Causal Relationships: Indicate that one variable directly affects another (e.g.,
increased advertising leads to higher sales). Correlation does not imply
causation.
Measurement:
Visualization:
o Line Charts: Good for showing trends and relationships over time.
Key Differences
Practical Applications
Page 95 of 116
Correlations: Useful in exploratory data analysis to identify potential relationships
worth investigating further. For example, a strong positive correlation between study
time and exam scores might lead educators to implement more study sessions.
Example Scenario
Purpose:
Improves Model Performance: Dirty data can lead to inaccurate models, high bias, or
overfitting.
Reduces Bias and Variance: Proper cleaning reduces noise and variability in the data,
improving generalization.
Steps
Page 96 of 116
1. Remove duplicate or irrelevant observations
1. As a first option, you can drop observations that have missing values, but doing this will drop
or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again, there is
an opportunity to lose integrity of the data because you may be operating from assumptions
and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null values.
5. Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation:
Does the data follow the appropriate rules for its field?
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help you form your next theory?
Page 97 of 116
1. Validity. The degree to which your data conforms to defined business rules or
constraints.
4. Consistency. Ensure your data is consistent within the same dataset and/or across
multiple data sets.
5. Uniformity. The degree to which the data is specified using the same unit of measure.
Having clean data will ultimately increase overall productivity and allow for the highest
quality information in your decision-making. Benefits include:
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and
quicker decision-making.
Page 98 of 116
Missing values are a common problem in datasets and can occur for various reasons, such as
data entry errors, equipment malfunctions, or incomplete surveys. It is crucial to handle
missing values appropriately, as they can lead to biased analysis and inaccurate results. There
are several techniques for handling missing values, two of which we will discuss next.
In some cases, removing observations that contain missing values may be appropriate. This
technique is called complete case analysis and involves removing any observation that
contains one or more missing values.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
Simply dropping values might lead to several data points being missing in the dataset.
Instead, it may be appropriate to impute missing values, which involves replacing missing
values with estimates based on the available data. There are several techniques for imputing
missing values, including mean imputation, median imputation, and regression imputation.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
imputer = SimpleImputer(strategy='mean')
df_clean =
Page 99 of 116
pd.DataFrame(imputer.fit_transform(df),
columns=df.columns)
In this code snippet, we imputed missing values using mean imputation with the
SimpleImputer() method from scikit-learn.
2. Removing Duplicates
Duplicate observations can lead to biased analysis and inaccurate results. Identifying and
removing duplicates is crucial to ensure the data used for analysis is accurate and reliable.
Dropping exact duplicates
This technique involves removing observations that are identical across all variables.
Duplicate records can skew the results of data analysis. When there are exact duplicates in the
dataset, it can lead to inflated counts or misleading statistics, which can impact the accuracy
of the analysis. Additionally, duplicates can cause memory issues and slow the analysis
process, especially when working with large datasets.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
df_clean = df.drop_duplicates()
Fuzzy matching
Sometimes, duplicates in a dataset may not be exact matches due to variations in data entry or
formatting inconsistencies. These variations can include spelling errors, abbreviations, and
different representations of the same data. Fuzzy matching techniques can be used to identify
and remove these duplicates.
Fuzzy matching algorithms can help to identify and remove these duplicates by comparing
the similarity of the data values rather than exact matches. By using fuzzy matching, analysts
can identify duplicates that may have been missed by traditional exact matching techniques,
leading to more accurate and complete data sets.
import pandas as pd
from fuzzywuzzy import fuzz
Page 100 of 116
# Load data
df = pd.read_csv('data.csv')
# Identify fuzzy duplicates
3. Handling Outliers
Outliers are extreme values that can have a significant impact on statistical analysis.
Identifying and handling outliers appropriately is essential to avoid biased analysis and
inaccurate results. Let us look at three techniques to handle outliers.
Trimming
Trimming is a method for handling outliers that involves removing extreme values from the
dataset. However, this technique is useful when the outliers are few and significantly different
from the rest of the data. Removing the outliers can reduce the impact of their extreme
values, and the remaining data can be analyzed without interference.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Trimming
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
df_clean = df[(df[column] > q1 - 1.5 * iqr) & (df[column] < q3 + 1.5 * iqr)]
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Winsorize data
4. Standardizing Data
Data standardization is a technique for transforming variables to have a common scale. When
working with data from different sources or formats, there can be variations in how it is
represented, such as differences in units of measurement, data formats, and data structures,
making it difficult to compare variables or perform statistical analysis. Data standardization is
a crucial step in such cases.
Min-max scaling
Min-max scaling involves rescaling the values of a variable to a range between 0 and 1. This
is done by subtracting the minimum value of the variable from each value and dividing it by
the variable’s range.
This technique is useful for normalizing data with different scales or units, making it easier to
compare and analyze. Min-max scaling preserves the relative relationships between the
values of the variable, but it can be sensitive to outliers and reduce the impact of extreme
values.
import pandas as pd
# Load data
Page 102 of 116
df = pd.read_csv('data.csv')
Z-score normalization also preserves the relative relationships between the values of the
variable but is less sensitive to outliers than min-max scaling.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
scaler = StandardScaler()
5. Data Validation
Data validation is a process of checking and verifying that the data is accurate, complete, and
consistent. Data validation techniques are used to identify and correct errors, inconsistencies,
and outliers in the data that may affect the quality of the analysis.
Range Checks
Range checks are a data validation technique used to ensure that values fall within a specified
Logic Checks
Logic checks are a data validation technique used to ensure that the data conforms to a
specified logical structure or relationship. This technique is particularly useful for checking
the relationships between variables and can help identify errors in data entry or processing.
For example, if a dataset contains variables representing a person’s height and weight, a logic
check may be used to ensure that the Body Mass Index is calculated correctly.
def logic_check(data):
for i in range(len(data)-1):
if not is_valid(data[i], data[i+1]):
print("Invalid values:", data[i], data[i+1])
def is_valid(value1, value2):
# check if value1 and value2 conform to specified logical relationship
return True or False
Using Coresets
As previously stated, the DataHeroes Coreset library, at its build phase, computes an
“Importance” metric for each data sample which can help fix mislabeling errors, thus
validating the dataset.
For example, let’s say we use the following code snippet to identify samples with very high
“Importance” values, i.e., samples that are out of the ordinary for that class distribution.
self.coreset_service =
CoresetTreeServiceLG(optimized_for='cleaning')
self.coreset_service.build(self.X, self.y)
result = self.coreset_service.get_important_samples(
class_size={self.predicted_classid: self.num_images_to_view}
important_sample_indices = result["idx"]
Now, one can either assign these samples to the true class, which is not always obvious to
identify or feasible. So, in such cases, we may want to delete the samples to make the Coreset
more accurate. To achieve this, we use the remove_samples function, as shown in the
following code fragment.
Final Thoughts
Data cleaning is an essential step in the data preprocessing pipeline of machine learning
projects. It involves identifying and correcting errors and removing inconsistencies to ensure
that the data used for analysis is reliable.
Data normalization
Normalization is a specific form of feature scaling that transforms the range of features to a
standard scale. Normalization and, for that matter, any data scaling technique is required only
when your dataset has features of varying ranges. Normalization encompasses diverse
techniques tailored to different data distributions and model requirements.
Normalized data enhances model performance and improves the accuracy of a model. It aids
algorithms that rely on distance metrics, such as k-nearest neighbors or support vector
machines, by preventing features with larger scales from dominating the learning process.
Let’s take a simple example to highlight the importance of normalizing data. We are trying to
predict housing prices based on various features such as square footage, number of bedrooms,
and distance to the supermarket, etc. The dataset contains diverse features with varying
scales, such as:
Min-Max normalization:
This method of normalising data involves transforming the original data linearly. The data’s
minimum and maximum values are obtained, and each value is then changed using the
formula that follows.
This division scales the variable to a proportion of the entire range. As a result, the
normalized value falls between 0 and 1.
When the feature X is at its minimum, the normalized value (x’) is 0. This is because
the numerator becomes zero.
For values between the minimum and maximum, x’ ranges between 0 and 1,
preserving the relative position of X within the original range.
The data is normalised by shifting the decimal point of its values. By dividing each data value
by the maximum absolute value of the data, we can use this technique to normalise the data.
The following formula is used to normalise the data value, v, of the data to v’:
where v’ is the normalized value, v is the original value, and j is the smallest integer such
Using the mean and standard deviation of the data, values are normalised in this technique
to create a standard normal distribution (mean: 0, standard deviation: 1). The equation that is
applied is:
Normalization Standardization
Normalization scales the values of a feature Standardization scales the features to have a
to a specific range, often between 0 and 1. mean of 0 and a standard deviation of 1.
Applicable when the feature distribution is Effective when the data distribution is
uncertain. Gaussian.
✓ Data transformation
Data transformation: the process of converting and cleaning raw data from one data source
to meet the requirements of its new location. Also called data wrangling, transforming data is
essential to ingestion workflows that feed data warehouses and modern data lakes.
You need to know why we need to apply this transformation. You need to know about
Normal distribution. What is normal distribution? The normal distribution, also known as the
Gaussian distribution, is a continuous probability distribution that is widely used in Machine
Learning and statistical modeling. It is a bell-shaped curve that is symmetrical around its
mean and is characterized by its mean and standard deviation.
In the above image, the first graph is right-skewed data which is a positive skew and the last
one is left-skewed data which is a negative skew. The center one is a normal distribution.
Mean: The mean is usually greater than the median due to the influence of the longer
right tail.
Median: The median is the middle value when the data is ordered. In a right-skewed
distribution, it is less than the mean.
Mode: The mode is the most frequent value. It tends to be on the left side, where the
majority of the data is concentrated.
Mean: The mean is usually less than the median due to the influence of the longer left
tail.
Median: The median is still the middle value when the data is ordered. In a left-
skewed distribution, it is greater than the mean.
Mode: The mode is still the most frequent value but now tends to be on the right side,
where the majority of the data is concentrated.
In simple terms, data transformation means changing and improving how we work with data.
Here are the benefits of data transformation:
3. Better decision making and business intelligence: With clean and integrated data,
organizations can make more informed decisions based on accurate insights, which
improves efficiency and competitiveness.
5. Data privacy: Protect data privacy and comply with data protection regulations by
transforming sensitive data through techniques like anonymization,
pseudonymization, or encryption.
1. Log Transformation
The log transformation is a mathematical operation applied to each data point in a dataset
using the natural logarithm (base ‘e’). The natural logarithm of a number x is denoted as ln(x)
or log_e(x).
The purpose of log transformation is to reduce the impact of extreme values and make the
data more interpretable and suitable for certain types of analyses or modeling.
Mathematical Formula:
y = log_e(x)
In this formula:
This transformation makes our data close to a normal distribution but not able to
exactly abide by a normal distribution.
This transformation is not applied to those features which have negative values.
Simple code:
import numpy as np
log_transformed_data = np.log(original_data)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(range(len(original_data)), original_data)
plt.title('Original Data')
plt.subplot(1, 2, 2)
plt.scatter(range(len(log_transformed_data)), log_transformed_data)
plt.title('Log-Transformed Data')
plt.show()
2. Reciprocal Transformation
It is another technique of transformation that involves converting the non-zero data points(x)
into their reciprocal(1/x).
If a feature contains large values, taking the reciprocal can help scale them down.
In a right-skewed distribution where most values are small, taking the reciprocal can
spread out the small values and bring them closer together.
Mathematical Formula:
y = 1/x
y is the reciprocal of x.
If x is a small positive number, its reciprocal will be a larger value, and vice versa.
This transformation reverses the order among values of the same sign, if x is a small
positive number, its reciprocal will be a larger value, and vice versa.
Simple Code:
import numpy as np
# Original data
# Reciprocal transformation
reciprocal_data = np.reciprocal(data)
#Output
3. Square Transformation:
This technique involves in converting the data points(x) into their square(x²).
By squaring the values, the transformation can spread out the data, reducing the skewness and
potentially achieving a more symmetrical distribution.
Formula:
y=(x²)
When you apply this technique, squaring a negative value will result in a positive
value.
However, it’s important to note that, the squaring of a smaller negative value results in
a higher positive value. Eg: (-5)**2 -> 25, (50)**2 -> 2500.
Simple Code:
import numpy as np
# Original data
# Squaring transformation
squared_data = np.square(data)
#Output
This technique involves taking the square root of each value in the data, i.e., applying the
formula (sqrt(x)).
This transformation is beneficial when dealing with right-skewed data or a few extreme
outliers. Taking the square root can help compress the larger values and make the distribution
more symmetric.
Formula:
y = sqrt(x)
Simple Code:
import numpy as np
transformed_data = np.sqrt(data_points)
#Output
2.1. Machine Learning algorithm is properly selected based on the characteristics of the
dataset.
2.2. Machine Learning models are properly trained based on a training set of data.