Unit-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

A8703-Machine Learning

➢Text Book
❖ G Amit Kumar Das, SaikatDutt, Subramanian Chandramouli, Machine
Learning, Pearson India Education Services, 2019.
➢Reference Books
1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, 2013.
2. Rudolph Russell, Machine Learning: Step-by-Step Guide to Implement
Machine Learning Algorithms with Python, Create Space Independent
Publishing Platform, 2018.

1
A8703-Machine Learning : Course Outcomes

➢ Identify the various concepts and challenges in machine Learning.

➢ Select modelling and evaluation technique for handling real time data.

➢ Use supervised and unsupervised learning algorithms for a given


problem.

➢ Examine supervised and unsupervised learning algorithms for


analysing data.

➢ Identify the various concepts of neural network to develop AI based


applications.
2
A8703-Machine Learning : Syllabus
Introduction to Machine Learning: Types of Machine Learning, Problems not to be solved using
Machine Learning, Applications of Machine Learning, Tools in Machine Learning, Issues in
Machine Learning, Machine learning Activities, Basic Types of Data in Machine Learning,
Exploring Structure of data, Data Quality & Remediation, Data Pre-Processing.

Modelling and Evaluation: Introduction, Selecting a Model, Training a Model, Model


Representation and Interpretability, Evaluating Performance of a Model, Improving
performance of a Model, Feature subset selection, Dimensionality Reduction - PCA, SVD,FA,
LDA.

Bayesian Concept Learning: Introduction, Bayes’ Theorem, Naïve Bayes Classifier, Ap-
plications of Naïve Bayes Classifier, Supervised Learning: Classification , Example of
Supervised Learning, Classification Model Learning Steps, Common Classification Algorithms:
KNN, Decision Tree, Random forest model , Support vector machines. Introduction of
Regression: Example of Regression, linear Regression, Multiple linear Regression. 3
A8703-Machine Learning : Syllabus

Unsupervised Learning: Introduction, Unsupervised vs Supervised Learning,


Application of Unsupervised Learning, Clustering K-Means, K-Medoid,
DBSCAN.

Basics of Neural Network: Introduction, Understanding the Biological Neuron,


Exploring the Artificial Neuron, Types of Activation Functions, Early
Implementations of ANN, Architectures of Neural Network.

4
Introduction to Machine Learning
• Introduction

• In the real world, we are surrounded by


humans who can learn everything from their
experiences with their learning capability,

• We have computers or machines which work


on our instructions. But can a machine also
learn from experiences or past data like a
human does?
Introduction to Machine Learning
• To make the machines to learn, Machine learning plays a important role.
• What is Machine Learning ?
Arthur Samuel, a research pioneer in the field of artificial intelligence and
computer gaming, coined the term “Machine Learning”.

✓Computers learning from data is known as machine learning.

• Definition of ML

• Machine learning (ML) is a sub-domain of artificial intelligence (AI) that


focuses on developing systems that learn—or improve performance—or predict
new data-or make decisions based on the data they receive.
Introduction to Machine Learning
• It involves training models on data to identify patterns, relationships, and insights,
which can then be used to perform various tasks and make predictions on new,
unseen data.

• “Machine learning is a method of data analysis that automates analytical model


building. It is a branch of artificial intelligence based on the idea that systems can
learn from data, identify patterns and make decisions with minimal human
intervention”.

• Definition of Machine Learning (Mitchell 1997) — “A computer program is said to


learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at the tasks improves with the experiences”.
Why Machine Learning ?
Programs have been developed that successfully

✓learn to recognize spoken words (Waibel 1989; Lee 1989),

✓predict recovery rates of pneumonia patients (Cooper et al. 1997),

✓detect fraudulent use of credit cards,

✓drive autonomous vehicles on public highways (Pomerleau 1989),

✓play games such as backgammon at levels approaching the performance of


human world champions (Tesauro 1992, 1995).c
Programs vs. learning algorithms

Traditional Programming refers to any manually created program that uses input
data and runs on a computer to produce the output.

In Machine Learning, also known as augmented analytics, the input data and
output are fed to an algorithm to create a program. This yields powerful
insights that can be used to predict future outcomes.
Components of Learning System

10
Introduction to Machine Learning
• Advantages of ML

1. Automation - enables automation of tasks that are repetitive or time- consuming

2. Prediction and Decision Making - analyze large datasets to make predictions and
decisions with high accuracy

3. Scalability: - can handle large volumes of data and scale efficiently to accommodate
growing datasets

4. Adaptability: - can adapt and learn from new data, allowing them to continuously improve.

5. Personalization: -can personalize user experiences by analyzing user behavior and


preferences.

6. Pattern Recognition: -identifying patterns, trends, and anomalies in data that may not be
obvious to humans.
Introduction to Machine Learning
• Disadvantages of ML

1. Data Dependency - models heavily rely on the quality and quantity of data for training.

2. Overfitting: models may become too specialized to the training data and fail to generalize well to
unseen data.

3. Interpretability: lack interpretability, making it challenging to understand how they arrive at their
predictions

4. Computational Resources : Training complex ML models requires significant computational


resources.

5. Ethical and Privacy Concerns: or infringe on privacy rights, raising ethical and social concerns.

6. Lack of Domain Knowledge: models may perform poorly in domains where domain- specific
knowledge is essential.
Introduction to Machine Learning
• Limitations:

1. Limited by Data Quality

2. Complexity:

3. Interpretability:

4. Generalization:

5. Scalability:

6. Human Expertise
History of Machine learning
• The history of machine learning traces back to the mid-20th century, with
roots in the fields of mathematics, computer science, and artificial intelligence.
Here's a brief overview:

• 1950s - 1960s: Early Foundations

1. Alan Turing (1950): Turing proposed the Turing Test as a measure of


machine intelligence, laying the groundwork for the concept of artificial
intelligence (AI).

2. Arthur Samuel (1959): Samuel developed the first self-learning program, a


checkers- playing program that improved its performance through
reinforcement learning.
History of Machine learning
• 1970s - 1980s: Symbolic AI Dominance
1. Expert Systems: Symbolic AI, based on rule-based expert systems,
dominated the field. These systems encoded human expertise in the form
of rules to solve specific problems.
2. Neural Networks Research: Neural networks research continued, but
interest discontinued due to limited computational power and the
dominance of symbolic AI.
• 1990s: Renaissance of Neural Networks
1. Backpropagation: The rediscovery of the backpropagation algorithm for
training neural networks led to renewed interest in neural networks and
machine learning.
History of Machine learning
2. Support Vector Machines (SVMs): Vladimir Vapnik and others developed support
vector machines, a powerful machine learning algorithm for classification and
regression tasks.

• 2000s - Present: The Big Data Era

1. Big Data: The explosion of data availability due to the internet, social media, and
digital technologies fueled the development of new machine learning algorithms
and techniques.

2. Deep Learning: Breakthroughs in deep learning, fueled by advances in


computational power and data availability, led to significant improvements in
areas like computer vision, natural language processing, and speech recognition.
History of Machine learning
3. Reinforcement Learning: Reinforcement learning gained prominence,
particularly in areas like robotics, gaming (e.g., AlphaGo), and
autonomous vehicles.

4. Machine Learning Applications: Machine learning became ubiquitous in


various applications, including recommendation systems, fraud detection,
healthcare diagnostics, autonomous vehicles, and more.

5. Ethical and Social Implications: Increased attention to the ethical and


social implications of machine learning, including concerns about bias,
fairness, privacy, and job displacement.
Evolution of machine learning
What is human learning?
• Human learning is the process through which individuals acquire new

knowledge, skills, attitudes, or behaviors.

• This process can occur in various ways, including through:

1. Experiences:

2. Observation:

3. Instruction:

• Human learning involves cognitive processes such as attention, memory,

and problem-solving.
What is human learning?
• Cognitive science is an interdisciplinary field that studies about

the mind and its processes, including how people think, learn,

remember, and perceive.


Types of human learning
• Human learning happens in one of the three ways –

• (1) either somebody who is an expert in the subject directly teaches us

(Learning under expert guidance),

• (2) we build our own notion indirectly based on what we have learnt from the

expert in the past (Learning guided by knowledge gained from experts), or

• (3) we do it ourselves, may be after multiple attempts (Learning by self or

self-learning),
How do machines learn?
• The basic machine learning process can be divided into three parts.

• 1. Data Input: Past data or information is utilized as a basis for future


decision-making.

• 2. Abstraction: The input data is represented in a broader way through the


underlying algorithm.

• 3. Generalization: The abstracted representation is generalized to form a


framework for making decisions.

• Figure 1 shows a schematic representation of the machine learning process.


How do machines learn?

Fig: Process of machine learning


• Human learning is that just by great memorizing and perfect recall, i.e. just

based on knowledge input, students can do well in the examinations only till a

certain stage.

• A better learning strategy needs to be adopted:


How do machines learn?
• 1. to be able to deal with the vastness of the subject matter and the
related issues in memorizing it.

• 2. to be able to answer questions where a direct answer has not


been learnt.

• A good option is to figure out the key points or ideas amongst a


vast pool of knowledge.

• From the machine learning paradigm, the vast pool of knowledge


is available from the data input.
How do machines learn?
• However, rather than using it in entirety of the data, a concept of
mapping is applied.

• Mapping of the data to known characteristic is called knowledge


abstraction performed by the machine.

• Finally, this abstracted mapping from the input data can be


applied to make critical conclusions.

• This is generalization in context of machine learning.


Types of Machine Learning
Types of Machine Learning
• Supervised Learning:

• Definition: In supervised learning, the algorithm is trained on a labeled


dataset, where each input is associated with a corresponding output. (which
means that the input data is paired with the desired output.).

• Usefulness: Supervised learning is highly useful in real-world scenarios


where there is a known outcome or target variable.

• Supervised learning is often used for tasks such as classification, regression,


and object detection.
• Real-world Applications: Supervised learning is applied in various domains
such as:
Types of Machine Learning
1. Predictive analytics: Forecasting customer churn, predicting sales trends, etc.

2. Healthcare: Diagnosing diseases based on patient data.

3. Finance: Credit scoring, fraud detection, risk assessment.

• Advantages:

• Well-understood and widely studied.

• Can achieve high accuracy when trained on sufficient and representative data.

• Disadvantages:

• Requires labeled data, which can be expensive and time-consuming to obtain.

• May suffer from overfitting if the model is too complex or the training dataset
is small.
Types of Machine Learning
• Unsupervised Learning:

• Definition: In unsupervised learning, the algorithm is trained on an


unlabeled dataset, where the goal is to discover hidden patterns or structures
within the data.
• Usefulness: Unsupervised learning is valuable in real-time scenarios where
the data is unstructured or lacks labels. It can uncover hidden insights and
group similar data points together.

• Real-world Applications: Unsupervised learning is applied in various domains


such as:

1. Market segmentation: Grouping customers based on similar traits or behaviors.


Types of Machine Learning
1. Anomaly detection: Identifying unusual patterns or outliers in data.

2. Recommender systems: Generating personalized recommendations based on


user preferences.
• Advantages:

• Can reveal hidden patterns or structures in the data.

• Does not require labeled data, making it applicable to a wide range of


datasets.

• Disadvantages:

• Evaluation of results can be subjective and challenging.

• Interpretability of the model's output may be limited.


Types of Machine Learning
• Semi Supervised Learning

• Semi-supervised learning is a machine learning paradigm that falls between


supervised and unsupervised learning.

• In semi-supervised learning, the dataset contains both labeled and unlabeled


data.

• The algorithm leverages the small amount of labeled data along with the
larger pool of unlabeled data to make predictions or learn patterns.
Types of Machine Learning
• Here are a few semi-supervised learning algorithms:

• Self-Training:

1. Self-training is a simple semi-supervised learning algorithm where the model


starts with a small amount of labeled data.
2. It trains initially on the labeled data and then uses the trained model to
make predictions on the unlabeled data.
3. The predictions with high confidence are added to the labeled dataset, and the
process iterates until convergence.
4. This approach assumes that the model's predictions on the unlabeled data are
reliable.
Types of Machine Learning
• Semi-Supervised Support Vector Machines (S3VM):

1. S3VM is an extension of traditional Support Vector Machines (SVM) to semi-

supervised settings.

2. It incorporates both labeled and unlabeled data into the SVM framework,

aiming to find a decision boundary that separates the data while minimizing

classification errors.

3. S3VM optimizes a combination of the margin and the empirical error on the

labeled data, along with a term penalizing the model's complexity.


Types of Machine Learning
• Label Propagation:

1. Label propagation is a graph-based semi-supervised learning algorithm.

2. It constructs a graph representation of the data, where nodes represent


data points, and edges represent similarities between points.
3. Initially, the labeled nodes are assigned their true labels, and then the labels
propagate through the graph based on similarities between nodes.
4. The final labels are determined based on the propagated labels, and the
process iterates until convergence.
Types of Machine Learning
• Generative Adversarial Networks (GANs):
1. GANs consist of two neural networks, a generator and a discriminator,
which are trained simultaneously through a min-max game.
2. In semi-supervised learning, GANs can be used to generate realistic samples
from the unlabeled data distribution.
3. The generated samples are combined with the labeled data to train a
classifier, effectively leveraging the unlabeled data to improve classification
performance.
Types of Machine Learning
• Reinforcement Learning:

• Definition: Reinforcement learning involves an agent learning to make


decisions by interacting with an environment and receiving feedback in the
form of rewards or penalties.
• Usefulness: Reinforcement learning is beneficial in real-time environments
where decisions must be made sequentially and actions have consequences. It
is used in areas such as robotics, gaming, and autonomous systems.
• Real-world Applications: Reinforcement learning is applied in various
domains such as:

1. Robotics: Training robots to perform complex tasks in dynamic environments.


Types of Machine Learning
1. Autonomous vehicles: Teaching vehicles to navigate safely and efficiently.

2. Resource management: Optimizing energy usage, inventory management, etc.

• Advantages:

1. Can learn complex behaviors and strategies through trial and error.

2. Suitable for environments with sparse or delayed feedback.

• Disadvantages:

1. Requires a well-defined reward structure, which may be challenging to specify.

2. Can be computationally expensive and time-consuming to train.


Problems Cannot to Be Solved Using Machine Learning
• Machine learning (ML) is a powerful tool for solving a wide range of problems,
there are certain types of problems that it may not be well-suited to address
effectively.

• Some examples of problems that cannot be easily solved by machine learning


alone:

• Lack of Data, Undefined Objectives, Causal inference, Ethical and Moral


Judgments, Unstructured Problem Solving, Domain Expertise, Conceptual
Understanding, Small Sample Sizes, Incorporating Context, Real-time Critical
Decision-making, Extreme Context Shifts, New and Novel Situations,
Interpersonal and Emotional Understanding
Problems Cannot to Be Solved Using Machine Learning
1. Lack of Data: Machine learning models require sufficient and high-quality data
for training. If the data is scarce, incomplete, or biased, the performance of ML
models can suffer.

2. Undefined Objectives: Machine learning relies on well-defined objectives and


metrics for optimization. If the problem itself is not well-defined, ML might not
be effective in finding solutions.

3. Causal Inference: Causal inference is the process of drawing conclusions about


causal relationships based on data and statistical methods. Determining cause
and effect requires more rigorous experimental design and statistical methods.
Problems Cannot to Be Solved Using Machine Learning
4. Ethical and Moral Judgments: Decisions involving ethical considerations, moral
judgments, and values often require human reasoning, empathy, and contextual
understanding that machine learning lacks in doing such tasks.

5. Unstructured Problem Solving: Machine learning is often used for structured


tasks with clearly defined inputs and outputs. Problems requiring creative
thinking, intuition, and subjective judgment may not be suitable for ML.

6. Domain Expertise: ML models require domain-specific knowledge for effective


feature engineering, interpretation of results, and ensuring meaningful
outcomes. Lack of domain expertise can lead to suboptimal solutions.
Problems Cannot to Be Solved Using Machine Learning
7. Conceptual Understanding: ML models can predict outcomes based on patterns
in data, but they may not provide a deep conceptual understanding of
underlying phenomena or processes.

8. Small Sample Sizes: Some machine learning algorithms, particularly deep


learning models, require large amounts of data to generalize well. Small
sample sizes can lead to overfitting and poor performance.

9. Incorporating Context: Contextual understanding and reasoning based on


broader context, cultural nuances, and real-world experiences are areas where
machines may struggle.
Problems Cannot to Be Solved Using Machine Learning
10. Real-time Critical Decision-making: Situations that require real-time
decision-making, especially in high-stakes environments like healthcare or
aviation, may not allow sufficient time for the learning and adaptation
process of ML models.

11. Extreme Context Shifts: Machine learning models might not perform well
when deployed in situations drastically different from their training
environment. They lack adaptability extreme shifts in context.

12. New and Novel Situations: ML models typically operate based on patterns
learned from past data. When faced with entirely new and novel situations,
they might not have sufficient information to provide accurate predictions.
Problems Cannot to Be Solved Using Machine Learning
13. Interpersonal and Emotional Understanding: Recognizing and
responding to human emotions, nuances, and interpersonal
interactions are challenging tasks that require human emotional
intelligence and social understanding.
Applications of Machine Learning
• Machine learning (ML) has become a powerful tool across various industries,
transforming how we live and work.

• Here's an overview of its applications and limitations:


Applications of Machine Learning

Healthcare: Finance:
1. Disease diagnosis and prognosis. 1. Fraud detection and prevention
2. Personalized treatment 2. Credit scoring and risk assessment
recommendation. 3. Algorithmic trading and financial
3. Drug discovery and development forecasting
4. Medical imaging analysis (e.g., MRI, 4. Customer segmentation and targeted
CT scans) marketing
5. Electronic health record (EHR) 5. Portfolio optimization and wealth
analysis for patient management management
Applications of Machine Learning

E-commerce: Marketing:
1. Product recommendation and 1. Customer segmentation and targeting
personalized shopping experiences 2. Sentiment analysis and brand
2. Customer segmentation and churn sentiment monitoring
prediction 3. Social media analytics and influencer
3. Price optimization and dynamic identification
pricing strategies 4. Customer lifetime value prediction
4. Fraud detection and prevention 5. Campaign optimization and marketing
5. Supply chain optimization and attribution modeling
demand forecasting
Applications of Machine Learning

Manufacturing: Transportation:
1. Predictive maintenance for 1. Autonomous vehicles and self-driving
machinery and equipment cars
2. Quality control and defect detection 2. Route optimization and traffic prediction
3. Supply chain optimization and 3. Demand forecasting for ride-sharing
inventory management and delivery services
4. Demand forecasting and production 4. Fleet management and vehicle routing
planning 5. Predictive maintenance for
5. Process optimization and efficiency transportation infrastructure
improvement
Applications of Machine Learning

Natural Language Processing (NLP): Computer Vision:


1. Sentiment analysis and opinion 1. Object detection and recognition
mining. 2. Image classification and segmentation
2. Text classification and document 3. Facial recognition and biometric
categorization authentication
3. Language translation and 4. Autonomous drones and aerial
multilingual communication surveillance
4. Chat bots and virtual assistants 5. Medical image analysis and diagnosis
5. Text summarization and content
generation
Limitations of ML
1. Lack of Common Sense Reasoning: ML algorithms struggle with tasks
requiring common sense or understanding the context of a situation beyond
the data they are trained on.
2. Creativity and Innovation: While applications like generating creative text
formats are emerging, current ML techniques lack the ability to truly
innovate or come up with entirely new ideas.
3. Ethical Decision-Making: ML algorithms cannot make ethical judgments or
navigate complex moral dilemmas requiring human values and
understanding.
Tools in Machine Learning
• Machine learning, a branch of artificial intelligence, is rapidly evolving and
requires a robust set of tools to build, train, and deploy models effectively.

• Here are some of the most popular tools, along with their advantages,
disadvantages, and limitations:
Tools in Machine Learning
1. SCIKIT-LEARN:
Advantages:
• Simple and easy-to-use API, making it great for beginners.

• Comprehensive documentation and community support.

• Implements a wide range of classical machine learning algorithms.

• Disadvantages:

• Limited support for deep learning models.

• May not be suitable for very large datasets or complex model architectures.

• Limitations:

• Lack of flexibility in customization compared to other frameworks like


TensorFlow or PyTorch.
Tools in Machine Learning
1. TENSORFLOW:
Advantages:
• Highly flexible and scalable, suitable for both research and production.

• Supports deep learning models with customizable architectures.

• TensorFlow Serving allows easy deployment of models.

• Disadvantages:

• Steeper learning curve compared to simpler libraries like scikit-learn.

• Requires more lines of code for simple tasks.

• Limitations:

• May require significant computational resources for training complex models.


Tools in Machine Learning
1. PYTORCH:

Advantages:
• Dynamic computational graph makes it easier to debug and experiment.

• Pythonic API is intuitive and easy to learn.

• Growing popularity in both research and industry.

• Disadvantages:

• Less mature ecosystem compared to TensorFlow.

• Limited production deployment tools compared to TensorFlow Serving.

• Limitations:

• Training large models can be slower compared to TensorFlow due to lack of


optimizations.
Tools in Machine Learning
1. KERAS:

• Advantages:

• High-level API, allowing for rapid prototyping and experimentation.

• Can run on top of TensorFlow, Theano, or Microsoft Cognitive Toolkit.

• Simplified syntax makes it easy to build neural networks.

• Disadvantages:

• Less flexibility compared to TensorFlow or PyTorch.

• May not be suitable for implementing custom architectures.

• Limitations:

• Limited support for complex research experiments compared to TensorFlow or PyTorch.


Tools in Machine Learning
1. APACHE SPARK MLLIB:

Advantages:
• Distributed computing capabilities suitable for big data processing.

• Integration with Apache Spark ecosystem for data preprocessing and


analysis.

• Disadvantages:

• Limited algorithms compared to standalone libraries like scikit-learn.

• Slower compared to native implementations for smaller datasets.

• Limitations:

• Not as actively developed or supported as other ML libraries.


Issues in Machine Learning
• Some common issues in Machine Learning that professionals face to inculcate
ML skills and create an application from scratch.

1. Data Quality and Quantity:


Insufficient data: Inadequate amount of data can lead to poor model
performance, especially for complex models like deep learning.
Data imbalance: When the classes in a classification problem are not
represented equally, the model may become biased towards the majority class.
Noisy data: Data may contain errors, outliers, or irrelevant information, which
can negatively impact model performance.
Issues in Machine Learning
2. Feature Engineering:
Identifying relevant features: Selecting the right features that contribute to
predictive performance is crucial. Missing important features or including
irrelevant ones can degrade model accuracy.
Handling categorical data: Encoding categorical variables effectively without
introducing bias or increasing dimensionality can be challenging.
Feature scaling: Ensuring that features are on similar scales can improve the
performance of certain algorithms, such as distance-based methods.
Issues in Machine Learning
3. Model Selection and Evaluation:
Overfitting and under fitting: Overfitting occurs when a model learns to
memorize the training data instead of generalizing to unseen data, while
under fitting happens when the model is too simple to capture the underlying
patterns.
Hyper parameter tuning: Selecting the optimal hyper parameters for a model
can be time- consuming and require extensive experimentation.
Model evaluation metrics: Choosing appropriate metrics to evaluate model
performance based on the problem domain is critical. Using inaccurate or
misleading metrics can lead to erroneous conclusions.
Issues in Machine Learning
4. Interpretability and Explain ability:
Black-box models: Complex models such as deep neural networks may lack
interpretability, making it difficult to understand the reasoning behind their
predictions.
Model transparency: Understanding how a model makes decisions is
important for gaining trust and addressing concerns about fairness, bias, and
ethics.
5. Deployment and Maintenance:
Deployment challenges: Integrating machine learning models into production
systems while ensuring scalability, reliability, and efficiency can be complex.
Monitoring and updating: Models may degrade over time due to changes in data
distribution or drift. Regular monitoring and updating are necessary to maintain
performance.
6. Ethical and Legal Considerations:

Bias and fairness: Machine learning models can inherit biases present in the
training data, leading to unfair or discriminatory outcomes.
Privacy concerns: Handling sensitive data requires careful attention to privacy
regulations and ethical considerations, such as data anonymization and informed
consent.
Machine learning Activities
• The following are the machine learning
activities:

1. Data Collection and Preprocessing 6. Hyper parameter Tuning

2. Exploratory Data Analysis (EDA) 7. Cross-Validation


8. Model Interpretation and
3. Feature Engineering
Explainability
4. Model Selection and Training 9. Deployment and Monitoring
5. Model Evaluation 10. Transfer Learning
Machine learning Activities
1. Data Collection and Preprocessing

• Example: Suppose we are building a spam email classifier. We collect a dataset


containing emails labeled as spam or not spam.

• Preprocessing involves tasks like removing HTML tags, converting text to


lowercase, removing null values, stop words, and tokenization.

2. Exploratory Data Analysis (EDA)

• Example: Before training a model, we analyze the distribution of features,


correlations, and outliers in our dataset. For instance, we might visualize the
frequency of spam words in spam emails compared to non-spam emails.
Machine learning Activities
3. Feature Engineering:

• Example: In a fraud detection system, we create new features like transaction


frequency, average transaction amount, and account holder age based on
existing data. These features provide more information to the model for better
fraud detection.
4. Model Selection and Training:

• Example: We experiment with different algorithms (e.g., logistic regression,


random forest, neural networks) and hyper parameters to find the best-
performing model for our task. For instance, we train multiple classifiers on
our spam email dataset and compare their accuracy scores.
Machine learning Activities
5. Model Evaluation:

• Example: After training our spam email classifier, we evaluate its performance
using metrics like accuracy, precision, recall, and F1-score. We split our dataset
into training and testing sets to assess how well the model generalizes to
unseen or new data.
6. Hyper parameter Tuning:

• Example: We use techniques like grid search or random search to tune the
hyper parameters of our machine learning model. For instance, we adjust the
learning rate, regularization strength, and batch size of a neural network to
optimize its performance on a validation set.
Machine learning Activities
7. Cross-Validation:

• Example: Instead of relying on a single train-test split, we perform k-fold


cross-validation to evaluate our model's performance more robustly. For
example, we divide our data into 5 folds, train the model on 4 folds, and
validate it on the remaining 1 fold, repeating this process five times.
8. Model Interpretation and Explain ability:

• Example: In a medical diagnosis system, we use techniques like SHAP


(SHapley Additive ex Plantations) or LIME (Local Interpretable Model-
agnostic Explanations) to interpret the predictions of our model. This helps to
understand which features are most influential in making decisions.
Machine learning Activities
9. Deployment and Monitoring:

• Example: After building and evaluating our model, we deploy it into a


production environment where it can make real-time predictions. We set up
monitoring systems to track the model's performance over time and retrain or
update it as necessary to maintain accuracy.
10. Transfer Learning:

• Example: In image classification, we leverage a pre-trained convolutional


neural network (CNN), such as ResNet or VGG, which was trained on a large
dataset like ImageNet. We fine-tune the CNN on our specific task with a
smaller dataset, achieving better performance than training from scratch.
Basic Types of Data in Machine Learning
1. Numerical Data:

• Examples: Age, temperature, height, salary, stock prices.

• Advantages:

o Easy to work with in many machine learning algorithms.

o Can represent a wide range of values and magnitudes.

• Disadvantages:

o Outliers can significantly affect analysis and model performance.

o May require scaling or normalization to ensure features are on a similar


scale.
Basic Types of Data in Machine Learning
2. Categorical Data:

Examples: Gender (Male, Female), color (Red, Green, Blue), product

categories (Electronics, Clothing, Books).


• Advantages:

o Useful for representing non-numeric attributes and classes.

o Can provide valuable information for classification tasks.

• Disadvantages:

o Need to be encoded into numerical values for most machine learning


algorithms.

o High cardinality (many unique categories) can lead to issues like the
curse of dimensionality.
Basic Types of Data in Machine Learning
3. Ordinal Data:

• Examples: Education level (High School < Bachelor's < Master's < Ph.D.), Likert
scale ratings (Strongly Disagree < Disagree < Neutral < Agree < Strongly Agree).
• Advantages:

o Preserves order or ranking among categories, providing additional information.

o Can be useful for certain types of regression or ranking tasks.

• Disadvantages:

o Not all machine learning algorithms can handle ordinal data directly.

o May require careful encoding to maintain the ordinal relationship.


Basic Types of Data in Machine Learning
4. Text Data:

• Examples: Tweets, emails, articles, customer reviews.

• Advantages:

o Rich source of information for sentiment analysis, text classification, and


natural language processing tasks.
o Can capture nuanced information and context.

• Disadvantages:

o High-dimensional and sparse representation can be computationally expensive.

• Preprocessing steps like tokenization and stemming are necessary, which can
introduce noise.
Basic Types of Data in Machine Learning
5. Image Data: Examples: Photographs, medical images, satellite images.

• Advantages:

o Rich visual information suitable for tasks like object detection, image
classification, and image segmentation.
o Deep learning models like CNNs can automatically extract hierarchical
features.

• Disadvantages:

o Large memory and computational requirements for processing high-


resolution images.
o Requires extensive preprocessing and data augmentation to handle
variations in lighting, orientation, and scale.
Basic Types of Data in Machine Learning
6. Time Series Data:

• Examples: Stock prices over time, temperature readings, and sensor data.

• Advantages:

o Captures temporal dependencies and trends over time.

o Suitable for forecasting, anomaly detection, and trend analysis.

• Disadvantages:

o Need to handle missing values and irregular sampling intervals.

o Sensitive to seasonality, trends, and noise.


Exploring the Structure of Data
• Exploring the structure of data involves examining its organization,
relationships, patterns, and attributes to gain insights and understanding.

• Here are examples, advantages, and disadvantages of exploring the structure of


data:
1. Descriptive Statistics:

• Examples: Mean, median, mode, standard deviation, and variance.

• Advantages:

o Provides summary statistics that describe the central tendency,


dispersion, and shape of the data distribution.
o Helps identify outliers and anomalies.
Exploring the Structure of Data
• Disadvantages:

o May not capture complex relationships between variables.

o Limited to numerical data.

• Example: Calculating the mean and standard deviation of exam scores to


understand the average performance and variability among students.
2. Data Visualization:

• Examples: Histograms, scatter plots, box plots, bar charts.

• Advantages:

o Offers visual representation of data distribution, trends, and relationships.

o Facilitates easy interpretation and communication of findings.


Exploring the Structure of Data
• Disadvantages:
o Interpretation may vary based on visualization techniques.
o Limited to visualizing a few variables at a time.
• Example: Plotting a histogram of customer ages to understand the age
distribution in a market dataset.
3. Correlation Analysis:
• Examples: Pearson correlation coefficient, Spearman rank correlation.
• Advantages:
o Quantifies the strength and direction of relationships between pairs of
variables.
o Helps identify potential predictors in regression analysis.
Exploring the Structure of Data
• Disadvantages:

o Assumes linear relationships and may miss non-linear associations.

o Correlation does not imply causation.

• Example: Computing the correlation between advertising spending and


sales revenue to understand their relationship in a marketing dataset.

4. Dimensionality Reduction:

• Examples: Principal Component Analysis (PCA), t-Distributed Stochastic


Neighbor Embedding (t-SNE).
Exploring the Structure of Data
• Advantages:

o Reduces the dimensionality of data while preserving important information.

o Facilitates visualization and interpretation of high-dimensional data.

• Disadvantages:

o Loss of interpretability in reduced dimensions.

o May discard some information.

• Example: Applying PCA to gene expression data to identify principal


components representing gene expression patterns.
Exploring the Structure of Data
5. Clustering Analysis:

• Examples: K-means clustering, hierarchical clustering.

• Advantages:

o Identifies natural groupings or clusters within the data.

o Useful for segmentation and pattern recognition.

• Disadvantages:

o Requires choosing the number of clusters, which can be subjective.

o Results may vary based on the choice of distance metric and clustering
algorithm. Example: Using K-means clustering to segment customers
based on their purchasing behavior.
Data Quality & Remediation
• Data quality refers to the reliability, accuracy, consistency, completeness, and
relevancy of data.

• Data remediation involves the process of identifying and correcting data quality
issues to ensure that data is accurate, reliable, and suitable for analysis or
decision-making.

• Here are examples, advantages, disadvantages of data quality and remediation


in real-time scenarios:
Data Quality & Remediation
• Examples of Data Quality Issues:

1. Inconsistent formats: In a customer database, phone numbers are stored in


various formats (e.g., +1 (555) 123-4567, 555-123-4567, 5551234567).
2. Missing values: In a sales dataset, some records may have missing values for
the "sales amount" field.
3. Duplicate records: A product inventory system contains duplicate entries for the
same item.

4. Incorrect data: In a healthcare database, a patient's birthdate is recorded as


01/13/1900, which is not possible.
Data Quality & Remediation
• Advantages of Data Quality & Remediation:

1. Improved decision-making: High-quality data leads to more accurate insights


and better- informed decisions.
2. Enhanced efficiency: Clean and reliable data reduces the time spent on data
cleaning and troubleshooting.
3. Increased trust: Stakeholders have greater confidence in data-driven analyses
and reports when data quality is high.
4. Regulatory compliance: Ensuring data quality helps organizations comply
with data protection and privacy regulations.
Data Quality & Remediation
• Disadvantages of Data Quality & Remediation:

1. Time-consuming: Identifying and rectifying data quality issues can be a time-


intensive process, especially for large datasets.
2. Costly: Data remediation efforts may require investments in tools, resources,
and personnel.

3. Complexity: Some data quality issues may be challenging to detect and


correct, especially in heterogeneous datasets.
4. Potential for errors: Human error during data cleaning and remediation can
introduce new inaccuracies or biases.
Data Pre-Processing
• Data preprocessing is a critical step in machine learning pipelines, involving
transforming raw data into a clean, structured format suitable for training
machine learning models.

• It includes various techniques to handle missing values, outliers, feature


scaling, normalization, encoding categorical variables, irrelevant features and
more, as well as to standardize or scale the data.

• Here's an overview along with suitable examples, advantages, and


disadvantages.
Data Pre-Processing
1. Handling Missing Values:

• Example: Suppose we have a dataset of customer information, and some


entries have missing values for the "income" attribute.

• We can handle this by imputing missing values using techniques like mean,
median, or mode imputation, or by using advanced imputation methods like
K-nearest neighbors (KNN) or predictive models.
• Advantages:

o Prevents loss of valuable data.

o Improves the robustness and reliability of the dataset.


Data Pre-Processing
• Disadvantages:

o Imputation methods may introduce bias if not handled carefully.

o Imputed values may not accurately represent the true underlying data
distribution.

2. Outlier Detection and Removal:

• Example: In a dataset of housing prices, we may find some entries with


unrealistically high or low prices.

• Outliers can be detected using statistical methods like Z-score or IQR


(Interquartile Range) and removed or adjusted accordingly.
Data Pre-Processing
• Advantages:

o Improves model performance by reducing the impact of outliers.

o Prevents the model from being skewed by extreme values.

• Disadvantages:

o Removal of outliers may lead to loss of valuable information.

o Subjective choice of outlier detection method and threshold.

3. Feature Scaling and Normalization:

• Example: In a dataset containing features with different scales (e.g., age


and income), scaling techniques like Min-Max scaling or Standardization (Z-
score normalization) can be applied to bring all features to a similar scale.
Data Pre-Processing
• Advantages:

o Ensures that features contribute equally to the model.

o Helps algorithms converge faster during training.

• Disadvantages:

o Scaling may amplify the noise in the data.

o Loss of interpretability for some features after scaling.

4. Encoding Categorical Variables:

• Example: Suppose we have a categorical feature like "gender" with values


"Male" and "Female." We can encode it into numerical values using
techniques like one-hot encoding or label encoding.
• Advantages:

o Allows algorithms to work with categorical data.

o Preserves the ordinal relationship between categories if needed.

• Disadvantages:

o Increases dimensionality, especially with one-hot encoding.

o May introduce sparsity and multicollinearity in the dataset.

5. Dimensionality Reduction:

• Example: Applying techniques like Principal Component Analysis (PCA) or t-


distributed Stochastic Neighbor Embedding (t-SNE) to reduce the
dimensionality of high-dimensional datasets while preserving most of the
relevant information.
Data Pre-Processing
• Advantages:

o Reduces overfitting and computational complexity.

o Visualizes high-dimensional data in lower dimensions.

• Disadvantages:

o Loss of interpretability in reduced dimensions.

o May discard some information, leading to loss of predictive power.

You might also like