Unit-1

A8703-Machine Learning
➢Text Book
❖ G Amit Kumar Das, SaikatDutt, Subramanian Chandramouli, Machine
Learning, Pearson India Education Services, 2019.
➢Reference Books
1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, 2013.
2. Rudolph Russell, Machine Learning: Step-by-Step Guide to Implement
Machine Learning Algorithms with Python, Create Space Independent
Publishing Platform, 2018.
1
A8703-Machine Learning : Course Outcomes
➢ Identify the various concepts and challenges in machine Learning.
➢ Select modelling and evaluation technique for handling real time data.
➢ Use supervised and unsupervised learning algorithms for a given

problem.
➢ Examine supervised and unsupervised learning algorithms for

analysing data.
➢ Identify the various concepts of neural network to develop AI based

applications.
2
A8703-Machine Learning : Syllabus
Introduction to Machine Learning: Types of Machine Learning, Problems not to be solved using
Machine Learning, Applications of Machine Learning, Tools in Machine Learning, Issues in
Machine Learning, Machine learning Activities, Basic Types of Data in Machine Learning,
Exploring Structure of data, Data Quality & Remediation, Data Pre-Processing.
Modelling and Evaluation: Introduction, Selecting a Model, Training a Model, Model

Representation and Interpretability, Evaluating Performance of a Model, Improving
performance of a Model, Feature subset selection, Dimensionality Reduction - PCA, SVD,FA,
LDA.
Bayesian Concept Learning: Introduction, Bayes’ Theorem, Naïve Bayes Classifier, Ap-
plications of Naïve Bayes Classifier, Supervised Learning: Classification , Example of
Supervised Learning, Classification Model Learning Steps, Common Classification Algorithms:
KNN, Decision Tree, Random forest model , Support vector machines. Introduction of
Regression: Example of Regression, linear Regression, Multiple linear Regression. 3
A8703-Machine Learning : Syllabus
Unsupervised Learning: Introduction, Unsupervised vs Supervised Learning,

Application of Unsupervised Learning, Clustering K-Means, K-Medoid,
DBSCAN.
Basics of Neural Network: Introduction, Understanding the Biological Neuron,

Exploring the Artificial Neuron, Types of Activation Functions, Early
Implementations of ANN, Architectures of Neural Network.
4
Introduction to Machine Learning
• Introduction
• In the real world, we are surrounded by

humans who can learn everything from their
experiences with their learning capability,
• We have computers or machines which work

on our instructions. But can a machine also
learn from experiences or past data like a
human does?
• To make the machines to learn, Machine learning plays a important role.
• What is Machine Learning ?
Arthur Samuel, a research pioneer in the field of artificial intelligence and
computer gaming, coined the term “Machine Learning”.
✓Computers learning from data is known as machine learning.
• Definition of ML
• Machine learning (ML) is a sub-domain of artificial intelligence (AI) that

focuses on developing systems that learn—or improve performance—or predict
new data-or make decisions based on the data they receive.
• It involves training models on data to identify patterns, relationships, and insights,
which can then be used to perform various tasks and make predictions on new,
unseen data.
• “Machine learning is a method of data analysis that automates analytical model

building. It is a branch of artificial intelligence based on the idea that systems can
learn from data, identify patterns and make decisions with minimal human
intervention”.
• Definition of Machine Learning (Mitchell 1997) — “A computer program is said to

learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at the tasks improves with the experiences”.
Why Machine Learning ?
Programs have been developed that successfully
✓learn to recognize spoken words (Waibel 1989; Lee 1989),
✓predict recovery rates of pneumonia patients (Cooper et al. 1997),
✓detect fraudulent use of credit cards,
✓drive autonomous vehicles on public highways (Pomerleau 1989),
✓play games such as backgammon at levels approaching the performance of

human world champions (Tesauro 1992, 1995).c
Programs vs. learning algorithms
Traditional Programming refers to any manually created program that uses input
data and runs on a computer to produce the output.
In Machine Learning, also known as augmented analytics, the input data and
output are fed to an algorithm to create a program. This yields powerful
insights that can be used to predict future outcomes.
Components of Learning System
10
• Advantages of ML
1. Automation - enables automation of tasks that are repetitive or time- consuming
2. Prediction and Decision Making - analyze large datasets to make predictions and
decisions with high accuracy
3. Scalability: - can handle large volumes of data and scale efficiently to accommodate
growing datasets
4. Adaptability: - can adapt and learn from new data, allowing them to continuously improve.
5. Personalization: -can personalize user experiences by analyzing user behavior and

preferences.
6. Pattern Recognition: -identifying patterns, trends, and anomalies in data that may not be
obvious to humans.
• Disadvantages of ML
1. Data Dependency - models heavily rely on the quality and quantity of data for training.
2. Overfitting: models may become too specialized to the training data and fail to generalize well to
unseen data.
3. Interpretability: lack interpretability, making it challenging to understand how they arrive at their
predictions
4. Computational Resources : Training complex ML models requires significant computational

resources.
5. Ethical and Privacy Concerns: or infringe on privacy rights, raising ethical and social concerns.
6. Lack of Domain Knowledge: models may perform poorly in domains where domain- specific
knowledge is essential.
• Limitations:
1. Limited by Data Quality
2. Complexity:
3. Interpretability:
4. Generalization:
5. Scalability:
6. Human Expertise
History of Machine learning
• The history of machine learning traces back to the mid-20th century, with
roots in the fields of mathematics, computer science, and artificial intelligence.
Here's a brief overview:
• 1950s - 1960s: Early Foundations
1. Alan Turing (1950): Turing proposed the Turing Test as a measure of

machine intelligence, laying the groundwork for the concept of artificial
intelligence (AI).
2. Arthur Samuel (1959): Samuel developed the first self-learning program, a

checkers- playing program that improved its performance through
reinforcement learning.
• 1970s - 1980s: Symbolic AI Dominance
1. Expert Systems: Symbolic AI, based on rule-based expert systems,
dominated the field. These systems encoded human expertise in the form
of rules to solve specific problems.
2. Neural Networks Research: Neural networks research continued, but
interest discontinued due to limited computational power and the
dominance of symbolic AI.
• 1990s: Renaissance of Neural Networks
1. Backpropagation: The rediscovery of the backpropagation algorithm for
training neural networks led to renewed interest in neural networks and
machine learning.
2. Support Vector Machines (SVMs): Vladimir Vapnik and others developed support
vector machines, a powerful machine learning algorithm for classification and
regression tasks.
• 2000s - Present: The Big Data Era
1. Big Data: The explosion of data availability due to the internet, social media, and
digital technologies fueled the development of new machine learning algorithms
and techniques.
2. Deep Learning: Breakthroughs in deep learning, fueled by advances in

computational power and data availability, led to significant improvements in
areas like computer vision, natural language processing, and speech recognition.
3. Reinforcement Learning: Reinforcement learning gained prominence,
particularly in areas like robotics, gaming (e.g., AlphaGo), and
autonomous vehicles.
4. Machine Learning Applications: Machine learning became ubiquitous in

various applications, including recommendation systems, fraud detection,
healthcare diagnostics, autonomous vehicles, and more.
5. Ethical and Social Implications: Increased attention to the ethical and

social implications of machine learning, including concerns about bias,
fairness, privacy, and job displacement.
Evolution of machine learning
What is human learning?
• Human learning is the process through which individuals acquire new
knowledge, skills, attitudes, or behaviors.
• This process can occur in various ways, including through:
1. Experiences:
2. Observation:
3. Instruction:
• Human learning involves cognitive processes such as attention, memory,
and problem-solving.
What is human learning?
• Cognitive science is an interdisciplinary field that studies about
the mind and its processes, including how people think, learn,
remember, and perceive.

Types of human learning
• Human learning happens in one of the three ways –
• (1) either somebody who is an expert in the subject directly teaches us
(Learning under expert guidance),
• (2) we build our own notion indirectly based on what we have learnt from the
expert in the past (Learning guided by knowledge gained from experts), or
• (3) we do it ourselves, may be after multiple attempts (Learning by self or
self-learning),
How do machines learn?
• The basic machine learning process can be divided into three parts.
• 1. Data Input: Past data or information is utilized as a basis for future

decision-making.
• 2. Abstraction: The input data is represented in a broader way through the

underlying algorithm.
• 3. Generalization: The abstracted representation is generalized to form a

framework for making decisions.
• Figure 1 shows a schematic representation of the machine learning process.

Fig: Process of machine learning

• Human learning is that just by great memorizing and perfect recall, i.e. just
based on knowledge input, students can do well in the examinations only till a
certain stage.
• A better learning strategy needs to be adopted:

• 1. to be able to deal with the vastness of the subject matter and the
related issues in memorizing it.
• 2. to be able to answer questions where a direct answer has not

been learnt.
• A good option is to figure out the key points or ideas amongst a

vast pool of knowledge.
• From the machine learning paradigm, the vast pool of knowledge

is available from the data input.
• However, rather than using it in entirety of the data, a concept of
mapping is applied.
• Mapping of the data to known characteristic is called knowledge

abstraction performed by the machine.
• Finally, this abstracted mapping from the input data can be

applied to make critical conclusions.
• This is generalization in context of machine learning.

Types of Machine Learning
• Supervised Learning:
• Definition: In supervised learning, the algorithm is trained on a labeled

dataset, where each input is associated with a corresponding output. (which
means that the input data is paired with the desired output.).
• Usefulness: Supervised learning is highly useful in real-world scenarios

where there is a known outcome or target variable.
• Supervised learning is often used for tasks such as classification, regression,

and object detection.
• Real-world Applications: Supervised learning is applied in various domains
such as:
1. Predictive analytics: Forecasting customer churn, predicting sales trends, etc.
2. Healthcare: Diagnosing diseases based on patient data.
3. Finance: Credit scoring, fraud detection, risk assessment.
• Advantages:
• Well-understood and widely studied.
• Can achieve high accuracy when trained on sufficient and representative data.
• Disadvantages:
• Requires labeled data, which can be expensive and time-consuming to obtain.
• May suffer from overfitting if the model is too complex or the training dataset
is small.
• Unsupervised Learning:
• Definition: In unsupervised learning, the algorithm is trained on an

unlabeled dataset, where the goal is to discover hidden patterns or structures
within the data.
• Usefulness: Unsupervised learning is valuable in real-time scenarios where
the data is unstructured or lacks labels. It can uncover hidden insights and
group similar data points together.
• Real-world Applications: Unsupervised learning is applied in various domains

such as:
1. Market segmentation: Grouping customers based on similar traits or behaviors.

1. Anomaly detection: Identifying unusual patterns or outliers in data.
2. Recommender systems: Generating personalized recommendations based on

user preferences.
• Advantages:
• Can reveal hidden patterns or structures in the data.
• Does not require labeled data, making it applicable to a wide range of

datasets.
• Disadvantages:
• Evaluation of results can be subjective and challenging.
• Interpretability of the model's output may be limited.

• Semi Supervised Learning
• Semi-supervised learning is a machine learning paradigm that falls between

supervised and unsupervised learning.
• In semi-supervised learning, the dataset contains both labeled and unlabeled

data.
• The algorithm leverages the small amount of labeled data along with the
larger pool of unlabeled data to make predictions or learn patterns.
• Here are a few semi-supervised learning algorithms:
• Self-Training:
1. Self-training is a simple semi-supervised learning algorithm where the model

starts with a small amount of labeled data.
2. It trains initially on the labeled data and then uses the trained model to
make predictions on the unlabeled data.
3. The predictions with high confidence are added to the labeled dataset, and the
process iterates until convergence.
4. This approach assumes that the model's predictions on the unlabeled data are
reliable.
• Semi-Supervised Support Vector Machines (S3VM):
1. S3VM is an extension of traditional Support Vector Machines (SVM) to semi-
supervised settings.
2. It incorporates both labeled and unlabeled data into the SVM framework,
aiming to find a decision boundary that separates the data while minimizing
classification errors.
3. S3VM optimizes a combination of the margin and the empirical error on the
labeled data, along with a term penalizing the model's complexity.

• Label Propagation:
1. Label propagation is a graph-based semi-supervised learning algorithm.
2. It constructs a graph representation of the data, where nodes represent

data points, and edges represent similarities between points.
3. Initially, the labeled nodes are assigned their true labels, and then the labels
propagate through the graph based on similarities between nodes.
4. The final labels are determined based on the propagated labels, and the
process iterates until convergence.
• Generative Adversarial Networks (GANs):
1. GANs consist of two neural networks, a generator and a discriminator,
which are trained simultaneously through a min-max game.
2. In semi-supervised learning, GANs can be used to generate realistic samples
from the unlabeled data distribution.
3. The generated samples are combined with the labeled data to train a
classifier, effectively leveraging the unlabeled data to improve classification
performance.
• Reinforcement Learning:
• Definition: Reinforcement learning involves an agent learning to make

decisions by interacting with an environment and receiving feedback in the
form of rewards or penalties.
• Usefulness: Reinforcement learning is beneficial in real-time environments
where decisions must be made sequentially and actions have consequences. It
is used in areas such as robotics, gaming, and autonomous systems.
• Real-world Applications: Reinforcement learning is applied in various
domains such as:
1. Robotics: Training robots to perform complex tasks in dynamic environments.

1. Autonomous vehicles: Teaching vehicles to navigate safely and efficiently.
2. Resource management: Optimizing energy usage, inventory management, etc.
• Advantages:
1. Can learn complex behaviors and strategies through trial and error.
2. Suitable for environments with sparse or delayed feedback.
• Disadvantages:
1. Requires a well-defined reward structure, which may be challenging to specify.
2. Can be computationally expensive and time-consuming to train.

Problems Cannot to Be Solved Using Machine Learning
• Machine learning (ML) is a powerful tool for solving a wide range of problems,
there are certain types of problems that it may not be well-suited to address
effectively.
• Some examples of problems that cannot be easily solved by machine learning

alone:
• Lack of Data, Undefined Objectives, Causal inference, Ethical and Moral

Judgments, Unstructured Problem Solving, Domain Expertise, Conceptual
Understanding, Small Sample Sizes, Incorporating Context, Real-time Critical
Decision-making, Extreme Context Shifts, New and Novel Situations,
Interpersonal and Emotional Understanding
1. Lack of Data: Machine learning models require sufficient and high-quality data
for training. If the data is scarce, incomplete, or biased, the performance of ML
models can suffer.
2. Undefined Objectives: Machine learning relies on well-defined objectives and

metrics for optimization. If the problem itself is not well-defined, ML might not
be effective in finding solutions.
3. Causal Inference: Causal inference is the process of drawing conclusions about

causal relationships based on data and statistical methods. Determining cause
and effect requires more rigorous experimental design and statistical methods.
4. Ethical and Moral Judgments: Decisions involving ethical considerations, moral
judgments, and values often require human reasoning, empathy, and contextual
understanding that machine learning lacks in doing such tasks.
5. Unstructured Problem Solving: Machine learning is often used for structured

tasks with clearly defined inputs and outputs. Problems requiring creative
thinking, intuition, and subjective judgment may not be suitable for ML.
6. Domain Expertise: ML models require domain-specific knowledge for effective

feature engineering, interpretation of results, and ensuring meaningful
outcomes. Lack of domain expertise can lead to suboptimal solutions.
7. Conceptual Understanding: ML models can predict outcomes based on patterns
in data, but they may not provide a deep conceptual understanding of
underlying phenomena or processes.
8. Small Sample Sizes: Some machine learning algorithms, particularly deep

learning models, require large amounts of data to generalize well. Small
sample sizes can lead to overfitting and poor performance.
9. Incorporating Context: Contextual understanding and reasoning based on

broader context, cultural nuances, and real-world experiences are areas where
machines may struggle.
10. Real-time Critical Decision-making: Situations that require real-time
decision-making, especially in high-stakes environments like healthcare or
aviation, may not allow sufficient time for the learning and adaptation
process of ML models.
11. Extreme Context Shifts: Machine learning models might not perform well
when deployed in situations drastically different from their training
environment. They lack adaptability extreme shifts in context.
12. New and Novel Situations: ML models typically operate based on patterns
learned from past data. When faced with entirely new and novel situations,
they might not have sufficient information to provide accurate predictions.
13. Interpersonal and Emotional Understanding: Recognizing and
responding to human emotions, nuances, and interpersonal
interactions are challenging tasks that require human emotional
intelligence and social understanding.
Applications of Machine Learning
• Machine learning (ML) has become a powerful tool across various industries,
transforming how we live and work.
• Here's an overview of its applications and limitations:

Healthcare: Finance:
1. Disease diagnosis and prognosis. 1. Fraud detection and prevention
2. Personalized treatment 2. Credit scoring and risk assessment
recommendation. 3. Algorithmic trading and financial
3. Drug discovery and development forecasting
4. Medical imaging analysis (e.g., MRI, 4. Customer segmentation and targeted
CT scans) marketing
5. Electronic health record (EHR) 5. Portfolio optimization and wealth
analysis for patient management management
E-commerce: Marketing:
1. Product recommendation and 1. Customer segmentation and targeting
personalized shopping experiences 2. Sentiment analysis and brand
2. Customer segmentation and churn sentiment monitoring
prediction 3. Social media analytics and influencer
3. Price optimization and dynamic identification
pricing strategies 4. Customer lifetime value prediction
4. Fraud detection and prevention 5. Campaign optimization and marketing
5. Supply chain optimization and attribution modeling
demand forecasting
Manufacturing: Transportation:
1. Predictive maintenance for 1. Autonomous vehicles and self-driving
machinery and equipment cars
2. Quality control and defect detection 2. Route optimization and traffic prediction
3. Supply chain optimization and 3. Demand forecasting for ride-sharing
inventory management and delivery services
4. Demand forecasting and production 4. Fleet management and vehicle routing
planning 5. Predictive maintenance for
5. Process optimization and efficiency transportation infrastructure
improvement
Natural Language Processing (NLP): Computer Vision:

1. Sentiment analysis and opinion 1. Object detection and recognition
mining. 2. Image classification and segmentation
2. Text classification and document 3. Facial recognition and biometric
categorization authentication
3. Language translation and 4. Autonomous drones and aerial
multilingual communication surveillance
4. Chat bots and virtual assistants 5. Medical image analysis and diagnosis
5. Text summarization and content
generation
Limitations of ML
1. Lack of Common Sense Reasoning: ML algorithms struggle with tasks
requiring common sense or understanding the context of a situation beyond
the data they are trained on.
2. Creativity and Innovation: While applications like generating creative text
formats are emerging, current ML techniques lack the ability to truly
innovate or come up with entirely new ideas.
3. Ethical Decision-Making: ML algorithms cannot make ethical judgments or
navigate complex moral dilemmas requiring human values and
understanding.
Tools in Machine Learning
• Machine learning, a branch of artificial intelligence, is rapidly evolving and
requires a robust set of tools to build, train, and deploy models effectively.
• Here are some of the most popular tools, along with their advantages,
disadvantages, and limitations:
1. SCIKIT-LEARN:
Advantages:
• Simple and easy-to-use API, making it great for beginners.
• Comprehensive documentation and community support.
• Implements a wide range of classical machine learning algorithms.
• Disadvantages:
• Limited support for deep learning models.
• May not be suitable for very large datasets or complex model architectures.
• Limitations:
• Lack of flexibility in customization compared to other frameworks like

TensorFlow or PyTorch.
1. TENSORFLOW:
Advantages:
• Highly flexible and scalable, suitable for both research and production.
• Supports deep learning models with customizable architectures.
• TensorFlow Serving allows easy deployment of models.
• Disadvantages:
• Steeper learning curve compared to simpler libraries like scikit-learn.
• Requires more lines of code for simple tasks.
• Limitations:
• May require significant computational resources for training complex models.

1. PYTORCH:
Advantages:
• Dynamic computational graph makes it easier to debug and experiment.
• Pythonic API is intuitive and easy to learn.
• Growing popularity in both research and industry.
• Disadvantages:
• Less mature ecosystem compared to TensorFlow.
• Limited production deployment tools compared to TensorFlow Serving.
• Limitations:
• Training large models can be slower compared to TensorFlow due to lack of

optimizations.
1. KERAS:
• Advantages:
• High-level API, allowing for rapid prototyping and experimentation.
• Can run on top of TensorFlow, Theano, or Microsoft Cognitive Toolkit.
• Simplified syntax makes it easy to build neural networks.
• Disadvantages:
• Less flexibility compared to TensorFlow or PyTorch.
• May not be suitable for implementing custom architectures.
• Limitations:
• Limited support for complex research experiments compared to TensorFlow or PyTorch.

1. APACHE SPARK MLLIB:
Advantages:
• Distributed computing capabilities suitable for big data processing.
• Integration with Apache Spark ecosystem for data preprocessing and

analysis.
• Disadvantages:
• Limited algorithms compared to standalone libraries like scikit-learn.
• Slower compared to native implementations for smaller datasets.
• Limitations:
• Not as actively developed or supported as other ML libraries.

Issues in Machine Learning
• Some common issues in Machine Learning that professionals face to inculcate
ML skills and create an application from scratch.
1. Data Quality and Quantity:

Insufficient data: Inadequate amount of data can lead to poor model
performance, especially for complex models like deep learning.
Data imbalance: When the classes in a classification problem are not
represented equally, the model may become biased towards the majority class.
Noisy data: Data may contain errors, outliers, or irrelevant information, which
can negatively impact model performance.
2. Feature Engineering:
Identifying relevant features: Selecting the right features that contribute to
predictive performance is crucial. Missing important features or including
irrelevant ones can degrade model accuracy.
Handling categorical data: Encoding categorical variables effectively without
introducing bias or increasing dimensionality can be challenging.
Feature scaling: Ensuring that features are on similar scales can improve the
performance of certain algorithms, such as distance-based methods.
3. Model Selection and Evaluation:
Overfitting and under fitting: Overfitting occurs when a model learns to
memorize the training data instead of generalizing to unseen data, while
under fitting happens when the model is too simple to capture the underlying
patterns.
Hyper parameter tuning: Selecting the optimal hyper parameters for a model
can be time- consuming and require extensive experimentation.
Model evaluation metrics: Choosing appropriate metrics to evaluate model
performance based on the problem domain is critical. Using inaccurate or
misleading metrics can lead to erroneous conclusions.
4. Interpretability and Explain ability:
Black-box models: Complex models such as deep neural networks may lack
interpretability, making it difficult to understand the reasoning behind their
predictions.
Model transparency: Understanding how a model makes decisions is
important for gaining trust and addressing concerns about fairness, bias, and
ethics.
5. Deployment and Maintenance:
Deployment challenges: Integrating machine learning models into production
systems while ensuring scalability, reliability, and efficiency can be complex.
Monitoring and updating: Models may degrade over time due to changes in data
distribution or drift. Regular monitoring and updating are necessary to maintain
performance.
6. Ethical and Legal Considerations:
Bias and fairness: Machine learning models can inherit biases present in the
training data, leading to unfair or discriminatory outcomes.
Privacy concerns: Handling sensitive data requires careful attention to privacy
regulations and ethical considerations, such as data anonymization and informed
consent.
Machine learning Activities
• The following are the machine learning
activities:
1. Data Collection and Preprocessing 6. Hyper parameter Tuning
2. Exploratory Data Analysis (EDA) 7. Cross-Validation

8. Model Interpretation and
3. Feature Engineering
Explainability
4. Model Selection and Training 9. Deployment and Monitoring
5. Model Evaluation 10. Transfer Learning
1. Data Collection and Preprocessing
• Example: Suppose we are building a spam email classifier. We collect a dataset

containing emails labeled as spam or not spam.
• Preprocessing involves tasks like removing HTML tags, converting text to

lowercase, removing null values, stop words, and tokenization.
2. Exploratory Data Analysis (EDA)
• Example: Before training a model, we analyze the distribution of features,

correlations, and outliers in our dataset. For instance, we might visualize the
frequency of spam words in spam emails compared to non-spam emails.
3. Feature Engineering:
• Example: In a fraud detection system, we create new features like transaction

frequency, average transaction amount, and account holder age based on
existing data. These features provide more information to the model for better
fraud detection.
4. Model Selection and Training:
• Example: We experiment with different algorithms (e.g., logistic regression,

random forest, neural networks) and hyper parameters to find the best-
performing model for our task. For instance, we train multiple classifiers on
our spam email dataset and compare their accuracy scores.
5. Model Evaluation:
• Example: After training our spam email classifier, we evaluate its performance
using metrics like accuracy, precision, recall, and F1-score. We split our dataset
into training and testing sets to assess how well the model generalizes to
unseen or new data.
6. Hyper parameter Tuning:
• Example: We use techniques like grid search or random search to tune the
hyper parameters of our machine learning model. For instance, we adjust the
learning rate, regularization strength, and batch size of a neural network to
optimize its performance on a validation set.
7. Cross-Validation:
• Example: Instead of relying on a single train-test split, we perform k-fold

cross-validation to evaluate our model's performance more robustly. For
example, we divide our data into 5 folds, train the model on 4 folds, and
validate it on the remaining 1 fold, repeating this process five times.
8. Model Interpretation and Explain ability:
• Example: In a medical diagnosis system, we use techniques like SHAP

(SHapley Additive ex Plantations) or LIME (Local Interpretable Model-
agnostic Explanations) to interpret the predictions of our model. This helps to
understand which features are most influential in making decisions.
9. Deployment and Monitoring:
• Example: After building and evaluating our model, we deploy it into a

production environment where it can make real-time predictions. We set up
monitoring systems to track the model's performance over time and retrain or
update it as necessary to maintain accuracy.
10. Transfer Learning:
• Example: In image classification, we leverage a pre-trained convolutional

neural network (CNN), such as ResNet or VGG, which was trained on a large
dataset like ImageNet. We fine-tune the CNN on our specific task with a
smaller dataset, achieving better performance than training from scratch.
Basic Types of Data in Machine Learning
1. Numerical Data:
• Examples: Age, temperature, height, salary, stock prices.
• Advantages:
o Easy to work with in many machine learning algorithms.
o Can represent a wide range of values and magnitudes.
• Disadvantages:
o Outliers can significantly affect analysis and model performance.
o May require scaling or normalization to ensure features are on a similar

scale.
2. Categorical Data:
Examples: Gender (Male, Female), color (Red, Green, Blue), product
categories (Electronics, Clothing, Books).

• Advantages:
o Useful for representing non-numeric attributes and classes.
o Can provide valuable information for classification tasks.
• Disadvantages:
o Need to be encoded into numerical values for most machine learning

algorithms.
o High cardinality (many unique categories) can lead to issues like the
curse of dimensionality.
3. Ordinal Data:
• Examples: Education level (High School < Bachelor's < Master's < Ph.D.), Likert
scale ratings (Strongly Disagree < Disagree < Neutral < Agree < Strongly Agree).
• Advantages:
o Preserves order or ranking among categories, providing additional information.
o Can be useful for certain types of regression or ranking tasks.
• Disadvantages:
o Not all machine learning algorithms can handle ordinal data directly.
o May require careful encoding to maintain the ordinal relationship.

4. Text Data:
• Examples: Tweets, emails, articles, customer reviews.
• Advantages:
o Rich source of information for sentiment analysis, text classification, and

natural language processing tasks.
o Can capture nuanced information and context.
• Disadvantages:
o High-dimensional and sparse representation can be computationally expensive.
• Preprocessing steps like tokenization and stemming are necessary, which can
introduce noise.
5. Image Data: Examples: Photographs, medical images, satellite images.
• Advantages:
o Rich visual information suitable for tasks like object detection, image
classification, and image segmentation.
o Deep learning models like CNNs can automatically extract hierarchical
features.
• Disadvantages:
o Large memory and computational requirements for processing high-

resolution images.
o Requires extensive preprocessing and data augmentation to handle
variations in lighting, orientation, and scale.
6. Time Series Data:
• Examples: Stock prices over time, temperature readings, and sensor data.
• Advantages:
o Captures temporal dependencies and trends over time.
o Suitable for forecasting, anomaly detection, and trend analysis.
• Disadvantages:
o Need to handle missing values and irregular sampling intervals.
o Sensitive to seasonality, trends, and noise.

Exploring the Structure of Data
• Exploring the structure of data involves examining its organization,
relationships, patterns, and attributes to gain insights and understanding.
• Here are examples, advantages, and disadvantages of exploring the structure of

data:
1. Descriptive Statistics:
• Examples: Mean, median, mode, standard deviation, and variance.
• Advantages:
o Provides summary statistics that describe the central tendency,

dispersion, and shape of the data distribution.
o Helps identify outliers and anomalies.
• Disadvantages:
o May not capture complex relationships between variables.
o Limited to numerical data.
• Example: Calculating the mean and standard deviation of exam scores to

understand the average performance and variability among students.
2. Data Visualization:
• Examples: Histograms, scatter plots, box plots, bar charts.
• Advantages:
o Offers visual representation of data distribution, trends, and relationships.
o Facilitates easy interpretation and communication of findings.

• Disadvantages:
o Interpretation may vary based on visualization techniques.
o Limited to visualizing a few variables at a time.
• Example: Plotting a histogram of customer ages to understand the age
distribution in a market dataset.
3. Correlation Analysis:
• Examples: Pearson correlation coefficient, Spearman rank correlation.
• Advantages:
o Quantifies the strength and direction of relationships between pairs of
variables.
o Helps identify potential predictors in regression analysis.
• Disadvantages:
o Assumes linear relationships and may miss non-linear associations.
o Correlation does not imply causation.
• Example: Computing the correlation between advertising spending and

sales revenue to understand their relationship in a marketing dataset.
4. Dimensionality Reduction:
• Examples: Principal Component Analysis (PCA), t-Distributed Stochastic

Neighbor Embedding (t-SNE).
• Advantages:
o Reduces the dimensionality of data while preserving important information.
o Facilitates visualization and interpretation of high-dimensional data.
• Disadvantages:
o Loss of interpretability in reduced dimensions.
o May discard some information.
• Example: Applying PCA to gene expression data to identify principal

components representing gene expression patterns.
5. Clustering Analysis:
• Examples: K-means clustering, hierarchical clustering.
• Advantages:
o Identifies natural groupings or clusters within the data.
o Useful for segmentation and pattern recognition.
• Disadvantages:
o Requires choosing the number of clusters, which can be subjective.
o Results may vary based on the choice of distance metric and clustering
algorithm. Example: Using K-means clustering to segment customers
based on their purchasing behavior.
Data Quality & Remediation
• Data quality refers to the reliability, accuracy, consistency, completeness, and
relevancy of data.
• Data remediation involves the process of identifying and correcting data quality
issues to ensure that data is accurate, reliable, and suitable for analysis or
decision-making.
• Here are examples, advantages, disadvantages of data quality and remediation

in real-time scenarios:
• Examples of Data Quality Issues:
1. Inconsistent formats: In a customer database, phone numbers are stored in

various formats (e.g., +1 (555) 123-4567, 555-123-4567, 5551234567).
2. Missing values: In a sales dataset, some records may have missing values for
the "sales amount" field.
3. Duplicate records: A product inventory system contains duplicate entries for the
same item.
4. Incorrect data: In a healthcare database, a patient's birthdate is recorded as

01/13/1900, which is not possible.
• Advantages of Data Quality & Remediation:
1. Improved decision-making: High-quality data leads to more accurate insights

and better- informed decisions.
2. Enhanced efficiency: Clean and reliable data reduces the time spent on data
cleaning and troubleshooting.
3. Increased trust: Stakeholders have greater confidence in data-driven analyses
and reports when data quality is high.
4. Regulatory compliance: Ensuring data quality helps organizations comply
with data protection and privacy regulations.
• Disadvantages of Data Quality & Remediation:
1. Time-consuming: Identifying and rectifying data quality issues can be a time-

intensive process, especially for large datasets.
2. Costly: Data remediation efforts may require investments in tools, resources,
and personnel.
3. Complexity: Some data quality issues may be challenging to detect and

correct, especially in heterogeneous datasets.
4. Potential for errors: Human error during data cleaning and remediation can
introduce new inaccuracies or biases.
Data Pre-Processing
• Data preprocessing is a critical step in machine learning pipelines, involving
transforming raw data into a clean, structured format suitable for training
machine learning models.
• It includes various techniques to handle missing values, outliers, feature

scaling, normalization, encoding categorical variables, irrelevant features and
more, as well as to standardize or scale the data.
• Here's an overview along with suitable examples, advantages, and

disadvantages.
Data Pre-Processing
1. Handling Missing Values:
• Example: Suppose we have a dataset of customer information, and some

entries have missing values for the "income" attribute.
• We can handle this by imputing missing values using techniques like mean,
median, or mode imputation, or by using advanced imputation methods like
K-nearest neighbors (KNN) or predictive models.
• Advantages:
o Prevents loss of valuable data.
o Improves the robustness and reliability of the dataset.

Data Pre-Processing
• Disadvantages:
o Imputation methods may introduce bias if not handled carefully.
o Imputed values may not accurately represent the true underlying data
distribution.
2. Outlier Detection and Removal:
• Example: In a dataset of housing prices, we may find some entries with

unrealistically high or low prices.
• Outliers can be detected using statistical methods like Z-score or IQR

(Interquartile Range) and removed or adjusted accordingly.
Data Pre-Processing
• Advantages:
o Improves model performance by reducing the impact of outliers.
o Prevents the model from being skewed by extreme values.
• Disadvantages:
o Removal of outliers may lead to loss of valuable information.
o Subjective choice of outlier detection method and threshold.
3. Feature Scaling and Normalization:
• Example: In a dataset containing features with different scales (e.g., age

and income), scaling techniques like Min-Max scaling or Standardization (Z-
score normalization) can be applied to bring all features to a similar scale.
Data Pre-Processing
• Advantages:
o Ensures that features contribute equally to the model.
o Helps algorithms converge faster during training.
• Disadvantages:
o Scaling may amplify the noise in the data.
o Loss of interpretability for some features after scaling.
4. Encoding Categorical Variables:
• Example: Suppose we have a categorical feature like "gender" with values

"Male" and "Female." We can encode it into numerical values using
techniques like one-hot encoding or label encoding.
• Advantages:
o Allows algorithms to work with categorical data.
o Preserves the ordinal relationship between categories if needed.
• Disadvantages:
o Increases dimensionality, especially with one-hot encoding.
o May introduce sparsity and multicollinearity in the dataset.
5. Dimensionality Reduction:
• Example: Applying techniques like Principal Component Analysis (PCA) or t-

distributed Stochastic Neighbor Embedding (t-SNE) to reduce the
dimensionality of high-dimensional datasets while preserving most of the
relevant information.
Data Pre-Processing
• Advantages:
o Reduces overfitting and computational complexity.
o Visualizes high-dimensional data in lower dimensions.
• Disadvantages:
o Loss of interpretability in reduced dimensions.
o May discard some information, leading to loss of predictive power.

Unit-1

Uploaded by

Copyright:

Available Formats

Unit-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-1

Uploaded by

Copyright:

Available Formats

A8703-Machine Learning

➢ Identify the various concepts and challenges in machine Learning.

➢ Use supervised and unsupervised learning algorithms for a given

➢ Examine supervised and unsupervised learning algorithms for

➢ Identify the various concepts of neural network to develop AI based

Modelling and Evaluation: Introduction, Selecting a Model, Training a Model, Model

Unsupervised Learning: Introduction, Unsupervised vs Supervised Learning,

Basics of Neural Network: Introduction, Understanding the Biological Neuron,

• In the real world, we are surrounded by

• We have computers or machines which work

✓Computers learning from data is known as machine learning.

• Machine learning (ML) is a sub-domain of artificial intelligence (AI) that

• “Machine learning is a method of data analysis that automates analytical model

• Definition of Machine Learning (Mitchell 1997) — “A computer program is said to

✓learn to recognize spoken words (Waibel 1989; Lee 1989),

✓predict recovery rates of pneumonia patients (Cooper et al. 1997),

✓detect fraudulent use of credit cards,

✓drive autonomous vehicles on public highways (Pomerleau 1989),

✓play games such as backgammon at levels approaching the performance of

1. Automation - enables automation of tasks that are repetitive or time- consuming

5. Personalization: -can personalize user experiences by analyzing user behavior and

4. Computational Resources : Training complex ML models requires significant computational

1. Limited by Data Quality

• 1950s - 1960s: Early Foundations

1. Alan Turing (1950): Turing proposed the Turing Test as a measure of

2. Arthur Samuel (1959): Samuel developed the first self-learning program, a

• 2000s - Present: The Big Data Era

2. Deep Learning: Breakthroughs in deep learning, fueled by advances in

4. Machine Learning Applications: Machine learning became ubiquitous in

5. Ethical and Social Implications: Increased attention to the ethical and

knowledge, skills, attitudes, or behaviors.

• This process can occur in various ways, including through:

• Human learning involves cognitive processes such as attention, memory,

remember, and perceive.

• (1) either somebody who is an expert in the subject directly teaches us

(Learning under expert guidance),

expert in the past (Learning guided by knowledge gained from experts), or

• (3) we do it ourselves, may be after multiple attempts (Learning by self or

• 1. Data Input: Past data or information is utilized as a basis for future

• 2. Abstraction: The input data is represented in a broader way through the

• 3. Generalization: The abstracted representation is generalized to form a

• Figure 1 shows a schematic representation of the machine learning process.

Fig: Process of machine learning

• A better learning strategy needs to be adopted:

• 2. to be able to answer questions where a direct answer has not

• A good option is to figure out the key points or ideas amongst a

• From the machine learning paradigm, the vast pool of knowledge

• Mapping of the data to known characteristic is called knowledge

• Finally, this abstracted mapping from the input data can be

• This is generalization in context of machine learning.

• Definition: In supervised learning, the algorithm is trained on a labeled

• Usefulness: Supervised learning is highly useful in real-world scenarios

• Supervised learning is often used for tasks such as classification, regression,

2. Healthcare: Diagnosing diseases based on patient data.

3. Finance: Credit scoring, fraud detection, risk assessment.

• Well-understood and widely studied.

• Requires labeled data, which can be expensive and time-consuming to obtain.

• Definition: In unsupervised learning, the algorithm is trained on an

• Real-world Applications: Unsupervised learning is applied in various domains

1. Market segmentation: Grouping customers based on similar traits or behaviors.