Machine Learning With Python: Amin Zollanvari
Machine Learning With Python: Amin Zollanvari
Machine Learning With Python: Amin Zollanvari
Machine
Learning
with Python
Theory and Implementation
CONSDIDEE
I
Machine Learning with Python
Amin Zollanvari
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for
any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Lovingly dedicated to my family:
Ghazal, Kian, and Nika
Preface
The primary goal of this book is to provide, as far as possible, a concise systematic
introduction to practical and mathematical aspects of machine learning algorithms.
The book has arisen from about eight years of my teaching experience in both
machine learning and programming, as well as over 15 years of research in funda-
mental and practical aspects of machine learning. There were three sides to my main
motivation for writing this book:
• Software skills: During the past several years of teaching machine learning
as both undergraduate and graduate level courses, many students with differ-
ent mathematical, statistical, and programming backgrounds have attended my
lectures. This experience made me realize that many senior year undergradu-
ate or postgraduate students in STEM (Science, Technology, Engineering, and
Mathematics) often have the necessary mathematical-statistical background to
grasp the basic principles of many popular techniques in machine learning. That
being said, a major factor preventing students from deploying state-of-the-art
existing software for machine learning is often the lack of programming skills in
a suitable programming language. In the light of this experience, along with the
current dominant role of Python in machine learning, I often ended up teaching
for a few weeks the basic principles of Python programming to the extent that
is required for using existing software. Despite many excellent practical books
on machine learning with Python applications, the knowledge of Python pro-
gramming is often stated as the prerequisite. This led me to make this book
as self-contained as possible by devoting two chapters to fundamental Python
concepts and libraries that are often required for machine learning. At the same
time, throughout the book and in order to keep a balance between theory and
practice, the theoretical introduction to each method is complemented by its
Python implementation either in Scikit-Learn or Keras.
• Balanced treatment of various supervised learning tasks: In my experience,
and insofar as supervised learning is concerned, treatment of various tasks
such as binary classification, multiclass classification, and regression is rather
imbalanced in the literature; that is to say, methods are often presented for
vii
viii Preface
Primary Audience
The primary audience of the book are undergraduate and graduate students. The
book aims at being a textbook for both undergraduate and graduate students that are
willing to understand machine learning from theory to implementation in practice.
The mathematical language used in the book will generally be presented in a way that
both undergraduate (senior year) and graduate students will benefit if they have taken
a basic course in probability theory and have basic knowledge of linear algebra. Such
requirements, however, do not impose many restrictions as many STEM (Science,
Technology, Engineering, and Mathematics) students have taken such courses by
the time they enter the senior
É year. Nevertheless, there are some more advanced
sections that start with a and when needed end with a separating line “———”
to be distinguished from other sections. These sections are generally geared towards
graduate students or researchers who are either mathematically more inclined or
would like to know more details about a topic. Instructors can skip these sections
without hampering the flow of materials. The book will also serve as a reference book
for machine learning practitioners. This audience will benefit from many techniques
and practices presented in the book that they need for their day-to-day jobs and
research.
Organization
ix
x About This Book
is not currently the best software to use for implementing an important class of
predictive models used in machine learning, known as, artificial neural networks
(ANNs). This is because training “deep” neural networks requires estimating and
adjusting many parameters and hyperparameters, which is computationally an ex-
pensive process. As a result, successful training of various forms of neural networks
highly relies on parallel computations, which are realized using Graphical Process-
ing Units (GPUs) or Tensor Processing Units (TPUs). However, scikit-learn does not
support using GPU or TPU. At the same time, scikit-learn does not currently support
implementation of some popular forms of neural networks known as convolutional
neural networks or recurrent neural networks. As a result, we have postponed ANNs
and “deep learning” to Chapter 13-15. There, we switch to Keras with TensorFlow
backend because they are well optimized for training and tuning various forms of
ANN and support various forms of hardware including CPU, GPU, or TPU.
In Chapter 13, we discuss many concepts and techniques used for training deep
neural networks (deep learning). As for the choice of architecture in this chapter,
we focus on multilayer perceptron (MLP). Convolutional Neural Networks (CNNs)
and Recurrent Neural Networks (RNNs) will be introduced in Chapter 14 and 15,
respectively.
Throughout the book, a shows some extra information on a topic. Places
marked like this could be skipped without affecting the flow of information.
3 can be skipped. In that case, each of the remaining chapters could be generally
covered within a week.
2. If students have no knowledge of Python programming, Chapters 2 and 3 can
be covered in two weeks. In that case, I would suggest covering either Chapters
1-11, 13-14 or Chapters 1-9, 11, 13-15.
3. Depending on the class pace or even the teaching strategy (for example, flipped
vs. traditional), instructors may decide to cover each of Chapters 2, 9, and 13 in
two weeks. In that case, instructors can adjust the teaching and pick topics based
on their experience.
A set of educational materials is also available for instructors to use. This includes
solutions to exercises as well as a set of Jupyter notebooks that can be used to
dynamically present and explain weekly materials during the classroom. In particular,
the use of Jupyter notebook allows instructors to explain the materials and, at the
same time, run codes during the presentation. There are also different ways that one
can use to present notebooks in a similar format as “slides”.
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.7 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 What Is “Learning” in Machine Learning? . . . . . . . . . . . . . . . . . . . . . . 7
1.3 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Python in Machine Learning and Throughout This Book . . . . . . . . . . 17
2 Getting Started with Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 First Things First: Installing What Is Needed . . . . . . . . . . . . . . . . . . . . 21
2.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Some Important Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Relational and Logical Operators . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Membership Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Built-in Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.2 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xiii
xiv Contents
2.6.3 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.4 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.5 Some Remarks on Sequence Unpacking . . . . . . . . . . . . . . . . . 39
2.7 Flow of Control and Some Python Idioms . . . . . . . . . . . . . . . . . . . . . . 41
2.7.1 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.2 List Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.3 if-elif-else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Function, Module, Package, and Alias . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8.2 Modules and Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.3 Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
É 51
2.9 Iterator, Generator Function, and Generator Expression . . . . . . . 52
2.9.1 Iterator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.9.2 Generator Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.9.3 Generator Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
12 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
12.1 Partitional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
12.1.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.1.2 Estimating the Number of Clusters . . . . . . . . . . . . . . . . . . . . . . 328
12.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
12.2.1 Definition of Pairwise Cluster Dissimilarity . . . . . . . . . . . . . . 337
12.2.2 Efficiently Updating Dissimilarities . . . . . . . . . . . . . . . . . . . . . 337
12.2.3 Representing the Results of Hierarchical Clustering . . . . . . . 340
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Chapter 1
Introduction
In supervised learning, which is a major task in machine learning, the basic assump-
tion is that the input and the output spaces include all possible realizations of some
random variables and there is an unknown association between these spaces. The
1.1 General Concepts 3
goal is then to use a given data to learn a mathematical mapping that can estimate
the corresponding element of the output space for any given element from the input
space. In machine learning, we generally refer to the act of estimating an element
of the output space as prediction. To this end, in supervised learning the given data
includes some elements (instances) of the input space and their corresponding ele-
ments from the output space (see Fig. 1.1). There are two main types of supervised
learning: classification and regression.
If the output space contains realizations of categorical random variables, the task
is called classification, and the model is called classifier. However, if the output
space contains realizations of numeric random variables (continuous or discrete),
the task is known as regression, and the model is called regressor. An example of
classification is to classify an image to cat or dog (image classification). Here, in
order to learn the mapping (i.e., the classifier) from the input space (images) to the
output space (cat or dog), many images with their corresponding labels are used.
An example of regression is to estimate the horizontal and vertical coordinates of a
rectangle (known as bounding box) around an object in a given image. Here, in order
to learn the mapping (i.e., the regressor) from the input space (images) to the output
space (coordinates of the bounding box), many images with their four target values
(two for the center of the object and two for the width and height of the object) are
used.
In unsupervised learning, the data includes only a sample of the input space (real-
izations of some random variables). The goal is then to use the given data to learn
a mathematical mapping that can estimate the corresponding element of the output
space for the data (see Fig. 1.2). However, what defines the output space depends
on the specific task. For example, in clustering, which is an important unsupervised
learning task, the goal is to discover groups of similar observations within a given
data. As a result, the output space includes all possible partitions of the given sample.
Here, one may think of partitioning as a function that assigns a unique group label to
an observation such that each observation in the given sample is assigned to only one
label. For example, image segmentation can be formulated as a clustering problem
where the goal is to partition an image into several segments where each segment
includes pixels belonging to an object. Other types of unsupervised learning include
density estimation and certain dimensionality reduction techniques where the output
space includes possible probability density functions and possible projections of
data into a lower-dimensionality space, respectively.
4 1 Introduction
vv
xn yn
x y
𝐲!
input space x 1 output space
unknown mapping y1
X x2 Y
y2
approximate
x mathematical mapping 𝐲!
given data S
S = {(x 1, y1), (x 2, y2), …, (x n, yn)}
Fig. 1.1: Supervised learning: the given data is a set S including elements of the
input space (i.e., x1 , x2 , . . . , xn ) and their corresponding elements from the output
space (i.e., y1 , y2 , . . . , yn ). The goal is to learn a mathematical mapping using S that
can estimate (predict) the corresponding element of the output space for any given
element from the input space.
xn
given data S
S = {x 1, x 2, …, x n}
mathematical mapping 𝜉"
Fig. 1.2: Unsupervised learning: the given data is a set S including elements of the
input space (i.e., x1 , x2 , . . . , xn ). The goal is to learn a mathematical mapping using
S that can project S to an element of the output space.
1.1 General Concepts 5
Regardless of the specific task, to learn the mapping, a practitioner commonly follows
a three-stage process, to which we refer as the design process:
Stage 1: selecting a set of “learning” rules (algorithms) that given data produce
mappings;
Stage 2: estimating mappings from data at hand;
Stage 3: pruning the set of candidate mappings to find a final mapping.
Generally, different learning rules produce different mappings, and each mapping
is indeed a mathematical model for the real-world phenomenon. The process of
estimating a specific mapping from data at hand (Stage 2) is referred to as learning
(the final mapping) or training (the rule). As such, the data used in this estimation
problem is referred to as training data, which depending on the application could be
different; for example, in supervised learning, data is essentially instances sampled
from the joint input-output space. The training stage is generally distinguished from
the model selection stage, which includes the initial candidate set selection (Stage 1)
and selecting the final mapping (Stage 3). Nonetheless, sometimes, albeit with some
ambiguity, this three-stage process altogether is referred to as training.
Let us elaborate a bit more on these stages. In the context of ANN, for example,
there exists a variety of predefined mathematical rules between an input space and an
output space. These mathematical rules are also used to categorize different types of
ANN; for example, multilayer perceptrons, recurrent neural networks, convolutional
neural networks, and transformers, to just name a few. In Stage 1, a practitioner should
generally use her prior knowledge about the data in order to choose some learning
rules that believes could work well in the given application. However, occasionally
the candidate set selection in Stage 1 is not only about choosing between various
rules. For example, in the context of ANN, even a specific neural network such as
multilayer perceptrons could be very versatile depending some specific nonlinear
functions that could be used in its structure, the number of parameters, and how
they operate in the hierarchical structure of the network. These are themselves some
parameters that could considerably influence the quality of mapping. As a result,
they are known as hyperparameters to distinguish them from those parameters that
are native to the mapping.
As a result, in Stage 1, the practitioner should also decide to what extent the
space of hyperparameters could be explored. This is perhaps the most “artistic”
part of the learning process; that is, the practitioner should narrow down an of-
ten infinite-dimensional hyperparameter space based on prior experience, available
search strategies, and available computational resources. At the end of Stage 1, what
a practitioner has essentially done is to select a set of candidate mathematical ap-
proximations of a real-world problem and a set of learning rules that given data
produce specific mappings.
1.2 What Is “Learning” in Machine Learning? 7
Going back to LaMDA, when Lemoine posted his “interview” with LaMDA online,
it quickly went viral and attracted media attention due to his speculation that LaMDA
is sentient, and that speculation made Google to put him on administrative leave.
Similar to many other machine learning models, a neural network (e.g., LaMDA)
is essentially the outcome of the aforementioned three-stage design process, after
all. Because it is trained and used for dialog applications, similar to many other
neural language models, both the input and output spaces are sequences of words.
In particular, given a sequence of input words, language models generally generate
a single word or a sequence of words as the output. The generated output can then
be treated itself as part of the input to generate more output words. Despite many
impressive algorithmic advances amplified by parallel computation hardware used
for model selection and training to come up with LaMDA (see (Thoppilan et al.,
2022)), it is a mapping between an input space of words to an output space of words;
and the parameters of this mapping are stored in 0s and 1s in terms of amount
of electrons trapped in some silicon memory cells on Google servers. To prevent
mixing science with myths and fallacies, it would be helpful to take into account the
following lines from creators of LaMDA more seriously (Thoppilan et al., 2022):
Finally, it is important to acknowledge that LaMDA’s learning is based on imitating human
performance in conversation, similar to many other dialog systems. A path towards high
quality, engaging conversation with artificial systems that may eventually be indistinguish-
able in some aspects from conversation with a human is now quite likely. Humans may
interact with systems without knowing that they are artificial, or anthropomorphizing the
system by ascribing some form of personality to it.
The question is whether we can refer to the capacity of LaMDA in generating text,
or for that matter to the capacity of any other machine learning model that performs
well, as a form of intelligence? Although the answer to this question should begin with
a definition of intelligence, which could lead to the definition of several other terms,
it is generally answered in the affirmative in so far as the normal use of intelligence
term is concerned and we distinguish between natural intelligence (e.g., in human)
and artificial intelligence (e.g., in LaMDA). In this sense, machine learning is a
promising discipline that can be used to create artificial intelligence.
The term learning is used frequently in machine learning literature; after all, it is
even part of the “machine learning” term. Duda et al. (Duda et al., 2000) write,
In the broadest sense, any method that incorporates information from training samples in
the design of a classifier employs learning.
The result of running the machine learning algorithm can be expressed as a function y(x)
which takes a new digit image x as input and that generates an output vector y, encoded the
same way as the target vectors.
Hastie et al. puts the matter this way (Hastie et al., 2001):
Vast amounts of data are being generated in many fields, and the statisticians’s job is to make
sense of it all: to extract important patterns and trends, and understand “what the data says”.
We call this learning from data.
The definition is neither too specific to certain tasks nor too broad that could
possibly introduce ambiguity. By “data” we refer to any collection of information
obtained by measurement, observation, query, or prior knowledge. This way we can
refer to the process of estimating any mapping used in, for example, classification,
regression, dimensionality reduction, clustering, reinforcement learning, etc., as
learning. As a result, the learned mapping is simply an estimator. In addition, the
utility of an algorithm is implicit in this definition; after all, the data should be used
in an algorithm to estimate a mapping.
One may argue that we can define learning as any estimation that could occur
in machine learning. While there is no epistemological difficulty in doing so, such
a general definition of learning would become too loose to be used in referring to
useful tasks in machine learning. To put the matter in perspective, suppose we would
like to develop a classifier that can classify a given image to dog or cat. Using such a
broad definition for learning, even when a label is assigned to an image by flipping
a coin, learning occurs because we are estimating the label of a given data point
(the image); however, the question is whether such a “learning” is anyhow valuable
in terms of a new understanding of the relationship between the input and output
spaces. In other words, although the goal is estimating the label, learning is not the
goal per se, but it is the process of achieving the goal.
At this stage, it seems worthwhile to briefly discuss the impact and utility of
prior knowledge in the learning process. Prior knowledge, or a priori knowledge,
refers to any information that could be used in the learning process and would be
generally available before observing realizations of random variables used to estimate
the mapping from the input space to the output space. Suppose in the above image
classification example, we are informed that the population through which the images
will be collected (the input space) contains 90% dogs and only 10% cats. Suppose
the data is now collected but the proportion of classes in the given training data is the
same. Such a discrepancy of class proportions between the population and the given
sample is common if classes are sampled separately. We train four classifiers from
the given training data and we decide to assign the label to a given image based on
1.3 An Illustrative Example 9
the majority vote among these four classifiers—this is known as ensemble learning
(discussed in Chapter 8) in which we combine the outcome of several other models
known as base models. However, here taking majority vote between an even number
of classifiers means there are situations with a tie between the dogs and cats assigned
by individual classifiers (e.g., two classifiers assign cat and two assign dog to a given
image). One way to break the ties when they happen is to randomly assign a class
label. However, based on the given prior knowledge, here a better strategy would be
to assign any image with a tie to the dog class because it is 9 times more likely to
happen than the cat class. Here in learning the ensemble classifier we use instances
of the joint input-output space as well as the prior knowledge.
In order to further clarify and demonstrate some of the steps involved in the afore-
mentioned three-stage design process, here we consider an example. In particular,
using an EEG classification task as archetypal, we attempt to present a number of
general machine learning concepts including feature selection, feature extraction,
segmentation, training, and evaluating a predictive model.
1.3.1 Data
Here, we rely on our prior knowledge about the problem and consider measurements
taken only from one of the sensors (C1 channel), which we believe it contains a
10 1 Introduction
good amount of discriminatory information between classes. Fig. 1.3 shows the 10
EEG recordings over C1 channel for the two aforementioned subjects. Let the set
of pairs S = {(x1, y1 ), . . . , (xn, yn )} denote all available EEG recordings for the two
subjects used here where yi is a realization of the class variable that takes 0 (for
control) or 1 (for alcoholic). Because we have two subjects and for each subject we
have 10 trials, n = 2 × 10 = 20. At the same time, because each xi contains 256
data points collected in one second (recall that the sampling rate was 256 Hz and
the duration of each trial was 1 second), it can be considered as a column vector
xi ∈ R p, p = 256, which is generally referred to as a feature vector. As the name
suggests, each element of a feature vector is a feature (also known as attribute)
because it is measuring a property of the phenomenon under investigation (here,
brain responses). For example, the value of the feature at the l th dimension of xi is
the realization of a random variable that measures the value of C1 channel at a time
point with an offset of l − 1, l = 1, . . . , 256 from the first time point recorded in the
feature vector. We refer to n and p as the sample size and dimensionality of feature
vectors, respectively. The goal is to use S, which contains n elements of the joint
input-output space to train the classifier.
Should we pick m channels rather than merely the C1 channel, then each xi will
be a vector of 256 × m features. However, as we will see in Chapter 10, it is often
useful to identify “important” features or, equivalent, remove “redundant” features—
a process known as feature selection. At the same time, feature selection can be done
using data itself, or based on prior knowledge. As a result, our choice of C1 channel
can be viewed as a feature selection based on prior knowledge because we have
reduced the potential dimensionality of feature vectors from 256 × m to 256.
to some output features that may or may not have a physical interpretation as input
features. For example, feature selection can be seen as a special type of feature
extraction in which the physical nature of features are kept in the selection process,
but in this section we will use a feature extraction that converts original features (set
of signal magnitude across time) to frequencies and amplitudes of some sinusoids.
As the signal model for this application, we choose the sum of sinusoids with
unknown frequencies, phases, and amplitudes, which have been used before in similar
applications (e.g., see (Abibullaev and Zollanvari, 2019)). That is to say,
M
Õ
si [l] = Ar cos(2π fr l + φr ) (1.1)
r=1
M
1
Õ
= ξ1r cos(2π fr l) + ξ2r sin(2π fr l) , l = 1, . . . , p ,
r=1
1.3.4 Segmentation
In many real-world applications, the given data is collected sequentially over time
and we are asked to train a model that can perform prediction. An important factor
that can affect the performance of this entire process is the length of signals that are
used to train the model. As a result, it is common to apply signal segmentation; that
is to say, to segment long given signals into smaller segments with a fixed length and
treat each segment as an observation. At this point, one may decide to first apply a
feature extraction on each signal segment and then train the model, or directly use
these signal segments in training. Either way, it would be helpful to treat the segment
length as a hyperparameter, which needs to be predetermined by the practitioner or
tuned from data.
1.3 An Illustrative Example 13
control
10 EEG Trials (μ V)
(a)
alcoholic
10 EEG Trials (μ V)
(b)
Fig. 1.3: Training Data: the EEG recordings of C1 channel for all 10 trials for (a)
control subject; and (b) alcoholic subject.
14 1 Introduction
Fig. 1.4: Modeling the first EEG trial for the control subject; that is, modeling the
signal at the bottom of Fig. 1.3a. The solid line in each figure is the original signal
denoted x1 . The (red) dashed line shows the estimate denoted s1 obtained using the
sinusoidal model. Depending on the model order M, two cases are considered (a)
M = 1; and (b) M = 3.
There are various reasons for signal segmentation. For example, one can perceive
signal segmentation as a way to increase the sample size; that is to say, rather than
using very few signals of large lengths, we use many signals of smaller lengths to train
a model. Naturally, a model (e.g., a classifier) that is trained based on observations
of a specific length expects to receive observations of the same length to perform
prediction. This means that the prediction performed by the trained model is based
on a given signal of the same length as the one that is used to prepare the training
data through segmentation; however, sometimes, depending on the application, we
may decide to make a single decision based on decisions made by the model on
several consecutive segments—this is also a form of ensemble learning discussed
in Chapter 8. Another reason for signal segmentation could be due to the quality of
feature extraction. Using a specific feature extraction, which has a limited flexibility,
may not always be the most prudent way to represent a long signal that may have a
large variability. In this regard, signal segmentation could help bring the flexibility
of feature extraction closer to the signal variability.
To better clarify these points, we perform signal segmentation in our EEG exam-
ple. In this regard, we divide each trial, which has a duration of 1 second, into smaller
segments of size 31 and 16 1
second; that is, segments of size 31 × 1000 = 333.3 and
16 × 1000 = 62.5 msec, respectively. For example, when segmentation size is 62.5
1
1.3 An Illustrative Example 15
1.3.5 Training
In Section 1.3.4, we observed that for a smaller segmentation size, extracted features
can more closely represent the observed signals. This observation may lead to the
conclusion that a smaller signal segment would be better than a larger one; however, as
we will see, this is not necessarily the case. This state of affairs can be attributed in part
to: 1) breaking dependence structure among sequentially collected data points that
occurs due to the segmentation process could potentially hinder its beneficial impact;
and 2) small segments may contain too little information about the outcome (output
space). To examine the effect of segmentation size on the classifier performance, in
this section we train a classifier for each segmentation size, and in the next section
we evaluate their predictive performance. Let us first summarize and formulate what
we have done so far.
Recall that each trial xi was initially a vector of size of p = 256 and we
denoted all collected training data as S = {(x1, y1 ), . . . , (xn, yn )} where n =
20 (2 subjects × 10 trials). Using signal segmentation, we segment each xi to c
non-overlapping smaller segments denoted xi j where each xi j is a vector of size
r = p/c, i = 1, . . . , n, j = 1, . . . , c, xi j [l] = xi [( j − 1)r + l], l = 1, . . . , r, and
for simplicity we assume c divides p. This means that our training data is now
Sc = {(x11, y11 ), (x12, y12 ), . . . , (xnc, ync )} where yi j = yi, ∀ j, and where index
c = 1, 3, 16 is used in notation Sc to reflect dependency of training data to the
segmentation size. In our experiments, we assume c = 1 (no segmentation), c = 3
(segments of size 333.3 msec), and c = 16 (segments of size 62.5 msec). Sup-
pose that now we perform signal modeling (feature extraction) over each signal
segment xi j and assume M = 3 for all values of c. This means that each xi j is
replaced with a feature vector θ i j of size 3M. In other words, the training data,
16 1 Introduction
which is used to train the classifier, after segmentation and feature extraction is
Sθc = {(θ 11, y11 ), (θ 12, y12 ), . . . , (θ nc, ync )}.
We use Sθ1 , Sθ3 , Sθ16 to train three classifiers of non-alcoholic vs. alcoholic. Re-
gardless of the specific choice of the classifier, when classification of a given signal
segment x of length r is desired, we first need to extract similar set of 3M features
from x, and then use the classifier to classify the extracted feature vector θ ∈ R3M . As
for the classifier choice, we select the standard 3NN (short for 3 nearest neighbors)
classifier. In particular, given an observation θ to classify, 3NN classifier finds the
three θ i j from Sθc that have lowest Euclidean distance to θ; that is to say, it finds the 3
nearest neighbors of θ. Then it assigns θ to the majority class among these 3 nearest
neighbors (more details in Chapter 5). As a result, we train three 3NN classifiers,
one for each training data.
1.3.6 Evaluation
The worth of a predictive model rests with its ability to predict. As a result, a
key issue in the design process is measuring the predictive capacity of trained
predictive models. This process, which is generally referred to as model evaluation
or error estimation, can indeed permeate the entire design process. This is because
depending on whether we achieve an acceptable level of performance or not, we may
need to make adjustments in any stages of the design process or even in the data
collection stage. There are different ways to measure the predictive performance of
a trained predictive model. In the context of classification, the accuracy estimate
(simply referred to as the accuracy) is perhaps the most intuitive way to quantify the
predictive performance. To estimate the accuracy, one way is to use a test data; that
is, to use a data, which similar to the training data, the class of each observation is
known, but in contrast with the training data, it has not been used anyhow in training
the classifier. The classifier is then used to classify all observations in the test data.
The accuracy estimate is then the proportion of correctly classified observations in
the test data.
In our application, the EEG “Small Dataset” already contains a test data for each
subject. Each subject-specific test data includes 10 trials with similar specifications
stated in Section 1.3.1 for the training data. Fig. 1.6 shows the test data that includes
10 EEG recordings over C1 channel for the two subjects. We use similar segmentation
and feature extraction that were used to create Sθ1 , Sθ3 , Sθ16 to transform the test data
into a similar form denoted Sθ,te θ,te θ,te
1 , S3 , S16 , respectively. Let ψc (θ), c = 1, 3, 16
denote the (3NN) classifier trained in the previous section using Sθc to classify a θ,
which is obtained from a given signal segment x. We should naturally evaluate ψc (θ)
on its compatible test data, which is Sθ,tec . The result of this evaluation shows that
ψ1 (θ), ψ3 (θ), and ψ16 (θ) have an accuracy of 60.0%, 66.7%, and 57.5%, respectively.
This observation shows that for this specific dataset, a 3NN classifier trained with a
segmentation size of 333.3 msec achieved a higher accuracy than the same classifier
trained when segmentation size is 1000 or 62.5 msec.
1.4 Python in Machine Learning and Throughout This Book 17
So far the duration of decision-making is the same as the segmentation size. For
example, a classifier trained using Sθ3 , labels a given signal of 333.3 msec duration
as control or alcoholic. However, if desired we may extend the duration of our
decision-making to the trial duration (1000 msec) by combining decisions made by
the classifier over three consecutive signal segments of 333.3 msec. In this regard, a
straightforward approach is to take the majority vote among decisions made by the
classifier over the three signal segments that make up a single trial. In general, this
ensemble classifier, denoted ψE,c (θ), is given by
c
Õ
ψE,c (θ) = arg max I {ψc (θ j )=y } , (1.3)
y ∈ {0,1} j=1
Table 1.1 shows the result of this survey on estimated usage of the top four
machine learning frameworks. All these production-ready software are either written
primarily in Python (e.g., Scikit-Learn and Keras) or have a Python API (TensorFlow
and xgboost). The wide usage and popularity of Python has made it the lingua franca
among many data scientists (Müller and Guido, 2017).
Software Usage
Scikit-Learn 82.3%
TensorFlow 52.6%
xgboost 47.9%
Keras 47.3%
18 1 Introduction
Fig. 1.5: Modeling the first EEG trial for the control subject (the signal at the bottom
of Fig. 1.3a) for different segmentation size. The solid line in each figure is the
original signal collected in the trial. Vertical dashed lines show segmented signals.
The (red) dashed lines show the estimate of each segmented signal obtained using
the sinusoidal model with M = 3. Depending on the segmentation size, three cases
are considered: (a) one trial, one segment; (b) one trial, three segments (segment
size = 333.3 msec); and (c) one trial, 16 segments (segment size = 62.5 msec).
control
10 EEG Trials (μ V)
(a)
alcoholic
10 EEG Trials (μ V)
(b)
Fig. 1.6: Test data: the EEG recordings of C1 channel for all 10 trials for (a) control
subject; and (b) alcoholic subject.
20 1 Introduction
method, and attributes. Then, in Chapter 2, readers will start from the basics of Python
programming and work their way up to more advanced topics in Python. Chapter 3
will cover three important Python packages that are very useful for data wrangling
and analysis: numpy, pandas, and matplotlib. Nevertheless, Chapters 2 and 3 can
be entirely skipped if readers are already familiar with Python programming and
these three packages.
Chapter 2
Getting Started with Python
Simplicity and readability along with a large number of third-party packages have
made Python the “go-to” programming language for machine learning. Given such
an important role of Python for machine learning, this chapter provides a quick
introduction to the language. The introduction is mainly geared toward those who
have no knowledge of Python programming but are familiar with programming
principles in another (object oriented) language; for example, we assume readers are
familiar with the purpose of binary and unary operators, repetitive and conditional
statements, functions, a class (a recipe for creating an object), object (instance of a
class), method (the action that the object can perform), and attributes (properties of
class or objects).
Install Anaconda! Anaconda is a data science platform that comes with many things
that are required throughout the book. It comes with a Python distribution, a package
manager known as conda, and many popular data science packages and libraries such
as NumPy, pandas, SciPy, scikit-learn, and Jupyter Notebook. This way we don’t
need to install them separately and manually (for example, using “conda” or “pip”
commands). To install Anaconda, refer to (Anaconda, 2023). There are also many
tutorials and videos online that can help installing that.
following command):
conda create name myenv python=3.10.9
Executing this command will download and install the desired package (here
Python version 3.10.9). To work with the environment, we should first acti-
vate it as follows:
conda activate myenv
Then we can install other packages that we need throughout the book:
• install “tensorflow” (required for Chapters 13-15):
conda install tensorflow
This will also install a number of other packages (e.g., numpy and
scipy) that are required for tensorflow to work (i.e., its dependencies).
Python is a programming language that is interpreted rather than compiled; that is,
we can run codes line-by-line. Although there are several IDEs (short for, integrated
development environment) that can be used to write and run Python codes, the IDE
of choice in this book is Jupter notebook. Using Jupyter notebook allows us to write
and run codes and, at the same time, combine them with text and graphics. In fact
all materials for this book are developed in Jupyter notebooks. Once Anaconda is
installed, we can launch Jupyter notebook either from Anaconda Navigator panel or
from terminal.
In a Jupyter notebook, everything is part of cells. Code and markdown (text) cells
are the most common cells. For example, the text we are now reading is written in a
text cell in a Jupyter notebook. The following cell is a code cell and will be presented
in gray boxes throughout the book. To run the contents of a code cell, we can click
on it and press “shift + enter”. Try it on the following cell and observe a 7 as the
output:
2 + 5
• First we notice that when the above cell is selected in a Jupyter notebook, a
rectangle with a green bar on the left appears. This means that the cell is in
“edit” mode so we can write codes in that cell (or simply edit if it is a Markdown
cell).
• Now if “esc” key (or “control + M”) is pressed, this bar turns blue. This means
that the cell is now in “command” mode so we can edit the notebook as a whole
without typing in any individual cell.
• Also depending on the mode, different shortcuts are available. To see the list of
shortcuts for different modes, we can press “H” key when the cell is in command
mode; for example, some useful shortcuts in this list for command mode are:
“A” key for adding a cell above the current cell; “B” key for adding a cell below
the current cell; and pressing “D” two times to delete the current cell.
24 2 Getting Started with Python
2.3 Variables
4.4
In the above example, we created two scalar variables x and y, assigned them to
some values, and multiplied. There are different types of scalar variables in Python
that we can use:
x = 1 # an integer
x = 0.3 # a floating-point
x = 'what a nice day!' # a string
x = True # a boolean variable (True or False)
x = None # None type (the absence of any value)
float
bool
NoneType
Here we look at an attribute of an integer to show that in Python even numbers are
objects:
2.4 Strings 25
(4).imag
The imag attribute extracts the imaginary part of a number in the complex domain
representation. The use of parentheses though is needed here because otherwise, there
will be a confusion with floating points.
2.4 Strings
This is a string
Well, this is 'string' too!
Johnny said: "How are you?"
We can use \t and \n to add tab and newline characters to a string, respectively:
print("Here we use a newline\nto go to the next\t line")
7
3
10
2.5
2
1
25
False
True
2.6 Built-in Data Structures 27
True
False
True
False
False
print('Hello' in 'HellOWorlds!')
print(320 in ['Hi', 320, False, 'Hello'])
True
False
True
Python has a number of built-in data structures that are used to store multiple data
items as separate entries.
2.6.1 Lists
Perhaps the most basic collection is a list. A list is used to store a sequence of objects
(so it is ordered). It is created by a sequence of comma-separated objects within [ ]:
x = [5, 3.0, 10, 200.2]
x[0] # the index starts from 0
[34, False]
In addition, lists are mutable, which means they can be modified after they are
created. For example,
x[4] = 52 # here we change one element of list x
x
y = [9, 0, 4, 2]
print(x + y) # to concatenate two lists, + operator is used
print(y * 3) # to concatenate multiple copies of the same list, *
,→operator is used
Accessing the elements of a list by indexing and slicing: We can use indexing to
access an element within a list (we already used it before):
x[3]
True
To access the elements of nested lists (list of lists), we need to separate indices
with square brackets:
z[1][0] # this way we access the second element within z and within
,→that we access the first element
'JupytherNB'
x[-1] # index -1 returns the last item in the list; -2 returns the
,→second item from the end, and so forth
75
x[-2]
Slicing is used to access multiple elements in the form of a sub-list. For this
purpose, we use a colon to specify the start point (inclusive) and end point (non-
inclusive) of the sub-list. For example:
x[0:4] # the last element seen in the output is at index 3
In this format, if we don’t specify a starting index, Python starts from the beginning
of the list:
x[:4] # equivalent to x[0:4]
Similarly, if we don’t specify an ending index, the slicing includes the end of the list:
x[4:]
[52, 2, 75]
[2, 75]
Another useful type of slicing is using [start:stop:stride] syntax where the stop
denotes the index at which the slicing stops (so it is not included), and the stride is
just the step size:
x[0:4:2] # steps of 2
['JupytherNB', None]
30 2 Getting Started with Python
In the above example, we start from 4 but because the stride is negative, we go
backward to the beginning of the list with steps of 2. In this format, when the stride
is negative and the “stop” is not specified, the “stop” becomes the beginning of the
list. This behavior would make sense if we observe that to return elements of a list
backward, the stopping point should always be less than the starting point; otherwise,
an empty list would have been returned.
Modifying elements in a list: As we saw earlier, one way to modify existing
elements of a list is to use indexing and assign that particular element to a new value.
However, there are other ways we may want to use to modify a list; for example, to
append a value to a list or to insert an element at a specific index. Python has some
methods for these purposes. Here we present some examples to show these methods:
[0, 2, 4, 9]
True
2.6 Built-in Data Structures 31
del x[1] # del statement can also be used to delete an element from a
,→list by its index
Copying a List: It is often desired to make a copy of a list and work with it without
affecting the original list. In these cases if we simply use the assignment operator,
we end up changing the original list! Suppose we have the following list:
list1 = ['A+', 'A', 'B', 'C+']
list2 = list1
list2
list2.append('D')
print(list2)
print(list1)
As seen in the above example, when ‘D’ is appended to list2, it is also appended
at the end of list1. One way to understand this behavior is that when we write list2
= list1, in fact what happens internally is that variable list2 will point to the
same container as list1. So if we modify the container using list2, that change
will appear if we access the elements of the container using list1.
There are three simple ways to properly copy the elements of a list: 1) slicing; 2)
copy() method; and 3) the list() constructor. They all create shallow copies of
a list (in contrast with deep copies). Based on Python documentation Python-copy
(2023):
A shallow copy of a compound object such as list creates a new compound object and
then adds references (to the objects found in the original object) into it. A deep copy of a
compound object creates a new compound object and then adds copies of the objects found
in the original object.
32 2 Getting Started with Python
Further implications of these statements are beyond our scope and readers are en-
couraged to see other references (e.g., (Ramalho, 2015, pp. 225-227)). Here we
examine these approaches to create list copies.
list3 = list1[:] # the use of slicing; that is, using [:] we make a
,→shallow copy of the entire list1
list3.append('E')
print(list3)
print(list1)
class list(object)
| list(iterable=(), /)
|
2.6 Built-in Data Structures 33
2.6.2 Tuples
Tuple is another data-structure in Python that similar to list can hold other arbitrary
data types. However, the main difference between tuples and lists is that a tuple is
immutable; that is, once it is created, its size and contents can not be changed.
A tuple looks like a list except that to create them, we use parentheses ( ) instead
of square brackets [ ]:
tuple1 = ('Machine', 'Learning', 'with', 'Python', '1.0.0')
tuple1
Once a tuple is created, we can use indexing and slicing just as we did for a list
(using square brackets):
tuple1[0]
'Machine'
tuple1[::2]
/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_51384/
,→877039090.py in <module>
Because we can not change the contents of tuples, there is no append or remove
method for tuples. Although we can not change the contents of a tuple, we could
redefine our entire tuple (assign a new value to the variable that holds the tuple):
tuple1 = ('Jupyter', 'NoteBook') # redefine tuple1
tuple1
('Jupyter', 'NoteBook')
A common use of tuples is in functions that return multiple values. For example,
modf() function from math module (more on functions and modules later), returns
a two-item tuple including the fractional part and the integer part of its input:
from math import modf # more on "import" later. For now just read
,→this as "from math module, import modf function" so that modf
,→function is available in our program
a = 56.5
modf(a) # the function is returning a two-element tuple
(0.5, 56.0)
We can assign these two return values to two variables as follows (this is called
sequence unpacking and is not limited to tuples):
2.6 Built-in Data Structures 35
x, y = modf(a)
print("x = " + str(x) + "\n" + "y = " + str(y))
x = 0.5
y = 56.0
Now that we discussed sequence unpacking, let us examine what sequence packing
is. In Python, a sequence of comma separated objects without parentheses is packed
into a tuple. This means that another way to create the above tuple is:
tuple1 = 'Machine', 'Learning', 'with', 'Python', '1.0.0' # sequence
,→packing
tuple1
Note that in the above example, we first packed 'Machine', 'Learning', 'with',
'Python', '1.0.0' into tuple1 and then unpacked into x, y, z, v, w. Python allows
to do this in one step as follows (also known as multiple assignment, which is really
a combination of sequence packing and unpacking):
x, y, z, v, w = 'Machine', 'Learning', 'with', 'Python', '1.0.0'
print(x, y, z, v, w)
One last note about sequence packing for tuples: if we want to create a one-element
tuple, the comma is required (why?):
tuple3 = 'Machine', # remove the comma and see what would be the type
,→here
type(tuple3)
tuple
36 2 Getting Started with Python
2.6.3 Dictionaries
A dictionary is a useful data structure that contains a set of values where each value
is labeled by a unique key (if we duplicate keys, the second value wins). We can
think of a dictionary data type as a real dictionary where the words are keys and
the definition of words are values but there is no order among keys or values. As
for the keys, we can use any immutable Python built-in type such as string, integer,
float, boolean or even tuple as long the tuple does not include a mutable object. In
technical terms, the keys should be hashable but the details of how a dictionary is
implemented under the hood is out of the scope here.
Dictionaries are created using a collection of key:value pairs wrapped within
curly braces { } and are non-ordered:
dict1 = {1:'value for key 1', 'key for value 2':2, (1,0):True, False:
,→[100,50], 2.5:'Hello'}
dict1
The above example is only for demonstration to show the possibility of using im-
mutable data types for keys; however, keys in dictionary are generally short and more
uniform. The items in a dictionary are accessed using the keys:
dict1['key for value 2']
/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_71178/
,→1077749807.py in <module>
The non-ordered nature of dictionaries allows fast access to its elements regardless
of its size. However, this comes at the expense of significant memory overhead
(because internally uses an additional sprase hash tables). Therefore, we should
think of dictionaries as a trade off between memory and time: as long as they fit in
the memory, they provide fast access to their elements.
In order to check the membership among keys, we can use the keys() method to
return a dict_keys object (it provides a view of all keys) and check the membership:
(1,0) in dict1.keys()
True
This could also be done with the name of the dictionary (the default behaviour is to
check the keys, not the values):
(1,0) in dict1 # equivalent to: in dict1.keys()
38 2 Getting Started with Python
True
In order to check the membership among values, we can use the values()
method to return a dict_values object (it provides a view of all values) and check
the membership:
"Hello" in dict1.values()
True
{'Country': 'USA',
'phone_numbers': {'Police': 102, 'Fire': 101, 'Gas': 104},
'population_million': 18.7}
2.6.4 Sets
Sets are collection of non-ordered unique and immutable objects. They can be defined
similarly to lists and tuples but using curly braces. Similar to mathematical set
operations, they support union, intersection, difference, and symmetric difference.
Here we define two sets, namely, set1 and set2 using which we examine set
operations:
set1 = {'a', 'b', 'c', 'd', 'e'}
set1
{'b', 'c'}
True
As seen in this example, variable y becomes a list of ‘Learning’ and ‘with’. Note
that the value of other variables are set by their positions. Here * is working as an
operator to implement extended iterable unpacking. We will discuss what an iterable
or iterator is more formally in Section 2.7.1 and Section 2.9.1 but, for the time being,
40 2 Getting Started with Python
it suffices to know that any list or tuple is an iterable object. To better understand
the extended iterable unpacking, we first need to know how * works as the iterable
unpacking.
We may use * right before an iterable in which case the iterable is expanded into
a sequence of items, which are then included in a second iterable (for example, list
or tuple) when unpacking is performed. Here is an example in which * operates on
an iterable, which is a list, but at the site of unpacking, we create a tuple.
*[1,2,3], 5
(1, 2, 3, 5)
Here is a similar example but this time the “second” iterable is a list:
[*[1,2,3], 5]
[1, 2, 3, 5]
{1, 2, 3, 5}
File "/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_71178/
,→386627056.py", line 1
*[1,2,3]
ˆ
SyntaxError: can't use starred expression here
The above example raises an error. This is because iterable unpacking can be only
used in certain places. For example, it can be used inside a list, tuple, or set. It can
be also used in list comprehension (discussed in Section 2.7.2) and inside function
definitions and calls (more on functions in Section 2.8.1).
A mechanism known as extended iterable unpacking that was added in Python 3
allows using * operator in an assignment expression; for example, we can have state-
ments such as a, b, *c = some-sequence or *a, b, c = some_sequence.
This mechanism makes * as “catch-all” operator; that is to say, any sequence item
that is not assigned to a variable is assigned to the variable following * . That is
the reason that in the example presented earlier, y becomes a list of ‘Learning’ and
‘with’ and the value of other variables are assigned by their position. Furthermore,
even if we change the list on the right to a tuple, y is still a list:
2.7 Flow of Control and Some Python Idioms 41
for loop statement in Python allows us to loop over any iterable object. What is a
Python iterable object? An iterable is any object capable of returning its members
one at a time, permitting it to be iterated over in a for loop. For example, any
sequence such as list, string, and tuple, or even non-sequential collections such as
sets and dictionaries are iterable objects. To be precise, to loop over some iterable
objects, Python creates a special object known as iterator (more on this in Section
2.9.1); that is to say, when an object is iterable, under the hood Python creates the
iterator object and traverses through the elements. The structure of a for loop is
quite straightforward in Python:
for variable in X:
the body of the loop
In the above representation of the for loop sturcture, X should be either an iterator
or an iterable (because it can be converted to an iterator object). It is important to
notice the indentation. Yes, in Python indentation is meaningful! Basically, code
blocks in Python are identified by indentation. At the same time, any statement that
should be followed by an indented code block is followed by a colon : (notice the :
before the body of the loop). For example, to iterate over a list:
for x in list1:
print(x)
A+
A
B
C+
D
42 2 Getting Started with Python
Hi There
key = 1
value = machine
key = 2
value = learning
key = 3
value = with python
key = 1
value = machine
key = 2
value = learning
key = 3
value = with python
2.7 Flow of Control and Some Python Idioms 43
When looping through a dictionary, it is also possible to fetch the keys and values
at the same time. For this purpose, we can use the items() method. This method
returns a dict_items object, which provides a view of all (key, value) tuples
in a dictionary. Next we use this method along with sequence unpacking for a more
Pythonic implementation of the above example. A code pattern is generally referred
to as Pythonic if it uses patterns known as idioms, which are in fact some code
conventions acceptable by the Python community (e.g., the sequence packing and
unpacking we saw earlier were some idioms):
for key, val in dict2.items():
print('key =', key)
print('value =', val)
print()
key = 1
value = machine
key = 2
value = learning
key = 3
value = with python
i = 0
i = 1
i = 2
i = 3
i = 4
i = 3
i = 4
i = 5
i = 6
i = 7
44 2 Getting Started with Python
i = 3
i = 5
i = 7
When looping through a sequence, it is also possible to fetch the indices and
their corresponding values at the same. The Pythonic way to do this is to use
enumerate(iterable, start=0), which returns an iterator object that provides
the access to indices and their corresponding values in the form of a two-element
tuple of index-value pair:
for i, v in enumerate(list6):
print(i, v)
0 Machine
1 Learning
2 with
3 Python
4 1.0.0
0 Machine
1 Learning
2 with
3 Python
4 1.0.0
1 Machine
2 Learning
3 with
4 Python
5 1.0.0
for i, v in enumerate(tuple1):
print(i, v)
0 Machine
1 Learning
2 with
3 Python
4 1.0.0
enumerate() can also be used with sets and dictionaries but we have to remember
that these are non-ordered collections so in general it does not make much sense to
fetch an index unless we have a very specific application in mind (e.g., sort some
elements first and then fetch the index).
Another example of a Python idiom is the use of zip() function, which creates
an iterator that aggregates two or more iterables, and then loops over this iterator.
list_a = [1,2,3,4]
list_b = ['a','b','c','d']
for item in zip(list_a,list_b):
print(item)
(1, 'a')
(2, 'b')
(3, 'c')
(4, 'd')
We mentioned previously that one way to create dictionaries is to use the dict()
constructor, which works with any iterable object as long as each element is iterable
itself with two objects. Assume we have a name_list of three persons, John, James,
Jane. Another list called phone_list contains their numbers that are 979, 797, 897
for John, James, and Jane, respectively. We can now use dict() and zip to create
a dictionary where keys are names and values are numbers:
name_list = ['John', 'James', 'Jane']
phone_list = [979, 797, 897]
dict3 = dict(zip(name_list, phone_list)) # it works because here we use
,→zip on two lists; therefore, each element of the iterable has two
,→objects
dict3
list_odd
List comprehension allows us to combine all this code in one line by combining
the list creation, appending, the for loop, and the condition:
list_odd_lc = [i**2 for i in range(1, 21) if i%2 !=0]
list_odd_lc
list_non_equal_tuples
2.8 Function, Module, Package, and Alias 47
[(0, 1),
(0, 2),
(1, 0),
(1, 2),
(2, 0),
(2, 1)]
2.7.3 if-elif-else
2.8.1 Functions
Functions are simply blocks of code that are named and do a specific job. We can
define a function in Python using def keyword as follows:
def subtract_three_numbers(num1, num2, num3):
result = num1 - num2 - num3
return result
6.0
48 2 Getting Started with Python
In the above example, we called the function using positional arguments (also
sometimes simply referred to arguments); that is to say, Python matches the ar-
guments in the function call with the parameters in the function definition by the
order of arguments provided (the first argument with the first parameter, second
with second, . . . ). However, Python also supports keyword arguments in which the
arguments are passed by the parameter names. In this case, the order of keyword
arguments does not matter as long as they come after any positional arguments (and
note that the definition of function remains the same). For example, this is a valid
code:
x = subtract_three_numbers(num3 = 1, num1 = 10, num2 = 3.0)
print(x)
6.0
x, y, z = string_func('coolFunctions') # unpacking
print(x, y, z)
13 COOLFUNCTIONS Coolfunctions
If we pass an object to a function and within the function the object is modified
(e.g., by calling a method for that object and somehow change the object), the changes
will be permanent (and nothing need to be returned):
def list_mod(inp):
inp.insert(1, 'AB')
If we use this way to load an entire module, then any function within the module
will be available through the program as:
50 2 Getting Started with Python
module_name.function_name()
However, to avoid writing the name of the module each time we want to use the
function, we can ask Python to import a function (or multiple functions) directly.
In this approach, the syntax is:
from module_name import function_name1, function_name2, ...
Let us consider our grocery_to_do function that was defined earlier. Suppose
we create a module called grocery.py and add this function into that file. What
if our program grows to the extend that we want to also separate and keep multiple
modules? In that case, we can create a package. As modules are files containing
functions, packages are folders containing modules. For example, we can create
a folder (package) called mlwp and place our module grocery.py in that folder.
Similar to what was discussed earlier, we have various options to import our pack-
age/module/function:
import mlwp.grocery
This import makes the mlwp.grocery module available. However, to access the
function, we need to use the full name of this module before the function. Another
way:
from mlwp import grocery
This import makes the module grocery available with no need for the package
name. Here we still need to use the name of grocery to access grocery_to_do.
Another way:
from mlwp.grocery import grocery_to_do
This import makes grocery_to_do function directly available (so no need to use
the package or module prefix).
2.8 Function, Module, Package, and Alias 51
2.8.3 Aliases
If we assign an alias to a function, we can not use the original name anymore. It
is also quite common to assign an alias to a module; for example, here we give an
alias to our grocery module:
from mlwp import grocery as gr
As a result, rather than typing the entire name of the module before the function
(grocery.grocery_to_do), we simply use gr.grocery_to_do. If desired, we
can assign alias to a package. For example,
import mlwp as mp
É
2.9 Iterator, Generator Function, and Generator Expression
2.9.1 Iterator
As mentioned before, within a for loop Python creates an iterator object from
an iterable such as list or set. In simple terms, an iterator provides the required
functionality needed by the loop (in general by the iteration protocol). But a few
questions that we need to answer are the following:
• How can we produce an iterator from an iterable?
• What is the required functionality in an iteration that the iterator provides?
Creating an iterator from an iterable object is quite straightforward. It can be
done by passing the iterable as the argument of the built-in function iter():
iter(iterable). An iterator object itself represents a stream of data and pro-
vides the access to the next object in this stream. This is doable because this special
object has a specific method __next__() that retrieves the next item in this stream
(this is also doable by passing an iterator object to the built-in function next(),
which actually calls the __next__() method). Once there is no more data in the
stream, the __next__() raises StopIteration exception, which means the iterator
is exhausted and no further item is produced by the iterator (and any further call to
__next__() will raise StopIteration exception). Let us examine these concepts
starting with an iterable (here a list):
list_a = ['a', 'b', 'c', 'd']
iter_a = iter(list_a)
iter_a
<list_iterator at 0x7f9c8d42a6a0>
'a'
'b'
next(iter_a)
É
2.9 Iterator, Generator Function, and Generator Expression 53
'c'
next(iter_a)
'd'
Now that the iterator is exhausted, one more call raises StopIteration exception:
next(iter_a)
/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_71178/
,→3255143482.py in <module>
----> 1 next(iter_a)
StopIteration:
Let us examine what was discussed before in Section 2.7.1; that is, the general
structure of a for loop, which is:
for variable in X:
the body of the loop
where X is either an iterator or an iterable. What happens as part of the for loop
is that the iter() is applied to X so that if X is an iterable, an iterator is created.
Then the next() method is applied indefinitely to the iterator until it is exhausted
in which case the loop ends.
Generator functions are special functions in Python that simplify writing custom
iterators. A generator function returns an iterator, which can be used to generate
stream of data on the fly. Let us clarify this statement by few examples.
Example 2.1 In this example, we write a function that can generate the sum of the
first n integers and then use the return list from this function in another for loop to
do something for each of the elements of this list (here we simply print the element
but of course could be a more complex task):
54 2 Getting Started with Python
def first_n_sum_func(n):
sum_list_num, num = [], 1 # create an empty list to hold values of
,→sums
sum_num = sum(sum_list_num)
while num <= n:
sum_num += num
sum_list_num.append(sum_num)
num += 1
return sum_list_num
for i in first_n_sum_func(10):
print(i, end = " ") # end parameter is used to replace the default
,→newline character at the end
1 3 6 10 15 21 28 36 45 55
In Example 2.1 we created a list within the function to hold the sums, returned
this iterable, and used it in a for loop to iterate over its elements for printing. Note
that if n becomes very large, we need to store all the sums in the memory (i.e., all
the elements of the list). But what if we just need these elements to be used once?
For this purpose, it is much more elegant to use generators. To create a generator,
we can replace return with yield keyword:
def first_n_sum_gen(n):
sum_num , num = 0, 1
while num <= n:
sum_num += num
num += 1
yield sum_num
for i in first_n_sum_gen(10):
print(i, end = " ")
1 3 6 10 15 21 28 36 45 55
• item 3: The first time the __next__() method (equivalently, the next() func-
tion) is called on the generator object, the function will start execution until it
reaches the first yield statement. At this point the yielded value is returned, all
local variables in the function are stored, and the control is passed to the calling
environment;
• item 4: Once the __next__() method is called again on the generator, the
function executation resumes where it left off until it reaches the yield again,
and this process continues;
• item 5: Once the function terminates, StopIteration is raised.
To understand the above examples, we look at another example to examine the effect
of these points, and then go back to Example 2.1.
If we have an iterable and we would like to print out its elements, one
way is of course what was done before in Example 2.1 to use a for loop.
However, as discussed in Section 2.6.5, an easier approach is to use * as
the iterable unpacking operator in the print function call (and print is a
function that is defined in a way to handle that):
print(*first_n_sum_gen(10)) # compare with
,→print(first_n_sum_gen(10))
1 3 6 10 15 21 28 36 45 55
1 3 6 10 15 21 28 36 45 55
Example 2.2 Here we create the following generator function and call next() func-
tion several times:
def test_gen():
i = 0
print("Part A")
yield i
i += 2
print("Part B")
yield i
i += 2
print("Part C")
yield i
56 2 Getting Started with Python
Part A
Part B
Part C
/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_71178/
,→3495006520.py in <module>
StopIteration:
Example 2.1 (continued): Now here we are in the position to understand every-
thing in first_n_sum_gen generator function.
Based on what we mentioned before in Section 2.9.1, in a for loop the
__next__() is applied indefinitely to the iterator until it is exhausted (i.e., a
StopIteration exception is raised). Now based on the aforementioned item 2,
É
2.9 Iterator, Generator Function, and Generator Expression 57
The same way listcomps are useful for creating simple lists, generator expressions
(known as genexps) are a handy way of creating simple generators. The general
syntax for genexps is similar to listcomps except that the square brackets [ ] are
replaced with parentheses ( ):
(expression for exp_1 in seq_1
if condition_1
for exp_2 in seq_2
if condition_2
...
for exp_n in seq_n
if condition_n)
1 3 6 10 15 21 28 36 45 55
58 2 Getting Started with Python
Exercises:
a) what is x[:2:-1]?
b) explain what would be the “start” when it is not specified in slicing and the
stride is negative. Why does it make sense?
Which of the following creates a list with elements being the corresponding elements
of “a” added to 1?
A) a + 1
B) a + [1, 1, 1, 1]
C) [i+1 for i in a]
D) [a + 1]
a) Write a piece of code that loops through the values of this dictionary and makes
the first letter in each word upper case (hint: check out string.title() method).
The output should look like this:
value = Good
value = Morning
value = John
Create a third list, namely, list_c, with elements being ‘a1’, ‘b1’, ‘c1’. Then print
elements of zip(list_a, list_b, list_c). Guess what would be the general
É
2.9 Iterator, Generator Function, and Generator Expression 59
Exercise 5: Suppose we have a list of students’ phone numbers and a tuple of students’
names:
phone = [979921, 989043, 933043, 932902]
student = ('jane', 'nika', 'kian', 'mira')
In a single line of code, create a dictionary with phones as keys and names as values
(hint: use zip()). Explain how the code works.
Exercise 6: Use the listcomp to create a list with three-tuple elements where each
tuple contains a number between 1 and 10, its square and cubic; that is,
[(0, 0, 0),
(1, 1, 1),
(2, 4, 8),
(3, 9, 27),
(4, 16, 64),
(5, 25, 125),
(6, 36, 216),
(7, 49, 343),
(8, 64, 512),
(9, 81, 729)]
Use the listcomp to find john’s belongings in one line of code. The output should
look like
['book', 'pen', 'backpack', 'car']
Hint: as part of the list comprehension, you can use the membership operator
discussed in Section 2.5.3 to check whether ‘john’ is part of each string in the list.
Exercise 9: Write a function named belongings that receives the first name and
the last name of a student as two arguments along with an arbitrary number of items
that she has in her backpack. The function should print the name and belongings of
the student. For example, here are two calls of this function and their outputs:
belongings('Emma', 'Smith', 'pen', 'laptop')
belongings('Ava', 'Azizi', 'pen', 'charger', 'ipad', 'book')
Exercise 11: Suppose there is a package called “package” that contains a module
called “module”. The “module” contains a function called “function(string)” that
expects a string. Which of the following is a legitimate way to access and use
function(string)? Choose all that applies.
A)
from package import module
function("hello")
B)
import package.module as pm
pm.function("hello")
C)
É
2.9 Iterator, Generator Function, and Generator Expression 61
D)
from package import module as m
m.function("hello")
3.1 NumPy
NumPy is a fundamental Python package that provides efficient storage and opera-
tions for multidimensional arrays. In this regard, the core functionality of NumPy
is based on its ndarray data structure (short for n-dimensional array), which is
somewhat similar to built-in list type but with homogeneous elements−in fact the
restriction on having homogeneous elements is the main reason ndarray can be
stored and manipulated efficiently. Hereafter, we refer to ndarray type as NumPy
arrays or simply as arrays.
Working with NumPy arrays is quite straightforward. To do so, we need to first
import the numpy package. By convention, we use alias np:
import numpy as np
np
Importing numpy triggers importing many modules as part of that. This is done
by many imports as part of __init__.py, which is used to make many useful
functions accessible at the package level. For example, to create an identity ma-
trix, which is a commonly used function, NumPy has a convenient function named
eye(). Once NumPy is imported, we have automatically access to this function
through np.eye(). How is this possible? By installing Anacoda, the NumPy pack-
age is available, for example, at a location such as this (readers can check the output
of the above cell to show the path on their systems):
/Users/amin/opt/anaconda3/lib/python3.8
Inside the numpy folder, we have __init__.py file. Basically once NumPy is
imported, as in any other regular package, __init__.py is automatically executed.
Inside __init__.py, we can see, a line from . import lib—here “dot” means
from the current package, and lib is a subpackage (another folder). This way we
import subpackage lib and within that we have another __init__.py. Within this
file, we have from .twodim_base import *, which means importing everything
from twodim_base module (file twodim_base.py) in the current package (i.e., lib
subpackage). And finally within twodim_base.py, we have the function eye()!
To create a numpy array from a Python list or tuple, we can use np.array()
function:
list1 = [[1,2,3], [4,5,6]]
list1
a1 = np.array(list1)
a1
array([[1, 2, 3],
[4, 5, 6]])
numpy.ndarray
Here is a similar example but one of the numbers in the nested list is a floating-point:
list2 = [[1, 2, 3.6], [4, 5, 6]]
list2
a2 = np.array(list2)
a2
array([[1. , 2. , 3.6],
[4. , 5. , 6. ]])
In the above example, all integers became floating point numbers (double precision
[float64]). This is due to homogeneity restriction of elements in numpy arrays. As
a result of this restriction, numpy upcasts all elements when possible.
There are a few attributes of numpy arrays that are quite useful in practice.
• dtype: we can check the type of the elements of an array with dtype (short for
data type) attribute of the array. For example,
a2.dtype
dtype('float64')
array([[1. , 2. , 3.6],
[4. , 5. , 6. ]], dtype=float32)
Furthermore, when an array is constructed, we can recast its data type using
astype() method:
a4 = a3.astype(int) # this produces a copy of the array while casting
,→the specified type
a4
array([[1, 2, 3],
[4, 5, 6]])
a3.dtype
66 3 Three Fundamental Python Packages
dtype('float32')
a4.dtype
dtype('int64')
• ndim: returns the number of dimensions (also referred to as the number of axes)
of an array:
a4.ndim # a4 is a 2D array
(2, 3)
There is a difference between a 1D array and a 2D array with one row or column.
Let us explore this with one example:
a5 = np.arange(5)
a5
array([0, 1, 2, 3, 4])
a5.shape
(5,)
However, the following array is a 2D array with 1 row and 5 columns (or perhaps
more clear if we say an array of 1 list with 5 elements):
3.1 NumPy 67
a6 = np.array([[0, 1, 2, 3, 4]])
print(a6.shape)
a6
(1, 5)
array([[0, 1, 2, 3, 4]])
This time writing, for example, a6[3] does not make sense (we can have a6[0,3]
though). Its transpose is like a matrix of 5 × 1:
print(a6.T.shape) # taking the transpose of the array using .T and
,→looking into its shape
a6.T # this array has 5 rows and 1 column (or 5 lists each with 1
,→element)
(5, 1)
array([[0],
[1],
[2],
[3],
[4]])
array([[0, 0],
[0, 0],
[0, 0]])
a = np.ones((3,5))
a
a = np.full((2,4), 5.5)
a
• eye() creates a 2D array with ones on diagonal and zeros elsewhere (identity
matrix):
a = np.eye(5)
a
a = np.arange(5)
a
array([0, 1, 2, 3, 4])
a = np.arange(3,8)
a
array([3, 4, 5, 6, 7])
a = np.random.normal(0, 1, size=(3,4))
a
A list of these and other functions are found at NumPy documentation (NumPy-
array, 2023).
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
Then,
70 3 Three Fundamental Python Packages
a[1]
array([3, 4, 5])
Therefore, a[1] returns (provides a view of) a 1D array at the second position (index
1). To access the first element of this subarray, for example, we can treat the subarray
as an array that is subsequently indexed using [0]:
a[1][0]
We have already seen this type of indexing notation (separate square brackets)
for list of lists in Section 2.6.1. However, NumPy also supports multidimensional
indexing, in which we do not need to separate each dimension index into its square
brackets. This way we can access elements by comma separated indices within a
pair of square brackets [ , ]. This way of indexing numpy arrays is more efficient as
compared to separate square brackets because it does not return an entire subarray
to be subsequently indexed and, therefore, it is the preferred way of indexing. For
example:
a[1, 0]
Slicing: We can access multiple elements using slicing and striding in the same
way we have previously seen for lists and tuples. For example,
a[:3:2,-2:]
array([[1, 2],
[7, 8]])
In the code snippet below we choose columns with a stride of 2 and the last two rows
in reverse:
a[:-3:-1,::2]
array([[ 9, 11],
[ 6, 8]])
One major difference between numpy array slicing and list slicing is that array
slicing provides a view of the original array whereas list slicing copies the list. This
means that if a subarray, which comes from slicing an array, is modified, the changes
appear in the original array too (recall that in list slicing that was not the case). This
is quite useful when working with large datasets because it means that we can work
and modify pieces of such datasets without loading and working with the entire
dataset at once. To better see this behaviour, consider the following example:
3.1 NumPy 71
b = a[:-3:-1,::2]
b[0,1] = 21 # change 11 in b to 21
a # the change appears in a too
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 21]])
To change this default behaviour, we can use: 1) the copy() method; and 2) the
array() constructor:
b = a[:-3:-1,::2].copy() # using copy()
b
array([[ 9, 21],
[ 6, 8]])
b[1,1] = 18
a # the change does not appear in a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 21]])
array([[ 9, 21],
[ 6, 8]])
b[1,1] = 18
a # the change does not appear in a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 21]])
c = [10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29]
idx = [19 17 15 13 11]
c[idx] = [29 27 25 23 21]
/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_43151/
,→419593116.py in <module>
1 idx = idx.astype('float32')
----> 2 c[idx] # observe that we changed the dtype of idx to float32 and
,→we get error
And here we examine the effect of fancy indexing with a negative index:
idx = idx.astype('int_')
idx[0] = -1 # observe how the negative index is interpreted
print('\n' + 'idx = ' + str(idx)\
+ '\n' + 'c[idx] = ' + str(c[idx]))
One major difference between numpy arrays and lists is that the size (number of
elements) of an array once it is created is fixed; that is to say, to shrink and expand an
existing array we need to create a new array and copy the elements from the old one
to the new one, and that is computationally inefficient. However, as we discussed in
Section 2.6.1, expanding or shrinking a list was straightforward through the methods
defined for lists. Although the size of an array is fixed, its shape is not and could be
changed if desired. The most common way to reshape an array is to use reshape()
method:
a = np.arange(20).reshape(4,5) # 4 elements in the first dimension
,→(axis 0) and 5 in the second dimension (axis 1)
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
The reshape() method provides a view of the array in a new shape. Let us examine
this by an example:
b = a.reshape(2,10)
b
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
b[0,9] = 29
a # observe that the change in b appears in a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 29],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
To use the reshape() method, we can specify the number of elements in each
dimension and we need to ensure that the total number of these elements is the same
as the size of the original array. However, because the size is fixed, the number of
elements in one axis could be determined from the size and the number of elements
in other axes (as long as they are compatible). NumPy uses this fact and allows us
not to specify the number of elements along one axis. To do so, we can use -1
along an axis and NumPy will determine the number of elements along that axis.
For example,
74 3 Three Fundamental Python Packages
c = b.reshape(5,-1)
c
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 29, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
array([0, 1, 2, 3, 4])
array([[0],
[1],
[2],
[3],
[4]])
array([[0],
[1],
[2],
[3],
[4]])
3.1 NumPy 75
array([[0, 1, 2, 3, 4]])
Computations using NumPy arrays could be either slow and verbose or fast and
convenient. The main factor that leads to both fast computations and convenience
of implementation is vectorized operations. These are vectorized wrappers for per-
forming element-wise operation and are generally implemented through NumPy
Universal Functions also known as ufuncs, which are some built-in functions writ-
ten in C for efficiency. In the following example, we will see how the use of ufuncs
will improve computational efficiency.
Here we calculate the time to first create a numpy array called “a” of size 10000
and then subtract 1 from every element of a. We first use a conventional way of
doing this, which is a for loop:
%%timeit
a = np.arange(10000)
for i, v in enumerate(a):
a[i] -= 1
3.55 ms ± 104 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now we use the subtraction operator “-”, which implements the vectorized operation
(applies the operation to every element of the array):
%%timeit
a = np.arange(10000)
a = a - 1 # vectorized operation using the subtraction operator
8.8 μs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note the striking difference in time! The vectorized operation is about 400 times
faster here. We can also implement this vectorized subtraction using np.subtract
unfunc:
%%timeit
a = np.arange(10000)
a = np.subtract(a,1)
8.88 μs ± 192 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
76 3 Three Fundamental Python Packages
All the previous unfuncs require two operands. As a result, they are referred to
as binary ufuncs; after all, these are implementations of binary operation. However,
there are other useful operations (and ufuncs), which are unary; that is, they operate
on a single array. For example, operations such as calculating the sum or product of
elements of an entire array or along a specific axis, or taking element-wise natural
log, to just name a few. Table 3.2 provides a list of some of these unary ufuncs.
Below, we present some examples of these unary ufuncs:
a = np.array([[3,5,8,2,4],[8, 10, 2, 3, 5]])
a
array([[ 3, 5, 8, 2, 4],
[ 8, 10, 2, 3, 5]])
Then,
np.min(a, axis=0)
array([3, 5, 2, 2, 4])
np.sum(a, axis=1)
3.1 NumPy 77
UFunc Operation
np.sum Sum of array elements over a given axis
np.prod Product of array elements over a given axis
np.mean Arithmetic mean over a given axis
np.std Standard deviation over a given axis
np.var Variance over a given axis
np.log Element-wise natural log
np.log10 Element-wise base 10 logarithm .
np.sqrt Element-wise square root
np.sort Sorts an array
np.argsort Returns the indices that would sort an array
np.min Returns the minimum value of an array over a given axis
np.argmin Returns the index of the minimum value of an array over a given axis
np.max Returns the maximum value of an array over a given axis
np.argmax Returns the index of the maximum value of an array over a given axis
array([22, 28])
For some of these operations such as sum, min, and max, a simpler syntax is to use
the method of the array object. For example:
a.sum(axis=1) # here sum is the method of the array object
array([22, 28])
For each of these methods, we can check its documentations at numpy.org to see its
full functionality. For example, for np.sum and ndarray.sum, the documentations
are available at (NumPy-sum, 2023) and (NumPy-arrsum, 2023), respectively.
3.1.7 Broadcasting
Broadcasting is a set of rules that instructs NumPy how to treat arrays of different
shapes and dimensions during arithmetic operations. In this regard, under some
conditions, it broadcasts the smaller array over the larger array so that they have
compatible shapes for the operation. There are two main reasons why broadcasting
in general leads to efficient computations: 1) the looping required for vectorized
operations occurs in C instead of Python; and 2) it implements the operation without
copying the elements of array so it is memory efficient.
As a matter of fact, we have already seen broadcasting in the previous section
where we used arithmetic operations on arrays. Let us now see some examples
and examine them from the standpoint of broadcasting. In the following example, a
constant 2 is added to every element of array a:
78 3 Three Fundamental Python Packages
a = np.array([0, 1, 2, 3, 4])
b = a + 2
b
array([2, 3, 4, 5, 6])
One can think of this example as if the scalar 2 is streched (broadcast) to create an
array of the same shape as a so that we can apply the element-by-element addition
(see Fig. 3.1). However, it is important to remember that NumPy neither makes a
copy of 2 nor creates an array of the same shape as a. It just uses the original copy
of 2 (so it is memory efficient) and based on a set of rules efficiently performs the
addition operation as if we have an array of 2 with the same shape as a.
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
(2, 4)
3.1 NumPy 79
[2 1 9 6]
(4,)
If we write a + b:
• By Rule 1, a single 1 is added to the left of b shape to make b a two-dimensional
array as a; that is, we can think of b now as having shape (1, 4).
• The shape of a is (2, 4) and the shape of b is (1, 4). Therefore, Rule 2 does not
raise an error because the sizes in each dimension are either equal (here 4) or at
least one of them is 1 (for example, 2 in a and 1 in b).
• Rule 3 indicates that the size of the output is (max{2,1}, max{4,4}) so it is (2,
4). This means that array b is streched along its axis 0 (the left element in a
shape tuple) to match the size of a in that dimension.
Let’s see the outcome:
c = a + b
print(c)
c.shape
[[ 3. 2. 10. 7.]
[ 3. 2. 10. 7.]]
(2, 4)
[[[0]
[1]
[2]
[3]]
[[4]
[5]
[6]
[7]]]
80 3 Three Fundamental Python Packages
(2, 4, 1)
If we write a + b:
• By Rule 1, a 1 is added to the left of b shape to make it a 3-dimensional array
as a; that is, we can think of b now as having shape (1, 1, 4).
• The shape of a is (2, 4, 1) and the shape of b is (1, 1, 4). Therefore, Rule 2 does
not raise an error because the sizes in each dimension are either equal or at least
one of them is 1.
• Rule 3 indicates that the size of the output is (max{2,1}, max{4,1}, max{1,4})
so it is (2, 4, 4). This means that array a is streched along its axis 2 (the third
element in its shape tuple) to make its size 4 (similar to b in that dimension), array
b is streched along its axis 1 to make its size 4 (similar to a in that dimension),
and finally array b is stretched along its axis 0 to make its size 2.
Let’s see the outcome:
c = a + b
print(c)
c.shape
[[[10 20 30 40]
[11 21 31 41]
[12 22 32 42]
[13 23 33 43]]
[[14 24 34 44]
[15 25 35 45]
[16 26 36 46]
[17 27 37 47]]]
(2, 4, 4)
(2, 4, 1)
array([[10, 20],
[30, 40]])
3.2 Pandas 81
If we write a + b:
• By Rule 1, a 1 is added to the left of b shape to make it a 3-dimensional array
as a; that is, we can think of b now as having shape (1, 2, 2).
• The shape of a is (2, 4, 1) and the shape of b is (1, 2, 2). Therefore, Rule 2 raises
an error because the sizes in the second dimension (axis 1) are neither equal nor
at least one of them is 1.
Let’s see the error:
a + b
/var/folders/vy/894wbsn11db_lqf17ys9fvdm0000gn/T/ipykernel_43151/
,→1216668022.py in <module>
----> 1 a + b
3.2 Pandas
Pandas specially facilitates working with tabular data that can come in a variety
of formats. For this purpose, it is built around a type of data structure known as
DataFrame. Dataframe, named after a similar data structure in R programming
language, is indeed the most commonly used object in pandas and is similar to a
spreadsheet in which columns can have different types. However, understanding that
requires the knowledge of a more fundamental data structure in pandas known as
Series.
3.2.1 Series
s = pd.Series(data, index=index)
where the argument index is optional and data can be in many different types but
it is common to see an ndarray, list, dictionary, or even a constant.
82 3 Three Fundamental Python Packages
If the data is a list or ndarray and index is not passed, the default values from 0
to len(data)-1 will be used. Here we create a Series from a list without providing
an index:
import pandas as pd
s1
0 a
1 [[0, 1, 2], [3, 4, 5]]
2 b
3 [3.4, 4]
dtype: object
1 a
2 [[0, 1, 2], [3, 4, 5]]
3 b
4 [3.4, 4]
dtype: object
0 2
1 3
2 4
3 5
4 6
5 7
6 8
7 9
dtype: int64
We can access the values and indices of a Series object using its values and index
attributes:
s2.values
array([2, 3, 4, 5, 6, 7, 8, 9])
3.2 Pandas 83
s2.index
The square brackets and slicing that we saw for NumPy arrays can be used along
with index to access elements of a Series:
s2[2] # using index 2 to access its associated value
1 3
3 5
5 7
dtype: int64
Similar to NumPy arrays, the slicing provides a view of the original Series:
s3 = s2[2::2]
s3[2] += 10
s2
0 2
1 3
2 14
3 5
4 6
5 7
6 8
7 9
dtype: int64
2 14
4 6
6 8
dtype: int64
14
The above example would make more sense if we consider the index as a label
for the corresponding value. Thus, when we write s3[2], we try to access the value
that corresponds to label 2. This mechanism gives more flexibility to Series objects
because for the indices we can use other types rather than integers. For example,
s4 = pd.Series(np.arange(10, 15), index = ['spring', 'river', 'lake',
,→'sea', 'ocean'])
s4
spring 10
river 11
lake 12
sea 13
ocean 14
dtype: int64
river 11
sea 13
dtype: int64
However, when it comes to slicing, the selection process using integer indices
works similarly to NumPy. For example, in s3 where index had integer type,
s3[0:2] chooses the first two elements (similar to NumPy arrays):
s3[0:2]
2 14
4 6
dtype: int64
But at the same time, we can also do slicing using explicit index:
s4['spring':'ocean':2]
spring 10
lake 12
3.2 Pandas 85
ocean 14
dtype: int64
Notice that when using the explicit index in slicing, both the start and the stop are
included (as opposed to to the previous example where implicit indexing was used
in slicing and the stop was not included).
When in a Series we have integer indexing, the fact that slicing uses implicit
indexing could cause confusion. For exmaple,
s = pd.Series(np.random.randint(1, 10, 7), index = np.arange(2, 9))
s
2 2
3 5
4 1
5 2
6 2
7 6
8 9
dtype: int64
2 2
3 5
4 1
dtype: int64
To resolve such confusions, pandas provides two attributes loc and iloc for
Series and DataFrame objects using which we can specify the type of indexing.
Using loc and iloc one uses explicit and implicit indexing, respectively:
s.iloc[0:3] # note that the value corresponding to "3" is not included
2 2
3 5
4 1
dtype: int64
2 2
3 5
4 1
dtype: int64
As we said before, we can create Series from dictionaries as well; for example,
86 3 Three Fundamental Python Packages
c 20
a 40
b 10
dtype: int64
b 10.0
c 20.0
d NaN
c 20.0
dtype: float64
In the above example, we tried to pull out a value for a non-existing key in the
dictionary (“d”). That led to NaN (short for Not a Number), which is the standard
marker used in pandas for missing data.
Finally, if the we use a scalar in the place of data, the scalar is repeated to match
the size of index:
s = pd.Series('hi', index = np.arange(5))
s
0 hi
1 hi
2 hi
3 hi
4 hi
dtype: object
3.2.2 DataFrame
{'one': a hi
b hi
c hi
d hi
dtype: object,
'two': a 1.0
b [yes, no]
c 3.0
d 5
dtype: object,
'three': a 40.0
b 10.0
c 20.0
d NaN
dtype: float64}
df = pd.DataFrame(d)
There are various ways to create a DataFrame but perhaps the most common
ways are:
• from a dictionary of Series
• from ndarrays or lists
• from a dictionary of ndarrays or lists
• from a list of dictionaries
88 3 Three Fundamental Python Packages
Kazakhstan Japan
lake 10 80.0
pond 30 NaN
river 200 300.0
spring 1000 900.0
import numpy as np
import pandas as pd
a = np.array([[10,30,200,1000],[80, np.nan, 300, 900]]).T # to transpose
df = pd.DataFrame(a, index = ['lake', 'pond', 'river', 'spring'],
,→columns = ['Kazakhstan','Japan'])
df
Kazakhstan Japan
lake 10.0 80.0
pond 30.0 NaN
river 200.0 300.0
spring 1000.0 900.0
• DataFrame from a dictionary of arrays or lists: To use this, the length of all
arrays/lists should be the same (and the same as index length if provided).
3.2 Pandas 89
dict = {"Kazakhstan": [100, 200, 30, 10], "Japan": [900, 300, None, 80]}
df = pd.DataFrame(dict, index=['spring', 'river', 'pond', 'lake'])
df
Kazakhstan Japan
spring 100 900.0
river 200 300.0
pond 30 NaN
lake 10 80.0
Kazakhstan Japan
spring 100 900.0
river 200 300.0
pond 30 NaN
lake 10 80.0
In the above examples, we relied on pandas to arrange the columns based on the
information in data parameters provided to DataFrame constructor (see (Pandas-
dataframe, 2023)). In the last example, for instance, the columns were determined by
the keys of dictionaries in the list. We can use the columns paramater of DataFrame
to re-arrange them or even add extra columns. For example:
list1 = [{"Kazakhstan": 100, "Japan": 900}, {"Kazakhstan": 200, "Japan":
,→ 300}, {"Kazakhstan": 30}, {"Kazakhstan": 10, "Japan": 80}]
df
Once a DataFrame is at hand, we can access its elements in various forms. For
example, we can access a column by its name:
df['Japan']
90 3 Three Fundamental Python Packages
spring 900.0
river 300.0
pond NaN
lake 80.0
Name: Japan, dtype: float64
If the column names are strings, this could be also done by an “attribute-like”
indexing as follows:
df.Japan
spring 900.0
river 300.0
pond NaN
lake 80.0
Name: Japan, dtype: float64
USA Japan
spring NaN 900.0
river NaN 300.0
pond NaN NaN
lake NaN 80.0
At the same time, we can use loc and iloc attributes for explicit and implicit
indexing, respectively:
df.iloc[1:4,0:2]
Japan USA
river 300.0 NaN
pond NaN NaN
lake 80.0 NaN
df.loc[['river','lake'], 'Japan':'USA']
Japan USA
river 300.0 NaN
lake 80.0 NaN
Another way is to retrieve each column, which is a Series, and then index on that:
df.Japan.iloc[:2]
3.2 Pandas 91
spring 900.0
river 300.0
Name: Japan, dtype: float64
spring 900.0
river 300.0
Name: Japan, dtype: float64
That being said, array style indexing without loc or iloc is not acceptable. For
example,
df.iloc[1,0]
300.0
and
df.loc['river', 'Japan']
300.0
We can also use masking (i.e., select some elements based on some criteria).
For example, assume we have a DataFrame of students with their information as
follows:
import pandas as pd
dict1 = {
"GPA": pd.Series([3.00, 3.92, 2.89, 3.43, 3.55, 2.75]),
"Name": pd.Series(['Askar', 'Aygul', 'Ainur', 'John', 'Smith',
,→'Xian']),
We would like to retrieve the information for any student with GPA > 3.5. To do
so, we define a boolean mask (returns True or False) that can easily be used for
indexing:
df.GPA > 3.5
0 False
1 True
2 False
3 False
4 True
5 False
Name: GPA, dtype: bool
1 Aygul
4 Smith
Name: Name, dtype: object
In the above example we only returned values of columns Name at which the mask
array (df.GPA > 3.5) is True. To retrieve all columns for which the mask array is
True we can use the mask with the DataFrame:
df[df.GPA > 3.5]
There are two handy methods for DataFrame objects, namely, head(n=5) and
tail(n=5) that return the first and last n rows of a DataFrame (the default value of
n is 5), respectively:
df.head()
Last but not least, we can access the column and index labels through the columns
and index attributes of a DataFrame:
df.columns
3.2 Pandas 93
A powerful feature of Pandas is its ability to easily read data from and write to a vari-
ety of common formats (sometimes, depending on specific applications, some other
libraries would be used as well). For example, Pandas can be used to read from/write
to csv format, pickle format (this is a popular Python binary data format using which
we can save any Python object (pickling) to be accessed later (unpickling)), SQL,
html, etc. A list of pandas input/output (I/O) functions is found at (Pandas-io, 2023).
As an example, suppose we would like to read the table of pandas I/O tools,
which is available at the URL (Pandas-io, 2023). For this purpose, we can use
pandas read_html that reads tables from an HTML:
url = 'https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html'
table = pd.read_html(url)
print(len(table))
The read_html() returns a list of possible DataFrames in the html file (see (Pandas-
readhtml, 2023)). This is why the length of the above table is 8 (as of writing this
manuscript). Our desired table is the first one though:
table[0]
3.3 Matplotlib
%matplotlib notebook
%matplotlib inline
which is the default backend in notebooks, we can create static images in the note-
book. A list of available backends is shown by
%matplotlib --list
To plot figures, matplotlib genereally relies on two concepts of figure and Axes.
Basically, the figure is what we think of as the “whole figure”, while Axes is what
we genereally refer to as “a plot” (e.g., part of the image with curves and coordinates).
Therefore, a figure can contain many Axes but each Axes can be part of one figure.
Using an Axes object and its defined methods, we can control various properties.
For example, Axes.set_title(), Axes.set_xlabel(), and Axes.set_ylim()
can be used to set the title of a plot, set the label of the x axis, or set the limits of the
y-axis, respectively. There two ways to use matplotlib:
1) the pyplot style, which resembles plotting Matlab figures (therefore, it is also
referred to as the Matlab style). In this style, we rely on pyplot module and its
functions to automatically create and manage figures and axes; and
3.3 Matplotlib 95
2) the object-oriented style (OO-style) using which we create figures and axes, and
control them using methods defined for them.
In general the pyplot style is less flexible than the OO-style and many of the
functions used in pyplot style can be implemented by methods of Axes object.
pyplot-style—case of a simple plot: We first create a simple plot using the plot
function of pyplot module (see (pyplot, 2023) for its full documentation). To do so,
we first import the pyplot and use the standard alias plt to refer to that. As plotting
functions in matplotlib expect NumPy arrays as input (even array-like objects
such as lists as inputs are converted internally to NumPy arrays), it is common to see
importing NumPy along with matplotlib.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel('time', fontsize='small')
plt.ylabel('amplitude', fontsize='small')
plt.title('time-amplitude', fontsize='small')
plt.legend(fontsize='small') # to add the legend
plt.tick_params(axis='both', labelsize=7) # to adjust the size of tick
,→labels on both axes
In the above code, and for illustration purposes, we used various markers. The list of
possible markers is available at (matplotlib-markers, 2023).
OO-style—case of a simple plot: Here we use the OO-style to create a similar
plot as shown in Fig. 13.2. To do so, we use subplots() from pyplot (not to
be confused with subplot()) that returns a figure and a single or array of Axes
objects. Then we can use methods of Axes objects such as Axes.plot to plot data
on the axes (or use aforementioned Axes methods to control the behaviour of the
Axes object):
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 1, 20)
96 3 Three Fundamental Python Packages
time-amplitude
1.00
0.75
0.50
0.25
amplitude
0.00
−0.25
−0.50
−0.75 cubic function
cosine function
−1.00
0.0 0.2 0.4 0.6 0.8 1.0
time
Fig. 3.2: The output of the above plotting example. Here we used pyplot-style.
ax.plot(x, x, linewidth=2)
ax.plot(x, x**3, 'r+', label = 'cubic function')
ax.plot(x, np.cos(np.pi * x), 'b-.', label = 'cosine function',
,→marker='d', markersize = 5)
ax.set_xlabel('time', fontsize='small')
ax.set_ylabel('amplitude', fontsize='small')
ax.set_title('time-amplitude', fontsize='small')
ax.legend(fontsize='small')
ax.tick_params(axis='both', labelsize=7) # to adjust the size of tick
,→labels on both axes
time-amplitude
1.00
0.75
0.50
0.25
amplitude
0.00
−0.25
−0.50
−0.75 cubic function
cosine function
−1.00
0.0 0.2 0.4 0.6 0.8 1.0
time
Fig. 3.3: The output of the above plotting example. Here we used OO-style.
plt.subplot(212)
plt.plot(x, np.cos(np.pi * x), 'b-.', label = 'cosine function',
,→marker='d', markersize = 5)
plt.plot(x, x, linewidth=2, label = 'linear')
plt.legend(fontsize='small')
plt.ylabel('amplitude', fontsize='small')
plt.xlabel('time', fontsize='small')
plt.tick_params(axis='both', labelsize=7)
time-amplitude
1.0
cubic function
0.8
amplitude
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.5
amplitude
0.0
time-amplitude
1.0
0.8
amplitude
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
time
Fig. 3.4: The output of the above plotting example. Here we used pyplot-style.
3.3 Matplotlib 99
x = np.linspace(0, 1, 20)
fig1.suptitle('time-amplitude', fontsize='small')
time-amplitude
1.0
cubic function
0.8
amplitude
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.5
amplitude
0.0
time-amplitude
1.0
0.8
amplitude
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
time
Fig. 3.5: The output of the above plotting example. Here we used OO-style.
3.3 Matplotlib 101
fact devising algorithms that can estimate the parameters of a line (or in general,
parameters of nonlinear mathematical functions) that can separate the data from
different groups in some mathematical sense.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal as mvn
np.random.seed(100)
m0 = np.array([-1,-1])
m1= np.array([1,1])
cov0 = np.eye(2)
cov1 = np.eye(2)
plt.style.use('seaborn')
plt.figure(figsize=(4,3), dpi=150)
plt.plot(x0,y0,'g.',label='class 0', markersize=6)
plt.plot(x1,y1,'r.', label='class 1', markersize=6)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
lim_left, lim_right = plt.xlim()
plt.plot([lim_left, lim_right],[-lim_left*2/3-1/3,-lim_right*2/3-1/
,→3],'k', linewidth=1.5) # to plot the line
Example 3.5 In this example, we would like to first create a 10×10 grid with x and
y coordinates ranging from 0 to 9. Then we write a function that assigns a label to
each point in this grid such that if the sum of x and y coordinates for each point
in the grid is less than 10, we assign a label 0 and if is greater or equal to 10, we
102 3 Three Fundamental Python Packages
−1
−2
−3 −2 −1 0 1 2
Fig. 3.6: The output of the code written for Example 3.4 .
assign a label 1. Then we create a color plot such that labels of 0 and 1 are colored
differently—here we use ‘aquamarine’ and ‘bisque’ colors (see a list of matplotlib
colors at (matplotlib-colors, 2023)). A few things to know:
1) This example becomes helpful later when we would like to plot the decision
regions of a classifier. We will see that similar to the labeling function here, a
binary classifier also divides the feature space (for example, our 2D space) into
two regions over which labels are different;
2) To create the grid, we use numpy.meshgrid(). Observe what X and Y are and
see how they are mixed to create the coordinates; and
3) To color our 2D space defined by our grid, we use pyplot.pcolormesh() with
x and y coordinates in X and Y, respectively, and an array Z of the same size as
X and Y. The values of Z will be mapped to color by a Colormap instance given
by cmap parameter. For this purpose, the minimum and maximum values of Z
will be mapped to the first and last element of the Colormap. However, here our
color map has only two elements and Z also includes 0 and 1 so the mapping is
one-to-one.
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 10, 1)
y = np.arange(0, 10, 1)
X, Y = np.meshgrid(x, y)
X
3.3 Matplotlib 103
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
[6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7],
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
[9, 9, 9, 9, 9, 9, 9, 9, 9, 9]])
array([[0, 0],
[1, 0],
[2, 0],
[3, 0],
[4, 0],
[5, 0],
[6, 0],
[7, 0],
[8, 0],
[9, 0],
[0, 1],
[1, 1],
[2, 1],
[3, 1],
[4, 1]])
def f(W):
return (W.sum(axis=1) >= 10).astype(int)
104 3 Three Fundamental Python Packages
Z = f(coordinates)
Z
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1])
0 2 4 6 8
Fig. 3.7: The output of the plotting code written for Example 3.5.
Exercises:
Exercise 1: The following wikipedia page includes a list of many machine learning
datasets:
3.3 Matplotlib 105
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
Extract the table of datasets for “News articles” only and then find the minimum sam-
ple size used in datasets used in this table (i.e., the minimum of column “Instances”).
In doing so,
1) Do not extract all tables to choose this one from the list of tables. Make sure you
only extract this table. To do so, use something unique in this table and use it as
for the match argument of pandas.read_html();
2) The column “Instances” should be converted to numeric. For doing so, use
pandas.to_numeric() function. Study and properly use the errors parameter
of this function to be able to complete this task.
3) See an appropriate pandas.DataFrame method to find the minimum.
Exercise 2: The “data” folder, contains GenomicData_orig.csv file. This file
contains a real dataset taken from Gene Expression Omnibus (GEO), https://
www.ncbi.nlm.nih.gov/geo/ with accession ID GSE37745. The original data
contains gene expression taken from 196 individuals with either squamous cell
carcinoma (66 individuals), adenocarcinoma (106), or large cell carcinoma (24);
however, we have chosen a subset of these individuals and also artificially removed
some of the measurements to create the effect of missing data.
1) Use an appropriate reader function from pandas to read this file and show the
first 10 rows.
2) What are the labels of the first column (index 0) and 20th column (index 19)?
Hint: use the columns attribute of the DataFrame
3) drop the first two columns (index 0 and 1) because they are sample ID info
and we do not need. Hint: use pandas.DataFrame.drop() method. Read
documentation to use it properly. You can use Part 2 to access the labels of these
columns to use as input to pandas.DataFrame.drop()
4) How many missing values are in this dataset?
Hint: first use pandas.DataFrame.isnull() to return a DataFrame with
True for missing values and False otherwise, and then either use numpy.sum()
or pandas.DataFrame.sum(). Use sum() two times.
5) Use DataFrame.fillna() method to fill the missing values with 0 (filling
missing values with some values is known as imputing missing values). Use the
method in Part 4 to show that there is no missing value now.
Exercise 3: Modify Example 3.5 presented before in this chapter to create the
following figure:
To do so,
1) use W.sum(axis=1) >= 10 condition (similar to Example 3.5) along with
another suitable condition that should be determined to plot the stripe shape;
2) do not explicitly plot lines as we did in the Example 3.4. What appears in
this figure as lines is the “stairs” shape as in Example 3.5, but with a much
larger grid in the same range of x and y (i.e., a finer grid). Therefore, it is
106 3 Three Fundamental Python Packages
0
0 1 2 3 4 5 6 7 8 9
required to first redefine the grid (for example, something like arange(0, 10,
0.01)), and then to color these boundaries between regions as black use the
pyplot.contour() function with appropriate inputs (X ,Y, and Z, and black
colors).
Exercise 4: Write a piece of code that creates a numpy array of shape (2,4,3,5,2)
with all elements being 4?
Exercise 5: Suppose a and b are two numpy arrays. Looking at shape attribute of
array a leads to
(2, 5, 2, 1)
and looking at the shape attribute of array b leads to
(5, 1, 4)
Which of the following options are correct about c = a + b?
A) a + b raises an error because a and b are not broadcastable
B) a and b are broadcastable and the shape of c is (5, 2, 4)
C) a and b are broadcastable and the shape of c is (2, 5, 2, 4)
D) a and b are broadcastable and the shape of c is (5, 5, 4, 1)
E) a and b are broadcastable and the shape of c is (2, 5, 2, 1)
Exercise 6: Suppose a and b are two numpy arrays. Looking at shape attribute of
array a leads to
(3, 5, 2, 4, 8)
and looking at the shape attribute of array b leads to
(2, 4, 1)
3.3 Matplotlib 107
a) 25
b) 20
c) 15
d) the code raises an error
Exercise 9: What is the output of each of the following code snippets:
a)
a = np.arange(1, 5)
b = a[:2]
b[0] = 10
print(a)
b)
a = np.arange(1, 5)
b = np.arange(2)
c = a[b]
108 3 Three Fundamental Python Packages
c[0] = 10
print(a)
Exercise 10: Suppose we have a numpy array “a” that contains about 10 million
integers. We would like to multiply each element of a by 2. Which of the following
two codes is better to use? Why?
Code A)
for i, v in enumerate(a):
a[i] *= 2
Code B)
a *= 2
Exercise 11: Which of the following options can not create a DataFrame()? Explain
the reason (assume pandas is imported as pd).
a)
dic = {"A": [10, 20, 30, 40], "B": [80, 90, 100, None]}
pd.DataFrame(dic, index=['z', 'd', 'a', 'b', 'f'])
b)
dic = {"A": [10, 20, 30, 40], "B": [80, 90, 100, None]}
pd.DataFrame(dic)
c)
dic = {"A": [10, 20, 30, 40], "B": [80, 90, None]}
pd.DataFrame(dic)
d)
import numpy as np
dic = {"A": np.array([10, 20, 30, 40]), "B": [80, 90, 100, None]}
pd.DataFrame(dic, index=['z', 'd', 'a', 'b'])
1 2 3
1 a b NaN
2 NaN NaN c
3 d e NaN
In each of the following parts, determine whether the indexing is legitimate, and
if it is, deteremine the output:
a)
df.loc[1,1]
b)
df[1]
c)
df[1][1]
d)
df[1].iloc[1]
e)
df[1].loc[1]
f)
df[1,1]
Chapter 4
Supervised Learning in Practice: the First
Application Using Scikit-Learn
In this chapter, we first formalize the idea of supervised learning and its main two
tasks, namely, classification and regression, and then provide a brief introduction to
one of the most popular Python-based machine learning software, namely, Scikit-
Learn. After this introduction, we start with a classification application with which
we introduce and practice with some important concepts in machine learning such
as data splitting, normalization, training a classifier, prediction, and evaluation. The
concepts introduced in this chapter are important from the standpoint of the practical
machine learning because they are used frequently in the design process.
Supervised learning is perhaps the most common type of machine learning in which
the goal is to learn a mapping between a vector of input variables (also known as
predictors or feature vector, which is a vector of measurable variables in a problem)
and output variables. To guide the learning process, the assumption in supervised
learning is that a set of input-output instances, also referred to as training data, is
available. An output variable is generally known as target, response, or outcome, and
while there is no restriction on the number of output variables, for the ease of notation
and discussion, we assume a single output variable, denoted y, is available. While
it is generally easy and cheap to measure feature vectors, it is generally difficult or
costly to measure y. This is the main reason that in supervised learning practitioners
go through the pain of assigning the outcomes to input vectors once (i.e., collecting
the training data) and hope to learn a function that can perform this mapping in the
future.
Suppose we have a set of training data Str = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )} where
xi ∈ R p, i = 1, . . . , n, represents a vector including the values of p feature variables
(feature vector), yi denotes the target value (outcome) associated with xi , and n is
the number of observations (i.e., sample size) in the training data. In classification,
possible values for yi belong to a set of predefined finite categories called labels and
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 111
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_4
112 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
the goal is to assign a given (realization of random) feature vector (also known as
observation or instance) to one of the class labels (in some applications known as
multilabel classification, multiple labels are assigned to one instance). In regression,
on the other hand, yi represents realizations of a numeric random variable and the
goal is to estimate the target value for a given feature vector. Whether the problem
is classification or regression, the goal is to estimate the value of the target y for
a given feature vector x. In machine learning, we refer to this estimation problem
as prediction; that is, predicting y for a given x. At the same time, in classification
and regression, we refer to the mathematical function that performs this mapping as
classifier and regressor, respectively.
4.2 Scikit-Learn
estimator.fit(data, targets)
or
estimator.fit(data)
The fit() method performs the estimation from the given data.
• Transformers: Some estimators can also transform data. These estimators are
known as transformers and implement transform() method to perform the
transformation of data as:
new_data = transformer.transform(data)
back-to-back:
new_data = transformer.fit_transform(data)
• Predictors: Some estimators can make predictions given a data. These esti-
mators are known as predictors and implement predict() method to perform
prediction:
prediction = predictor.predict(data)
probability = predictor.predict_proba(data)
sklearn.utils.Bunch
Datasets that are part of scikit-learn are generally stored as “Bunch” objects; that
is, an object of class sklearn.utils.Bunch. This is an object that contains the
actual data as well as some information about it. All these information are stored
in Bunch objects similar to dictionaries (i.e., using keys and values). Similar to
dictionaries, we can use key() method to see all keys in a Bunch object:
iris.keys()
114 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
The value of the key DESCR gives some brief information about the dataset. Never-
theless, compared with dictionaries, in Bunch object we can also access values as
bunch.key (or, equivalently, as in dictionaries by bunch['key']). For example,
we can access class names as follows:
print(iris['target_names']) # or, equivalently, print(iris.target_names)
Next, we print the first 500 characters in the DESCR field where we see the name of
features and classes:
print(iris.DESCR[:500])
.. _iris_dataset:
All measurements (i.e., feature vectors) are stored as values of data key. Here,
the first 10 feature vectors are shown:
iris.data[:10]
We refer to the matrix containing all feature vectors as data matrix (also known
as feature matrix). By convention, scikit-learn assumes this matrix has the shape
of sample size × feature size; that is, the number of observations (also sometimes
referred to as the number of samples) × the number of features. For example, in this
data there are 150 Iris flowers and for each there are 4 features; therefore, the shape
is 150 × 4:
iris.data.shape
(150, 4)
The corresponding targets (in the same order as the feature vectors stored in data)
can be accessed through the target field:
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
The three classes in the dataset, namely, setosa, versicolor, and virginica are
encoded as integers 0, 1, and 2, respectively. Although there are various encoding
schemes to transform categorical variables into their numerical counterparts, this is
known as integer (ordinal) encoding (also see Exercise 3).
Here we use bincount function from NumPy to count the number of samples in
each class:
import numpy as np
np.bincount(iris.target)
116 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
As seen here, there are 50 observations in each class. Here we check whether
the type of data matrix and target is “array-like” (e.g., numpy array or pandas
DataFrame), which is the expected type of input data for scikit-learn estimators:
print('type of data: ' + str(type(iris.data))+ '\ntype of target: ' +
,→str(type(iris.target)))
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal
,→width (cm)']
X_train_shape: (120, 4)
X_test_shape: (30, 4)
y_train_shape: (120,)
y_test_shape: (30,)
X_train and y_train are the feature matrix and the target values used for
training, respectively. The feature matrix and the corresponding targets for evaluation
are stored in X_test and y_test, respectively. Let us count the number of class-
specific observations in the training data:
np.bincount(y_train)
This shows that the equal proportion of classes in the given data is kept in both the
training and the test sets.
For datasets in which the number of variables is not really large, visualization could
be a good exploratory analysis—it could reveal possible abnormalities or could
provide us with an insight into the hypothesis behind the entire experiment. In this
118 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
regard, scatter plots can be helpful. We can still visualize scatter plots for three
variables but for more than that, we need to display scatter plots between all pairs
of variables in the data. These exploratory plots are known as pair plots. There are
various ways to plot pair plots in Python but an easy and appealing way is to use
pairplot() function from seaborn library. As this function expects a DataFrame
as the input, we first convert the X_train and y_train arrays to dataframes and
concatenate them.
import pandas as pd
X_train_df = pd.DataFrame(X_train, columns=iris.feature_names)
y_train_df = pd.DataFrame(y_train, columns=['class'])
X_y_train_df = pd.concat([X_train_df, y_train_df], axis=1)
Fig. 4.1 presents the scatter plot of all pairs of features. The plots on diagonal
show the histogram for each feature across all classes. The figure, at the very least,
shows that classification of Iris flower in the collected sample is plausible based on
some of these features. For example, we can observe that class-specific histograms
generated based on petal width feature are fairly distinct. Further inspection of these
plots suggests that, for example, petal width could potentially be a better feature than
sepal width in discriminating classes. This is because the class-specific histograms
generated by considering sepal width are more mixed when compared with petal
width histograms. We may also be able to make similar suggestions about some
bivariate combinations of features. However, it is not easy to infer much about
higher order feature dependency (i.e., multivariate relationship) from these plots. As
a result, although visualization could also be used for selecting discriminating feature
subsets, it is generally avoided because we are restricted to low-order dependency
among features, while in many problems higher-order feature dependencies lead to
an acceptable level of prediction. On the other hand, in machine learning there exist
feature subset selection methods that are entirely data-driven and can detect low- or
high-order dependencies among features. These methods are discussed in Chapter
10. In what follows, we consider all four features to train a classifier.
It is common that the values of features in a dataset come in different scales. For
example, the range of some features could be from 0 to 1, while others may be in
the order of thousands or millions depending on what they represent and how they
are measured. In these cases, it is common to apply some type of feature scaling
4.6 Feature Scaling (Normalization) 119
8
sepal length (cm)
7
4.0
sepal width (cm)
3.5
3.0
2.5
class
0
6 1
petal length (cm)
2.5
petal width (cm)
2.0
1.5
1.0
0.5
0.0
4 6 8 2 3 4 2 4 6 8 0 1 2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Fig. 4.1: Pair plots generated using training set in the Iris classification application
(also known as normalization) to the training data to make the scale of features
“comparable”. This is because some important classes of machine learning models
such as neural networks, kNNs, ridge regression, to just name a few, benefit from
feature scaling.
Two common way for feature scaling include standardization and min-max scal-
ing. In standardization, first the mean and the standard deviation of each feature is
found. Then for each feature, the mean of that feature is subtracted from all feature
values and the result of this subtraction is divided by the standard deviation of the
feature. This way the feature vector is centered around zero and will have a standard
120 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
Next, we observe that the mean of each feature is 0 (the reason for having such small
numbers instead of absolute 0 is the finite machine precision):
X_train_scaled.mean(axis=0) # observe the mean is 0 now
It is important to note that the “test set” should not be used in any stage involved
in training of our classifier, not even in a preprocessing stage such as normalization.
The main reason for this can be explained through the following four points:
• Point 1: when a classifier is trained, the most salient issue is the performance of
the classifier on unseen (future) observations collected from the same applica-
tion. In other words, the entire worth of the classifier depends on its performance
on unseen observations;
• Point 2: because “unseen” observations, as the name suggests, are not available
to us during the training, we use (the available) test set to simulate the effect of
unseen observations for evaluating the trained classifier;
• Point 3: as a result of Point 2, in order to have an unbiased evaluation of the
classifier using test set, the classifier should classify observations in the test set
in precisely the same way it is used to classify unseen future observations; and
• Point 4: unseen observations are not available to us and naturally they can not be
used in any training stage such as normalization, feature selection, etc.; therefore,
observations in the test set should not be used in any training stage either.
4.6 Feature Scaling (Normalization) 121
set in training and the other for evaluation” as an illegitimate practice, what
is really meant is that following this procedure and then framing that as
test-set estimator of performance is illegitimate. If we accept it, however,
as another performance metric, one that is less known and is expected to
be optimistically biased to some extent, then it is a legitimate performance
estimator.
Once the relevant statistics (here, the mean and the standard deviation) are esti-
mated from the training set, they can be used to normalize the test set:
X_test_scaled = X_test - mean
X_test_scaled /= std
Observe that the test set does not necessarily have a mean of 0 or standard deviation
of 1:
print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))
To estimate the mean and the standard deviation of each feature from the training set
(i.e., from X_train), we call the fit() method of the scaler object (afterall, any
transformer is an estimator and implements the fit() method):
scaler.fit(X_train)
StandardScaler()
The scaler object holds any information that the standardization algorithm im-
plemented in StandardScaler class extracts from X_train. The fit() method
returns the scaler object itself and modifies that in place (i.e., stores the parameters
estimated from data). Next, we call the transform() method of the scaler object
to transform the training and test sets based on statistics extracted from the train-
ing set, and use X_train_scaled and X_test_scaled to refer to the transformed
training and test sets, respectively:
4.7 Model Training 123
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled.std(axis=0))
For later use and in order to avoid the above preprocessing steps, the training
and testing arrays could be saved using numpy.save() to binary files. For NumPy
arrays this is generally more efficient than the usual “pickling” supported by pickle
module. For saving multiple arrays, we can use numpy.savez(). We can provide
our arrays as keyword arguments to this function. In that case, they are saved in a
binary file with arbitrary names that we provide as keywords. If we give them as
positional arguments, then they will be stored with names being arr_0, arr_1, etc.
Here we specify our arrays as keywords X and y :
np.savez('data/iris_train_scaled', X = X_train_scaled, y = y_train)
np.savez('data/iris_test_scaled', X = X_test_scaled, y = y_test)
numpy.save and numpy.savez add npy and npz extensions to the name of a
created file, respectively. Later, arrays could be loaded by numpy.load(). Using
numpy.load() for npz files, returns a dictionary-like object using which each array
can be accessed either by the keywords that we provide when saving the file, or by
arr_0, arr_1, . . . , if positional arguments were used during the saving process.
We are now in the position to train the actual machine learning model. For this
purpose, we use the k-nearest neighbors (kNN) classification rule in its standard
form. To classify a test point, one can think that the kNN classifier grows a spherical
region centered at the test point until it encloses k training samples, and classifies
the test point to the majority class among these k training samples. For example, in
Fig. 4.2, 5NN assigns “green” to the test observation because within the 5 nearest
observations to this test point, three are from the green class.
kNN classifier is implemented in the KNeighborsClassifier class in the
sklearn.neighbors module. Similar to the way we used the StandardScaler
estimator earlier, we first instantiate the KNeighborsClassifier class into an
object:
124 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
2.0
1.5
1.0
0.5
0.0
−0.5
−1.0
−1.5
−1 0 1 2 3
Fig. 4.2: The working principle of kNN (here k = 5). The test point is identified by
×. The circle encloses 5 nearest neighbors of the test point.
KNeighborsClassifier(n_neighbors=3)
Similar to the use of fit() method for StandardScaler, the fit() method here
returns the modified knn object.
4.8 Prediction Using the Trained Model 125
As said before, some estimators in scikit-learn are predictors; that is, they can make
prediction by implementing the predict() method. KNeighborsClassifier is
also a predictor and, therefore, implements predict(). Here we use this method
to make prediction on a new data point. Suppose we have the following data point
measured in the original scales as in the original training set X_train:
x_test = np.array([[5.5, 2, 1.1, 0.6]]) # same as: np.array([5.5, 2, 1.
,→1, 0.6]).reshape(1,4)
x_test.shape
(1, 4)
We can also give several sample points as the argument to the predict() method.
In that case, we receive the assigned label for each of them:
y_test_predictions = knn.predict(X_test_scaled)
print('knn predicts: ' + str(iris.target_names[y_test_predictions])) #
,→fancy indexing in Section 3.1.4
The above sequence of operations, namely, instantiating the class KNN, fitting, and
predicting, can be combined as the following one liner pattern known as method
chaining:
y_test_predictions = KNN(n_neighbors=3).fit(X_train_scaled, y_train).
,→predict(X_test_scaled)
There are various rules and metrics to assess the performance of a classifier. Here
we use the simplest and perhaps the most intuitive one; that is, the proportion of
misclassified points in a test set. This is known as the test-set estimator of error rate.
In other words, the proportion of misclassified observations in the test set is indeed
an estimate of classification error rate denoted ε, which is defined as the probability
of misclassification by the trained classifier.
Let us first formalize the definition of error rate for binary classification. Let X
and Y represent a random feature vector and a binary random variable representing
the class variable, respectively. Because Y is a discrete random variable and X is a
continuous feature vector, we can characterize the joint distribution of X and Y (this
is known as joint feature-label distribution) as:
∫
P(X ∈ E, Y = i) = p(x|Y = i)P(Y = i)dx, i = 0, 1 , (4.1)
E
probabilistic question to ask is what would be the joint probability of all those events
and Y = 1? Formally, to answer this question, we need to find,
∫ ∫
1
P(X ∈ E0, Y = 1) = p(x|Y = 1)P(Y = 1)dx ≡ p(x|Y = 1)P(Y = 1)dx ,
E0 ψ(x)=0
(4.2)
1
where = is a direct consequence of (4.1). Similarly, let E1 denote all events for which
ψ(X) gives label 1. We can ask a similar probabilistic question; that is, what would
be the joint probability of Y = 0 and E1 ? In this case, we need to find
∫ ∫
P(X ∈ E1, Y = 0) = p(x|Y = 0)P(Y = 0)dx ≡ p(x|Y = 0)P(Y = 0)dx .
E1 ψ(x)=1
(4.3)
k
ε̂te = . (4.5)
m
Rather than reporting the error estimate, it is also common to report the accuracy
estimate of a classifier. The accuracy, denoted acc, and its test-set estimate, denoted
ˆ te , are given by:
acc
acc = 1 − ε ,
ˆ te = 1 − ε̂te .
acc (4.6)
For example, a classifier with an error rate of 15% has an accuracy of 85%.
Let us calculate the test-set error estimate of our trained kNN classifier. In this
regard, we can compare the actual labels within the test set with the predicted labels
and then find the proportion of misclassification:
errors = (y_test_predictions != y_test)
errors
128 4 Supervised Learning in Practice: the First Application Using Scikit-Learn
The classifier misclassified three data points out of 30 that were part of the test set;
therefore, ε̂te = 30
3
= 0.1:
error_est = sum(errors)/errors.size
print('The error rate estimate is: {:.2f}'.format(error_est) + '\n'\
'The accuracy is: {:.2f}'.format(1-error_est))
In the above code, we use the place holder { } and format specifier 0.2f to specify the
number of digits after decimal point. The “:” before .2f separates the format specifier
from the rest of the replacement field (if any option is set) within the { }.
Using scikit-learn built-in functions from metrics module many performance
metrics can be easily calculated. A complete list of these metrics supported by scikit-
learn is found at (Scikit-eval, 2023). Here we only show how the accuracy estimate
can be obtained in scikit-learn. For this purpose, there are two options: 1) using the
accuracy_score function; and 2) using the score method of the classifier.
The accuracy_score function expects the actual labels and predicted labels as
arguments:
from sklearn.metrics import accuracy_score
print('The accuracy is {:.2f}'.format(accuracy_score(y_test,
,→y_test_predictions)))
All classifiers in scikit-learn also have a score method that given a test data and its
labels, returns the classifier accuracy; for example,
print('The accuracy is {:.2f}'.format(knn.score(X_test_scaled, y_test)))
Exercises:
Are the number of samples in each class still the same as the original dataset?
0 0 0 0 0 0 0 0 0 0
5 5 5 5 5 5 5 5 5 5
0 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 5
5 5 5 5 5 5 5 5 5 5
0 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 5
5 5 5 5 5 5 5 5 5 5
0 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 5
5 5 5 5 5 5 5 5 5 5
0 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 50 0 5
5 5 5 5 5 5 5 5 5 5
0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5
was already encoded. In this exercise, we will practice encoding in conjunction with
a dataset collected for a business application.
Working with clients and understanding the factors that can improve the business
is part of marketing data science. One goal in marketing data science is to keep
customers (customer retention), which could be easier in some aspects than attracting
new customers. In this exercise, we see a case study on customer retention.
A highly competitive market is communication services. Following the break-up
of AT&T in 80’s, customers had a choice to keep their carrier (AT&T) or move
to another carrier. AT&T tried to identify factors relating to customer choice of
carriers. For this purpose, they collected customer data from a number of sources
including phone interviews, household billing, and service information recorded in
their databases. The dataset that we use here is based on this effort and is taken from
(Miller, 2015). It includes nine feature variables and a class variable (pick). The
goal is to construct a classifier that predicts customers who switch to another service
provider or stay with AT&T.
The dataset is stored in att.csv, which is part of the “data” folder. Below is
description of the variables in the dataset:
pick: customer choice (AT&T or OCC [short for Other Common Carrier])
income: household income in thousands of dollars
moves: number of times the household moved in the last five years
age: age of respondent in years (18-24, 25-34, 35-44, 45-54, 55-64, 65+)
education: <HS (less than high school); HS (high school graduate); Voc (vocational
school); Coll (Some College); BA (college graduate); >BA (graduate school)
employment: D (disabled); F (full-time); P (part-time); H (homemaker); R (re-
tired); S (student); U (unemployed)
usage: average phone usage per month
nonpub: does the household have an unlisted phone number
reachout: does the household participate in the AT&T “Reach Out America”
phone service plan?
card: does the household have an “AT&T Calling Card”?
1) load the data and pick only variables named employment, usage, reachout, and
card as features to use in training the classifier (keep “pick” because it is the
class variable).
2) find the rate of missing value for each variable (features and class variable).
3) remove any feature vector and its corresponding class with at least one missing
entry. What is the sample size now?
4) use stratified random split to divide the data into train and test sets where the
test set size is 20% of the dataset. Set random_state=100.
5) There are various strategies for encoding with different pros and cons. An
efficient strategy in this regard is ordinal encoding. In this strategy, a variable
with N categories, will be converted to a numeric variable with integers from 0 to
N − 1. Use OrdinalEncoder() and LabelEncoder() transformers to encode
the categorical features and the class variable, respectively—LabelEncoder
works similarly to OrdinalEncoder except that it accepts one-dimensional
4.9 Model Evaluation (Error Estimation) 131
arrays and Series (so use that to encode the class variable). Hint: you need to fit
encoder on training set and transform both training and test sets.
6) train a 7NN classifier using encoded training set and evaluate that on the encoded
test set.
Exercise 4: We have a dataset D using which we would like to train and evaluate a
classifier. Suppose
• we divide D into a training set denoted Dtr ain and a test set denoted Dtest
• apply normalization on Dtr ain using parameters estimated from Dtr ain to create
nor m , which denotes the normalized training set.
Dtr ain
C)
• apply normalization on Dtest using parameters estimated from Dtr ain to create
nor m
the normalized test set denoted Dtest
nor m nor m
• train the classifier on Dtr ain and assess its performance on Dtest
D)
• apply normalization on Dtest using parameters estimated from Dtr ain to create
the normalized test set denoted Dtestnor m
nor m
• train the classifier on Dtr ain and assess its performance on Dtest
Chapter 5
k-Nearest Neighbors
There are many predictive models in machine learning that can be used in the design
process. We start the introduction to these predictive models from kNN (short for k-
Nearest Neighbors) not only because of its simplicity, but also due to its long history
in machine learning. As a result, in this chapter we formalize the kNN mechanism
for both classification and regression and we will see various forms of kNN.
5.1 Classification
kNN is one of the oldest rules with a simple working mechanism and has found
many applications since its conception in 1951 (Fix and Hodges, 1951). To for-
malize the kNN mathematical mechanism, it is common to consider the case of
binary classification where the values of target variable y are encoded as 0 and
1. Furthermore, to mathematically characterize the kNN mechanism, we introduce
some additional notations. Suppose we use a distance metric (e.g., Euclidean dis-
tance) to quantify distance of all feature vectors (observations) in a training set
Str = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )} to a given feature vector x (test observation)
for which we obtain an estimate of y denoted ŷ. We denote the i th nearest observation
to x as x(i) (x) with its associated label as y(i) (x). Note that both x(i) (x) and y(i) (x) are
functions of x because the distance is measured with respect to x. With this notation,
x(1) (x), for example, means the nearest observation within Str to x.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 133
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_5
134 5 k-Nearest Neighbors
(
I {y(i) (x)=1} > i=1 k I {y(i) (x)=0} ,
Ík 1 Ík 1
1 if i=1
ŷ = ψ(x) = k (5.1)
0 otherwise ,
where k denotes the number of nearest neighbors with respect to the given feature
vector x and I A is the indicator of event A, which is 1 when A happens, and 0
otherwise. Here we write ŷ = ψ(x) to remind us the fact that the function ψ(x) (i.e.,
the classifier) is essentially as estimator of the class label y. Nevertheless, hereafter,
we denote all classifiers simply as ψ(x). In addition, in what follows, the standard
kNN classifier is used interchangeably with kNN classifier.
Ík
Let us elaborate more on (5.1). The term i=1 I {y(i) (x)=1} in (5.1) counts the number
of training observations
Ík with label 1 among the k nearest neighbors of x. This is then
compared with i=1 I {y(i) (x)=0} , which is the number of training observations with
I {y(i) (x)=1} > i=1
Ík Ík
label 0 among the k nearest neighbors of x, and if i=1 I {y(i) (x)=0} ,
then ψ(x) = 1 (i.e., kNN assigns label 1 to x); otherwise, ψ(x) = 0. As we can see
in (5.1), the factor k1 does not have any effect on the process of decision making
in the classifier and it can be canceled out safely from both sides of the inequality.
However, having that in (5.1) better reflects the relationship between the standard
kNN classifier and other forms of kNN that we will see later in this chapter. For
multiclass classification problems with c classes j = 0, 1, . . . , c−1, (e.g., in Iris flower
classification problem we had c = 3), we again identify the k nearest neighbors of
x, count the number of labels among these k training samples, and classify x to the
majority class among these k observations; that is to say,
k
Õ 1
ŷ = ψ(x) = argmax I {y (x)=j } . (5.2)
j i=1
k (i)
y_test = arrays['y']
print('X shape = {}'.format(X_train.shape) + '\ny shape = {}'.
,→format(y_train.shape))
X shape = (120, 4)
y shape = (120,)
For illustration purposes, we only consider the first two features in data; that is, the
first and second features (columns) in X_train and X_test:
X_train = X_train[:,[0,1]]
X_test = X_test[:,[0,1]]
X_train.shape
(120, 2)
To plot the decision regions, we use the same technique discussed in Example
3.5; that is, we first create a grid, train a classifier using training data, classify each
point in the grid using the trained classifier, and then plot the decision regions based
on the assigned labels to each point in the grid:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.neighbors import KNeighborsClassifier as KNN
color = ('aquamarine', 'bisque', 'lightgrey')
cmap = ListedColormap(color)
for ax in axs.ravel():
ax.label_outer() # to show the x-label and the y-label for the last
,→row and the left column, respectively
As we can see in Fig. 5.1, a larger value of the hyperparameter k generally leads to
smoother decision boundaries. That being said, a smoother decision region does not
necessarily mean a better classifier performance.
The classification formulated in (5.1) is the mechanism of the standard kNN classifier.
There is another form of kNN known as distance-weighted kNN (DW-kNN) (Dudani,
1976). In this form, the k nearest neighbors of a test observation x are weighted
according to their distance from x. The idea is that observations that are closer to x
should impose a higher influence on decision making. A simple weighting scheme
for this purpose is to weight observations based on the inverse of their distance to x.
With this weighting scheme, DW-kNN classifier becomes:
(
> i=1 d[x(i) (x), x] I {y(i) (x)=0} ,
1 Ík 1 1 Ík 1
1 if i=1 d[x(i) (x), x] I {y(i) (x)=1}
ψ(x) = D D
0 otherwise ,
(5.3)
where
5.1 Classification 137
−1
−2
−1
−2
−2 −1 0 1 2 −2 −1 0 1 2
sepal length (normalized) sepal length (normalized)
Fig. 5.1: kNN decision regions and boundaries for k = 1, 3, 9, 36 in the Iris classifi-
cation problem.
k
Õ 1
D= , (5.4)
i=1
d[x(i) (x), x]
and where d[x(i) (x), x] denotes the distance of x(i) (x) from x. If we think of each
weight 1/d[x(i) (x), x] as a “score” given to the training observation x(i) (x), then DW-
kNN classifies the test observation x to the class with the largest “cumulative score”
given to its observations that are among the k nearest neighbors of x. At this stage,
it is straightforward to see that the standard kNN given by (5.1) is a special form of
DW-kNN where the k nearest neighbor have equal contribution to voting; that is,
when d[x(i) (x), x] = 1 for all k nearest neighbors, which makes D = k.
In some settings, DW-kNN may end up behaving similarly to a simple nearest-
neighbor rule (1NN) because training observations that are very close to a test
observation will take very large weights. In practice though, the choice of using
DW-kNN or standard kNN, or even the choice of k are matters that can be decided in
the model selection phase (Chapter 9). In scikit-learn (as of version 1.2.2), DW-kNN
can be implemented by setting the weights parameter of KNeighborsClassifier
to 'distance' (the default value of weights is 'uniform', which corresponds to
the standard kNN).
138 5 k-Nearest Neighbors
v
u
tÕq
dE [xi, x j ] = (xli − xl j )2 , (5.5)
l=1
Õq
dMa [xi, x j ] = |xli − xl j | , (5.6)
l=1
q
Õ p1
dMi [xi, x j ] = |xli − xl j | p . (5.7)
l=1
In what follows, to specify the choice of distance in the kNN classifier, the
name of the distance is prefixed to the classifier; for example, Manhattan-kNN clas-
sifier and Manhattan-weighted kNN classifier refer to the standard kNN and the
DW-kNN classifiers with the choice of Manhattan distance. When this choice is
not indicated specifically, Euclidean distance is meant to be used. The Minkowski
distance has an order p, which is an arbitrary integer. When p = 1 and p = 2,
Minkowski distance reduces to the Manhattan and the Euclidean distances, re-
spectively. In KNeighborsClassifier, the choice of distance is determined by
the metric parameter. Although the default metric in KNeighborsClassifier
is Minkowski, there is another parameter p, which determines the order of the
Minkowski distance, and has a default value of 2. This means that the default metric
of KNeighborsClassifier is indeed the Euclidean distance.
5.2 Regression
k
Õ 1
ŷ = f (x) = y(i) (x) . (5.8)
i=1
k
5.2 Regression 139
Similar to (5.1), here ŷ = f (x) is written to emphasize that function f (x) is essentially
an estimator of y. Nevertheless, hereafter, we denote our regressors simply as f (x).
From (5.8), the standard kNN regressor estimates the target of a given x as the average
of targets of the k nearest neighbors of x. We have already seen the functionality
of kNN classifier through the Iris classification application. To examine the kNN
functionality in regression, in the next section, we use another well-known dataset
that suits regression applications.
The Dataset: Here we use the California Housing dataset, which includes the
median house price (in $100,000) of 20640 California districts with 8 features that
can be used to predict the house price. The data was originally collected in (Kelley
Pace and Barry, 1997) and suits regression applications because the target, which is
the median house price of a district (identified by “MedHouseVal” in the data), is a
numeric variable. Here we first load the dataset:
from sklearn import datasets
california = datasets.fetch_california_housing()
print('california housing data shape: '+ str(california.data.shape) + \
'\nfeature names: ' + str(california.feature_names) + \
'\ntarget name: ' + str(california.target_names))
As mentioned in Section 4.3, using the DESCR key we can check some details
about the dataset. Here we use this key to list the name and meaning of variables in
the data:
print(california.DESCR[:975])
.. _california_housing_dataset:
:Attribute Information:
- MedInc median income in block group
140 5 k-Nearest Neighbors
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
“Latitude” and “Longitude” identify the centroid of each district in the data, and
“MedInc” values are in $10,000.
Preprocessing and an Exploratory Analysis: We split the data into training and
test sets. Compared with the Iris classification problem, we use the default 0.25
test_size and do not set stratify to any variable—after all, random sampling
with stratification requires classes whereas in regression, there are no “classes”.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train_shape: (15480, 8)
X_test_shape: (5160, 8)
y_train_shape: (15480,)
y_test_shape: (5160,)
Similar to Section 4.6, we use standardization to scale the training and the test sets.
In this regard, the scaler object is trained using the training set, and is then used to
transform both the training and the test sets:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
5.2 Regression 141
Cov[X, Y ]
ρ= , (5.9)
σX σY
where
and where E[Z] and σZ are the expected value and the standard deviation
of Z, respectively. As the covariance Cov[X, Y ] measures the joint variation
between two variables X and Y , ρ in (5.9) represents a normalized form of
their joint variability. However, in practice, we can not generally compute
(5.9) because it depends on the unknown actual expectations and standard
deviations. Therefore, we need to estimate that from a sample.
Suppose we collect a sample from the joint distribution of X and Y . This
sample consists of n pairs of observations {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )};
that is, when the value of X is xi , the value of Y is yi . The sample Pearson’s
correlation coefficient r is an estimate of ρ and is given by
Ín
i=1 (xi
− x̄)(yi − ȳ)
r = qÍ qÍ , (5.11)
n 2 n 2
(x
i=1 i − x̄) (y
i=1 i − ȳ)
california_pd.head().round(3)
Longitude MEDV
0 0.883 2.903
1 -0.999 0.687
2 -0.741 1.097
3 -1.423 4.600
4 -1.283 2.134
Now we calculate and, at the same time, we draw a color-coded plot (known as
heatmap) of these correlation coefficients using seaborn.heatmap() (the result is
depicted in Fig. 5.2):
fig, ax = plt.subplots(figsize=(9,5))
sns.heatmap(california_pd.corr().round(2), annot=True, square=True,
,→ax=ax)
One way that we may use this exploratory analysis is in feature selection. In many
applications with a limited sample size and moderate to large number of features,
using a subset of features could lead to a better performance in predicting the target
than using all features. This is due to what is known as the “curse of dimensionality”,
which is discussed in details in Chapter 10. In short, this phenomenon implies that
using more features in training not only increases the computational burden, but
also, and perhaps more importantly, could lead to a lower performance of the trained
models after adding more than a certain number of features. The process of feature
subset selection (or simply feature selection) per se is an important area of research
in machine learning and will be discussed in Chapter 10. Here, we may use the
correlation matrix presented in Fig. 5.2 to identify a subset of features such that each
feature within this subset is strongly or even moderately correlated with the response
5.2 Regression 143
1.00
0.50
AveRooms 0.32 -0.15 1 0.85 -0.07 -0.01 0.1 -0.03 0.15
AveBedrms -0.06 -0.08 0.85 1 -0.06 -0.01 0.07 0.02 -0.04 0.25
MEDV 0.68 0.11 0.15 -0.04 -0.03 -0.03 -0.14 -0.05 1 −0.75
MedInc
HouseAge
AveBedrms
Population
AveOccup
Latitude
MEDV
Longitude
AveRooms
Fig. 5.2: The heatmap of pairwise correlation coefficients for all variables in the
California Housing dataset.
np.savez('data/california_train_fs_scaled', X = X_train_fs_scaled, y =
,→y_train) # to save for possible uses later
np.savez('data/california_test_fs_scaled', X = X_test_fs_scaled, y =
,→y_test)
Model Training and Evaluation: Now we are in the position to train our model.
The standard kNN regressor is implemented in in the KNeighborsRegressor class
in the sklearn.neighbors module. We train a standard 50NN (Euclidean distance
and uniform weights) using the scaled training set (X_train_fs_scaled), and then
use the model to predict the target in the test set (X_test_fs_scaled). At the same
time, we created the scatter plot between predicted targets ŷ and the actual targets.
This entire process is implemented by the following short code snippet:
from sklearn.neighbors import KNeighborsRegressor as KNN
plt.style.use('seaborn')
knn = KNN(n_neighbors=50)
knn.fit(X_train_fs_scaled, y_train)
y_test_predictions = knn.predict(X_test_fs_scaled)
plt.figure(figsize=(4.5, 3), dpi = 200)
plt.plot(y_test, y_test_predictions, 'g.', markersize=4)
lim_left, lim_right = plt.xlim()
plt.plot([lim_left, lim_right], [lim_left, lim_right], '--k',
,→linewidth=1)
Let Ste denote the test set containing m test observations. In situations where
ŷi = yi for any xi ∈ Ste , we commit no error in our predictions. This implies that
5.2 Regression 145
4
y (Predicted MEDV)
2
̂
0
0 1 2 3 4 5
y (Actual MEDV)
Fig. 5.3: The scatter plot of predicted targets by a 50NN and the actual targets in the
California Housing dataset.
all points in the scatter plot between predicted and actual targets should lie on the
diagonal line ŷ = y. In Fig. 5.3, we can see that, for example, when MEDV is
lower than $100k, our model is generally overestimating the target; however, for
large values of MEDV (e.g., around $500k), our model is underestimating the target.
Although such scatter plots help shed light on the behavior of the trained regressor
across the entire range of the target variable, we generally would like to summarize
the performance of the regressor using some performance metrics.
As we discussed in Chapter 4, all classifiers in scikit-learn have a score method
that given a test data and their labels, returns a measure of classifier performance.
This also holds for regressors. However, for regressors the score method estimates
coefficient of determination, also known as R-squared statistics, denoted R̂2 , given
by
RSS
R̂2 = 1 − , (5.12)
TSS
where RSS and TSS are short for Residual Sum of Squares and Total Sum of Squares,
respectively, defined as
146 5 k-Nearest Neighbors
m
Õ
RSS = (yi − ŷi )2 , (5.13)
i=1
Õm
TSS = (yi − ȳ)2 , (5.14)
i=1
and where ȳ is the target mean within the test set; that is ȳ = m1 i=1
Ím
yi . This means
that R̂2 measures how well our trained regressor is performing when compared with
the the trivial estimator of the target, which is ȳ. In particular, for perfect predictions,
RSS = 0 and R̂2 = 1; however, when our trained regressor is performing similarly
to the trivial average estimator, we have RSS = TSS and R̂2 = 0.
In contrast with the standard kNN regressor given in (5.8), DW-kNN regressor
computes a weighted average of targets for the k nearest neighbors where the weights
are the inverse of the distance of each nearest neighbor from the test observation
x. This causes observations that are closer to a test observation impose a higher
influence in the weighted average. In particular, DW-kNN is given by
K
1 Õ 1
f (x) = y(i) , (5.15)
D i=1 d[x(i) (x), x]
where D is defined in (5.4). When d[x(i) (x), x] = 1 for all k nearest neighbors,
(5.15) reduces to the standard kNN regressor given by (5.8). In scikit-learn, DW-
kNN regressor is obtained by the default setting of the weights parameter of
5.2 Regression 147
Exercises:
class → 1 2 1 3 2 3 2 1 3
observation → 0.45 0.2 0.5 0.05 0.3 0.15 0.35 0.45 0.4
A) class 1
B) class 2
C) class 3
D) there is a tie between class 1 and 3
Exercise 3: Suppose we have the following training data where x and y denote the
value of the predictor and the target value, respectively. We train a standard 3NN
regressor using this training data. What is the prediction of y for x = 2 using the
trained 3NN?
x −3 1 −1 4 2 5
y 0 5 15 −3 3 6
A) 2 B) 1.66 C) 3 D) 4 E) 5 E) 1
Exercise 4: Suppose we trained a regressor and applied to a test set with seven
observations to determine R̂2 . The result of applying this regressor to the test set is
tabulated in the following table where y and ŷ denote the actual and the estimated
values of the target, respectively. What is R̂2 ?
A) 0.25 B) 0.35 C) 0.55 D) 0.65 E) 0.75
148 5 k-Nearest Neighbors
y 0 −1 2 3 4 −3 2
ŷ −1 −2 3 2 5 −3 4
Exercise 5: Suppose we trained a regressor and applied to a test set with six observa-
tions to determine R̂2 . The result of applying this regressor to the test set is tabulated
in the following table where y and ŷ denote the actual and the estimated values of
the target, respectively. Determine R̂2 ?
y 1 2 1 2 9 −3
ŷ −1 0 3 2 9 3
Fig. 5.4: The annual revenue of a company (in million US $) for 11 years.
arrow shows the observation that should be used as the target for the first window
(i.e., for the first feature vector). If possible, associate a target value to feature
vectors created in Part 1. Remove the feature vector that has no corresponding
target from the training data.
3) Training and prediction. Use the prepared training data to forecast the revenue
for 2023 using Euclidean-2NN regressor.
3
1
2
2
3
3
Fig. 5.5: Creating feature vectors by a sliding window with an overlap of one obser-
vation.
150 5 k-Nearest Neighbors
Fig. 5.6: An arrow pointing to the target value for the feature vector that is created
using observations within the window (the dashed rectangle).
Exercise 7: In Exercise 6, the sliding window size and the amount of overlap should
generally be treated as hyperparameters.
1. Redo Exercise 6 to forecast the revenue for 2023 based on the revenue in the
last four years. Determine the size of the sliding window and use an overlap of 2
observations. What is the estimated 2023 revenue based on the Euclidean-2NN
regressor?
2. Redo Part 1, but this time use the Manhattan-2NN regressor.
Chapter 6
Linear Models
Linear models are an important class of machine learning models that have been
applied in various areas including genomic classification, electroencephalogram
(EEG) classification, speech recognition, face recognition, to just name a few. Their
popularity in various applications is mainly attributed to one or a combination of the
following factors: 1) their simple structure and training efficiency; 2) interpretability;
3) matching their complexity with that of problems that have been historically
available; and 4) a “satisficing” approach to decision-making. In this chapter, we
cover some of the most widely used linear models including linear discriminant
analysis, logistic regression, and multiple linear regression. We will discuss various
forms of shrinkage including ridge, lasso, and elastic-net in combination with logistic
regression and multiple linear regression.
There exist a number of linear models for classification and regression. Here we aim
to focus on the working principles of a few important linear models that are frequently
used in real-world applications and leave others to excellent textbooks such as (Duda
et al., 2000; Bishop, 2006; Hastie et al., 2001). We start our discussion from linear
classifiers. However, before exploring these classifiers or even defining what a linear
classifier is, it would be helpful to study the optimal classification rule. This is
because the structure of two linear classifiers that we will cover are derived from the
estimate of the optimal classifier, also known as the Bayes classifier.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 151
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_6
152 6 Linear Models
In other words, the classifier ψ(x) assigns label i to x ∈ R p if gi (x) > g j (x) for
i , j. The functions gi (x) are known as discriminant functions. The classifier
then partitions the feature space R p into c decision regions, R 1 , . . . , R c such that
R 1 ∪ R 2 ∪ . . . ∪ R c = R p . That is to say, if gi (x) > g j (x) for ∀i , j then x ∈ R i .
Using this definition, a decision boundary is the boundary between two decision
regions R i and R j , if any, which is given by
(
1 if g(x) > 0 ,
ψ(x) = (6.3)
0 otherwise .
{x | g(x) = 0} . (6.4)
The misclassification error rate of a classifier was defined in Section 4.9. In particular,
the error rate was defined as
6.1 Optimal Classification 153
ε = P(ψ(X) , Y ) . (6.5)
Ideally, we desire to have a classifier that has the lowest error among any possible
classifier. The question is whether we can achieve this classifier. As we will see in this
section, this classifier, which is known as the Bayes classifier, exists in theory but not
really in practice because it depends on the actual probabilistic distribution of data,
which is almost always unknown in practice. Nonetheless, derivation of the Bayes
classifier will guide us towards practical classifiers as we will diiscuss. In order to
characterize the Bayes classifier, here we first start from some intuitive arguments
with which we obtain a theoretical classifier, which is later proved (in Exercise 4) to
be the Bayes classifier.
Suppose we are interested to develop a binary classifier for a given observation
x. A reasonable assumption is to set gi (x) to P(Y = i|x), i = 0, 1, which is known as
the posterior probability and represents the probability that class (random) variable
takes the value i for a given observation x. If this probability is greater for class
1 than class 0, then it makes sense to assign label 1 to x. Replacing this specific
gi (x), i = 0, 1 in (6.3) means that our classifier is:
(
1 if P(Y = 1|x) > P(Y = 0|x) ,
ψ(x) = (6.6)
0 otherwise .
p(x|Y = i)P(Y = i)
P(Y = i|x) = . (6.8)
p(x)
In Section 4.9, we discussed the meaning of various terms that appear in (6.8).
Replacing (6.8) in (6.6) means that our classifier can be equivalently written as
(
1 if p(x|Y = 1)P(Y = 1) > p(x|Y = 0)P(Y = 0) ,
ψ(x) = (6.9)
0 otherwise ,
(
p(x|Y=1) P(Y=0)
1 if log > log ,
ψ(x) = p(x|Y=0) P(Y=1)
(6.10)
0 otherwise .
154 6 Linear Models
In writing (6.10) from (6.9), we rearranged the terms and took the natural log from
both side—taking the log(.) from both sides does not change our classifier because
it is a monotonic function. However, it is common to do so in the above step because
some important families of class-conditional densities have exponential form and
doing so makes the mathematical expressions of the corresponding classifiers easier.
As we can see here, there are different ways to write the same classifier ψ(x)—
the forms presented in (6.6), (6.7), (6.9), and (6.10) all present the same classifier.
Although this classifier was developed intuitively in (6.6), it turns out that for binary
classification, it is the best classifier in the sense that it achieves the lowest misclas-
sification error given in (6.5). The proof is slightly long and is left as an exercise
(Exercise 4). Therefore, we refer to the classifier defined in (6.6) (or its equiva-
lent forms) as the optimal classifier (also known as the Bayes classifier). Extension
of this to multiclass classification with c classes is obtained by simply replacing
gi (x), i = 0, . . . , c − 1 by P(Y = i|x) in (6.1); that is to say, the Bayes classifier for
multiclass classification with c classes is
Linear models for classification are models for which the decision boundaries are
linear functions of the feature vector x (Bishop, 2006, p.179), (Hastie et al., 2001,
p.101). There are several important linear models for classification including linear
discriminant analysis, logistic regression, percetron, and linear support vector ma-
chines; however, here we discuss the first two models not only to be able to present
them in details but also because:
• these two models are quite efficient to train (i.e., have fast training); and
• oftentimes, applying a proper model selection on the first two would lead to a
result that is “comparable” to the situation where a model selection is performed
over all four. That being said, in the light of the no free lunch theorem, there is
no reason to favor one classifier over another regardless of the context (specific
problem) (Duda et al., 2000).
The theoretical aspects discussed below for each model have two purposes:
• to better understand the working principle of each model; and
• to better understand some important parameters used in scikit-learn implemen-
tation of each model.
6.2 Linear Models for Classification 155
1 1 T Σ −1 (x−µ )
p(x|Y = i) = e− 2 (x−µ i ) i i , i = 0, 1, . . . , c − 1, (6.12)
(2π) p/2 |Σi | 1/2
where c is the number of classes and µ i and Σi are the class-specific mean vector (a
vector consists of the mean of each variable in x) and the covariance matrix between
all p variables in x, respectively. The second assumption in development of LDA is
∆
that Σ = Σi, ∀i, which means that there is a common covariance matrix Σ across all
classes. Of course this is only an assumption and it is not necessary true in practice;
however, this assumption makes the form of classifier simpler (in fact linear).
∆
Consider the binary classification with i = 0, 1. Using the assumption of Σ = Σ0 =
Σ1 in (6.10) and after some standard algebraic simplifications yield (see Exercise 6):
T
1 if x − µ0 +µ1 Σ−1 µ − µ + log P(Y=1) > 0 ,
ψ(x) =
2 1 0 1−P(Y=1) (6.13)
0 otherwise .
This means that the LDA classifier is the Bayes classifier when class-conditional
densities are Gaussian with a common covariance matrix. The expression of LDA
in (6.13) is in the form of (6.3) where
µ 0 + µ 1 T −1 P(Y = 1)
g(x) = x − µ 1 − µ 0 + log .
Σ (6.14)
2 1 − P(Y = 1)
156 6 Linear Models
From (6.4) and (6.13) the decision boundary of the LDA classifier is given by
x | aT x + b = 0 ,
(6.15)
where
a = Σ−1 µ 1 − µ 0 ,
(6.16)
µ + µ T
P(Y = 1)
0 1
b=− −1
µ 1 − µ 0 + log .
Σ (6.17)
2 1 − P(Y = 1)
From (6.15)-(6.17), it is clear that the LDA classifier is a linear model for classifica-
tion because its decision boundary is a linear function of x.
Some Properties of the Linear Decision Boundary: The set of x that satisfies
(6.15) forms a hyperplane with its orientation and location uniquely determined by
a and b. Specifically, the vector a is normal to the hyperplane. To see this, consider
a vector u from one arbitrary point x1 on the hyperplane to another point x2 ; that
is, u = x2 − x1 . Because both x1 and x2 are on the hyperplane, then aT x1 + b = 0
and aT x2 + b = 0. Therefore, aT (x2 − x1 ) = aT u = 0, which implies a is normal to
any vector u on the hyperplane. This is illustrated by Fig. 6.1a. This figure shows the
decision boundary of the LDA classifier for a two-dimensional binary classification
problem where classes are normally distributed with means µT0 = [4, 4]T and
µT1 = [−1, 0]T , a common covariance matrix given by
1 0
Σ= ,
0 4
and P(Y = 1) = 0.5. From (6.14)-(6.17), the decision boundary in this example is
determined by x = [x1, x2 ]T that satisfies −5x1 − x2 + 9.5 = 0. This is the solid
line that passes through the filled circle, which is the midpoint between population
µ +µ
means; that is, 0 2 1 . This is the case because P(Y = 1) = 0.5, which implies
P(Y=1)
log 1−P(Y=1) = 0, and, therefore,
µ + µ µ + µ 1 T −1
0 1
− 0 µ 1 − µ 0 = 0.
Σ
2 2
Also note the vector a = Σ−1 µ 1 − µ 0 = [−5, −1]T is perpendicular to the linear
decision boundary. This vector is not in general in direction of the line connecting the
population means (the dashed line in Fig. 6.1a) unless Σ−1 µ 1 − µ 0 = k µ 1 − µ 0
for some scaler k, which is the case if µ 1 − µ 0 is an eigenvector of Σ−1 (and, therefore,
an eigenvector of Σ).
Suppose in the same binary classification example, we intend to classify three
observations: x A = [4, −2]T , xB = [0.2, 5]T , and xC = [1.1, 4]T . Replacing x A, xB ,
and xC in the discriminant function −5x1 − x2 + 9.5, gives -8.5, 3.5, 0, respectively.
6.2 Linear Models for Classification 157
Therefore, from (6.13), x A and xB are classified as 0 (blue in the figure) and 1 (red in
the figure), respectively, and xC lies on the decision boundary and could be randomly
assigned to one of the classes. Fig. 6.1b-6.1d illustrate the working mechanism of
the LDA classifier to classify these three sample points. Due to the first part of the
µ +µ µ +µ
discriminant function in (6.13) (i.e., x − 0 2 1 ), xB , and xC are moved by − 0 2 1
to points x0A, x0B , and xC0 , respectively. Then in order to classify, the inner product
inner products are negative, positive, and zero, which lead to the same class labels
as discussed before.
Sample LDA Classifier: The main problem with using classifier (6.13) in prac-
tice is that the discriminant function in (6.13) depends on the actual distributional
parameters µ 0 , µ 1 , and Σ that are almost always unknown unless we work with
synthetically generated data for which we have the underlying parameters that were
used to generate the data in the first place.
Given a training data Str = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )} and with no prior
information on distributional parameters used in (6.13), one can replace the un-
known distributional parameters by their sample estimates to obtain the sample LDA
classifier given by
T
1 if x − µ̂0 + µ̂1 Σ̂−1 µ̂ − µ̂ + log P(Y=1) > 0 ,
ψ(x) =
2 1 0 1−P(Y=1) (6.18)
0 otherwise ,
where µ̂ i is the sample mean for class i and Σ̂ is the pooled sample covariance
matrix, which are given by (ni is the number of observations in class i)
ni
1 Õ
µ̂ i = xj , (6.19)
ni j=1
1 Õ n
1 Õ
Σ̂ = (x j − µ̂ i )(x j − µ̂ i )T I {y j =i } , (6.20)
n0 + n1 − 2 i=0 j=1
𝝁!
𝑥"
𝑥"
𝝁"
𝜽!
a a 𝒙#
𝒙$#
𝑥! 𝑥!
(a) (b)
𝒙#
𝒙#
𝒙$#
𝒙$#
𝑥"
𝑥"
𝜽! 𝜽𝑪
a a
𝑥! 𝑥!
(c) (d)
Fig. 6.1: The working mechanism of the LDA classifier. The blue and red ellipses
show the areas of equal probability density of the Gaussian population for class
µ +µ
0 and 1, respectively. The filled circle identifies the midpoint 0 2 1 . (a): vector
a = Σ −1
µ 1 − µ 0 = [−5, −1] is perpendicular to the linear decision boundary,
T
which is expressed as −5x1 − x2 + 9.5 = 0. At the same time, the decision boundary
µ +µ
passes through the midpoint 0 2 1 because P(Y = 1) = 0.5; (b)-(d): x A, xB , and
µ +µ
xC are moved in the opposite direction of vector 0 2 1 to points x0A, x0B , and xC 0 ,
respectively. The positivity or negativity of the inner products of these vectors with
vector a are determined by the angles between them. Because θ A > 90◦ (obtuse
angle), θ B < 90◦ (acute angle), and θC = 90◦ (right angle), x A, xB , and xC are
classified to class 0 (blue), 1 (red), and either one (could be randomly chosen),
respectively.
6.2 Linear Models for Classification 159
and as long as the meaning is clear from the context, we may refer to both (6.13) and
(6.18) as the LDA classifier.
Similar to (6.14)-(6.17), we conclude that the decision boundary of the sample
LDA classifier forms a hyperplane aT x + b = 0 where
−1
a = Σ̂ µ̂ 1 − µ̂ 0 ,
(6.21)
and
T
µ̂ + µ̂ 1 P(Y = 1)
−1
b=− 0 µ̂ 1 − µ̂ 0 + log .
Σ̂ (6.22)
2 1 − P(Y = 1)
Fig. 6.2a shows the sample LDA classifier for two set of samples drawn from the
same Gaussian density functions used in Fig. 6.1. Here 20 sample points are drawn
from each Gaussian population. This data is then used in (6.19) and (6.20) to estimate
the decision boundary used in the LDA classifier (6.18). The linear decision boundary
is characterized as −3.28 x1 −0.92 x2 +7.49 = 0 (dotted line) and is different from the
optimal decision boundary, which is −5x1 − x2 + 9.5 = 0 (solid line). Fig. 6.2b shows
the sample LDA classifier for the same example except that we increase the number
of sample points drawn from each population to 1500. The decision boundary in this
case is −5.20 x1 −0.90 x2 +9.67 = 0 and it appears to be closer to the optimal decision
boundary. However, rather than discussing closeness of the decision boundaries, it
is easier to discuss the behavior of the rule in terms of its performance. As the
number of samples drawn from two Gaussian distribution increases unboundedly
(i.e., n → ∞), the error rate of the sample LDA classifier converges (in probability)
to the error rate of the Bayes rule 6.13, a property known as the statistical consistency
of the classifier1. Nevertheless, the precise characterization of this property for the
LDA classifier or other rules is beyond the scope of our discussion and interested
readers are encouraged to consult other sources such as (Devroye et al., 1996) and
(Braga-Neto, 2020).
∆
In the development of LDA classifier, we assumed Σ = Σ0 = Σ1 . This assumption
led us to use the pooled sample covariance matrix to estimate Σ. However, if we
assume Σ0 , Σ1 , it does not make sense anymore to use a single estimate as the
estimate of both Σ0 and Σ1 . In these situations, we use different estimate of class
covariance matrices in the expression of the Bayes classifier, which is itself developed
by assuming Σ0 , Σ1 . The resultant classifier is known as quadratic discriminant
analysis (QDA) and has non-linear decision boundaries (see Exercise 7).
Remarks on Class Prior Probabilities: The discriminant function of the LDA
classifier (6.18) explicitly depends on the prior probabilities P(Y = i), i = 0, 1.
However, in order to properly use such prior probabilities in training LDA (and other
classifiers that explicitly depend on them), we need to distinguish between two types
of sampling procedure: random sampling and separate sampling.
(a) (b)
Fig. 6.2: The decision boundary of the sample LDA classifier for various training
samples drawn from the two Gaussian populations used for plots in Fig. 6.1. The
sample points are shown by orange or blue dots depending on their class. (a): 20
sample points are drawn from each Gaussian population. Here the decision boundary
is characterized as −3.28 x1 − 0.92 x2 + 7.49 = 0; and (b) 1500 sample points are
drawn from each Gaussian population. Here the decision boundary is characterized
as −5.20 x1 − 0.90 x2 + 9.67 = 0. The decision boundary in this case appears to be
“closer” to the optimal decision boundary identified by the solid line.
−1 1 T −1
gi (x) = µ̂Ti Σ̂ x − µ̂ Σ̂ µ̂ i + log P(Y = i) , (6.23)
2 i
where µ̂ i is defined in (6.19) and Σ̂ is given by
c−1 n
1 ÕÕ
Σ̂ = (x j − µ̂ i )(x j − µ̂ i )T I {y j =i } , (6.24)
n − c i=0 j=1
where n =
Íc−1
i=0 ni is the total sample size across all classes.
Scikit-learn implementation: In scikit-learn, the LDA classifier is implemented
by the LinearDiscriminantAnalysis class from the sklearn.discriminant_analysis
module. However, to achieve a similar representation of LDA as in (6.23) one needs
to change the default value of solver parameter from 'svd' (as of scikit-learn
version 1.2.2) to 'lsqr', which solves the following system of linear equation to
find a in a least squares sense:
Σ̂a = µ̂ i . (6.25)
At the same time, note that the LDA discriminant (6.23) can be written as
1
gi (x) = aT x − aT µ̂ i + log P(Y = i) , (6.26)
2
where
−1
a = Σ̂ µ̂ i . (6.27)
162 6 Linear Models
In other words, using 'lsqr', (6.25) is solved in a way to avoid inverting the
matrix Σ̂ (rather than directly solving (6.27)). Note that by doing so, LDA classifier
can be even applied to cases where the sample size is less than the feature size. In
these situations, (6.27), which is indeed the direct formulation of LDA classifier, is
not feasible because Σ̂ is singular but we can find a “special” solution from (6.25).
Many models in scikit-learn have two attributes, coef_ (or sometimes coefs_,
for example, for multi-layer perceptron that we discuss later) and intercept_ (or
intercepts_) that store their coefficients and intercept (also known as bias term).
The best way to identify such attributes, if they exist at all for a specific model, is to
look into the scikit-learn documentation. For example, in case of LDA, for a binary
classification problem, the coefficients vector and intercept are stored in coef_ of
shape (feature size, ) and intercept_, respectively; however, for multiclass
classification, coef_ has a shape of (class size, feature size).
É
More on solver: A question that remains unanswered is what is meant to find
an answer to (6.25) when Σ̂ is not invertible? In general, when a system of linear
equations is underdetermined (e.g., when Σ̂ is singular, which means we have infinite
solutions), we need to rely on techniques that designate a solution as “special”. Using
'lsqr' calls scipy.linalg.lstsq, which by default is based on DGELSD routine
from LAPACK written in fortran. DGELSD itself finds the minimum-norm solution
to the problem. This is a unique solution that minimizes || Σ̂a − µ̂ i ||2 using singular
value decompisiton (SVD) of matrix Σ̂. This solution exists even for cases where Σ̂
−1
is singular. However, it becomes the same as Σ̂ µ̂ i if Σ̂ is full rank.
The default value of solver parameter in LinearDiscriminantAnalysis is svd.
The use of SVD here should not be confused with the aforementioned use of SVD
for 'lsqr'. With the 'lsqr' option, the SVD of Σ̂ is used; however, the SVD used
along with 'svd' is primarily applied to the centered data matrix and dimensions
with singular values greater than 'tol' (current default 0.0001)—dimensions with
singular values less than 'tol' are discarded. This way, the scikit-learn implementa-
tion avoids even computing the covariance matrix per se. This could be an attractive
choice in high-dimensional situations where p > n. The svd computation here relies
on scipy.linalg.svd, which is based on DGESDD routine from LAPACK. Last
but not least, the connection between Python and Fortran subroutines is created by
f2py module that can be used to create a Python module containing automatically
generated wrapper functions from Fortran rountines. For more details see (f2py,
2023).
——————————————————————————————-
The linearity of decision boundaries in LDA was a direct consequence of the Gaussian
assumption about class-conditional densities. Another way to achieve linear decision
boundaries is to assume some monotonic transformation of posterior P(Y = i|x) is
6.2 Linear Models for Classification 163
p
logit(p) = log . (6.28)
1−p
The ratio inside the log function in (6.28) is known as odds (and another name for
the logit function is log-odds). For example, if in a binary classification problem, the
probability of having an observation from one class is p = 0.8, then odds of having
0.8
an observation from that class is 0.2 = 4.
Let us consider a function known as logistic sigmoid function σ(x) given by
1
σ(x) = . (6.29)
1 + e−x
Using logit(p) as the argument of σ(x) leads to
1 1
σ (logit(p)) = = = p, (6.30)
1 + e−logit(p)
p
−log
1+e 1−p
which means the logistic sigmoid function is the inverse of the logit function. In
logistic regression, we assume the logit of posteriors are linear functions of x; that
is,
P(Y = 1|x)
logit P(Y = 1|x) = log = aT x + b .
(6.31)
1 − P(Y = 1|x)
Replacing p and logit(p) with P(Y = 1|x) and aT x + b in (6.30), respectively, yields
T
1 e(a x+b)
P(Y = 1|x) = = , (6.32)
1 + e−(aT x+b) 1 + e(aT x+b)
which is always between 0 and 1 as it should (because the posterior is a probability).
In addition, because P(Y = 0|x) + P(Y = 1|x) = 1, we have
T
e−(a x+b) 1
P(Y = 0|x) = = . (6.33)
1+e −(aT x+b)
1 + e T x+b)
(a
1 (1−i)
P(Y = i|x) = = P(Y = 1|x)i 1 − P(Y = 1|x) , i = 0, 1 .
1 + e(−1)i (aT x+b)
(6.34)
( T
e−(a x+b)
1 if 1
− > 0,
ψ(x) = T
1+e−(a x+b)
T
1+e−(a x+b) (6.35)
0 otherwise .
(
1 if 1
> 1
,
ψ(x) =
T 2
1+e−(a x+b) (6.36)
0 otherwise ,
which is equivalent to
(
1 if aT x + b > 0 ,
ψ(x) = (6.37)
0 otherwise .
As seen in (6.37), the decision boundary of the binary classifier defined by the
logistic regression is linear and, therefore, this is a linear classifier as well. What
remains is to estimate the unknown parameters a and b and replace them in (6.37)
to have a functioning classifier.
A common approach to estimate the unknown parameters in logistic regression
is to maximize the log likelihood of observing the labels given observations as a
∆
function of a and b. Let β = [b, a]T denote the (p+1)-dimensional column vector of
unknown parameters. With the assumption of having independent data, we can write
1
where y j = 0, 1, = follows from the compact form specified in (6.34), and β in P(Y1 =
y1, . . . , Yn = yn |x1, . . . , xn ; β) shows that this probability is indeed a function of
unknown parameters. This maximization approach is an example of a general method
known as maximum likelihood estimation because we are trying to estimate values
6.2 Linear Models for Classification 165
n
1Õ
e(β) = − y j log P(Yj = 1|x j ; β) + (1 − y j )log 1 − P(Yj = 1|x j ; β) , (6.39)
n j=1
(
1 if âT x + b̂ > 0 ,
ψ(x) = (6.40)
0 otherwise .
Although it is common to see in literature labels are 0 and 1 when the loss function
(6.39) is used, there is a more compact form to write the loss function of logsictic
regression if we assume labels are y j ∈ {−1, 1}. First note that similar to the compact
representation of posteriors in (6.34), we can write
1
P(Yj = y j |x j ) = , y j = −1, 1 . (6.41)
1+ e−y j (a x j +b)
T
Using (6.41), we can write (6.39) in the following compact form where y j ∈ {−1, 1}:
n
1Õ
log 1 + e−y j (a x j +b) .
T
e(β) =
(6.42)
n j=1
We emphasize that using (6.39) with y j ∈ {0, 1} is equivalent to using (6.42) with
y j ∈ {−1, 1}. Therefore, these choices of labeling classes do not affect estimates of
a and b, and the corresponding classifier.
An important and rather confusing point about logistic regression is that its main
utility is in classification, rather than regression. By definition, we are trying to
approximate and estimate numeric posterior probabilities and no surprise that it is
called logistic regression; however, in terms of utility, almost always we are interested
166 6 Linear Models
T
e(ai x+bi )
P(Y = i|x) = Íc−2 (aT x+bi ) , i = 0, 1, . . . , c − 2 , (6.43)
1 + i=1 e i
and because the posterior probabilities should add up to 1, we can obtain the posterior
probability of class c − 1 as
1
P(Y = c − 1|x) = Íc−2 (aT x+bi ) . (6.44)
1 + i=1 e i
With these forms of posterior probabilities one can extend the loss function
to multiclass and then use some iterative optimization methods to estimate the
parameters of the model (i.e., ai and bi, ∀i). The classifier is then obtained by using
discriminants (6.43) and (6.44) in (6.1). Equivalently, we can set gi (x) = aTi x+ bi, i =
0, . . . , c − 2, and gc−1 (x) = 0, and use them as discriminant functions in (6.1)—this
is possible by noting that the denominators of all posteriors presented in (6.43) and
(6.44) are the same and do not matter, and then taking the natural logarithm from
all numerators. This latter form better shows the linearity of this classifier. As this
classifier is specified by a series of approximation made on posterior probabilities
P(Y = i|x), which add up to 1 and together resemble a multinomial distribution, it is
known as multinomial logistic regression.
——————————————————————————————-
Regularization: In practical applications, it is often helpful to consider a penalized
form of the loss function indicated in (6.42); that is, to use a loss function that
constrains the magnitude of coefficients in a (i.e., penalizes possible large coefficients
to end up with smaller coefficients). Because we are trying to regularize or shrink
estimated coefficients, this approach is known as regularization or shrinkage. There
are various forms of shrinkage penalty. One common approach is to minimize
6.2 Linear Models for Classification 167
n
Õ Tx
j +b)
1
e(β) = C log 1 + e−y j (a + ||a||22 ,
(6.45)
j=1
2
qÍ
p
where ||a||2 is known as l2 norm of a = [a1, a2, ..., a p ]T given by ||a||2 = a2 =
k=1 k
√
aT a and C is a tuning parameter, which shows the relative importance of these
two terms on estimating the parameters. As C → ∞, the l2 penalty term has no
effect on the minimization and the estimates obtained from (6.45) become similar
to those obtained from (6.42); however, as C → 0, the coefficients in a are shrunk
to zero. This means the higher C, the less regularization we expect, which, in turn,
implies a stronger fit on the training data. The optimal choice of C though is generally
estimated (i.e., C is tuned) in the model selection stage. Because here we are using
the l2 norm of a, minimizing (6.45) is known as l2 regularization. The logistic
regression trained using this objective function is known as logistic regression with
l2 penalty (also known as ridge penalty). There are other types of useful shrinkage
such as l1 regularization and elastic-net regularization. The l1 regularization in
logistic regression is achieved by minimizing
n
j +b)
Õ Tx
e(β) = C log 1 + e−y j (a + ||a||1 ,
(6.46)
j=1
Íp
where ||a||1 = k=1
|ak |. Elastic-net regularization, on the other hand, is minimizing
n
Õ Tx
j +b)
1−ν
e(β) = C log 1 + e−y j (a + ν||a||1 + ||a||22 ,
(6.47)
j=1
2
Example 6.1 In this example, we would like to train a logistic regression with l2
regularization (LRR) for Iris flower classification using two features: sepal width
and petal length. For this purpose, we first load the preprocessed Iris dataset that was
prepared in Section 4.6:
6.2 Linear Models for Classification 169
import numpy as np
arrays = np.load('data/iris_train_scaled.npz')
X_train = arrays['X']
y_train = arrays['y']
arrays = np.load('data/iris_test_scaled.npz')
X_test = arrays['X']
y_test = arrays['y']
print('X shape = {}'.format(X_train.shape) + '\ny shape = {}'.
,→format(y_train.shape))
X shape = (120, 4)
y shape = (120,)
X shape = (30, 4)
y shape = (30,)
For illustration purposes, here we only use the second and third features (sepal width
and petal length) of the data to develop and test our classifiers:
X_train = X_train[:,[1,2]]
X_test = X_test[:,[1,2]]
X_train.shape
(120, 2)
We use the technique used in Section 5.1.1 to draw the decision boundaries of
Logistic Regression with l2 Regularizarion:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.linear_model import LogisticRegression as LRR
color = ('aquamarine', 'bisque', 'lightgrey')
cmap = ListedColormap(color)
lrr.fit(X_train, y_train)
Z = lrr.predict(coordinates)
Z = Z.reshape(X.shape)
ax.tick_params(axis='both', labelsize=6)
ax.set_title('LRR Decision Regions: C=' + str(C), fontsize=8)
ax.pcolormesh(X, Y, Z, cmap = cmap, shading='nearest')
ax.contour(X ,Y, Z, colors='black', linewidths=0.5)
ax.plot(X_train[y_train==0, 0], X_train[y_train==0, 1],'g.',
,→markersize=4)
1.0
0.5
0.0
0.5
1.0
1.5
2 1 0 1 2 2 1 0 1 2
sepal width (normalized) sepal width (normalized)
Fig. 6.3: The scatter plot of the normalized Iris dataset for two features and decision
regions for logistic regression with l2 regularization (LRR) for C = 0.01 (left) and
C = 100 (right). The green, red, and black points show points corresponding to
setosa, versicolor, and virginica Iris flowers, respectively. The LRR decision regions
for each class are colored similarly.
P(Y = 1| x̃1 )
= e a1 (x1 +1)+a2 x2 +...+a p x p +b ,
∆
r1 = odds = (6.49)
P(Y = 0| x̃1 )
where x̃1 = [x1 + 1, x2, . . . , x p ]T . From (6.48) and (6.49), the odds ratio become
P(Y = 1| x̃i )
= e a1 x1 +...ai (xi +1)+...+a p x p +b ,
∆
ri = odds = (6.51)
P(Y = 0| x̃i )
and where x̃i = [x1, . . . , xi + 1, . . . , x p ]T . We can also define the relative change in
odds (RCO) as a function of one unit in xi :
ri − r0 r0 e ai − r0
RCO% = × 100 = × 100 = (e ai − 1) × 100 . (6.52)
r0 r0
As an example, suppose we have trained a logistic regression classifier where a2 =
−0.2. This means a one unit increase in x2 leads to an RCO of -18.1% (decreases
odds of Y = 1 by 18.1%).
Although (6.51) and (6.52) were obtained for one unit increase in xi , we can
extend this to any arbitrary number. In general a k unit increase (a negative k points
to a decrease) in the value of xi leads to:
ri − r0
RCO% = × 100 = (ek ai − 1) × 100 . (6.53)
r0
Example 6.2 In this example, we work with a genomic data (gene expressions) taken
from patients who were affected by oral leukoplakia. Data was obtained from Gene
Expression Omnibus (GEO) with accession # GSE26549. The data includes 19897
features (19894 genes and three binary clinical variables) and 86 patients with a
median follow-up of 7.11 years. Thirty-five individuals (35/86; 40.7%) developed
oral cancer (OC) over time and the rest did not. The data matrix is stored in a file
named “GenomicData_OralCancer.txt” and the variable names (one name for each
column) is stored in “GenomicData_OralCancer_var_names.txt”, and both of these
files are available in “data“ folder. The goal is to build a classifier to classify those
who developed OC from those who did not. The outcomes are stored in column
named “oral_cancer_output” (-1 for OC patients and 1 for non-OC patients). For this
purpose, we use a logistic regression with l1 regularization with a C = 16.
# load the dataset
import numpy as np
import pandas as pd
data = pd.read_csv('data/GenomicData_OralCancer.txt', sep=" ",
,→header=None)
data.head()
0 1 2 3 4 5 6 7 8 9 ... \
0 4.44 8.98 5.58 6.89 6.40 6.35 7.12 6.87 7.18 7.81 ...
1 4.59 8.57 6.57 7.25 6.44 6.34 7.40 6.91 7.18 8.12 ...
2 4.74 8.80 6.22 7.13 6.79 6.08 7.42 6.93 7.48 8.82 ...
3 4.62 8.77 6.32 7.34 6.29 5.65 7.12 6.89 7.27 7.18 ...
4 4.84 8.81 6.51 7.16 6.12 5.99 7.13 6.85 7.21 7.97 ...
6.2 Linear Models for Classification 173
19888 19889 19890 19891 19892 19893 19894 19895 19896 19897
0 6.08 5.49 12.59 11.72 8.99 10.87 1 0 1 1
1 6.17 6.08 13.04 11.36 8.96 11.03 1 0 0 1
2 6.39 5.99 13.29 11.87 8.63 10.87 1 1 0 -1
3 6.32 5.69 13.33 12.02 8.86 11.08 0 1 1 1
4 6.57 5.59 13.22 11.87 8.89 11.15 1 1 0 1
header = pd.read_csv('data/GenomicData_OralCancer_var_names.txt',
,→header=None)
data.columns=header.iloc[:,0]
data.loc[:,"oral_cancer_output"] = data.loc[:,"oral_cancer_output"]*-1
,→# to encode the positive class (oral cancer) as 1
data.head()
y_train = data.oral_cancer_output
X_train = data.drop('oral_cancer_output', axis=1)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
lrr = LRR(penalty = 'l1', C=16, solver = 'liblinear', random_state=42)
lrr.fit(X_train, y_train)
174 6 Linear Models
np.sum(coeffs != 0)
292
Therefore, there are 292 features (here all are genes) with non-zero coefficients
in our linear logistic regression classifier. In other words, the use of l1 regularization
with logistic regression leads to 0 coefficients for 19606 features. This is the con-
sequence of the internal feature selection mechanism of logistic regression with l1
regularization that was mentioned in Section 6.2.2. This type of feature selection is
also known as embedded feature selection (more on this in Chapter 10).
Next we sort the identified 292 features based on their odds ratio:
non_zero_coeffs = coeffs[coeffs!= 0]
ORs=np.exp(non_zero_coeffs) #odds ratios
sorted_args = ORs.argsort()
ORs_sorted=ORs[sorted_args]
ORs_sorted[:10]
feature_names = header.iloc[:-1,0].values
selected_features = feature_names[coeffs!= 0]
selected_features_sorted = selected_features[sorted_args]
selected_features_sorted[0:10]
We would like to plot the RCOs for 10 genes that led to the highest decrease
and 10 genes that led to the highest increase in RCO by one unit increase in their
expressions.
RCO=(ORs_sorted-1.0)*100
selected_RCOs=np.concatenate((RCO[:10], RCO[-10:]))
features_selected_RCOs = np.concatenate((selected_features_sorted[:10],
,→selected_features_sorted[-10:]))
features_selected_RCOs
6.2 Linear Models for Classification 175
plt.figure(figsize=(12,6))
pyplot.bar(features_selected_RCOs, selected_RCOs, color='black')
plt.xlabel('genes')
plt.ylabel('RCO %')
plt.xticks(rotation=90, fontsize=12)
plt.yticks(fontsize=12)
50
40
30
20
RCO %
10
0
10
20
DNAJA2
DGCR6
CNO
ORAI2
SEMG1
ZNF609
RNASE13
BRD7P3
UBXN1
C6orf10
MMP26
HHAT
C15orf62
C1orf151
KIR2DS5
SNORD33
SNORD41
C21orf84
LOC136242
LOC100133746
genes
Fig. 6.4: RCO vs. genes. The figure shows 10 genes that led to the highest decrease
and 10 genes that led to the highest increase in RCOs by one unit increase in their
expressions.
Here we can see that a one unit increase in the expression of DGCR6 leads to an
RCO of -25.8%. Does it mean a one unit decrease in its expression leads to an RCO
of 25.8%? (see Exercise 8).
176 6 Linear Models
Linear models for regression are models that the estimate of the response variable
is a linear function of parameters (Bishop, 2006, p.138), (Hastie et al., 2001, p.44);
that is,
f (x) = aT x + b . (6.54)
where
n
Õ n
Õ
2
RSS(β) = y j − f (x j ) = (y j − βT x̃ j )2 , (6.56)
j=1 j=1
and where x̃ j is an augmented feature vector defined as x̃ j = [1, xTj ]T , ∀ j; that is,
we add a 1 in the first position and other elements are identical to x j . We can write
(6.56) in a matrix form as
6.3 Linear Models for Regression 177
∂RSS(β)
= −2XT y + 2XT Xβ = 0 , (6.58)
∂β
where we used the fact that for two vectors w and β and a symmetric matrix W,
∂wT β ∂ βT w
= = w, (6.59)
∂β ∂β
∂ βT Wβ
= 2Wβ . (6.60)
∂β
where I p+1 is the identity matrix of p + 1 dimension. The lasso solution is obtained
as
1−ν
β̂ elastic-net = argmax RSS(β) + α ν||a||1 + α ||a||22 ,
(6.65)
β 2
where both α and ν are tuning parameters that create a compromise between no
regularization, and l1 and l2 regularizations. In particular when α = 0, both (6.64)
and (6.65) reduce to (6.61), which means both the lasso and the elastic-net reduce to
the ordinary least squares. At the same time, for ν = 1, (6.65) becomes (6.64); that
is, the solution of elastic-net becomes that of lasso. In contrast with (6.45)-(6.47)
where a larger C was indicator of less regularization, here α is chosen such that a
larger α implies stronger regularization. This choice of notation is made to make the
equations compatible with scikit-learn implementation of these methods. As there is
no closed-form solution for lasso and elastic-net, iterative optimization methods are
used to obtain the solution of (6.64) and (6.65). Last but not least, once we obtain an
estimate of β from either of (6.61), (6.63), (6.64), or (6.65), we replace the estimate
of parameters in (6.54) to estimate the target for a given x.
Example 6.3 Suppose we have the following training data where for each value of
predictor x, we have the corresponding y right below that. We would like to estimate
the coefficient of a simple linear regression using ordinary least squares to have
ŷ = β̂0 + β̂1 x.
x −1 2 −1 0 0
y 0 2 2 0 −1
a) We need to find the solution from (6.61). For that purpose, first we need to
construct matrix X where as explained X = [x̃1, x̃2, . . . , x̃n ]T and where x̃ is the
augmented feature vector. Therefore, we have
1 −1 0
1 2 2
X = 1 −1 , y = 2 .
(6.66)
1 0 0
1 0 −1
6.3 Linear Models for Regression 179
−1
1
−1 0
1 2 ® 2
© ª 3
1 1 1 11 1 1 1 11
β̂ ols = 2= 1 .
1 −1 ®
® 5 (6.67)
−1 2 −1 0 0
1 −1 2 −1 0 0 3
0 ® 0
®
1 0 ¬ −1
«
Therefore, β0 = 3
5 and β1 = 13 .
b) From Section 5.2.2, in which R̂2 was defined, we first need to find RSS and TSS:
m
Õ
RSS = (yi − ŷi )2 , (6.68)
i=1
Õm
TSS = (yi − ȳ)2 , (6.69)
i=1
where ŷi is the estimate of yi for xi obtained from our model by ŷi = β̂0 + β̂1 xi , and
ȳ is the average of responses. From part (a),
3 1 4 3 2 19 3
ŷ1 = − = , ŷ2 = + = , ŷ3 = ŷ1 , ŷ4 = , ŷ5 = ŷ4 .
5 3 15 5 3 15 5
Therefore,
4 2 19 2 4 2 3 2 3 2
RSS = 0 − + 2− + 2− + 0− + −1− ≈ 6.53 .
15 15 15 5 5
3 2 3 2 3 2 3 2 3 2 36
TSS = 0 − + 2− + 2− + 0− + −1− = = 7.2 .
5 5 5 5 5 5
Thus,
RSS 6.53
R̂2 = 1 − =1− = 0.092 ,
TSS 7.2
which is relatively a low R̂2 .
Scikit-learn implementation: The ordinary least squares, ridge, lasso, and elastic-
net solution of linear regression are implemented in LinearRegression, Ridge,
180 6 Linear Models
to present the normal equation shown in (6.61). However, similar to lsqr solver
that we discussed in Section 6.2.1 for LinearDiscriminantAnalysis, the lsqr
solver in LinearRegression also relies on scipy.linalg.lstsq, which as men-
tioned before by default is based on DGELSD routine from LAPACK and finds the
minimum-norm solution to the problem Xβ = y; that is to say, minimizes ||y − Xβ||22
in (6.57). As for the ridge regression implemented via Ridge class, there are cur-
rently several solvers with the current default being auto, which determines the
solver automatically. Some of these solvers are implemented similar to our discus-
sion in Section 6.2.1. For example, svd solver computes the SVD decomposition
of X using scipy.linalg.svd, which is itself based on DGESDD routine from
LAPACK. Those dimensions with a singular value less than 10−15 are discarded.
Once the SVD decomposition is calculated, the ridge solution given in (6.63) is
computed (using _solve_svd() function). Here we explain the rationale. Suppose
X is an arbitrary m × n real matrix. We can decompose X as X = UDVT where
UUT = UT U = Im , VVT = VT V = In , and D is a rectangular m × n diagonal
matrix with min{m, n} non-negative real numbers known as singular values — this
is known as SVD decomposition of X. Then we can write (6.63) as follows:
and note that because all matrices in (D2 + αI p+1 )−1 D are diagonal, this term is also
a diagonal matrix where each diagonal element in each dimension can be computed
easily by first adding α to the square of the singular value for that dimension,
inverting, and then multiplying by the same singular value.The cholesky choice for
the ridge solver relies on scipy.linalg.solve, which is based on DGETRF and
DGETRS from LAPACK to compute the LU factorization and then find the solution
based on the factorization. As for Lasso and ElasticNet, scikit-learn currently
uses coordinate descent solver.
——————————————————————————————-
6.3 Linear Models for Regression 181
Exercises:
Exercise 1:
1 0
and µ̂ 1 = [0, −1]T , and the pooled sample covariance matrix is Σ̂ = . We
0 2
would like to use these estimates to construct an LDA classifier. Determine the range
of P(Y = 1) such that an observation x = [1.5, 0]T is classified to class 1 using the
constructed LDA classifier.
Classify an observation x = [1, 1]T according to the Bayes rule (assuming classes
are equiprobable).
Exercise 4: Prove that the optimal classifier ψB (x) (also known as Bayes clas-
É
sifier) is obtained as
182 6 Linear Models
(
1 if P(Y = 1|x) > P(Y = 0|x) ,
ψB (x) = (6.71)
0 otherwise .
p(x| Y=i)
i=3
i=2
i=1
0 1 1.5 2 3 3.5 4 x
A) Class 1
B) Class 2
C) Class 3
Exercise 7: Show that the Bayes plug-in rule for Gaussian class-conditional densi-
ties with different covariance matrices has the following form, which is known as
quadratic discriminant analysis (QDA):
(
1 if 12 xT A1 x + A2 x + a > 0 ,
ψ(x) = (6.72)
0 otherwise ,
6.3 Linear Models for Regression 183
where
−1 −1
A1 = Σ̂0 − Σ̂1 , (6.73)
−1 −1
A2 = µ̂T1 Σ̂1 − µ̂T0 Σ̂0 , (6.74)
P(Y = 1)
1 T −1 1 −1 1 | Σ̂1 |
a= µ̂ Σ̂ µ̂ − µ̂T Σ̂ µ̂ − log + log , (6.75)
2 0 0 0 2 1 1 1 2 | Σ̂0 | 1 − P(Y = 1)
and where
1 Õ
Σ̂i = (x j − µ̂ i )(x j − µ̂ i )T . (6.76)
ni − 1 y =i
j
Exercise 8: For a binary logistic regression classifier, a one unit increase in the value
of a numeric feature leads to an RCO of α.
a) What will be the RCO for one unit decrease in the value of this numeric feature?
b) Suppose α = 0.2 (i.e., 20%). What is the RCO for one unit decrease in the value
of the feature?
Exercise 9: Suppose we have two data matrices Xtr ain and X1test . We train a multiple
linear regression model using Xtr ain and in order to assess that model using RSS
metric, we apply it on X1test . This leads to a value of RSS denoted as RSS1 . Then we
augment X1test by 1 observation and build a larger test dataset of 101 observations
denoted as X2test (that is to say, 100 observations in X2test are the same observations
in X1test ). Applying the previously trained model on X2test leads to a value of RSS
denoted as RSS2 . Which of the following is true? Explain why?
A) Depending on the actual observations, both RSS1 ≥ RSS2 and RSS2 ≥ RSS1
are possible
B) RSS1 ≥ RSS2
C) RSS2 ≥ RSS1
Exercise 10: Suppose in a binary classification problem with equal prior probability
of classes, the class conditional densities are multivariate normal distributions with
a common covariance matrix: N (µ 0, Σ) and N (µ 1, Σ). Prove that the Bayes error εB
is obtained as
∆
εB = Φ − , (6.77)
2
where Φ(.) denotes the cumulative distribution function of a standard normal random
variable, and ∆ (known as the Mahalanobis distance between the two distributions)
is given by q
∆= (µ 0 − µ 1 )T Σ−1 (µ 0 − µ 1 ). (6.78)
Exercise 11: Suppose in a binary classification problem with equiprobable classes
(class 0 vs. class 1), the class-conditional densities are 100-dimensional normal
184 6 Linear Models
Exercise 12: Suppose we would like to use scikit-learn to solve a multiple linear
regression problem using l1 regularization. Which of the following is a possible
option to use?
A) “sklearn.linear_model.LogisticRegression” class with default choices of param-
eters
𝑥 0 2 1 -5 3 -1
𝑦 4 4 2 0 0 2
Suppose we have an electrical device and at each time point t the voltage at a specific
part of this device can only vary in the interval [1, 5]. If the device is faulty, this
voltage, denoted V(t), varies between [2, 5], and if it is non-faulty, the voltage can be
any value between [1, 3]. In the absent of any further knowledge about the problem,
(it is reasonable to) assume V(t) is uniformly distributed in these intervals (i.e., it has
uniform distributions when it is faulty or non-faulty). Based on these assumptions,
we would like to design a classifier based on V(t) that can “optimally” (insofar as
the assumptions hold true) classifies the status of this device at each point in time.
What is the optimal classifier at each point in time? What is the error of the optimal
classifier at each time point?
(i.e., i = 96, 97, . . . , 100) in these Gaussian distributions and constructs the Bayes
classifier. We denote the Bayes error of this 5-dimensional classification problem by
5∗ . Which one shows the relationship between 5∗ and 30 ∗ ?
Decision trees are nonlinear graphical models that have found important applications
in machine learning mainly due to their interpretability as well as their roles in other
powerful models such as random forests and gradient boosting regression trees
that we will see in the next chapter. Decision trees resemble the principles of “20-
questions” game in which a player chooses an object and the other player asks a series
of yes-no questions to be able to guess the object. In binary decision trees, the set
of arbitrary “yes-no” questions in the game translate to a series of standard yes-no
(thereby binary) questions that we identify from data and the goal is to estimate
the value of the target variable for a given feature vector x. As opposed to arbitrary
yes-no questions in the game, the questions in decision rules are standardized to be
able to devise a learning algorithm. Unlike linear models that result in linear decision
boundaries, decision trees partition the feature space into hyper-rectangular decision
regions and, therefore, they are nonlinear models. In this chapter we describe the
principles behind training a popular type of decision trees known as CART (short
for Classification and Regression Tree). We will discuss development of CART for
both classification and regression.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 187
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_7
188 7 Decision Trees
a decision tree, leaf nodes contain the decision made for x (i.e., the estimate of the
target). Once a decision tree is trained, it is common to visualize its graph to better
understand the relationship between questions.
In this section, we use a hypothetical example with which we introduce a number
of practical concepts used in training decision trees. These concepts are: 1) splitting
node; 2) impurity; 3) splitter; 4) best split; 5) maximum depth; and 6) maximum
leaf nodes. In Section 5.2.2, we saw an application in which we aimed to solve a
regression problem to estimate the median house price of a neighborhood (MEDV)
based on a set of features. Suppose we wish to solve this example but this time we
entirely rely on our past experiences to design a simple binary decision tree that can
help classify MEDV to “High” or “Low” (relative to our purchasing power) based
on three variables: average number of rooms per household (RM), median house age
in block group (AGE), and median income in the district (INCM).
Suppose from our past experience (think of this as past data, or “training” data),
we believe that RM is an important factor in the sense that in many cases (but not
all), if RM is four or more, MEDV is “High”, and otherwise “Low”. Therefore, we
consider RM ≤ 4 as the first splitting node in our decision tree (see Fig. 7.1a). Here,
we refer to this as a splitting node because the answer to the question “is RM ≤ 4?”
splits our training data (past experiences) into two subsets: one subset contains all
neighborhoods with RM ≤ 4 and the other subset contains all neighborhoods with
RM > 4. This splitting combined with our past experiences also means that at
this stage we can even make a decision based on “RM ≤ 4?” because we believe
neighborhoods with RM ≤ 4 generally have low MEDV and neighborhoods with
RM > 4 generally have high MEDV.
There are two important concepts hidden in the aforementioned statements: 1)
impurity; and 2) splitter. Notice that we said RM ≤ 4 is important for “many cases
(but not all)”. In other words, if we decide to make a decision at this stage based on
RM ≤ 4, this decision is not perfect. Even though “RM ≤ 4?” splits the data into
two subsets, we can not claim that all neighborhoods in each subset are entirely from
one class. In other words, the split achieved by RM ≤ 4 is impure with respect to
class labels. Nevertheless, the reason we chose this specific split among all possible
splits was that we believe it can minimize the impurity of data subsets; that is to
say, one subset contains a large number of neighborhoods with high MEDV and
the other subset contains a large number of neighborhoods with low MEDV. This
strategy was indeed a splitting strategy, also known as splitter; that is to say, our
“splitter” was to identify the best split in the sense of minimizing the impurity of
data subsets created as the result of yes/no answer to a standardized question of
“a variable ≤ a threshold?” This means to find the best split, we need to identify
the combination of variable-threshold that minimizes the impurity of data subsets
created as the result of the split.
Now we need to decide whether we should stop further splitting or not. Suppose
we believe that we can still do better in terms of cutting up the decision space and,
for that reason, we refer back to our training data. In our experience, new houses
with AGE ≤ 3 have high price even if RM ≤ 4; otherwise, they have low price.
At the same time, for RM > 4, the price would potentially depend on INCM. In
7.1 A Mental Model for House Price Classification 189
particular, if these houses are in neighborhoods with a relatively low INCM, the price
is still low. This leads to a decision tree grown as much as shown in Fig. 7.1b with
two additional splitting nodes AGE ≤ 3 and INCM ≤ 5 and 3 leaf nodes at which
a decision is made about MEDV. Next based on our past experiences, we believe
that if RM > 4 & INCM > 5, the price depends on AGE (Fig. 7.1c). In particular,
RM > 4 & INCM > 5 & AGE ≤ 20, MEDV is high and for RM > 4 & INCM > 5
& AGE > 20, MEDV is low. At this point we stop splitting because either we think
terminal nodes are pure enough or further splitting makes the tree too complex. Fig.
7.1d shows the final decision tree.
RM ≤ 4?
RM ≤ 4?
yes no
yes no
AGE ≤ 3? INCM ≤ 5?
yes no yes no
(a) (b)
(b)
RM ≤(a)
4? RM ≤ 4?
yes no yes no
price = high price = low price = low AGE ≤ 20? price = high price = low price = low AGE ≤ 20?
yes no yes no
(c) (d)
(c) (d)
Fig. 7.1: Development of a mental model for house price classification. Here the
decision tree is grown based on our past experiences.
A question that is raised is, why don’t we want the tree to become too complex?
Suppose we grow a tree to the extent that all leaf nodes become pure; that is, they
contain training data from one class. This is generally doable unless all observations
in a node have the same values but from different classes in which case no further
splitting is possible and the node remains impure. Is it a good practice to continue
growing the tree until all nodes are pure? Not necessarily because it is very likely
that the tree overfits the training data; that is, the tree closely follows the training
data but does not perform well on novel observations that are not part of training. In
other words, the decision tree does not generalize well.
190 7 Decision Trees
A follow-up question is then what are some properties using which we can control
the structural complexity of a tree? One metric to control the structural complexity
of a binary tree is the maximum number of links between the root node and any of
the leaf nodes. This is known as the maximum depth — the depth of a node is the
number of links between the node and the root node. For example, the maximum
depth of the decision tree in Fig. 7.1d is 3. If we preset the maximum depth of a
binary decision tree, it limits both the number of leaf nodes and the total number
of nodes. Suppose before growing a binary decision tree, we set 3 as the maximum
depth. What is the maximum number of leaf nodes and the maximum number of
nodes (including the root and other splitting nodes)? The maximum happens if all
leaf nodes have depth of 3. Therefore, we have a maximum of 8 leaf nodes and at
most 1 (root) + 2 + 4 + 8 (leaf) = 15 nodes. In general for a binary decision tree of
depth m, we have at most 2m leaf nodes and 2m+1 −1 nodes. Alternatively, rather than
limiting the maximum depth, one may restrict the maximum number of leaf nodes.
That will also restrict the maximum depth and the maximum number of nodes.
AGE
20 𝑹𝒉𝒊𝒈𝒉
𝑹𝒍𝒐𝒘
INCM
4 RM
Fig. 7.2: Partitioning the feature space into five boxes using the decision tree shown
in Fig. 7.1d. The decision regions for the “high”- and “low”-MEDV classes denoted
by Rhigh and Rlow are shown in gray and blue, respectively. There are two gray
boxes and three blue boxes corresponding to the labels given by the leaf nodes in
Fig. 7.1d.
As we said before, decision trees partition the feature space into hyper-rectangles
(boxes). Each box corresponds to a decision rule obtained by traversing from the
root node to the leaf that includes the label for any observation falling into that
rectangular region. Fig. 7.2 shows partitioning of the feature space into five boxes
7.2 CART Development for Classification: 191
using the decision tree shown in Fig. 7.1d. Although in the figure these boxes appear
to have a finite size, each box has an infinite length over at least one of the RM, AGE,
or INCM axes because, in theory at least, there is no specific upper limit for these
variables. As an example, let us consider the gray box on the left bottom corner. This
box corresponds to the gray label assigned by traversing from the root to left most
leaf in Fig. 7.1d. This traverse corresponds to the following decision rule: assign a
“high” label when RM ≤ 4 AND AGE ≤ 3.
Although in this example we introduced several important concepts used in train-
ing and functionality of decision trees, we did not really present any specific algo-
rithm using which a tree is trained from a data at hand. There are multiple algorithms
for training decision trees, for instance, CHAID (Kass, 1980), ID3 (Quinlan, 1986),
C4.5 (Quinlan, 1993), CRUISE (Kim and Loh, 2001), and QUEST (Loh and Shih,
1997), to just name a few. Nevertheless, the binary recursive partitioning mechanism
explained here imitates closely the principles behind training a popular type of deci-
sion trees known as CART (short for Classification and Regression Trees) (Breiman
et al., 1984). In the next section, we will discuss this algorithm for training and using
decision trees in both classification and regression.
7.2.1 Splits
For a numeric feature x j , CART splits data according to the outcome of questions
of the form: “x j ≤ θ?”. For a nominal feature (i.e., categorical with no order) they
have the form “x j ∈ A” where A is a subset of all possible values taken by x j .
Regardless of testing a numeric or nominal feature, these questions check whether
the value of x j is in a subset of all possible values that it can take. Hereafter, and for
simplicity we only consider numeric features. As the split of data is determined by
the j th coordinate (feature) in the feature space and a threshold θ, for convenience
we refer to a split as ξ = ( j, θ) ∈ Ξ where Ξ denotes the set of all possible ξ that
create a unique split of data. It is important to notice that the size of Ξ is not infinite.
As an example, suppose in a training data, feature x j takes the following values:
1.1, 3.8, 4.5, 6.9, 9.2. In such a setting, for example, ξ = ( j, 4.0) and ξ = ( j, 4.4),
create the same split of the data over this feature: one subset being {1.1, 3.8} and
the other being {4.5, 6.9, 9.2}. As a matter of fact we obtain the same split for any
ξ = ( j, θ) where θ ∈ (3.8, 4.5). Therefore, it would be sufficient to only examine one
of these splits. In CART, this arbitrary threshold between two consecutive values is
taken as their halfway point. Assuming feature x jÍ , j = 1, . . . , p, has v j distinct values
p
in a training data, then the cardinality of Ξ is ( j=1 v j ) − p. We denote the set of
candidate values of θ (halfways between consecutive values) for feature x j by Θ j .
192 7 Decision Trees
The splitting strategy in CART is similar to what we saw in Section 7.1; that is, it
attempts to find the best split ξ that minimizes the weighted cumulative impurity
of data subsets created as the result of the split. Here we aim to mathematically
characterize this statement.
Let Str = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )} denote the training data. Assume the
tree has already m − 1 node and we are looking for the best split at the mth node.
Let Sm ⊂ Str denote the training data available for the split at node m. To identify
the best split, we can conduct an exhaustive search over Ξ and identify the split that
minimizes a measure of impurity. Recall that each ξ ∈ Ξ corresponds to a question
“x j ≤ θ?” where θ ∈ Θ j and depending on the yes/no answer to this question for
each observation in Sm , Sm is then split into SYes No
m,ξ and Sm,ξ ; that is to say,
m,ξ = Sm − Sm,ξ .
SNo Yes
(7.1)
When visualizing the structure of a decision tree, by convention the child node
containing SYes
m,ξ is on the left branch of the parent node. Let i(S) denote a specific
measure of impurity of data S presented to a node (also referred to as impurity of the
node itself) and let nS denote the number of data points in S. A common heuristic
used to designate a split ξ as the best split at node m is to maximize the difference
between the impurity of Sm and the weighted cumulative impurity of SYes No
m,ξ and Sm,ξ
(weighted cumulative impurity of the two child nodes). With the assumption that
any split at least produces a “more pure” data or at least as impure as Sm , we can
assume we are maximizing the drop in impurity. This impurity drop for a split ξ at
node m is denoted as ∆m,ξ and is defined as
nSYes nSNo
m, ξ m, ξ
∆m,ξ = i(Sm ) − i(SYes
m,ξ ) − m,ξ ) .
i(SNo (7.2)
nSm nSm
ξm
∗
= argmax ∆m,ξ . (7.3)
ξ ∈Ξ
nSYes nSNo
In (7.2), multiplication factors nSm, ξ and nSm, ξ are used to deduct a weighted
m m
average of child nodes impurities from the parent impurity i(Sm ). The main technical
reason for having such a weighted average in (7.2) is that with a concave choice of
impurity measure, ∆m,ξ becomes nonnegative—and naturally it makes sense to refer
to ∆m,ξ as impurity drop. The proof of this argument is left as an exercise (Exercise
6). Nonetheless, a non-technical reason to describe the effect of the weighted average
7.2 CART Development for Classification: 193
used here with respect to an unweighted average is as follows: suppose that a split
ξ causes nSYes << nSNo , say nSYes = 10 and nSNo = 1000, but the distribution
m, ξ m, ξ m, ξ m, ξ
of data in child nodes are such that i(SYes m,ξ ) >> i(Sm,ξ ). Using unweighted average
No
Yes
of child nodes impurities leads to i(Sm,ξ ) dictating our decision in choosing the
optimal ξ; however, i(SYes
m,ξ ) itself is not reliable as it has been estimated on very few
observations (here 10) compared with many observations used to estimate i(SNo m,ξ )
(here 1000). The use of weighted average, however, can be thought as a way to
introduce the reliability of these estimates in the average.
Once ξm ∗ is found, we split the data and continue the process until a split-stopping
1) a predetermined maximum depth. This has a type of global effect; that is to say,
when we reach this preset maximum depth we stop growing the tree entirely;
2) a predetermined maximum number of leaf nodes. This also has a global effect;
3) a predetermined minimum number of samples per leaf. This has a type of local
effect on growing the tree. That is to say, if a split causes the child nodes having
less data points than the specified minimum number, that node will not split any
further but splitting could still be happening in other nodes; and
4) a predetermined minimum impurity drop. This has also a local effect. In par-
ticular, if the impurity drop due to a split at a node is below this threshold, the
node will not split.
As we see all these criteria depend on a predetermined parameter. Let us consider,
for example, the minimum number of samples per leaf. Setting that to 1 (and not
setting any other stopping criteria) allows the tree to fully grow and probably causes
overfitting (i.e., good performance on training set but bad performance on test set). On
the other hand, setting that to a large number could stop growing the tree immaturely
in which case the tree may not learn discriminatory patterns in data. Such a situation
is referred to as underfitting and leads to poor performance on both training and
test sets. As a result, each of the aforementioned parameters that controls the tree
complexity should generally be treated as a hyperparameter to be tuned.
The classification in a decision tree occurs in the leaf nodes. In this regard, once the
splitting process is stopped for all nodes in the tree, each leaf node is labeled as the
majority class among all training observations in that node. Then to classify a test
observation x, we start from the root node and based on answers to the sequence of
binary questions (splits) in the tree, we identify the leaf into which x falls; the label
of that leaf is then assigned to x.
194 7 Decision Trees
So far we did not discuss possible choices of i(S). In defining an impurity measure, the
first thing to notice is that in a multiclass classification with c classes 0, 1, . . . , c − 1,
the impurity of a sample S that falls into a node is a function of the proportion of
class k observations in S—we denote this proportion by pk , k = 0, . . . , c − 1. Then,
the choice of i(S) in classification could be any nonnegative function such that the
function is:
1. maximum when data from all classes are equally mixed in a node; that is, when
p0 = p1 = ... = pc−1 = c1 , i(S) is maximum;
2. minimum when the node contains data from one class only; that is, for c-tuples
of the form (p0, p1, . . . , pc−1 ), i(S) is minimum for (1, 0, . . . , 0), (0, 1, 0, . . . , 0),
. . . , (0, . . . , 0, 1); and
3. a symmetric function of p0 , p1 , . . . , pc−1 ; that is, i(S) does not depend on the
order of p0 , p1 , . . . , pc−1 .
There are some common measures of impurity that satisfy these requirements.
Assuming a multiclass classification with classes 0, 1, . . . , c − 1, common choices of
i(S) include Gini, entropy, and misclassification error (if we classify all observations
in S to the majority class) given by:
c−1
Õ
for Gini: i(S) = pk (1 − pk ) , (7.4)
k=0
c−1
Õ
for entropy: i(S) = − pk log(pk ) , (7.5)
k=0
for error rate: i(S) = 1 − max pk , (7.6)
k=0,...,c−1
n
where pk = Í c−1k
nk
in which nk denotes the number of observations from class
k=0
k in S. Fig. 7.3 shows the concave nature of these impurity measures for binary
classification. Although in (7.5) we use natural logarithm, sometimes entropy is
written in terms of log2 . Nevertheless, changing the base does not change the best
split because based on log2 b = logb
log2 , using log2 is similar to multiplying natural log
1
by a constant log2 for all splits.
Example 7.1 In a binary classification problem, we have the a training data shown
in Table 7.1 for two available features where x j denotes a feature variable, j = 1, 2.
We wish to train a decision stump (i.e., a decision tree with only one split).
a) identify candidate splits;
b) for each identified candidate split, calculate the impurity drop using entropy
metric (use natural log); and
7.2 CART Development for Classification: 195
0.7
0.6
0.5
i() 0.4
0.3
0.2
0.1 Gini
entropy
0.0 error rate
0.0 0.2 0.4 0.6 0.8 1.0
p0
Fig. 7.3: Three common concave measures of impurity as a function of p0 for binary
classification.
c) based on the results of part (b), determine the best split (or splits if there is a
tie).
class 0 0 0 0 1 1 1 1
x1 1 1 3 3 1 1 1 3
x2 2 4 4 0 2 0 4 2
a) x1 takes two values 1 and 3. Taking their halfway point, the candidate split based
on x1 is x1 ≤ 2.
x2 takes three values 0, 2, and 4. Taking their halfway points, the candidate splits
based on x2 are x2 ≤ 1 and x2 ≤ 3.
b) for each candidate split identified in part (a), we need to quantify ∆1,ξ .
nSYes nSNo
From (7.2), we have ∆1,ξ = i(S1 ) − nS1, ξ i(SYes 1,ξ ) + nS1, ξ i(SNo
1,ξ ) .
1 1
Here S1 is the entire data in the first node (root node) because no split has occurred
yet. We have
1
Õ 4 4 4 4
i(S1 ) = − pk log(pk ) = − log − log = log(2) = 0.693 .
k=0
8 8 8 8
196 7 Decision Trees
For ξ = (1, 2)–recall that based on definition of ξ, this means the split x1 ≤ 2: based
on this split, SYes
1,ξ contains five observations (two from class 0 and three from class
1), and SNo
1,ξ contains three observations (two from class 0 and one from class 1).
Therefore,
2 2 3 3
Yes
i(S1,ξ ) = − log − log = 0.673 ,
5 5 5 5
2 2 1 1
1,ξ ) = − log
i(SNo − log = 0.636 .
3 3 3 3
Therefore,
5 3
∆1,ξ=(1,2) = 0.693 − 0.673 + 0.636 = 0.033 .
8 8
For ξ = (2, 1), which means x2 ≤ 1 split, SYes
1,ξ contains two observations (one from
class 0 and one from class 1), and SNo
1,ξ contains six observations (three from class 0
and three from class 1). Therefore,
1 1 1 1
i(SYes
1,ξ ) = − log − log = 0.693 ,
2 2 2 2
3 3 3 3
1,ξ ) = − log
i(SNo − log = 0.693 .
6 6 6 6
Therefore,
2 6
∆1,ξ=(2,1) = 0.693 − 0.693 + 0.693 = 0 . (7.7)
8 8
For ξ = (2, 3): based on this split, SYes
1,ξ contains five observations (two from class 0
No
and three from class 1), and S1,ξ contains three observations (two from class 0 and
one from class 1). This is similar to ξ = (1, 2). Therefore, ∆1,ξ=(1,2) = 0.033 .
c) ξ = (1, 2) and ξ = (2, 3) are equality good in terms of entropy metric because
the drop in impurity for both splits are the same and is larger than the other split
(recall that we look for the split that maximizes the impurity drop).
8. A useful property of decision tree is its ability to easily handle weighted samples.
To do so, we can replace the number of samples in a set used in (7.2) and any of the
metrics presented in (7.4)-(7.6) by the sum of weights of samples belonging to that
set. For example, rather than nSYes and nSm , we use sum of weights of samples in
m, ξ
n
m,ξ and Sm , respectively. Therefore, rather than using pk =
SYes Í c−1k
nk
that is used in
k=0
(7.4)-(7.6), we use the ratio of the sum of weights of samples in class k to the sum
of weights for all observations.
Our discussion in Section 7.2.1 and 7.2.2 did not depend on class labels of obser-
vations and, therefore, it was not specific to classification. In other words, the same
discussion is applicable to regression. The only differences in developing CART for
regression problems appear in: 1) the choices of impurity measures and; 2) how we
estimate the target value at a leaf node.
To better understand the concept of impurity and its measures in regression, let us
re-examine the impurity for classification. One way we can think about the utility
of having a more pure set of child nodes obtained by each split is to have more
certainty about the class labels for the training data falling into a node if that node is
considered as a leaf node (in which case using the trivial majority vote classifier, we
classify all training data in that node as the majority class). In regression, once we
designate a node as a leaf node, a trivial estimator that we can use as the estimate
of the target value for any test observation falling into a leaf node is the mean of
target values of training data falling into that node. Therefore, a more “pure” node
(to be more certain about the target) would be a node where the values of target for
training data falling into that node are closer to their means. Choosing the mean as
our trivial estimator of the target at a leaf node, we may then take various measures
such as mean squared error (MSE) or mean absolute error (MAE) to measure how
far on average target values in a leaf node are from their mean. Therefore, one can
think of a node with a larger magnitude of these measures as a more impure node.
However, when the mean is used as our target estimator, there is a subtle difference
between using the MSE and MAE as our impurity measures. This is because the
mean minimizes MSE function, and not the MAE. In particular, if we have a1 , a2 ,
. . . , an , then among all constant estimators a, the mean of a’s minimizes the MSE;
that is,
198 7 Decision Trees
n
Õ
ā = argmin (ai − a)2 , (7.8)
a i=1
where
n
1Õ
ā = ai . (7.9)
n i=1
1 Õ
for MSE: i(S) = (y − ȳ)2 , (7.10)
nS y ∈S
1 Õ
for MAE: i(S) = |y − ỹ| , (7.11)
nS y ∈S
where
1 Õ
ȳ = mean(y ∈ S) = y, (7.12)
nS y ∈S
ỹ = median(y ∈ S) . (7.13)
It is also common to refer to MSE and MAE as “squared error” and “absolute error”,
respectively.
When (7.10) and (7.11) are used as impurity measures, the mean and median of
target values of training data in a leaf node are used, respectively, as the estimate of
targets for any observation falling into that node.
7.3 CART Development for Regression 199
arrays = np.load('data/iris_train_scaled.npz')
X_train = arrays['X']
y_train = arrays['y']
arrays = np.load('data/iris_test_scaled.npz')
X_test = arrays['X']
y_test = arrays['y']
200 7 Decision Trees
X_train = X_train[:,[0,1]]
X_test = X_test[:,[0,1]]
print('X shape = {}'.format(X_train.shape) + '\ny shape = {}'.
,→format(y_train.shape))
print('X shape = {}'.format(X_test.shape) + '\ny shape = {}'.
,→format(y_test.shape))
color = ('aquamarine', 'bisque', 'lightgrey')
cmap = ListedColormap(color)
mins = X_train.min(axis=0) - 0.1
maxs = X_train.max(axis=0) + 0.1
x = np.arange(mins[0], maxs[0], 0.01)
y = np.arange(mins[1], maxs[1], 0.01)
X, Y = np.meshgrid(x, y)
coordinates = np.array([X.ravel(), Y.ravel()]).T
fig, axs = plt.subplots(2, 2, figsize=(6, 4), dpi = 200)
fig.tight_layout()
min_samples_leaf_val = [1, 2, 5, 10]
for ax, msl in zip(axs.ravel(), min_samples_leaf_val):
cart = CART(min_samples_leaf=msl)
cart.fit(X_train, y_train)
Z = cart.predict(coordinates)
Z = Z.reshape(X.shape)
ax.tick_params(axis='both', labelsize=6)
ax.set_title('CART Decision Regions: min_samples_leaf=' + str(msl),
,→fontsize=7)
ax.pcolormesh(X, Y, Z, cmap = cmap, shading='nearest')
ax.contour(X ,Y, Z, colors='black', linewidths=0.5)
ax.plot(X_train[y_train==0, 0], X_train[y_train==0, 1],'g.',
,→markersize=4)
ax.plot(X_train[y_train==1, 0], X_train[y_train==1, 1],'r.',
,→markersize=4)
ax.plot(X_train[y_train==2, 0], X_train[y_train==2, 1],'k.',
,→markersize=4)
ax.set_xlabel('sepal length (normalized)', fontsize=7)
ax.set_ylabel('sepal width (normalized)', fontsize=7)
print('The accuracy for min_samples_leaf={} on the training data is
,→{:.3f}'.format(msl, cart.score(X_train, y_train)))
print('The accuracy for min_samples_leaf={} on the test data is {:.
,→3f}'.format(msl, cart.score(X_test, y_test)))
for ax in axs.ravel():
ax.label_outer()
X shape = (120, 2)
y shape = (120,)
X shape = (30, 2)
y shape = (30,)
The accuracy for min_samples_leaf=1 on the training data is 0.933
The accuracy for min_samples_leaf=1 on the test data is 0.633
7.3 CART Development for Regression 201
−1
−2
−1
−2
−2 −1 0 1 2 −2 −1 0 1 2
sepal length (normalized) sepal length (normalized)
Fig. 7.4: The scatter plot of normalized Iris dataset for two features and decision
regions for CART with minimum samples per leaf being 1, 2, 5, and 10. The green,
red, and black points show points corresponding to setosa, versicolor, and virginica
Iris flowers, respectively. The CART decision regions for each class are colored
similarly.
sepal width (cm) <= -0.068 sepal length (cm) <= 0.328
gini = 0.184 gini = 0.546
samples = 40 samples = 80
value = [36, 3, 1] value = [4, 37, 39]
class = setosa class = virginica
gini = 0.512 gini = 0.0 sepal width (cm) <= -0.303 sepal length (cm) <= 1.35
samples = 11 samples = 29 gini = 0.457 gini = 0.405
value = [7, 3, 1] value = [29, 0, 0] samples = 34 samples = 46
class = setosa class = setosa value = [4, 24, 6] value = [0, 13, 33]
class = versicolor class = virginica
sepal width (cm) <= -0.773 gini = 0.628 sepal width (cm) <= 0.166 gini = 0.0
gini = 0.287 samples = 11 gini = 0.467 samples = 11
samples = 23 value = [4, 5, 2] samples = 35 value = [0, 0, 11]
value = [0, 19, 4] class = versicolor value = [0, 13, 22] class = virginica
class = versicolor class = virginica
gini = 0.26 gini = 0.32 sepal length (cm) <= 0.809 gini = 0.32
samples = 13 samples = 10 gini = 0.493 samples = 10
value = [0, 11, 2] value = [0, 8, 2] samples = 25 value = [0, 2, 8]
class = versicolor class = versicolor value = [0, 11, 14] class = virginica
class = virginica
Fig. 7.5: The trained decision tree for Iris flower classification. The figure shows
the “glass-box” structure of decision trees. Interpretability is a key advantage of
decision trees. By convention, the child node satisfying condition in a node is on the
left branch.
Exercises:
Exercise 1: Load the training and test Iris classification dataset that was prepared
in Section 4.6. Select the first three features (i.e., columns with indices 0, 1, and
2). Write a program to train CART (for the hyperparamater min_samples_leaf
being 1, 2, 5, and 10) and kNN (for the hyperparameter K being 1, 3, 5, 9) on this
trivariate dataset and calculate and show their accuracies on the test dataset. Which
combination of the classifier and hyperparameter shows the highest accuracy on the
204 7 Decision Trees
sepal idth (normalized) CART Decision Regions: min_samples_leaf=1 CART Decision Regions: min_samples_leaf=2
−1
−2
−1
−2
−2 −1 0 1 2 −2 −1 0 1 2
sepal length (normalized) sepal length (normalized)
Fig. 7.6: The scatter plot and decision regions for a case that five observations were
randomly removed from the data used in Example 7.2.
test dataset?
Exercise 2: Which of the following options is the scikit-learn class that implements
the CART for regression?
A) “sklearn.tree.DecisionTreeClassifier”
B) “sklearn.tree.DecisionTreeRegressor”
C) “sklearn.linear_model.DecisionTreeClassifier”
D) “sklearn.linear_model.DecisionTreeRegressor”
E) “sklearn.ensemble.DecisionTreeClassifier”
F) “sklearn.ensemble.DecisionTreeRegressor”
examined to find the best split for training a decision stump using CART?
class 0 0 0 0 1 1 1
x1 −1 0 3 2 3 4 10
x2 7 3 2 0 1 −5 −3
Exercise 4: We would like to train a binary classifier using CART algorithm based on
some 3-dimensional observations. Let x = [x1, x2, x3 ]T denote an observation with
xi being feature i, for i = 1, 2, 3. In training the CART classifier, suppose a parent
node is split based on x1 feature and results in two child nodes. Is that possible that
the best split in one of these child nodes is again based on x1 feature?
A) Yes
B) No
Exercise 5: In Example 7.2, we concluded that ξ = (1, 2) and ξ = (2, 3) are equally
good to train a decision stump based on the entropy measure of impurity. Suppose
we choose ξ = (1, 2) as the split used in the root node and we would like to grow the
tree further. Determine the best split on the left branch (satisfying x1 ≤ 2 condition).
Assume 0 log( 00 ) = 0.
É
Exercise 6:
n
As discussed in Section 7.2.4, impurity measure i(S) is a function of pk = Í c−1k
nk
,k =
k=0
0, . . . , c − 1, where nk is the number of observations from class k in S. Let
φ(p0, p1, . . . , pc−1 ) denote this function. Suppose that this function is concave in
the sense that for 0 ≤ α ≤ 1 and any 0 ≤ xk ≤ 1 and 0 ≤ yk ≤ 1, we have
Show that using this impurity measure in the impurity drop defined as
nSYes nSNo
ξ ξ
∆ξ = i(S) − i(SYes
ξ )− ξ ),
i(SNo
nS nS
leads to ∆ξ ≥ 0 for any split ξ.
206 7 Decision Trees
Exercise 7: Consider the following two tree structures identified as Tree A and Tree
B in which x1 and x2 are the names of two features:
Tree A:
𝑥" ≤ 1
yes no
𝑥! ≤ 0 𝑥! ≤ 1
yes no yes no
yes no yes no
Tree B:
𝑥" ≤ 1
yes no
𝑥! ≤ 0 𝑥" ≤ 2
yes no yes no
yes no yes no
Can Tree A and Tree B be the outcome of applying the CART algorithm to some
training data with the two numeric features x1 and x2 collected for a binary classifi-
cation problem?
𝑥! ≤ 2
yes no
𝑥! ≤ 0 𝑥" ≤ 1
yes no yes no
yes no yes no
B)
𝑥" ≤ 2
yes no
𝑥! ≤ 0 𝑥" ≤ 3
yes no yes no
yes no yes no
C)
𝑥" ≤ 2
yes no
𝑥! ≤ 0 𝑥" ≤ 3
yes no yes no
yes no yes no
The fundamental idea behind ensemble learning is to create a robust and accurate
predictive model by combining predictions of multiple simpler models, which are
referred to as base models. In this chapter, we describe various types of ensemble
learning such as stacking, bagging, random forests, pasting, and boosting. Stacking
is generally viewed as a method for combining the outcomes of several base models
using another model that is trained based on the outputs of the base models along
with the desired outcome. Bagging involves training and combining predictions of
multiple base models that are trained individually on copies of the original training
data that are created by a specific random resampling known as bootstrap. Random
forest is a specific modification of bagging when applied to decision trees. Similar
to bagging, a number of base models are trained on bootstrap samples and combined
but to create each decision tree another randomization element is induced in the
splitting strategy. This randomization often leads to improvement over bagged trees.
In pasting, we randomly pick modest-size subsets of a large training data, train a
predictive model on each, and aggregate the predictions. In boosting a sequence of
weak models are trained and combined to make predictions. The fundamental idea
behind boosting is that at each iteration in the sequence, a weak model (also known
as weak learner) is trained based on the errors of the existing models and added to
the sequence.
In this section, we will first try to answer the following question: why would combin-
ing an ensemble of base models possibly help improve the prediction performance?
Our attempt to answer this question is not surprising if we acknowledge that integrat-
ing random variables in different ways is indeed a common practice in estimation
theory to improve the performance of estimation. Suppose we have a number of
independent and identically distributed random variables Y1 , Y2 , . . . , YN with mean
µ and variance σ 2 ; that is, E[Yi ] = µ and Var[Yi ] = σ 2 , i = 1, . . . , N. Using Y1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 209
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_8
210 8 Ensemble Learning
Y = fo (X) + , (8.1)
where fo (X) is the best (optimal) approximation of Y that we can achieve by using a
class of models and denotes the unknown zero-mean error term (because any other
mean can be considered as a bias term added to fo (X) itself). The term is known
as irreducible error because no matter how well we do by using the mathematical
model and features considered, we can not reduce this error (e.g., because we do not
measure some underlying variables of the phenomenon or because our mathematical
model is a too simplistic model of the phenomenon).
Nevertheless, we are always working with a finite training data and what we end
up with instead is indeed an estimation of fo (X) denoted by f (X) and we use it as
an estimate of Y . How well do we predict? We can look into the mean square error
(MSE). From the relationship presented in (8.1), we can write the MSE of f (X) with
respect to unknown Y as
2 2
E Y − f (X) = E fo (X) + − f (X)
h i2
= E fo (X) + − f (X) + E[ fo (X)] − E[ f (X)] + E[ f (X)] − E[ fo (X)]
h i2 h i2
1
= E fo (X) − f (X) − (E[ fo (X)] − E[ f (X)]) + E E[ fo (X)] − E[ f (X)] + Var[]
2
= Vard [ f (X)] + Bias[ f (X)]2 + Var[] ,
(8.2)
1
where the expectation is over the joint distribution of X, Y , and training data, =
2
follows from E[] = 0, and to write = we used the fact that
8.1 A General Perspective on the Efficacy of Ensemble Learning 211
h i2 h i2 h i2
∆
Bias[ f (X)]2 = E[Y − f (X)] = E[ fo (X) − f (X)] = E E[ fo (X)] − E[ f (X)] ,
(8.3)
+ f (X)
o
fo(X)
+
𝑌
Fig. 8.1: The larger (smaller) oval shows the set of possible approximations within a
class of complex (simple) classifiers. The best approximation in the more “complex”
class is closer to Y . The red region within each oval shows the set of possible estimates
of fo (X) within that class. The average distance between Y and the points within the
red region of the complex class is less than the average distance between Y and the
set of points within the highlighted region of the simple class. This shows a lower
bias of the more complex classifier.
É
8.1.2 How Would Ensemble Learning Possibly Help?
The difference between fo (X) and f (X) is known as reducible error because this
could be reduced, for example, by collecting more or better datasets to be used in
estimating fo (X). Therefore, we write
Let us now consider an ensemble of M base models that are trained using a
specific learning algorithm but perhaps with different (structural or algorithmic)
hyperparameters. The necessity of having the same learning algorithm in the follow-
ing arguments is the use of fo (X) as the unique optimal approximation of Y that is
achievable within the class of models considered. Nevertheless, with a slightly more
work these arguments could be applied to different learning algorithms. We write
M M M
avg ∆ 1 Õ 1 Õ 1 Õ
Vard = Vard [ fi (X)] = E[ei2 ] − (E[ei ])2 . (8.10)
M i=1 M i=1 M i=1
Now let us consider an ensemble regression model that as the prediction takes the
average of fi (X)’s; after all, each fi (X) is the prediction made by each model and it
is reasonable (and common) to take their average in an ensemble model. Therefore,
from (8.8) we can write
M
1 Õ
fi (X) = fo (X) + ē, i = 1, . . . , M , (8.11)
M i=1
where
M
1 Õ
ē = ei . (8.12)
M i=1
The MSE of prediction obtained by this ensemble model has a variance of devi-
ation term similar to (8.7), which is obtained as (in the following we denote that by
Varens
d ):
h 1 ÕM M i 2
2i h 1 Õ
d = E[ē ] − (E[ē]) = E
Varens .
2 2
ei − E ei (8.13)
M i=1 M i=1
214 8 Ensemble Learning
To better compare the average MSE of prediction by base models and the MSE of
prediction achieved by the ensemble model (i.e., average of their predictions) to the
avg
extent that are affected by Vard and Varens d , respectively, we make an assumption
that E[ei ] = 0, i = 1, . . . , M. In that case, (8.10) and (8.13) reduce to
avg
Vard = i=1 E[ei ] ,
1 ÍM 2
M
h 2i h 2i (8.14)
d =E = .
ÍM ÍM
Varens 1
M i=1 ei
1 1
ME M i=1 ei
M M
1 h Õ 2i Õ
E ei ≤ E[ei2 ] , (8.15)
M i=1 i=1
which means
avg
d ≤ Vard .
Varens (8.16)
1 avg
d =
Varens Vard . (8.17)
M
This is an important observation that shows a substantial reduction in variance of
1
deviation (i.e., a factor of M ) obtained by having uncorrelated error terms. In practice,
this is not usually the case though because typically we have one dataset and we take
some overlapping subsets of that to develop our predictions fi (X), i = 1, . . . , M.
——————————————————————————————-
8.2 Stacking 215
8.2 Stacking
Generalizer 1
Generalizer 2
Generalizer 𝑦!
𝐱 .
.
.
Generalizer N
level 0 level 1
8.3 Bagging
Recall from Section 7.4 in which we mentioned that the decision regions of decision
trees could be quite sensitive with respect to changes in data. One question is whether
we can systematically improve the robustness of some classifiers based on decision
trees that are more robust with respect to variation in data. Let us break up this
question into two:
1. What is meant by variation in data and how can we simulate that?
2. How can we possibly improve the robustness?
To answer the first question, suppose we have a training data with a sample size
n. Each observation in this training set is a realization of a probability distribution
governing the problem. The probability distribution itself is the result of cumula-
tive effect of many (possibly infinite) hidden factors that could relate to technical
variabilities in experiments or stochastic nature of various phenomena. Had we had
detailed information about probabilistic characteristics of all these underlying fac-
tors, we could have constructed the probability distribution of the problem and used
it to generate more datasets from the underlying phenomenon. However, in practice,
except for the observations at hand, we have generally no further information about
the probability distribution of the problem. Therefore, it makes sense if we somehow
use the data at hand to generate more data from the underlying phenomenon (i.e., to
simulate variation in data).
But how can we generate several new legitimate datasets from the single original
data at hand? One way is to assume some observations are repeated; after all, if an
observation happens once, it may happen again (and again). Therefore, to create each
new dataset, we randomly draw with replacement n observations from the original
data at hand. However, keeping the same sample size n implies some observations
from the original training dataset do not appear in these new datasets. This way we
have simulated variation in data by just assuming repetition and deletion operation
on the observations from the original dataset. These newly generated datasets (i.e.,
from sampling the original dataset with replacement while keeping the same sample
size) are known as boostrap samples.
Now we turn to the second question. As each of these bootstrap training sets is a
legitimate dataset from the same problem, we train a decision tree on each bootstrap
sample. Then for a given feature vector x, each tree produces a prediction. We can then
aggregate these predictions to create a final prediction. The aggregation operation
could be the majority vote for classification and the averaging for regression. We note
that the final prediction is the outcome of applying decision tree algorithm on many
training datasets, and not just one dataset. Therefore, we capture more variation of
data in the aggregated model. Put in other words, the aggregated model is possibly
less affected by variation in data as compared with each of its constituent tree models
(for a more concrete analysis on why this may improve the prediction performance
refer to Section 8.1.2).
The aforementioned systematic way (i.e., creating bootstrap samples and ag-
gregating) is indeed the mechanism of bagging (short for bootstrap aggregating)
8.4 Random Forest 217
B
Õ
Classification: ψbag (x) = argmax I {ψ∗j (x)=i } (8.18)
i ∈ {0,...,c−1} j=1
B
1Õ ∗
Regression: fbag (x) = f (x) (8.19)
B j=1 j
where ψ ∗j and f j∗ denote the classifier and regressor trained on bootstrap sample S∗tr, j ,
respectively, the set {0, . . . , c − 1} represents the class labels, and I A is the indicator
of event A. The number of bootstrap samples B is typically set between 100 to 200.
Scikit-learn implementation: Bagging classifier and regressor are implemented by
BaggingClassifier and BaggingRegressor classes from sklearn.ensemble
module, respectively. Using base_estimator parameter, both BaggingClassifier
and BaggingRegressor can take as input a user-specified base model along with
its hyperparameters. Another important hyperparameter in these constructors is
n_estimators (by default it is 10). Using this parameter, we can control the num-
ber of base models, which is the same as the number of bootstrap samples (i.e., B
in the aforementioned notations). The scikit-learn implementation even allows us to
set the size of bootstrap samples to a size different from the original sample size.
This could be done by setting the max_samples parameter in these constructors.
The default value of this parameter (which is 1.0) along with the default value of
bootstrap parameter (which is True) leads to bagging as described before. In
these classes, there is another important parameter, namely, max_features, which
by default is set to 1.0. Setting the value of this parameter to an integer less than
p (the number of features in the data) or a float less than 1.0 trains each estimator
on a randomly picked subset of features. That being said, when we train each base
estimator on a different subset of features and samples, the method is referred to as
random patches ensemble. In this regard, we can even set bootstrap_features to
True so that the random selection is performed by replacement.
we examine all features and pick the split that maximizes the impurity drop. This
process is then continued until one of split-stopping criteria is met. For random
forests, however, for each split in training a decision tree on a bootstrap sample,
d ≤ p features are randomly picked as a candidate feature set and the best split is
the one that maximizes the impurity drop over this candidate feature set. Although d
√
could be any integer less or equal to p, it is typically set as p or log2 p. This simple
change in the examined features is the major factor to decorrelate trees in random
forest as compared with bagged trees (recall Section 8.1.2 where we showed having
uncorrelated error terms in the ensemble leads to substantial reduction in variance
of deviation). We can summarize the random forest algorithm as follows:
2. For each bootstrap sample, train a decision tree classifier or regressor denoted
by ψ ∗j and f j∗ , respectively, where:
2.1. To find the best split at each node, randomly pick d ≤ p features and
maximize the impurity drop over these features
2.2. Grow the tree until one of the split-stopping criteria is met (see Section
7.2.2 where four criteria were listed)
B
Õ
Classification: ψRF (x) = argmax I {ψ∗j (x)=i } (8.20)
i ∈ {0,...,c−1} j=1
B
1Õ ∗
Regression: fRF (x) = f (x) (8.21)
B j=1 j
Scikit-learn implementation: The random forest classifier and regressor are im-
plemented in RandomForestClassifier and RandomForestRegressor classes
from sklearn.ensemble module, respectively. Many hyperparameters in these
algorithms are similar to scikit-learn implementation of CART with the same func-
tionality but applied to all underlying base trees. However, here perhaps the use of a
parameter such as max_features makes more sense because this is a major factor
in deriving the improved performance of random forests with respect to bagged trees
in many datasets. Furthermore, we need to point out a difference between the classi-
fication mechanism of RandomForestClassifier with the majority vote that we
saw above. In scikit-learn, rather than the majority vote, which is also known as hard
8.5 Pasting 219
voting, a soft voting mechanism is used to classify an instance. In this regard, for a
given observation x, first the class probability of that observation belonging to each
class is calculated for each base tree—this probability is estimated as the proportion
of samples from a class in the leaf that contains x. Then x is assigned to the class
with the highest class probability averaged over all base trees in the forest.
8.5 Pasting
The underlying idea in pasting (Breiman, 1999) is that rather than training one
classifier on the entire dataset, which sometimes can be computationally challenging
for large sample size, we randomly (without replacement) pick a number of modest-
size subsets of training data (say “K” subsets), construct a predictive model for
each subset, and aggregate the results. The combination rule is similar to what we
observed in Section 8.3 for bagging. Assuming ψ j (x) and f j (x) denote the classifier
and regressor trained for the j th training subset, j = 1, . . . , K, the combination rule
is:
K
Õ
Classification: ψpas (x) = argmax I {ψ j (x)=i } (8.22)
i ∈ {0,...,c−1} j=1
K
1 Õ
Regression: fpas (x) = f j (x) (8.23)
K j=1
8.6 Boosting
8.6.1 AdaBoost
In AdaBoost (short for Adaptive Boosting), a sequence of weak learners are trained
on a series of iteratively weighted training data (Freund and Schapire, 1997). The
weight of each observation at each iteration depends on the performance of the
trained model at the previous iteration. After a certain number of iterations, these
models are then combined to make a prediction. The good empirical performance of
AdaBoost on many datasets has made some pioneers such as Leo Breiman to refer
to this ensemble technique as “as the most accurate general purpose classification
algorithm available” (see (Breiman, 2004)). Boosting was originally developed for
classification and then extended to regression. For the classification, it was shown
that even if base classifiers are weak classifiers, which are efficient to train, boosting
can produce powerful classifiers. A weak classifier is a model with a performance
slightly better than random guessing (for example, its accuracy on a balanced training
data in binary classification problems is slightly better than 0.5). An example of a
weak learner is decision stump, which is a decision tree with only one split. In some
extension of AdaBoost to regression, for example, AdaBoost.R proposed in (Drucker,
1997), the “accuracy better than 0.5” restriction was carried over and changed to the
“average loss less than 0.5”.
Here we discuss AdaBoost.SAMME (short for Stagewise Additive Modeling us-
ing a Multi-class Exponential Loss Function) for classification (Zhu et al., 2009)
and AdaBoost.R2 proposed in (Drucker, 1997) for regression, which is a modifi-
cation of AdaBoost.R proposed earlier in (Freund and Schapire, 1997). Suppose a
sample of size n collected from c classes and a learning algorithm that can handle
multiclass classification and supports weighted samples (e.g., decision stumps [see
Section 7.2.5]) are available. AdaBoost.SAMME is implemented as in the following
algorithm.
8.6 Boosting 221
2. For m from 1 to M:
2.1. Apply the learning algorithm to weighted training sample to train classifier
ψm (x).
2.2. Compute the error rate of the classifier ψm (x) on the weighted training
sample:
j=1 w j I {y j ,ψm (x j )}
Ín
ε̂m = . (8.24)
j=1 w j
Ín
w j ← w j × exp αm I {y j ,ψm (x j )} .
(8.26)
M
Õ
ψboost (x) = argmax αm I {ψm (x)=i } . (8.27)
i ∈ {0,...,c−1} m=1
dom guessing. In binary classification, this means ε̂m < 0.5. This is also used to
introduce an early stopping criterion in AdaBoost; that is to say, the process of
training M base classifier is terminated as soon as ε̂m > 1 − c1 . Another criterion
for early stopping is when we achieve perfect classification in the sense of having
ε̂m = 0.
3. To make the final prediction by ψboost (x), those classifiers ψm+1 (x), m =
1, 2, . . . , M, that had a lower error rate ε̂m during the training have a greater
αm , which means they have a higher contribution. In other words, αm is an
indicator of confidence in classifier ψm (x).
4. If at any point during training AdaBoost, the error on the weighted training
sample becomes 0 (i.e., ε̂m = 0), we can stop training. This is because when
ε̂m = 0 the weights will not be updated and the base classifier for the subsequent
iterations remain the same. At the same time, because αk = 0, k ≥ m, they are
not added to ψboost (x), which means the AdaBoost classifier remains the same.
The AdaBoost.R2 (Drucker, 1997) algorithm for regression is presented on the
next page.
Scikit-learn implementation: AdaBoost.SAMME can be implemented using
AdaBoostClassifier from sklearn.ensemble module by setting the algoirthm
parameter to SAMME. AdaBoost.R2 is implemented by AdaBoostRegressor from
the same module. Using base_estimator parameter, both AdaBoostClassifier
and AdaBoostRegressor can take as input a user-specified base model along
with its hyperparameters. If this parameter is not specified, by default a decision
stump (a DecisionTreeClassifier with max_depth of 1) will be used. The
n_estimators=50 parameter controls the maximum number of base models (M
in the aforementioned formulations of these algorithms). In case of perfect fit, the
training terminates before reaching M. Another hyperparameter is the learning rate
η (identified as learning_rate). The default value of this parameter, which is
1.0, leads to the original representation of AdaBoost.SAMME and AdaBoost.R2
presented in (Zhu et al., 2009) and (Drucker, 1997), respectively.
É
8.6.2 Gradient Boosting
In (Friedman et al., 2000), it was shown that one can write AdaBoost as a stagewise
additive modeling problem with a specific choice of loss function. This observation
along with the connection between stagewise additive modeling with steepest descent
minimization in function space allowed a new family of boosting strategies known as
gradient boosting (Friedman, 2001). Here we first present an intuitive development
of the gradient boosting technique, then present its formal algorithmic description
as originally presented in (Friedman, 2001, 2002), and finally its most common
implementation known as gradient boosted regression tree (GBRT).
8.6 Boosting 223
É
Algorithm AdaBoost.R2 Algorithm for Regression
1. Assign a weight w j to each observation j = 1, 2, . . . , n and initialize them to
w j = n1 .
3. For m from 1 to M:
3.2. Apply the learning algorithm to the created sample to train regressor fm (x) .
The Intuition behind gradient boosting: Given a training data of size n, denoted
Str = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )}, we aim to use a sequential process to construct
a predictive model that can estimate the response y for a given feature vector x.
Suppose in this sequential process, our estimate of y after M iteration is based on
FM (x), which has an additive expansion of the form:
M
Õ
FM (x) = βm hm (x) . (8.33)
m=0
The set of hm (x) is known as the set of basis functions with βm being the coefficient
of hm (x) in the expansion. From the boosted learning perspective, hm (x) is a “base
learner” (usually taken as a classification tree) and βm is its contribution to the final
prediction. Suppose we wish to implement a greedy algorithm where at iteration m,
βm hm (x) is added to the expansion but we are not allowed to revise the decision made
in the previous iteration (i.e., β0 h0 (x), . . . , βm−1 hm−1 (x)). Therefore, at iteration M,
we can write,
where ŷ denotes the estimate of y. For any iteration m ≤ M, our estimate ŷm is
obtained as
Suppose the function Fm−1 (x) = Fm−2 (x) + βm−1 hm−1 (x) is somehow obtained.
Of course this itself gives us an estimate of yi, i = 1, . . . , n, but now we would like to
estimate Fm (x) = Fm−1 (x) + βm hm (x) to have a better estimate of yi . As Fm−1 (x) is
already obtained, estimating Fm (x) is equivalent to estimating βm hm (x) (recall that
our algorithm is greedy). But how can we find βm hm (x) to improve the estimate? One
option is to learn hm (x) by constraining that to satisfy βm hm (xi ) = yi − Fm−1 (xi );
that is, βm hm (xi ) becomes the residual yi − Fm−1 (xi ). With this specific choice we
have
This means that with this specific choice of βm hm (x), we have at least perfect
estimates of response for training data through the imposed n contraints. However,
this does not mean βm hm (x) will lead to perfect estimate of response for any given
x. Although (8.36) looks trivial to some extent, there is an important observation to
make. Note that if we take the squared-error loss
8.6 Boosting 225
1 2
L y, F(x) = y − F(x) , (8.37)
2
then
∂L y, F(x)
= F(x) − y , (8.38)
∂F(x)
and therefore,
h ∂L y, F(x) i
= Fm−1 (xi ) − yi . (8.39)
∂F(x) F(x) = Fm−1 (x)
x = xi, y = yi
∂ L yi ,Fm−1 (xi )
The left side of (8.39) is sometimes written as ∂ Fm−1 (xi ) but for the sake of
clarity, we prefer the notation used in (8.39). From (8.39) we can write (8.36) as
h ∂L y, F(x) i
Fm (xi ) = Fm−1 (xi ) − , i = 1, . . . , n . (8.40)
∂F(x) F(x) = Fm−1 (x)
x = xi , y = yi
This means that given Fm−1 (x), we move in the direction of negative of gradient of loss
with respect to the function. This is indeed the “steepest descent” minimization in the
function space. The importance of viewing the “correction” term in (8.36) as in (8.40)
is that we can now replace the squared-loss function with any other differentiable loss
function (for example to alleviate the sensitivity of squared-error loss to outliers).
Although the negative of gradient of the squared-loss becomes the residual, for other
h ∂L y,F(x) i
loss functions it does not; therefore, rim , − ∂F(x) is known
F(x) = Fm−1 (x)
x = xi , y = yi
as “pseudo”-residuals (Friedman, 2002).
Note that hm (x) is a member of a class of models (e.g., a weak learner). Therefore,
one remaining question is whether we can impose the n aforementioned constraints
on hm (xi ); that is to say, can we enforce the following relation for an arbitrary
differentiable loss function?
Perhaps we can not ensure (8.41) exactly; however, we can choose a member of this
weak learner class (to learn it) that produces a vector hm = [hm (x1 ), . . . , hm (xn )]T ,
which is most parallel to rm = [r1m, r2m, . . . , rnm ]T . After all, having hm being most
parallel to rm implies taking a direction that is as parallel as possible to the steepest
descent direction across all n observations. In this regard, in (Friedman, 2001)
226 8 Ensemble Learning
n
Õ 2
hm (x) = argmin rim − ρh(xi ) . (8.42)
h(x),ρ i=1
As we can see, to learn hm (x) we replace responses of yi with rim (its pseudo-
residual at iteration m). Therefore, one may refer to these pseudo-residuals as
pseudo-responses as in (Friedman, 2001). Once hm (x) is learned from (8.42), we
can determine the step size βm from
n
Õ
βm = argmin L yi, Fm−1 (xi ) + βhm (xi ) .
(8.43)
β i=1
Now we can use hm (x) and βm to update the current estimate Fm−1 (x) from (8.35).
For implementation of this iterative algorithm, we can initialize F0 (x) and terminate
the algorithm when we reach M. To summarize the gradient boosting algorithm, it is
more convinient to first present the algorithm for regression, which aligns well with
the aforementioned description.
Gradient Boosting Algorithm for Regression: Here we present an algorithmic
description of gradient boosting as originally presented in (Friedman, 2001, 2002).
In this regard, hereafter, we assume hm (x) is a member of a class of functions that is
parametrized by am = {am,1, am,2, . . .}. As a result, estimating hm (x) is equivalent
to estimating the parameter set am and, therefore, we can use notation h(x; am ) to
refer to the weak learner at iteration m.
8.6 Boosting 227
2.2. Estimate (train) the weak learner parameter set using pseudo-residuals and
least-squares criterion:
n
Õ 2
am = argmin rim − ρh(xi ; a) .
a,ρ i=1
For the squared-error loss (L(yi, a) = 21 (yi −a)2 ), for example, line 1 and 2.1 in the
above algorithm simplify to F0 (x) = n1 i=1 yi and rim = yi − Fm−1 (xi ), respectively.
Ín
Using the absolute-error loss (L(yi, a) = |yi − a|) line 1 and 2.1 are replaced with
F0 (x) = median(yi ∈ Str ) and rim = sign(yi − Fm−1 (xi )), respectively, where sign(b)
is 1 for b ≥ 0 and -1, otherwise.
É
8.6.3 Gradient Boosting Regression Tree
J
Õ
h(x; bm ) = b jm I {x∈R j m } , (8.45)
j=1
where I {. } is the indicator function, b jm is the estimate of the target (also re-
ferred to as score) for any training sample point falling into region R jm , and
bm = {b1m, b2m, . . . , bJ m }. Using (8.45) as the update rule in line 2.4 of the Gradient
Boosting algorithm and taking γ jm = βm b jm leads to the following algorithm for
implementation of GBRT.
2.2. Train a J-terminal regression tree using Str = {(x1, r1m ), . . . , (xn, rnm )}
and a mean squared error (MSE) splitting criterion (see Section 7.3) to give
terminal regions R jm, j = 1, . . . , J .
In line 2.4 of GBRT, one can also use a shrinkage in which the update rule is
where 0 < ν ≤ 1.
Gradient Boosting Regression Tree (GBRT) for Classification: GBRT is ex-
tended into classification by growing one tree for each class (using squared-error
splitting criterion) at each iteration of boosting and using a multinomial deviance
loss (to compute the pseudo-residuals by its gradient). Without going into the details,
8.6 Boosting 229
we present the algorithm for c-class classification (readers can refer to (Friedman,
2001) for details).
2. For m from 1 to M:
e Fk(m−1) (x)
pkm (x) = Íc−1 , k = 0, . . . , c − 1 . (8.48)
Fk(m−1) (x)
k=0 e
e Fk M (x)
3. The class labels are then found by computing pk (x) = Í c−1 F
k M (x)
,k = 0, . . . , c−
k=0 e
1, and ŷ = argmin pk (x) where ŷ is the predicted label for x.
k
XGBoost (short for eXtreme Gradient Boosting) is a popular and widely used method
that, since its inception in 2014, has helped many teams win in machine learning
competitions. XGBoost gained momentum in machine learning after Tianqi Chen
and Tong He won a special award named “High Energy Physics meets Machine
Learning” (HEP meet ML) for their XGBoost solution during the Higgs Boson
Machine Learning Challenge (Adam-Bourdarios et al., 2015; Chen and He, 2015).
The challenge was posted on Kaggle in May 2014 and lasted for about 5 months,
attracting more than 1900 people competing in 1785 teams. Although the XGBoost
did not achieve the highest score on the test data, Chen and He received the special
“HEP meet ML” award for their solution that was chosen as the most useful model
(judged by a compromise between performance and simplicity) for the ATLAS ex-
periment at CERN. From an optimization point of view, there are two properties that
distinguish XGBoost from GBRT:
where h(x; bm ) is the mapping that characterizes a regression tree learned at iteration
m and is defined similar to (8.45). Hereafter, and for the ease of notations, we denote
such a mapping that defines a regression tree as h(x), and one learned during the
sequential process at iteration m as hm (x). Therefore, for a tree with J leaf nodes,
h(x) is defined similarly to (8.45), by dropping the iteration index m; that is,
J
Õ
h(x) = b j I {x∈R j } , (8.50)
j=1
where {R j } Jj=1 is the set of regions that partition the feature space and are identified
by the J leaf nodes in the tree, and b j is the score given to any training sample point
falling into region R j . Hereafter, we use h and hm to refer to a regression tree that is
characterized by mappings h(x) and hm (x), respectively. At iteration m, hm is found
by minimizing a regularized objective function as follows:
8.6 Boosting 231
n
Õ
hm = argmin L yi, Fm−1 (xi ) + h(xi ) + Ω (h) ,
(8.51)
h∈F i=1
where L is a loss function to measure the difference between yi and its estimate at
iteration m, Ω(h) measures the complexity of the tree h, and F is the space of all
regression trees. Including Ω(h) in the objective function penalizes the complexity
of trees. This is one way to guard against overfitting, which is more likely to happen
with a complex tree than a simple tree. Recall that we can approximate a real-valued
function f (x + ∆x) using second-order Taylor expansion as
d f (x) 1 d 2 f (x) 2
f (x + ∆x) ≈ f (x) + ∆x + ∆x . (8.52)
dx 2 d2 x
Replacing f (.), x and ∆x in (8.52) with L yi, Fm−1 (xi ) + h(xi ) , Fm−1 (xi ), and h(xi ),
respectively, yields,
1
L yi, Fm−1 (xi ) + h(xi ) ≈ L yi, Fm−1 (xi ) + gi,1 h(xi ) + gi,2 h(xi )2 ,
(8.53)
2
where
h ∂L y, F(x) i
gi,1 = , (8.54)
∂F(x) F(x) = Fm−1 (x)
x = xi , y = yi
h ∂ 2 L y, F(x) i
gi,2 = . (8.55)
∂ 2 F(x) F(x) = Fm−1 (x)
x = xi , y = yi
For simplicity
the partial derivatives
in (8.54) and (8.55) are sometimes written as
∂ L yi ,Fm−1 (xi ) ∂2 L yi ,Fm−1 (xi )
∂ Fm−1 (xi ) and ∂2 Fm−1 (xi )
, respectively. Previously in (8.39) we specified
(8.54) for the squared-error loss. For the same loss, as an example, it is straightforward
to see gi,2 = 1. Using (8.53) in (8.51) yields
n
Õ 1
hm = argmin L yi, Fm−1 (xi ) + gi,1 h(xi ) + gi,2 h(xi )2 + Ω (h) .
(8.56)
h∈F i=1
2
As L yi, Fm−1 (xi ) does not depend on h, it is treated as a constant term in the
objective function and can be removed. Therefore,
n
Õ 1
hm = argmin gi,1 h(xi ) + gi,2 h(xi )2 + Ω (h) .
(8.57)
h∈F i=1
2
232 8 Ensemble Learning
The summation in (8.57) is over the sample points. However, because each sample
point belongs to only one leaf, we can rewrite the summation as a summation over
leaves. To do so, let I j denote the set containing the indices of training data in leaf
j; that is, I j = {i|xi ∈ R j }. Furthermore, let Jh denote the number of leaves in tree
h. Then, using (8.50) we can write (8.57) as
Jh Õ
Õ 1 ©Õ
hm = argmin + i,2 ® j + Ω (h) ,
© ª 2
g b g b (8.58)
ª
i,1 ® j
h∈F
j=1 «i ∈I j
2 i ∈I
« j
¬ ¬
which for simplicity can be written as
Jh
Õ 1
hm = argmin G j,1 b j + G j,2 b j + Ω (h) ,
2
(8.59)
h∈F j=1
2
where
Õ Õ
G j,1 = gi,1 , G j,2 = gi,2 . (8.60)
i ∈I j i ∈I j
There are various ways to define Ω (h) but one way that was originally proposed
in (Chen and Guestrin, 2016; Chen and He, 2015) and works quite well in practice
is to define
h J
1 Õ
Ω (h) = γJh + λ b2 , (8.61)
2 j=1 j
where γ and λ are two tuning parameters. The first term in (8.61) penalizes the
number of leaves in the minimization (8.59) and the second term causes having a
shrinkage effect on the scores similar to what we saw in Section 6.2.2 for logistic
regression. In other words, large scores are penalized and, therefore, scores have less
room to wiggle. At the same time, having less variation among scores guard against
overfitting the training data. Replacing (8.61) in (8.59) yields
where
Jh
Õ 1
obj m = G j,1 b j + (G j,2 + λ) b2j + γJh . (8.63)
j=1
2
∂ obj m
= G j,1 + (G j,2 + λ) b j . (8.64)
∂ bj
Setting the derivative (8.64) to 0, we can find the optimal value of scores for a fixed
structure with Jh leaves as
G j,1
b∗j = − , j = 1, . . . , Jh , (8.65)
G j,2 + λ
1Õ
Jh G2j,1
obj ∗m =− + γJh . (8.66)
2 j=1 G j,2 + λ
Now we return to the first part of the aforementioned strategy. It is generally in-
tractable to enumerate all possible tree structures. Therefore, we can use a greedy
algorithm similar to CART that starts from a root node and grow the tree. However,
in contrast with impurity measures used in CART (Chapter 7), in XGBoost (8.66) is
used to measure the quality of splits. Note that from (8.66), the contribution of each
G2
leaf node k to the obj ∗m is − 21 Gk,2k,1+λ + γ. We can write the objective function as
1 Õ
Jh G2j,1 1 G k,1
2
obj ∗m = − + γ(Jh − 1) − +γ . (8.67)
2 j=1, j,k G j,2 + λ 2 G k,2 + λ
| {z }
contribution of node k
Suppose node k is split into left and right nodes k L and k R . The objective function
after this split is
1 Õ
Jh G2j,1 2
1 G k L ,1
2
1 G k R ,1
obj ∗m, split =− + γ(Jh − 1) − +γ − +γ .
2 j=1, j,k G j,2 + λ 2 G k L ,2 + λ 2 G k R ,2 + λ
| {z }| {z }
contribution of node k L contribution of node k R
(8.68)
To identify the best split at a node, one can enumerate all features and choose the
split that results in the maximum gain obtained by (8.69). Similar to CART, if we
assume the values of training data ready for the split have v j distinct values for feature
x j , j = 1, .., p, then the number of candidate splits (halfway between consecutive
Íp
values) is ( j=1 v j ) − p. The gain obtained by (8.69) can be interpreted as the quality
of split. Depending on the regularization parameter γ, it can be negative for some
splits. We can grow the tree as long as the gain for some split is still positive. A larger
γ results in more splits having a negative gain and, therefore, a more conservative
tree expansion. This way we can also think of γ as the minimum gain required to
split a node.
Scikit-learn-style implementation of XGBoost: XGBoost is not part of scikit-
learn but its Python library can be installed and it has a scikit-learn API that
makes it accessible similar to many other estimators in scikit-learn. If Anaconda
has been installed as instructed in Chapter 2, one can use Conda package manager
to install xgboost package for Python (see instructions at [12]). As XGBoost is
essentially a form of GBRT, it can be used for both classification and regression.
The classifier and regressor classes can be imported as from xgboost import
XGBClassifier and from xgboost import XGBRegressor, respectively. Once
imported, they can be used similarly to scikit-learn estimators. Some of the im-
portant hyperparameters in these classes are n_estimators, max_depth, gamma,
reg_lambda, and learning_rate, which are the maximum number of base trees,
the maximum depth of each base tree, the minimum gain to split a node (see (8.61)
and (8.69)), the regularization parameter used in (8.61), and the shrinkage coefficient
ν that can be used in (8.49) to update Fm−1 (x) similar to (8.47), respectively.
Exercises:
Exercise 1: Load the training and test Iris classification dataset that was prepared
in Section 4.6. Select the first two features (i.e., columns with indices 0, 1). Write
a program to train bagging, pasting, random forests, and boosting using CART for
four cases that are obtained by the combination of min_samples_leaf∈ {2, 10}
and n_estimators∈ {5, 20} and record (and show) their accuracies on the test
data. Note that a min_samples_leaf=10 will lead to a “shallower” tree than
min_samples_leaf=2. Using the obtained accuracies, fill the blanks in the fol-
lowing lines:
For bagging, the highest accuracy is obtained by a ______ tree (choose: Shallow or
Deep)
8.6 Boosting 235
For pasting, the highest accuracy is obtained by a ______ tree (choose: Shallow or
Deep)
For random forest, the highest accuracy is obtained by a ______ tree (choose: Shal-
low or Deep)
For AdaBoost, the highest accuracy is obtained by a ______ tree (choose: Shallow
or Deep)
until the 10th iteration and stop training at the end of iteration 10. We refer to error
rates of ψboost
5 (x) and ψboost
10 (x) and on a specific test dataset as “test error”. Can we
This assumption ideally approximates the situation where having one less ob-
servation in data does not considerably affect the predictability of a classification
rule—this could be the case specially if n is relatively large. Let pE shows the
accuracy of ψE (x). Show that pE > p.
É
Exercise 4:
M M
1 h Õ 2i Õ
E ei ≤ E[ei2 ] , (8.72)
M i=1 i=1
Chapter 9
Model Evaluation and Selection
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 237
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_9
238 9 Model Evaluation and Selection
Let Str,n = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )} denote the training set where index n is
used in Str,n to highlight the sample size. We will discuss the following estimation
rules:
• Hold-out Estimator
• Resubstituiton
• Bootstrap
• Cross-validation (CV)
– Standard K-fold CV
– Leave-one-out
– Stratified K-fold CV
– Shuffle and Split (Random Permutation CV)
ψl (x) is the outcome of applying the learning algorithm Ψ to Str,l . Viewing ε̂h
as the estimate of the performance of the learning algorithm outcome would help
prevent confusion and misinterpretations specially in cases when the learning process
contains steps such as scaling and feature selection and extraction (Chapter 11). As
a result, we occasionally refer to the estimate of the performance of a trained model
as the performance of Ψ, which is the learning algorithm that led to the model.
Despite the intuitive meaning of hold-out estimator, its use in small sample (when
the sample size with respect to feature size is relatively small) could be harmful for
both the training stage and the evaluation process. This is because as compared with
large sample, in small sample each observation has generally (and not surprisingly) a
larger influence on the learning algorithm. In other words, removing an observation
from a small training data generally harms the performance of a learning algorithms
more than removing an observation from a large training set. Therefore, when n is
relatively small, holding out a portion of Str,n could severely hamper the performance
of the learning algorithm. At the same time, we have already a small sample to work
with and the hold out estimate is obtained even on a smaller set (say 25% of n).
Therefore, we can easily end up with an inaccurate and unreliable estimate of the
performance not only due to the small test size but also because the estimate could
be very bias towards a specific set picked for testing. As we will see later in this
section, there are more elegant rules that can alleviate both aforementioned problems
associated with the use of hold-out estimator in small-sample settings.
Nonetheless, one way to remove the harmful impact of holding out a portion of
Str,n for testing in small-sample is that once ε̂h is obtained, we apply Ψ to the entire
dataset Str,n to train a classifier ψn (x) but still use ε̂h , which was estimated on Ste,m
using ψl (x), as the performance estimate of ψn (x). There is nothing wrong with this
practice: we are just trying to utilize all available data in training the classifier that
will be finally used in practice. The hold-out estimate ε̂h can be used as an estimate
of the performance of ψn (x) because ε̂h is our final judgement (at least based on
hold-out estimator) about the performance of Ψ in the context of interest. Not only
that, to train ψn (x) we add more data to the data that was previously used to train
ψl (x). With “well-behaved” learning algorithms, this should not generally lead to a
worse performance. Although one may argue that it makes more sense to use ε̂h as
an estimate of an upper bound on the error rate of ψn (x), on a given dataset and with
hold-out estimator, we have no other means to examine that and, therefore, we use
ε̂h as the estimate of error rate of ψn (x).
Resubstitution Estimator: The outcome of applying Ψ to a training set Str,n is
a classifier ψn (x). The proportion of samples within Str,n that are misclassified by
ψn (x) is the resubstitution estimate of the error rate (Smith, 1947) (also known
as apparent error rate). Resubstitution estimator will generally lead to an overly
optimistic estimate of the performance. This is not surprising because the learning
algorithm attempts to fit Str,n to some extent. Therefore, it naturally performs better
on Str,n rather than on an independent set of data collected from the underlying
feature-label distributions. As a result this estimator is not generally recommended
for estimating the performance of a predictive model.
240 9 Model Evaluation and Selection
following algorithm.
Algorithm E0 Estimator
1) Set B to an integer between 25 to 200 (based on a suggestion from Efron in
(Efron, 1983)).
4) ε̂B
0 error estimate is obtained as the ratio of the number of misclassified obser-
vations among samples in Ak ’s over the total number of samples in Ak ’s; that
is,
ÍB
ek
ε̂B
0
= ÍBk=1 . (9.1)
k=1 |A k|
There are other variants of bootstrap (e.g., 0.632 or 0.632+ bootstrap); however, we
keep our discussion on bootstrap estimators short because they are computationally
more expensive than K-fold CV (for small K)—recall that B in bootstrap is generally
between 25 to 200.
Cross-validation: Cross-validation (Lachenbruch and Mickey, 1968) is perhaps the
most common technique used for model evaluation because it provides a reasonable
compromise between computational cost and reliability of estimation. To understand
the essence of cross-validation (abbreviated as CV), we take a look at the hold-out
estimator again. In the hold-out, we split Str,n to a training set Str,l and a test
set Ste,m . We previously discussed problems associated with the use of hold-out
estimator specially in small-sample. In particular, our estimate could be biased to the
choice of held-out set Ste,m , which means that hold-out estimator can have a large
variance from sample to sample (collected from the same population). To improve
the variance of hold-out estimator, we may decide to switch the role of training and
test sets; that is to say, train a new classifier on Ste,m , which is now the training set,
9.1 Model Evaluation 241
and test its performance on Str,l , which is now treated as the test set. This provides
us with another hold-out estimate based on the same split (except for switching the
roles) that we can use along with the previous estimate to estimate the performance
of the learning algorithm Ψ (e.g., by taking their average). Although to obtain the two
hold-out estimates we preserved the nice property of keeping the test data separate
from training, the second training set (i.e., Ste,m ) has a much smaller sample size with
respect to the entire data at hand—recall that m is typically ∼25% of n. Therefore,
the estimate of the performance that we see by using Ste,m as training set can not
be generally a good estimate of the performance of applying Ψ to Str,n specially in
situations where n is not large. To bring the training sample size used to obtain both
of these hold-out estimates as close as possible to n, we need to split the data equally
to training and test sets; however, even that can not resolve the problem for moderate
to small n because a significant portion of the data (i.e., 50%) is not still used for
training. As we will see next, CV also uses independent pairs of test and training
sets to estimate the performance. However, in contrast with the above procedure, the
procedure of training and testing is repeated multiple times (to reduce the variance)
and rather than 50% of the entire data generally larger training sets are used (to
reduce the bias).
Standard K -fold CV: This is the basic form of cross-validation and is implemented
as in the following algorithm.
2. for k from 1 to K:
2.1. Hold out Foldk from Str,n to obtain Str,n − Foldk dataset (i.e., the entire
data excluding fold K).
2.3. Classify observations in Foldk using ψk (x). Let ek denote the number of
misclassified observations in Foldk .
K− f old
3. Compute the K-fold CV error estimate, denoted ε̂cv , as the ratio of the total
number of misclassified observations over the total sample size n; that is,
K
K− f old 1Õ
ε̂cv = ek . (9.2)
n k=1
242 9 Model Evaluation and Selection
Foldk
Class 0
Class 1
Class 2
Fig. 9.1: The working mechanism of standard K-fold CV where the data is divided
into 5 consecutive folds (K = 5).
Foldk
Class 0
Class 1
Class 2
lower bias with respect to the K-fold CV (K < n) that has n/K observations less
than Str,n . That being said, loo has a larger variance (Kittler and DeVijver, 1982).
Furthermore, we note that the use of loo estimator needs constructing n surrogate
classifiers. Therefore, this estimator is practically feasible when n is not large. In
addition,
Stratified K-fold CV: In Section 4.4, we stated that it is generally a good practice
to split the data into training and test sets in a stratified manner to have the same
proportion of samples from different classes in both the training and the test sets.
This is because under a random sampling for data collection, the proportion of
samples from each class is itself an estimate of the prior probability of that class;
that is, an estimate of the probability that an observation from a class appears before
making any measurements. These prior probabilities are important because once
a classifier is developed for a problem with specific class prior probabilities, one
expects to employ the classifier in the future under a similar context where the prior
probabilities are similar. In other words, in a random sampling, if the proportion of
samples from each class are changed radically from training set to test set (this is
specially common in case of sampling the populations separately), we may naturally
expect a failed behaviour from both the trained classifier (Shahrokh Esfahani and
Dougherty, 2013) andd its performance estimators (Braga-Neto et al., 2014).
In terms of model evaluation using CV, this means in order to have a realistic view
of the classifier performance on test data, in each iteration of CV we should keep the
proportion of samples from each class in the training set and the held out fold similar
to the full training data. This is achieved by another variant of cross-validation known
as stratified K-fold CV, which in classification problems is generally the preferred
method compared with the standard K-fold CV. Fig. 9.3 shows the training and
test sample indices for the stratified K-fold CV where the data is divided into 5
consecutive folds. The size of each held-out fold in this figure is similar to the size of
folds in Fig 9.1; however, the proportion of samples from each class that appears in
each fold is kept approximately the same as the original dataset. As before we may
244 9 Model Evaluation and Selection
even shuffle data before splitting to reduce the chance of having a systematic bias in
the data. Fig. 9.4 shows a schematic representation of splits in this setting.
Foldk
Class 0
Class 1
Class 2
Fig. 9.3: The working mechanism of stratified K-fold CV where the data is divided
into a set of consecutive folds such that the proportion of samples from each class is
kept the same as the full data (K = 5).
Foldk
Class 0
Class 1
Class 2
Fig. 9.4: The working mechanism of stratified K-fold CV with shuffling (K = 5).
Shuffle and Split (Random Permutation CV): Suppose we have a training dataset
that is quite large in terms of sample size and, at the same time, we have a compu-
tationally intensive Ψ to evaluate. In such a setting, even training one single model
using all the available dataset may take a relatively long time. Evaluating Ψ on this
data using leave-one-out is certainly not a good option because we need to repeat the
entire training process n times on almost the entire dataset. We may decide to use,
for example, 5-fold CV. However, even that could not be efficient because for each
iteration of 5-fold CV, the training data has a sample size of n − n/5, which is still
large and could be an impediment in repeating the training process several times.
In such a setting, we may still decide to train our final model on the entire training
9.1 Model Evaluation 245
data (to use all available data in training the classifier that will be finally used in
practice), but for evaluation, we may randomly choose a subset of the entire dataset
for training and testing, and repeat this procedure K times. This shuffle and split
evaluation process (also known as random purmutation CV) does not guarantee that
an observation that is used once in a fold as a test observation does not appear in
other folds. Fig. 9.5 shows a schematic representation of the splits for this estimator
where we choose the size of training data at each iteration to be 50% of the entire
dataset, the test data to be 20%, and the rest 30% are unused data (neither used for
training nor testing).
One might view the performance estimate obtained this way as an overly pes-
simistic estimate of our final trained model; after all, the final model is generally
trained on a much larger dataset, which means the final model has generally a better
performance compared to any surrogate model trained as part of this estimator. How-
ever, we need to note that removing a portion of samples in large-sample settings
has generally less impact on the performance of a learning algorithm than removing
a similar proportion in small-sample situations. At the same time, shuffle-and-split
is commonly used in large-sample settings, which to some extent limits the degree
of its pessimism.
Foldk
Unused Data
Class 0
Class 1
Class 2
Fig. 9.5: The working mechanism of shuffle and split where the training and test
data at each iteration is 50% and 20%, respectively. In contrast with other forms of
cross-validation, here an observation could be used for testing multiple times across
various CV iterations.
X_train_shape: (120, 4)
X_test_shape: (30, 4)
y_train_shape: (120,)
y_test_shape: (30,)
K-fold CV iteration 1
Train indices: [ 1 2 3 5 6 7 8 13 14 16 17 19 20 21
,→23 25 27 28 29 32 33 34 35 37 38 39 41 43 46 48 49 50
,→51 52 54 57 58 59 60 61 63 66 67 68 69 71 72 74 75 77
,→79 80 81 82 83 84 85 86 87 90 92 93 94 95 98 99 100 101
,→102 103 105 106 108 111 112 113 115 116 117 119]
Test indices: [ 0 4 9 10 11 12 15 18 22 24 26 30 31 36
,→40 42 44 45 47 53 55 56 62 64 65 70 73 76 78 88 89 91
,→96 97 104 107
109 110 114 118]
K-fold CV iteration 2
Train indices: [ 0 1 2 4 9 10 11 12 14 15 18 20 21 22
,→23 24 26 29 30 31 32 36 37 40 41 42 44 45 47 48 51 52
,→53 55 56 57 58 59 60 61 62 63 64 65 70 71 73 74 75 76
,→78 79 81 82 86 87 88 89 91 92 93 96 97 99 101 102 103 104
,→105 106 107 108 109 110 112 114 115 116 118 119]
Test indices: [ 3 5 6 7 8 13 16 17 19 25 27 28 33 34
,→35 38 39 43 46 49 50 54 66 67 68 69 72 77 80 83 84 85
,→90 94 95 98
100 111 113 117]
K-fold CV iteration 3
Train indices: [ 0 3 4 5 6 7 8 9 10 11 12 13 15 16
,→17 18 19 22 24 25 26 27 28 30 31 33 34 35 36 38 39 40
,→42 43 44 45 46 47 49 50 53 54 55 56 62 64 65 66 67 68
,→69 70 72 73 76 77 78 80 83 84 85 88 89 90 91 94 95 96
,→97 98 100 104 107 109 110 111 113 114 117 118]
Test indices: [ 1 2 14 20 21 23 29 32 37 41 48 51 52 57
,→58 59 60 61 63 71 74 75 79 81 82 86 87 92 93 99 101 102 103
,→105 106 108 112 115 116 119]
recall from Section 4.4-4.6 that when both splitting and preprocessing are desired,
the data is first split into training and test sets and then preprocessing is applied to
the training set only (and the test set is transformed by statistics obtained on training
set). This practice ensures that no information from the test set is used in training via
preprocessing. In CV, our training set after the “split” at each iteration is Str,n −Foldk
and the test data is Foldk . Therefore, if a preprocessing is desired, the preprocessing
must be applied to Str,n − Foldk at each iteration separately and Foldk should be
transformed by statistics obtained on Str,n −Foldk . In Chapter 11, we will see how this
could be done in scikit-learn. In the meantime, we have to realize that if we apply the
CV on the training data stored in np.load('data/iris_train_scaled.npz'),
some information of Foldk has already been used in Str,n − Foldk via preprocessing,
which is not legitimate (see Section 4.6 for a discussion on why this is an illegitimate
practice).
Rather than implementing a for loop as shown above to train and test surrogate
classifiers at each iteration, scikit-learn offers a convenient way to obtain the CV
scores using cross_val_score function from sklearn.model_selection mod-
ule. Among several parameters of this class, the following parameters are specially
useful:
1) estimator is to specify the estimator (e.g., classifier or regressor) used within
CV;
2) X is the training data;
3) y is the values of target variable;
4) cv determines the CV strategy as explained here:
– setting cv to an integer K in a classification problem uses stratified K-fold
CV with no shuffling. To implement, the K-fold CV for classification, we can
also set cv parameter to an object of KFold class and if shuffling is required,
the shuffle parameter of KFold should be set to True;
– setting cv to an integer K in a regression problem uses standard K-fold CV
with no shuffling (note that stratified K-fold CV is not defined for regression).
If shuffling is required, the shuffle parameter of KFold should be set to True
and cv should be set to an object of KFold;
5) n_jobs is used to specify the number of CPU cores that can be used in parallel
to compute the CV estimate. Setting n_jobs=-1 uses all processors; and
6) scoring: the default None value of this parameter leads to using the default
metric of the estimator score method; for example, in case of using the es-
timator DecisionTreeRegressor, R̂2 is the default scorer and in case of
DecisionTreeClassifier, the accuracy estimator is the default scorer.
cross_val_score uses a scoring function (the higher, the better) rather than
a loss function (the lower, the better). In this regard, a list of possible scoring
functions are available at (Scikit-metrics, 2023). We will also discuss in detail some
of these scoring functions later in this Chapter. Here we use cross_val_score to
implement the standard K-fold CV. Note that the results are the same as what were
obtained before.
9.1 Model Evaluation 249
knn = KNN()
cv_scores = cross_val_score(knn, X_train, y_train, cv=kfold)
print("the accuracy of folds are: ", cv_scores)
print("the overall 3-fold CV accuracy is: {:.3f}".format(cv_scores.
,→mean()))
Similar to KFold class that can be used to create the indices for the train-
ing and test at each CV iteration, we also have StratifiedKFold class from
sklearn.model_selection module that can be used for creating indices. There-
fore, another way that gives us more control over stratified K-fold CV is to set the
cv parameter of cross_val_score to an object from StratifiedKFold class.I
For example, here we shuffle the data before splitting (similar to Fig. 9.4) and set the
random_state for reproducibility:
from sklearn.model_selection import StratifiedKFold
strkfold = StratifiedKFold(n_splits=K_fold, shuffle=True,
,→random_state=42)
To implement the shuffle and split method, we can use ShuffleSplit class from
sklearn.model_selection module that can be used for creating indices and then
set the cv parameter of cross_val_score to an object from ShuffleSplit class.
250 9 Model Evaluation and Selection
Two important parameters of this class are test_size and train_size, which
could be either a floating point between 0.0 and 1.0 or an integer. In the former case,
the value of the parameter is interpreted as the proportion with respect to the entire
dataset, and in the latter case, it becomes the number of observations used.
Our aim here is to use an evaluation rule to estimate the performance of a trained
classifier. In this section, we will cover different metrics to measure various aspects
of a classifier performance. We will discuss both the probabilistic definition of
these metrics as well as their empirical estimators. Although we distinguish the
probabilistic definition of a metric from its empirical estimator, it is quite common
in machine learning literature to refer to an empirical estimate as the metric itself. For
example, the sensitivity of a classifier is a probabilistic concept. However, because
in practice we almost always need to estimate it from a data at hand, it is common to
refer to its empirical estimate per se, as the sensitivity.
Suppose a training set Str,n is available and used to train a binary classifier ψ. In
particular, ψ maps realizations of random feature vector X to realizations of a class
variable Y that takes two values: “P” and “N”, which are short for “positive” and
“negative”, respectively; that is, ψ(X) : R p → {P, N} where p is the dimensionality
of X. Although assigning classes as P and N is generally arbitrary, the classification
emphasis is often geared towards identifying class P instances; that is to say, we
generally label the “atypical” label as positive and the “typical” one as negative. For
example, people with a specific disease (in contrast with healthy people) or spam
emails (in contrast with non-spam emails) are generally allocated as the positive
class.
The mapping produced by ψ(X) from R p to {P, N} has an intermediary step.
Suppose x is a realization of X. The classification is performed by first mapping
x ∈ R p to s(x), which is a score on a univariate continuum such as R. This score
could be, for example, the distance of an observation from the decision hyperplanes,
or the estimated probability that an instance belongs to a specific class. The score
s(x) is then compared to t, which is a specific value of threshold T. We assume scores
are oriented such that if s(x) exceeds t, then x is assigned to P and otherwise to N.
Formally, we write
(
P if s(x) > t ,
ψ(x) = (9.3)
N otherwise .
In what follows, we define several probabilities that can help measure differ-
ent aspects of a classifier. To discuss the empirical estimators of these proba-
bilities, we assume the trained classifier ψ is applied to a sample of size m,
Ste,m = {(x1, y1 ), (x2, y2 ), . . . , (xm, ym )} and observations x1, . . . , xm are classified as
9.1 Model Evaluation 251
ŷ1, . . . , ŷm , respectively. We can categorize the outcome of the classifier for each xi
as a true positive (TP), a false positive (FP), a true negative (TN), or a false negative
(FN), which mean:
1) TP: both ŷi and yi are P;
2) FP: ŷi is P but yi is N (so the positive label assigned by the classifier is false);
3) TN: both ŷi and yi are N; and
4) FN: ŷi is N but yi is P (so the negative label assigned by the classifier is false).
Furthermore, the number of true positives, false positives, true negatives, and false
negatives among instances of Ste,m that are classified by ψ are denoted nTP , nFP , nTN ,
and nFN , respectively. The actual number of positives and negatives within Ste,m are
denoted by nP and nN , respectively. Therefore, nP = nTP + nFN , nN = nTN + nFP , and
nTP + nFP + nTN + nFN = nP + nN = m. We can summarize the counts of these states
in a square matrix as follows, which is known as Confusion Matrix (see Fig. 9.6).
nTP nFN
nFP nTN
True positive rate (tpr): the probability that a class P instance is correctly classified;
that is, t pr = P s(X) > t|X ∈ P . It is also common to refer to tpr as sensitivity (as
in clinical/medical applications) or recall (as in information retrieval literature). The
ˆ is obtained as
empirical estimator of tpr, denoted t pr,
nTP nTP
ˆ =
t pr = . (9.4)
nP nTP + nFN
False positive rate (fpr): the probability that a class N instance is misclassified;
that is, f pr = P s(X) > t|X ∈ N . Its empirical estimator, denoted f ˆpr, is
nFP nFP
f ˆpr = = . (9.5)
nN nTN + nFP
True negative rate (tnr): the probability that a class N instance is correctly clas-
sified; that is, tnr = P(s(X) ≤ t|X ∈ N). In clinical and medical applications, it is
common to refer to tnr as specificity. Its empirical estimator is
nTN nTN
ˆ =
tnr = . (9.6)
nN nTN + nFP
252 9 Model Evaluation and Selection
False negative rate (fnr): the probability that a class P instance is misclassified;
that is, f nr = P s(X) ≤ t|X ∈ P . Its empirical estimator is
nFN nFN
f ˆnr = = . (9.7)
nP nTP + nFN
Although here four metrics are defined, we have t pr + f nr = 1 and f pr + tnr = 1;
therefore, having t pr and f pr summarize all information in these four metrics.
Positive predictive value (ppv): this metric is defined by switching the role of
s(X) > t and X ∈ P in the conditional probability used to define t pr; that is to say,
the probability that an instance predicted as class P is truly from that class (i.e., is
correctly classified): ppv = P X ∈ P|s(X) > t . It is also common to refer to ppv as
precision. The empirical estimator of ppv is
nTP
ˆ =
ppv . (9.8)
nTP + nFP
False discovery rate (fdr)1: this metric is defined as the probability that an instance
class P is truly from class N (i.e., is misclassified): f dr = P X ∈
predicted as
N|s(X) > t = 1 − ppv. The empirical estimator of f dr is
nFP
f ˆdr = . (9.9)
nTP + nFP
F1 score: to summarize the information of recall and precision using one metric,
we can use F1 measure (also known as F1 score), which is given by the harmonic
mean of recall and precision:
2 2 recall × precision
F1 = . (9.10)
1
recall + 1
precision
recall + precision
As can be seen from (9.10), if either of precision and recall are low, F1 will be
low. The plug-in estimator of F1 score is obtained by replacing recall and precision
in (9.10) with their estimates obtained from (9.4) and (9.8), respectively.
Error rate: As seen in Section 4.9, the error rate, denoted ε, is defined as the
probability of misclassification by the trained classifier; that is to say,
ε = P(ψ(X) , Y ) . (9.11)
1 Inspired by (Sorić, 1989), in a seminal paper (Benjamini and Hochberg, 1995), Benjamini and
Hochberg coined the term “false discovery rate”, and they defined that as the expectation of (9.9).
The use of fdr here can be interpreted similar to the definition of (tail area) false discovery rate by
Efron in (Efron, 2007).
9.1 Model Evaluation 253
However, because misclassification can occur for both instances of P and N class, ε
can be equivalently written as
where P(Y = i) is the prior probability of class i ∈ {P, N}. The empirical estimate of
ε, denoted ε̂, is obtained by using empirical estimates of f pr and f nr given in (9.5)
and (9.7), respectively, as well as the empirical estimates of P(Y = P) and P(Y = N),
which under a random sampling assumption, are nmP and nmN , respectively; that is,
nFP nN nFN nP nFP + nFN nFP + nFN
ε̂ = × + × = ≡ , (9.13)
nN m nP m m nFP + nFN + nTP + nTN
which is simply the proportion of misclassified observations (c.f. Section 4.9). An
equivalent simple form to write ε̂ directly as a function of y j and ŷ j is
m
1 Õ
ε̂ = I {y ,ŷ } , (9.14)
m j=1 j j
Replacing tnr, t pr, P(Y = N), and P(Y = P) by their empirical estimates yields the
accuracy estimate, denoted acc,
ˆ and is given by
nTN nN nTP nP nTN + nTP nTP + nTN
ˆ =
acc × + × = ≡ , (9.17)
nN m nP m m nFP + nFN + nTP + nTN
which is simply the proportion of correctly classified observations. This can be
equivalently written as
m
1 Õ
ˆ =
acc I {y =ŷ } . (9.18)
m j=1 j j
of performance along with the accuracy. The following example demonstrates the
situation.
Example 9.1 Consider the following scenario in which we have three classifiers
to classify whether the topic of a given news article is politics (P) or not (N).
We train three classifiers, namely, Classifier 1, Classifier 2, and Classifier 3, and
evaluate them on a test sample Ste,100 that contains 5 political and 95 non-political
articles. The results of this evaluation for these classifiers are summarized as follows:
To answer this question, we first look into the accuracy estimate. However, before
doing so, we first obtain all elements of the confusion matrix for each classifier.
Since nP = nTP + nFN , nN = nTN + nFP , from the given information we have:
0 + 95
ˆ Classifier 1 =
acc = 0.95 ,
0 + 95 + 0 + 5
4 + 91
ˆ Classifier 2 =
acc = 0.95 ,
4 + 91 + 4 + 1
1 + 94
ˆ Classifier 3 =
acc = 0.95 .
1 + 94 + 1 + 4
In terms of accuracy, all classifiers show a high accuracy estimate of 95%. How-
ever, Classifier 1 seems quite worthless in practice as it classifies any news article to
N class! On the other hand, Classifier 2 seems quite effective because it can retrieve
four out of the five P instances in Ste,100 , and in doing so it is fairly precise because
among the eight articles that it labels as P, four of them is actually correct (nTP = 4
and nFP = 4). As a result, it is inappropriate to judge the effectiveness of these
classifiers based on the accuracy alone. Next we examine the (empirical estimates
of) recall (t pr) and precision (ppv) for these classifier:
9.1 Model Evaluation 255
0
ˆ Classifier 1 =
t pr = 0,
0+5
0
ˆ Classifier 1 =
ppv , (undefined)
0+0
4
ˆ Classifier 2 =
t pr = 0.8 ,
4+1
4
ˆ Classifier 2 =
ppv = 0.5 ,
4+4
1
ˆ Classifier 3 =
t pr = 0.2 ,
1+4
1
ˆ Classifier 3 =
ppv = 1.
1+0
As seen here, in estimating the ppv of Classifier 1, we encounter division by 0. In
these situations, one may dismiss the worth of the classifier or, if desired, define the
estimate as 0 in order to possibly calculate other metrics of performance that are
based on this estimate (e.g., F1 score). Nonetheless, here we perceive Classifier 1
as worthless and focus on comparing Classifier 2 and Classifier 3. As can be seen,
ˆ than Classifier 3 but its ppv
Classifier 2 has a higher t pr ˆ is lower. How can we decide
which classifier to use in practice based on these two metrics?
Depending on the application, sometimes we care more about precision and
sometimes we care more about the recall. For example, if we have a cancer diagnostic
classification problem, we would prefer having a higher recall because it means
having a classifier that can better detect individuals with cancer—the cost of not
detecting a cancerous patient could be death. However, suppose we would like to
train a classifier that classifies an email as spam (positive) or not (negative) and based
on the assigned label it directs the email to the spam folder, which is not often seen,
or to the inbox. In this application, it would be better to have a higher precision so
that ideally none of our important emails is classified as spam and left unseen in the
spam folder. If we have no preference between recall and precision, we can use the
F1 score estimate to summarize both metrics in one. The F1 scores for Classifier 2
and Classifier 3 are
Area Under the Receiver Operating Characteristic Curve (ROC AUC): All the
aforementioned metrics of performance depend on t, which is the specific value
of threshold used in a classifier. However, depending on the context that a trained
classifier will be used in the future, we may decide to change this threshold at the final
stage of classification, which is a fairly simple adjustment, for example, to balance
recall and precision depending on the context. Then the question is whether we can
evaluate a classifier over the entire range of the threshold used in its structure rather
256 9 Model Evaluation and Selection
than a specific value of the threshold. To do so we can evaluate the classifier by the
Receiver Operating Characteristic (ROC) curve and the Area Under the ROC Curve
(ROC AUC).
ROC curve is obtained by varying t over its entire range and plotting pairs of
( f pr, t pr) where f pr and t pr are values on the horizontal and vertical axes for
each t, respectively. Recall that given t pr and f pr, we also have f nr and tnr, which
means ROC curve is in fact a concise way to summarize the information in these four
metrics. A fundamental property of the ROC curve is that it is a monotnic increasing
function from (0, 0) to (1, 1). This is because as t → ∞, P s(X) > t|X ∈ P and
P s(X) > t|X ∈ N approach zero. At the same time, these probabilities naturally
increase for a decreasing t. In the extreme case, when t → −∞, both of these
probabilities approach one.
Let p s(X)|X ∈ P and p s(X)|X ∈ N denote the probability density function
(pdf) of scores given by a classifier to observations of class P and N, respectively.
Therefore, t pr and f pr, which are indeed functions of t, can be written as
∫ ∞
t pr = p s(X)|X ∈ P ds(X) ,
∫t ∞
f pr = p s(X)|X ∈ N ds(X) .
t
Intuitively, a classification rule that produces larger scores for the P class than the N
class is more effective than the one
producing similar scores for both classes. In other
words, the more p s(X)|X ∈ P and p s(X)|X ∈ N differ, the less likely in general
p s(X)|X ∈ P =
the classifier misclassifies observations. In
the extreme case where
p s(X)|X ∈ N , then P s(X) > t|X ∈ P = P s(X) > t|X ∈ N , ∀t. This situation
corresponds to random classification and the ROC curve becomes the straight line
connecting (0, 0) to (1, 1) (the solid line in Fig. 9.7). In this case, the area under the
ROC curve (ROC AUC) is 0.5.
Now suppose there is a t0 such that P s(X) > t0 |X ∈ P = 1 and P s(X) > t0 |X ∈
N = 0. This means that the entire masses of p s(X)|X ∈ P and p s(X)|X ∈ N
pdfs are on (t0, ∞) and (−∞, t0 ] intervals, respectively. This can also be interpreted
as perfect classification. This situation leads to the point (0, 1) for the ROC curve.
At the same time, it is readily seen that as t increases in (t0, ∞), f pr remains 0 but
t pr decreases until it approaches 0. On the other hand, for decreasing t in the range
(−∞, t0 ], t pr remains 1 while f pr increases until it approaches 1. This is the dashed
ROC curve in Fig. 9.7. In this case, the ROC AUC is 1.
To estimate the ROC curve, we can first rank observations in a sample based on
their scores given by the classifier, and then changing the threshold of classifier from
∞ to −∞, and plot a curve of t pr against f pr. Once the ROC curve is generated, the
ROC AUC is basically the area under the ROC curve. It can be shown that the ROC
AUC is equivalent to the probability that a randomly chosen member of the positive
class is ranked higher (the score is larger) than a randomly chosen member of the
negative class. This probabilistic interpretation of ROC AUC leads to an efficient
way to estimate the ROC AUC (denoted as auc) ˆ as follows (Hand and Till, 2001):
9.1 Model Evaluation 257
1
𝑡 → −∞
0.9 𝑡 = 𝑡!
0.8
0.7
0.6
𝑡𝑝𝑟
0.5
0.4
0.3
0.2
𝑡 → ∞0.1
0
0 0.2 0.4 0.6 0.8 1
𝑓𝑝𝑟
Fig. 9.7: The ROC curve for random (solid) and perfect (dashed) classification.
RN − 0.5 nN (nN + 1)
ˆ =
auc , (9.19)
nN nP
where RN is the sum of ranks of observations from the N class. Given nP and nN , the
maximum of auc ˆ is achieved for maximum RN , which happens when the maximum
score given to any observation from the N class is lower than the minimum score
given to any observation from the P class. In other words, there exists a threshold
t that “perfectly” discriminates instances of class P from class N. In this case,
ÍnN
RN = i=1 (nP + i) = nP nN + 0.5 nN (nN + 1) , which leads to auc
ˆ = 1.
Example 9.2 The following table shows the scores given to 10 observations by a
classifier. The table also shows the true class of each observations and their ranks
based on the given scores. We would like to generate the ROC curve and also find
the auc.
ˆ
258 9 Model Evaluation and Selection
The ROC curve is shown in Fig. 9.8. As we lower down the threshold, we mark
some of the thresholds that lead to the points on the curve.
0.9 𝑡 → −∞
0.8
𝑡 = 0.65
0.7
𝑡 = 0.30
0.6
𝑡𝑝𝑟
$
0.5
0.4 𝑡 = 0.80
0.3
0.2 𝑡 = 0.95
𝑡 → ∞0.1
0
0 0.2 0.4 0.6 0.8 1
&
𝑓𝑝𝑟
Íc−1 Íc−1
nTPi i=0 nTPi
ˆ micro = Íi=0
t pr = . (9.20)
i=0 (nTPi + nFNi )
c−1 Íc−1
i=0 mi
Macro averaging: In this strategy, the metric is calculated for each binary clas-
sification problem and then combined by
c−1
1Õ
ˆ macro =
t pr ˆ i,
t pr (9.21)
c i=0
c−1 c−1
Õ mi Õ nTPi + nFNi
ˆ weighted =
t pr ˆ i=
t pr ˆ i.
t pr (9.22)
i=0 (nTPi + nFNi )
Íc−1 Íc−1
i=0 i=0 mi i=0
Other metrics such as recall, precision, and F1 score are defined for the positive
class (in case of multiclass, as mentioned before, OvR can be used to successively
treat each class as positive). Therefore, to use their scikit-learn classes, we should
pass actual labels and predicted labels, and can pass the label of the positive class by
pos_label parameter (in case of binary classification, pos_label=1 by default):
• For recall: use recall_score(y_true, y_pred)
• For precision: use precison_score(y_true, y_pred)
• For F1 score: use f1_score(y_true, y_pred)
For multiclass classification, recall_score, precison_score, and f1_score
support micro, macro, and weighted averaging. To choose a specific averaging, the
average parameter can be set to either 'micro', 'macro', or 'weighted'.
Last but not least, to compute the ROC AUC, we need the actual labels and the
given scores. Classifiers in scikit-learn have either decision_function method,
predict_proba, or both. The decision_function returns scores given to obser-
vations; for example, for linear classifiers a score for an observation is proportional
to the signed distance of that observation from the decision boundary, which is a
hyperplane. As the result, the output is just a 1-dimensional array. On the other hand,
predict_proba method is available when the classifier can estimate the probability
of a given observation belonging to a class. Therefore, the output is a 2-dimensional
array of size, sample size × number of classes, where sum of each row over columns
becomes naturally 1. For ROC AUC calculation, we can use:
• roc_auc_score(y_true, y_score)
y_score: for binary classification, this could be the output of decision_function
or predict_proba. In the multiclass case, it should be a probability array of shape
sample size × number of classes. Therefore, it could be output of predict_proba
only.
For multiclass classification, roc_auc_score supports macro, weighted, and
micro averaging (the latter in scikit-learn version 1.2.0). This is done by setting the
average to either 'macro', 'weighted', or 'micro'.
Example 9.3 Suppose we have the following actual and predicted labels identified
by y_true and y_pred, respectively,
y_true = [2, 3, 3, 3, 2, 2, 2]
y_pred = [3, 2, 2, 3, 2, 2, 2]
where label 2 is the positive class. In that case, nTP = 3, nFP = 2 and from (9.8),
precision=0.6. We can compute this metrics as follows (note that we explicitly set
pos_label=2; otherwise, as this is a binary classification problem by default it uses
pos_label=1, which is not a valid label and leads to error):
sklearn.metrics import precision_score
y_true = [2, 3, 3, 3, 2, 2, 2]
9.1 Model Evaluation 261
y_pred = [3, 2, 2, 3, 2, 2, 2]
precision_score(y_true, y_pred, pos_label=2)
0.6
Example 9.4 In the following example, we estimate several metrics in the Iris clas-
sification problem both for the macro averaging and for each class individually (i.e.,
when each class is considered as the positive class):
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score, confusion_matrix,
,→recall_score, precision_score, f1_score, roc_auc_score,
,→classification_report
from sklearn.linear_model import LogisticRegression as LRR
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_train, X_test, y_train, y_test= train_test_split(iris.data, iris.
,→target, random_state=100, test_size=0.2, stratify=iris.target)
lrr = LRR(C=0.1)
y_test_pred = lrr.fit(X_train, y_train).predict(X_test)
np.set_printoptions(precision=3)
print("Class-specific Recall = {}".format(recall_score(y_test,
,→y_test_pred, average=None)))
print("Class-specific Precision = {}".format(precision_score(y_test,
,→y_test_pred, average=None)))
262 9 Model Evaluation and Selection
print(classification_report(y_test, y_test_pred)) #
,→classification_report is used to print them out nicely
X_train_shape: (120, 4)
X_test_shape: (30, 4)
y_train_shape: (120,)
y_test_shape: (30,)
Accuracy = 0.967
Confusion Matrix is
[[10 0 0]
[ 0 10 0]
[ 0 1 9]]
Macro Average Recall = 0.967
Macro Average Precision = 0.970
Macro Average F1 score = 0.967
Macro Average ROC AUC = 0.997
Class-specific Recall = [1. 1. 0.9]
Class-specific Precision = [1. 0.909 1. ]
Class-specific F1 score = [1. 0.952 0.947]
precision recall f1-score support
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
lrr = LRR(C=0.1)
cv_scores = cross_val_score(lrr, X_train, y_train, cv=strkfold,
,→scoring='roc_auc_ovr')
cv_scores
arbitrary_name(estimator, X, y)
that returns a dictionary containing multiple scores and pass that through the
scoring parameter of cross_validate (see the following example).
264 9 Model Evaluation and Selection
Example 9.5 Here we would like to use cross-validation to evaluate and record the
performance of logistic regression in classifying two classes in Iris classification
dataset. As the performance metrics, we use the accuracy, the ROC AUC, and the
confusion matrix.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression as LRR
iris = datasets.load_iris()
X_train = iris.data[iris.target!=0,:]
y_train = iris.target[iris.target!=0]
print('X_train_shape: ' + str(X_train.shape) + '\ny_train_shape: ' +
,→str(y_train.shape) + '\n')
lrr = LRR()
strkfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_validate(lrr, X_train, y_train, cv=strkfold,
,→scoring=accuracy_roc_confusion_matrix)
df = pd.DataFrame(cv_scores)
display(df)
print("the overall 5-fold CV accuracy is: {:.3f}".
,→format(cv_scores['test_accuracy'].mean()))
print("the overall 5-fold CV roc auc is: {:.3f}".
,→format(cv_scores['test_roc_auc'].mean()))
X_train_shape: (100, 4)
y_train_shape: (100,)
test_fn test_tp
0 0 10
1 0 10
2 1 9
3 1 9
4 0 10
h 2i
MSE = EY,X Y − f (X) , (9.23)
where the expectation is with respect to the joint distribution of Y and X. The
ˆ is given by
empirical estimate of MSE, denoted MSE,
m
ˆ = 1
Õ
MSE (yi − ŷi )2 . (9.24)
m i=1
The summation is also known as the residual sum of squares (RSS), which we have
seen previously in Section 5.2.2; that is,
266 9 Model Evaluation and Selection
m
Õ
RSS = (yi − ŷi )2 . (9.25)
i=1
h i
MAE = EY,X Y − f (X) . (9.26)
ˆ is obtained as
The estimate of MAE, denoted MAE,
m
ˆ = 1
Õ
MAE |yi − ŷi | . (9.27)
m i=1
h 2i h 2i
E Y − E[Y ] − EY,X Y − f (X)
R2 = h 2i . (9.28)
E Y − E[Y ]
RSS
R̂2 = 1 − , (9.29)
TSS
where RSS is given in (9.25) and
m
Õ
TSS = (yi − ȳ)2 , (9.30)
i=1
where ȳ = 1 Ím
m i=1 yi . Some points to remember:
• R2 is a scoring function (the higher, the better), while MSE and MAE are loss
functions (the lower, the better).
• R2 is a relative measure, while MSE and MAE are absolute measures; that is,
given a value of MSE, it is not possible to judge whether the model has a good
9.2 Model Selection 267
performance or not unless extra information on the scale of the target values is
given. This is also seen from (9.24). If we change the scale of target values and
ˆ is multiplied by
their estimates by multiplying yi and ŷi by a constant c, MSE
2
c . In contrast, this change of scale does not have an impact on the R̂2 .
Scikit-learn implementation of evaluation metrics of regressors: Similar to clas-
sification metrics, regression metrics are also defined in sklearn.metrics module:
• For R̂2 : use r2_score(y_true, y_pred)
ˆ use mean_squared_error(y_true, y_pred)
• For MSE:
ˆ use mean_absolute_error(y_true, y_pred)
• For MAE:
Similar to classification metrics, we can estimate these metrics with cross-
validation by setting the scoring parameter of cross_val_score to the corre-
sponding string that can be found from SCORERS.keys() (also listed at (Scikit-
metrics, 2023)). In this regard, an important point to remember is that cross-
validation implementation in scikit-learn accepts a scoring function, and not a
loss function. However, as we said, MSE and MAE are loss functions. That is the
reason that legitimate “scorers” retrieved by SCORERS.keys() include, for exam-
ple, 'neg_mean_squared_error' and 'neg_mean_absolute_error'. Setting
the scoring parameter of cross_val_score to 'neg_mean_squared_error'
internally multiplies the MSE by -1 to obtain a score (the higher, the better). There-
fore, if needed, we can use 'neg_mean_squared_error' and then multiply the
returned cross-validation score by -1 to obtain the MSE loss, which is non-negative.
The model selection stage is a crucial step towards constructing successful predictive
models. Oftentimes, practitioners are given a dataset and are asked to construct a
classifier or regressor. In such a situation, the practitioner has the option to choose
from a plethora of models and algorithms. The practices that the practitioner follows
to select one final trained model that can be used in practice from such a vast number
of candidates are categorized under model selection.
From the perspective of the three-stage design process discussed in Section 1.1.6,
the model selection stage includes the initial candidate set selection (Step 1) and
selecting the final mapping (Step 3). For example, it is a common practice to initially
narrow down the number of plausible models based on
• practitioner’s prior experience with different models and their sample size-
dimensionality requirements;
• the nature of data at hand (for example, having time-series, images, or other
types); and
• if available, based on prior knowledge about the field of study from which the
data has been collected. This stage ipso facto is part of model selection, albeit
based on prior knowledge.
268 9 Model Evaluation and Selection
After this stage, the practitioner is usually faced with one or a few learning
algorithms. However, each of these learning algorithms generally has one or a few
hyperparameters that should be estimated. We have already seen hyperparameters
in the previous sections. These are parameters that are not estimated/given by the
learning algorithm using which a classifier/regressor is trained but, at the same time,
can significantly change the performance of the predictive model. For example, k in
kNN (Chapter 5), C or ν in various forms of regularized logistic regression (Section
6.2.2), minimum sample per leaf, maximum depth, or even the choice of impurity
criterion in CART (Section 7.2.2 and 7.2.4), or the number of estimators or the choice
of base estimators and all its hyperparameters in using AdaBoost (Section 8.6.1).
Training a final model involves using a learning algorithm as well as estimating its
hyperparameters—a process, which is known as hyperparameter tuning and is part
of model selection. Two common data-driven techniques for model selection are grid
search and random search.
Suppose after initial inspection (Step 1 of the design process), T learning al-
gorithms, denoted Ψi, i = 1, . . . , T, have been chosen. For each Ψi , we specify a
discrete hyperparameter space HΨDi . Depending on the estimation rules that are used
to estimate the metric of performance, there are two types of grid search that are
commonly used in practice:
• grid search using validation set
• grid search using cross-validation
Hereafter, we denote a predictive model (classifier/regressor) constructed by ap-
plying the learning algorithm Ψi with a specific set of hyperparameters θ ∈ HΨDi on
a training sample S1 by ψθ, i (S1 ). Furthermore, let ηψθ, i (S1 ) (S2 ) denote the value of a
scoring (the higher, the better) metric η obtained by evaluating ψθ, i (S1 ) on a sample
S2 .
Grid search using validation set: In our discussions in the previous chapters, we
have generally split a given sample into a training and a test set to train and evaluate
an estimator, respectively. To take into account the model selection stage, a common
strategy is to split the data into three sets known as training, validation, and test sets.
The extra set, namely, the validation set, is then used for model selection. In this
regard, one can first split the entire data into two smaller sets, for example, with a
size of 75% and 25% of the given sample, and use the smaller set as test set. The
larger set (i.e., the set of size 75%) is then split into a “training set” and a “validation
set” (we can use again the same ratio of 75% and 25% to generate the training and
validation sets).
Let Str , Sva , and Ste denote the training, validation, and test sets, respectively.
Furthermore, let Ψθ,va∗ denote the optimal learning algorithm-hyperparameter values
estimated using the grid search on validation set. We have
∗
Ψθ,va = argmax ηψθ ∗, i (St r ) (Sva ) , (9.31)
Ψi , i=1,2,...,T i
where
Although (9.31) and (9.32) can be combined in one equation, we have divided
them into two equations to better describe the underlying selection process. Let
us elaborate a bit on these equations. In (9.32), θ i∗ , which is the estimate of the
optimal values of hyperparameter for each learning algorithm Ψi, i = 1, 2, . . . , T, is
obtained. In this regard, for each θ ∈ HΨDi , Ψi is trained on Str to construct the
classifier ψθ, i (Str ). Then this model is evaluated on the validation set Sva to obtain
270 9 Model Evaluation and Selection
ηψθ, i (St r ) (Sva ). To estimate the optimal values of hyperparameters for Ψi , we then
compare all scores obtained for each θ ∈ HΨDi . At the end of (9.32), we have T
sets of hyperparameter values θ ∗i (one for each Ψi ). The optimal learning algorithm-
hyperparameter values is then obtained by (9.31) where the best scores of different
learning algorithms are compared and the one with maximum score is chosen. There
is no need to compute any new score in (9.31) because all required scores used in
the comparison have been already computed in (9.32).
Once Ψθ,va ∗ is determined from (9.31), we can apply it to Str to train our “final”
model. Nevertheless, this model has already been trained and used in (9.31), which
means that we can just pick it if it has been stored. Alternatively, we can combine Str
and Sva to build a larger set (Str ∪ Sva ) that is used in training one final model. In
this approach, the final model is trained by applying the estimated optimal learning
algorithm-hyperparameter values Ψθ,va ∗ to Str ∪ Sva . There is nothing wrong with
this strategy (as a matter of fact this approach more closely resembles the grid search
using cross-validation that we will see next). We just try to utilize the full power of
available data in training. With a “sufficient” sample size and well-behaved learning
algorithms, we should not generally see a worse performance.
Once the final model is selected/trained, it can be applied on Ste for evaluation.
Some remarks about this grid search on validation set:
• Sva simulates the effect of having test data within Str ∪ Sva . This means that
if in the process of model selection a preprocessing
stage is desired, it should
be applied to Str , and Sva similar to Ste should be normalized using statistics
found based on Str . If we want to train the final model on Str ∪ Sva , another
normalization should be applied to Str ∪ Sva and then Ste should be normalized
bases on statistics found on Str ∪ Sva .
• One potential problem associated with grid search using validation set is that the
estimate of the optimal learning algorithm-hyperparameter values obtained from
(9.31) could be biased to the single sample chosen for validation set. Therefore,
if major probabilistic differences exist between the single validation set from the
test data, the final constructed model may not generalize well and shows poor
performance on unseen data. This problem is naturally more pronounced in small
sample where the model selection could become completely biased towards the
specific observations that are in validation set. Therefore, this method of model
selection is generally used in large sample.
Grid search using cross-validation: In grid search using cross-validation, we look
for the combination of the learning algorithm-hyperparameter values that leads to
the highest score evaluated using cross-validation on Str . For a K-fold CV, this is
formally represented as:
K− f old
∗
Ψθ,cv = argmax ηψ , (9.33)
θ ∗, i
Ψi , i=1,2,...,T i
where
9.2 Model Selection 271
K− f old
θ i∗ = argmax ηψθ, i , i = 1, 2, . . . , T , (9.34)
θ ∈HΨD
i
and where
K
K− f old 1 Õ
ηψθ, i = ηψ (S −Foldk ) (Foldk ) , (9.35)
K k=1 θ, i t r
in which Foldk denote the set of observations that are in the k th fold of CV. Once
∗
Ψθ,cv is determined, a final model is then trained using the estimated optimal learning
algorithm-hyperparameter values Ψθ,cv ∗ on Str . Once this final model is trained, it
can be applied on Ste for evaluation.
Scikit-learn implementation of grid search model selection: To implement
the grid search using cross-validation, we can naturally write a code that goes
over the search space and use cross_val_score to compute the score for
each point in the grid. Nevertheless, scikit-learn has a convenient way to imple-
ment the grid search. For this purpose, we can use GridSearchCV class from
sklearn.model_selection module. A few important parameters of this class
are:
1) estimator: this is the object from the estimator class;
2) param_grid: this could be either a dictionary to define a grid or a list of
dictionaries to define multiple grids. Either way, the keys in a dictionary should
be parameters of the estimator that are used in model selection. The possible
values of each key used in the search are passed as a list of values for that key.
For example, the following code defines two grids to search for hyperparameter
tuning in logistic regression:
metrics are used in scoring, this parameter should be a string that identifies
the metric used for refitting.
Once an instance of class GridSearchCV is created, we can treat it as a regular
classifier/regressor as it implements fit, predict, score and several other meth-
ods. Here fit trains an estimator for each given combination of hyperparameters.
The predict and score methods can be used for prediction and evaluation using
the best found estimator, respectively. Nevertheless, an object from GridSearchCV
class has also some important attributes such as:
1) cv_results_: it is a dictionary containing the scores obtained on each fold for
each hyperparameter combination as well as some other information regrading
training (e.g., training time);
2) best_params_: the combination of hyperparameters that led to the highest
score;
3) best_score_: the highest score obtained on the given grids; and
4) best_estimator_: not available if refit=False; otherwise, this attributes
gives access to the estimator trained on the entire dataset using the best combi-
nation of hyperparameters—predict and score methods of GridSearchCV
use this estimator as well.
Example 9.7 Let us consider the Iris classification dataset again. We wish to use
GridSearchCV class to tune hyperparameters of logistic regression over a hyper-
parameter space defined using the same grids as specified in the previous page. We
monitor both the accuracy and the weighted ROC AUC during the cross validation
but we use the accuracy for hyperparameter tuning. Once the hyperparameter tuning
is completed, we also find the test accuracy (accuracy on the test data) for the best
model trained on the entire training data using the hyperparameter combination that
led to the highest CV accuracy. As for the cross-validation, we use stratified 3-fold
CV with shuffling data before splitting (all these tuning, training, and testing steps
can be implemented in a few lines of codes using scikit-learn!)
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression as LRR
from sklearn.model_selection import GridSearchCV, StratifiedKFold,
,→train_test_split
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_train, X_test, y_train, y_test= train_test_split(iris.data, iris.
,→target, random_state=100, test_size=0.2, stratify=iris.target)
lrr = LRR(max_iter=2000)
9.2 Model Selection 273
print('the accuracy of the best estimator on the test data is: {:.3f}'.
,→format(score_best_estimator))
df = pd.DataFrame(gscv.cv_results_)
df
X_train_shape: (120, 4)
X_test_shape: (30, 4)
y_train_shape: (120,)
y_test_shape: (30,)
To implement the grid-search using validation set, we can first use the function
train_test_split from sklearn.model_selection to set aside the test set,
274 9 Model Evaluation and Selection
if desired, and then use it once again to set aside the validation set. The indices
of the training and validation sets can be then created by using some cross valida-
tion generator such as PredefinedSplit class from sklearn.model_selection
and used as cv parameter of GridSearchCV, which accepts generators. To use
PredefinedSplit, we can set indices of validation set to an integer such as 0 but
the indices of the training set should be set to -1 to be excluded from validation set
(see the following example).
Example 9.8 In this example, we use the same data, classifier, and hyperparameter
search space as in Example 9.7; however, instead of grid search cross-validation, we
use validation set.
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression as LRR
from sklearn.model_selection import GridSearchCV, StratifiedKFold,
,→train_test_split, PredefinedSplit
iris = datasets.load_iris()
X, X_test, y, y_test= train_test_split(iris.data, iris.target,
,→random_state=100, stratify=iris.target)
indecis = np.arange(len(X))
X_train, X_val, y_train, y_val, indecis_train, indecis_val=
,→train_test_split(X, y, indecis, random_state=100, stratify=y)
print('the accuracy of the best estimator on the test data is: {:.3f}'.
,→format(score_best_estimator))
df = pd.DataFrame(gscv.cv_results_)
df
0 -1 0 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 0 -1 -1 -1 -1 0 -1 0 -1
0 -1 0 0 -1 -1 -1 0 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1
-1 -1 -1 -1 -1 -1 0 -1 0 -1 -1 -1 -1 -1 -1 -1]
the split used as training and validation sets:
(array([ 0, 1, 2, 3, 4, 5, 6, 10, 11, 13, 14, 15, 16,
18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32,
33, 34, 35, 38, 39, 41, 43, 45, 46, 47, 49, 51, 53,
54, 55, 57, 58, 59, 61, 62, 64, 65, 66, 67, 69, 71,
73, 76, 77, 78, 80, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101, 103, 105,
106, 107, 108, 109, 110, 111]), array([ 7, 8, 9, 12, 17,
,→21, 29,
36, 37, 40, 42, 44, 48,
50, 52, 56, 60, 63, 68, 70, 72, 74, 75, 79, 81, 93,
102, 104]))
the highest score is: 0.893
the best hyperparameter combination is: {'C': 0.1, 'l1_ratio': 0.8,
,→'penalty':
'elasticnet', 'solver': 'saga'}
the accuracy of the best estimator on the test data is: 0.974
Naturally, for a specific Ψ, if, 1) HΨD that is used in the grid search contains the
optimal set of hyperparameter values; 2) the estimation rule can accurately estimate
276 9 Model Evaluation and Selection
Example 9.9 In this example, we use the same data, classifier, and hyperparame-
ter search space as in Example 9.7; however, instead of GridSearchCV, we use
RandomizedSearchCV. As in Example 9.7, we consider both l1 and l2 penalty.
However, C is assumed to be log-uniformly distributed between 0.01 and 100, and ν
(defined in Section 6.2.2) is assumed to be uniformly distributed between 0 and 1.
9.2 Model Selection 277
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression as LRR
from sklearn.model_selection import RandomizedSearchCV,
,→StratifiedKFold, train_test_split
from scipy.stats import loguniform, uniform
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_train, X_test, y_train, y_test= train_test_split(iris.data, iris.
,→target, random_state=100, test_size=0.2, stratify=iris.target)
print('X_train_shape: ' + str(X_train.shape) + '\nX_test_shape: ' +
,→str(X_test.shape)\
+ '\ny_train_shape: ' + str(y_train.shape) + '\ny_test_shape: '
,→+ str(y_test.shape) + '\n')
lrr = LRR(max_iter=2000)
distrs = [{'penalty': ['l2'], 'C': loguniform(0.01, 100)},
{'penalty': ['elasticnet'], 'C': loguniform(0.01, 100),
,→'l1_ratio':uniform(0, 1), 'solver':['saga']}]
X_train_shape: (120, 4)
X_test_shape: (30, 4)
y_train_shape: (120,)
y_test_shape: (30,)
params
0 {'C': 15.352246941973478, 'penalty': 'l2'}
1 {'C': 8.471801418819974, 'penalty': 'l2'}
2 {'C': 2.440060709081752, 'penalty': 'l2'}
3 {'C': 0.04207053950287936, 'l1_ratio': 0.05808...
4 {'C': 0.21618942406574426, 'l1_ratio': 0.14286...
5 {'C': 0.012087541473056955, 'penalty': 'l2'}
6 {'C': 7.726718477963427, 'l1_ratio': 0.9385527...
7 {'C': 0.05337032762603955, 'l1_ratio': 0.18340...
8 {'C': 2.7964859516062437, 'l1_ratio': 0.007066...
9 {'C': 0.14618962793704957, 'penalty': 'l2'}
Exercises:
Exercise 2: Use the same dataset as in Exercise 1 but this time randomly split the
dataset to 80% training and 20% test. Use cross-validation grid search to find the best
set of hyperparameters for random forest to predict the target based on all features.
To define the space of hyperparameters of random forest assume:
1. max_features is 1/3;
2. min_samples_leaf ∈ {2, 10};
3. n_estimators ∈ {10, 50, 100}
9.2 Model Selection 279
In the grid search, use 10-fold CV with shuffling and compute and report both
the estimates of both the MAE and R2 . Then evaluate the model that has the highest
R̂2 determined from CV on the held-out test data and record its test R̂2 .
Exercise 3: In this exercise, use the same data, classifier, and hyperparameter search
space as in Example 9.9; however, instead of grid search cross-validation, use val-
idation set (for reproducibility, set the random_state in RandomizedSearchCV,
StratifiedKFold, and train_test_split to 42). Monitor and report both the
accuracy and the weighted ROC AUC on the validation set but use the accuracy for
hyperparameter tuning. Report the best estimated set of hyperparameters, its score
on validation-set, and the accuracy of the corresponding model on the test set.
Exercise 5: Which of the following best describes the use of resubstitution estimation
rule in a regression problem to estimate R2 ?
Exercise 7: Suppose we have the following training data. What is the leave-one-out
error estimate of the standard 3NN classification rule based on this data?
class A A A B B B
observation 0.45 0.2 0.5 0.35 0.45 0.4
2 3 4 5 6
A) 6 B) 6 C) 6 D) 6 E) 6
Exercise 8: Suppose we have a classifier that assigns to class positive (P) if the score
is larger than a threshold and to negative (N) otherwise. Applying this classifier to
some test data results in the scores shown in the following table where “class” refers
to the actual class of an observation. What is the estimated AUC of this classifier
based on this result?
1 P 0.25 6 P 0.35
2 P 0.45 7 N 0.38
3 N 0.48 8 N 0.3
4 N 0.4 9 P 0.5
5 N 0.33 10 N 0.2
É
Exercise 9:
Suppose in a binary classification problem, the class conditional densities are mul-
tivariate normal distributions with a common covariance matrix N (µ P, Σ) and
N (µ N, Σ), and prior probability of classes πP and πN . Show that for any linear
classifier in the form of
(
P if aT x + b > T
ψ(x) = (9.36)
N otherwise ,
where a and b determine the discriminant function, the optimal threshold, denoted,
Topt , that minimizes the error rate
ε = πN × f pr + πP × f nr , (9.37)
9.2 Model Selection 281
is
g(Σ) ln ππNP
h(µ N )+h(µ P )
Topt = − . (9.38)
2 h(µ N )−h(µ P )
where
h(µ) = aT µ + b,
g(Σ) = aT Σa . (9.39)
Exercise 10: Suppose in a binary classification problem, the estimate of the ROC
AUC for a specific classifier on a test sample Ste is 1. Let nFP and nFN denote the
number of false positives and false negatives that are produced by applying this
classifier to Ste , respectively, and acc
ˆ denote its accuracy estimate based on Ste .
Which of the following cases is impossible?
A) accˆ =1
B) accˆ <1
C) nFP > 0 and nFN = 0
D) nFP > 0 and nFN > 0
Exercise 11: Use scikit-learn to produce the ROC curve shown in Example 9.2.
Hint: one way is to use the true labels (y_true) and the given scores (y_pred) as
inputs to sklearn.metrics.RocCurveDisplay.from_predictions(y_true,
y_pred). Another way is first use the true labels and the given scores as inputs
to sklearn.metrics.roc_curve to compute the f ˆpr, t pr ˆ (we can also view the
thresholds at which they are computed). Then use the computed f ˆpr and t prˆ as
the inputs to sklearn.metrics.RocCurveDisplay to create the roc curve display
object, and plot the curve by its plot() method.
Chapter 10
Feature Selection
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 283
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_10
284 10 Feature Selection
of the original feature space are known as dimensionality reduction methods. There
are three main reasons why one should reduce dimensionality:
• interpretability: it is often helpful to be able to see the the relationships between
the inputs (feature vector) and the outputs in a given problem. For example, what
biomarkers (features) can possibly classify a patient with bad prognosis from
another patient with good prognosis? There are two layers into interpretability
of a final constructed model. One layer is the internal structure of the predictive
model. For example, in Section 7.4, we observed that CART has a glass-box
structure that helps shed light on the input-output relationship. Another layer
is the input to the model per se. For example, if the original dataset has a
dimensionality of 20,000, using all features as the input to the CART would
impair interpretability regardless of its glass-box structure. In these situations,
it is helpful to focus on “important” features that contribute to the prediction;
observations. This entire process is repeated 20 times to find the average accuracy
of each classifier as a function of feature size.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier as CART
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as
,→LDA
np.random.seed(42)
n_features = 100
mc_no = 20 # number of Monte Carlo simulation
acc_test = np.zeros((2, n_features)) # to record test accuracies for
,→both LDA and CART
cart = CART()
lda = LDA()
for j in np.arange(mc_no):
# by setting shuffle=False in make_classification, the informative
,→features are put first
X, y = make_classification(n_samples=1000, n_features=n_features,
,→n_informative=10, n_redundant=0, n_repeated=0, n_classes=2,
,→n_clusters_per_class=1, class_sep=1, shuffle=False)
As we can see in Fig. 10.1, after all informative features are added, adding any
additional feature would be harmful to the average classifier performance, which is
a manifestation of the peaking phenomenon.
286 10 Feature Selection
0.9
0.8
avg. test accuracy
(a) (b)
Fig. 10.1: Average test accuracy as a function of feature size: (a) CART; and (b)
LDA.
to name just a few criteria. Detailed descriptions of all aforementioned criteria are
beyond our scope. As a result, here we briefly discuss the main idea behind filter
methods using statistical tests—and even there our discussion is not meant to be a
complete guide on various aspects of statistical tests.
Statistical Tests: Statistical tests in feature selection are used to examine the “im-
portance” of a feature by testing some properties of populations based on available
data—for example, by testing the hypothesis that a single feature is unimportant in
distinguishing two or multiple classes from each other. To put it another way, the
definition of importance here depends on the type of hypothesis that we test. For
example, we may designate a feature as important if based on the given sample, the
hypothesis that there is no difference between the mean of class conditional densities
is rejected. If based on the statistic (i.e., a quantity derived from data) that is used in
the test (test statistic), we observe a compelling evidence against the null hypothesis,
denoted H0 , we reject H0 in the favor of the alternative hypothesis Ha , and we
conclude the feature is potentially important; otherwise, we continue to believe in
H0 . Therefore, in feature selection applications, H0 is generally the assertion that a
feature is unimportant and the alternative hypothesis Ha bears the burden of proof.
Formally, a test rejects a null hypothesis if its P-value (also known as the observed
significance level) is less than a significance level (usually denoted by α), which is
generally set to 0.05. For example, we reject H0 if the P-value is 0.02, which is
less than α = 0.05. The significance level of a test is the probability of a false
positive occurring, which for our purpose means claiming a feature is important
(i.e., rejecting H0 ) if H0 is true (i.e., the feature is not actually important). The
probability of a false positive is defined based on the assumption of repeating the
test on many samples of the same size collected from the population of interest; that
is to say, if H0 is true and the test procedure is repeated infinitely many times on
different samples of the same size from the population, the proportion of rejected
H0 is α. A detailed treatment of P-value and how it is obtained in a test is beyond the
scope here (readers can consult statistical textbooks such as (Devore et al., 2013)),
but the lower the P-value, the greater the strength of observed test statistic against
the null hypothesis (more compelling evidence against the hypothesis that the feature
is unimportant).
Common statistical tests that can be used for identifying important features include
t-test, which is for binary classification, and ANOVA (short for analysis of variance),
which is for both binary and multiclass classification. In ANOVA, a significant P-
value (i.e., a P-value that is less than α) means the feature would be important
in classifying at least two of the classes. In ANOVA, a statistic is computed that
under certain assumptions follows F-distribution (Devore et al., 2013) (therefore,
the statistic is known as F-statistic and the statistical test is known as F-test).
Whether the computed F-statistic is significant or not is based on having the P-value
less than α. In regression, the most common statistical test to check the importance
of a feature in predicting the target is based on a test that determines whether the
Pearson’s correlation coefficient is statistically significant. For this purpose, it also
conducts an F-test.
288 10 Feature Selection
It is worthwhile to remember that many common statistical tests only take into
account the effect of one feature at a time regardless of its multivariate relationship
with other features. This is itself a major shortcoming of these tests because in
many situations features work collectively to result in an observed target value.
Nonetheless, insofar as statistical tests are concerned, it can be shown that when we
have many features, and thereof many statistical tests to examine the importance of
each feature at a time, the probability of having at least one false positive among
all tests, also known as family-wise error rate (FWER), is more than α where α is
the significance level for each test (see Exercise 1). A straightforward approach to
control FWER is the well-known Bonferroni correction. Let m denote the number
of tests performed. In this approach, rather than rejecting a null hypothesis when
P-value ≤ α as it is performed in testing a single hypothesis, Bonferroni correction
involves rejecting the null hypotheses by comparing P-values to α/m (Exercise 2).
Despite the ease of use, Bonferroni correction is a conservative approach; that is to
say, the strict threshold to reject a null hypothesis results in increasing the probability
of not rejecting the null hypothesis when it is false (i.e., increases the false negative
rate). There are more powerful approaches than Bonferroni. All these approaches
that are essentially used to control a metric of interest such as FWER, f pr or f dr
in multiple testing problem are collectively known as multiple testing corrections.
It would be instructive to see the detail algorithm (and later implementation) of at
least one of these corrections. As a result, without any proof, we describe a well-
known multiple testing correction due to Benjamini and Hochberg (Benjamini and
Hochberg, 1995).
Benjamini–Hochberg (BH) false discovery rate procedure: This procedure is for
controlling the expected (estimate of the) false discovery rate in multiple tests; that
is, to control (see Section 9.1.2 for the definition of f dr):
h nFP i
E[ f ˆdr] = E . (10.1)
nTP + nFP
Suppose we have p features and, therefore, p null hypothesis denoted by H0,1 ,
H0,2 ,. . . ,H0, p (each null hypothesis asserts that the corresponding feature is unim-
portant). Therefore, we conduct p statistical tests to identify the null hypotheses that
are rejected in the favor of their alternative hypotheses. The BH f dr procedure for
multiple test correction is presented in the following algorithm.
10.2 Feature Selection Techniques 289
2) Sort P-values obtained from tests and denote them by P(1) ≤ P(2) ≤ . . . ≤ P(p) .
3) Let j denote the largest i such that P(i) ≤ pi α. To find j, start from the
largest P-value and check whether P(p) ≤ pp α; if not, then check whether
P(p−1) ≤ p−1 p−2
p α; if not, then check P(p−2) ≤ p α, and continue. Stop as soon as
the inequity is met— j is the index of the P(i) that meets this condition for the
first time.
4) Reject the null hypothesis H0,(1) , H0,(2) , . . . , H0,(j) . This means all P(i), i ≤ j
(P-values smaller than P(j) ) are also significant even if P(i) ≤ pi α is not met for
some of them.
Example 10.2 Suppose we have 10 features and run ANOVA test for each to identify
important features. The following P-values are obtained where Pi denotes the P-
value for feature i = 1, 2, . . . , 10. Apply BH procedure to determine important
features by an α = 0.05.
P1 = 0.001, P2 = 0.1, P3 = 0.02, P4 = 0.5, P5 = 0.002 P6 = 0.005, P7 = 0.01,
P8 = 0.04, P9 = 0.06, P10 = 0.0001
Step 2: Sort P-values: 0.0001 ≤ 0.001 ≤ 0.002 ≤ 0.005 ≤ 0.01 ≤ 0.02 ≤ 0.04 ≤
0.06 ≤ 0.1 ≤ 0.5.
Step 3:
10
is 0.5 ≤ 10 0.05? No.
9
is 0.1 ≤ 10 0.05? No.
8
is 0.06 ≤ 10 0.05? No.
7
is 0.04 ≤ 10 0.05 = 0.035? No.
is 0.02 ≤ 10 0.05 = 0.03? Yes.
6
Step 4: Reject any null hypothesis with a P-value less or equal to 0.02. That is, null
hypothesis corresponding to 0.0001, 0.001, 0.002, 0.005, and 0.01. This means the
indices of important features are 10, 1, 5, 6, 7, and 3.
Using multiple tests correction, we select a number of features that are important
based on the P-values and what we try to control using the correction procedure.
This way, we generally do not know in advance how many features will be selected.
In other words, given an α for the correction, which is a measure of the expected
290 10 Feature Selection
Example 10.3 Here we wish to conduct ANOVA and then use BH procedure
to correct for multiple testing. For this purpose we use the Genomic dataset
GenomicData_orig.csv that was used in Exercise 1, Chapter 6. The data had
21050 features, and thereof that many statistical tests to determine the importance
of each feature based on F-statistic. For this purpose, we first apply ANOVA (using
f_classif) and “manually” implement the procedure outlined above for the BH
correction. Then we will see that using SelectFdr, we can easily use ANOVA,
correct for multiple testing based on BH procedure, and at the same time, implement
the filter (i.e., select our important features). Last but not least, SelectKBest is used
to pick the best 50 individual features based on ANOVA.
from sklearn.feature_selection import SelectFdr, f_classif, SelectKBest
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
10.2 Feature Selection Techniques 291
df = pd.read_csv('data/GenomicData_orig.csv')
df.head()
df = df.drop(df.columns[[0,1]], axis=1)
# to use SelectKBest
filter_KBest = SelectKBest(k=50).fit(X_train, y_train)
X_train_filtered_KBest = filter_KBest.transform(X_train)
print("The number of important features based on F statistic (using
,→K=50 best individual feature):", X_train_filtered_KBest.shape[1])
X_test_filtered_KBest = filter_KBest.transform(X_test)
print("The shape of test data (the number of features should match with
,→training): ", X_test_filtered_KBest.shape)
correction): 7009
The maximum p-value considered significant before BH correction is: 0.0100
The number of important features based on F statistic (after multiple
,→hypothesis
correction): 5716
The maximum p-value considered significant after BH correction is: : 0.
,→0027
Let's look at the p-values for the first 20 features
[1.26230819e-06 2.57999668e-04 7.20697614e-01 4.61019523e-01
1.44329139e-04 7.73450170e-07 7.88695806e-01 5.38203614e-01
6.12687071e-01 6.08016364e-01 7.42496319e-10 7.22040641e-01
6.38454543e-01 9.08196717e-03 5.42029503e-17 8.64539548e-02
8.26923725e-09 3.88193129e-06 8.46980459e-02 3.57175908e-01]
The number of important features based on F statistic (after multiple
,→hypothesis
correction): 5716
The shape of test data (the number of features should match with
,→training):
(43, 5716)
The number of important features based on F statistic (using K=50 best
individual feature): 50
The shape of test data (the number of features should match with
,→training):
(43, 50)
10.2 Feature Selection Techniques 293
These methods wrap around a specific learning algorithm and evaluate the impor-
tance of a feature subset using the predictive performance of the learner trained on
the feature subset. The learning algorithm that is used for wrapper feature selection
does not necessarily need to be the same as the learning algorithm that is used to
construct the final predictive model; for example, one may use an efficient algorithm
such as decision trees or Naive Bayes for feature selection but a more sophisticated
algorithm to construct the final predictive model based on selected features.
Commonly used wrapper methods include a search strategy using which the space
of feature subsets is explored. Each time a candidate feature subset is selected for
evaluation based on the search strategy, a model is trained and evaluated using an
evaluation rule along with an evaluation metric. In this regard, cross-validation is
the most popular evaluation rules and either the accuracy, the ROC AUC, or the F1
score would be common metrics to use. At the end of the search process, the feature
subset with highest score is returned as the chosen feature subset. Although there are
many search strategies to use in this regard, here we discuss three approaches that
are straightforward to implement:
• Exhaustive Search
• Sequential Forward Search (SFS)
• Sequential Backward Search (SBS)
Exhaustive Search: Suppose we wish to select k out of p candidate features that
are “optimal” based on a score such as the accuracy estimate. In this regard, one
can exhaustively search the space of all possible feature subsets of size k out of p.
p!
This means that there are pk = k!(p−k)!
feature subset to choose from. However,
this could be extremely large even for relatively modest values of p and k. For
example, for 50 20 > 10 . In reality the situation is exacerbated because we do not
13
even know how many features should be selected. To put it another way, there is
generally no a priori evidence about any specific k. In other words, an exhaustive
search should be conducted for all possible values of k = 1, 2, . . . , p. This means
that to find the optimal feature subset, the number subset evaluation by an exhaustive
Íp
search is k=1 pk = 2 p − 1. Therefore, the number of feature subset to evaluate
grows exponentially and becomes quickly impractical as p grows—for example for
p = 50, 1015 feature subsets should be evaluated! Although exhaustive search is
the optimal search strategy, the exponential complexity of the search space is the
main impediment in its implementation. As a result, we generally rely on suboptimal
search strategies as discussed next.
Sequential Forward Search (SFS): This is a greedy search strategy in which at
each iteration a single feature, which in combination with the current selected feature
subset leads to the highest score, is identified and added to the selected feature subset.
This type of search is known as greedy because the search does not revise the decision
made in previous iterations (i.e., does not revise the list of selected feature subsets
from previous iterations). Let fi, i = 1, . . . , p, and Ft denote the i th feature in a p-
294 10 Feature Selection
dimensional feature vector and a subset of these features that is selected at iteration
t, respectively. Formally, the algorithm is implemented in Algorithm “Sequential
Forward Search”.
1.1. One stopping criterion could be the size of feature subsets to pick; for
example, one may decide to select k < p features.
1.2. Another stopping criterion that does not require specifying a predetermined
number of features to pick is to continue the search as long as there is a tangible
improvement in the score by adding one or a few features sequentially. For
example, one may decide to stop the search if after adding 3 features the score
is not improved more than a prespecified constant.
2. Find the fi < Ft that maximizes the estimated scoring metric of the
model trained using Ft ∪ fi . Denote that feature by fibest . In other words, fibest
is the best univariate feature that along with features Ft leads to the highest score.
4. Repeat steps 2 and 3 until the stopping criterion is met (e.g., based on the first
stopping criterion, if |Ft | = k). Once the search is stopped, the best feature
subset is Ft .
2. Find the fi ∈ Ft that maximizes the estimated scoring metric of the model
trained using Ft − fi . Denote that feature by fiworst . In other words, fiworst is the
worst univariate feature that its removal from the current selected feature subset
leads to the highest score.
4. Repeat steps 2 and 3 until the stopping criterion is met. Once the search is
stopped, the best feature subset is Ft .
n_features = 100
n_informative = 10
lda = LDA()
X, y = make_classification(n_samples=1000, n_features=n_features,
,→n_informative=10, n_redundant=0, n_repeated=0, n_classes=2,
,→n_clusters_per_class=1, class_sep=1, shuffle=False)
X_train_full, X_test_full, y_train, y_test = train_test_split(X, y,
,→stratify=y, train_size=0.2)
strkfold = StratifiedKFold(n_splits=5, shuffle=True)
%%timeit -n 1 -r 1
np.random.seed(42)
sfs = SequentialFeatureSelector(lda, n_features_to_select=10,
,→cv=strkfold, n_jobs=-1) # by default the direction is 'forwards'
sfs.fit(X_train_full, y_train)
print(sfs.get_support()) # using support we can see which feature is
,→selected (True means selected, False means not selected)
print("The number of truly informative features chosen by SFS is:",
,→sum(sfs.get_support()[:10]))
[False True False True False True False True False True False False
False False False False False False False False False False False False
False False False False False False False False True False False False
False False False True False False False False False False False False
False False False False False False False False False False False False
False True False False False False False False False False False False
False False False False False False False False False False True False
True False False False False False False False False False False False
False False False False]
The number of truly informative features chosen by SFS is: 5
5.58 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n 1 -r 1
sbs = SequentialFeatureSelector(lda, n_features_to_select=10,
,→cv=strkfold, direction='backward', n_jobs=-1)
sbs.fit(X_train_full, y_train)
print(sbs.get_support())
print("The number of truly informative features chosen by SBS is:",
,→sum(sbs.get_support()[:10]))
[False True True True False True False True False True False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False True False
10.2 Feature Selection Techniques 297
False True False True False False False False False False False False
False False False False False False False False True False False False
False False False False False False False False False False False False
False False False False]
The number of truly informative features chosen by SBS is: 6
50.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In this example, SBS picks more of the informative features than SFS but it takes
almost 9 times more than SFS to stop.
These are algorithms that once the predictive model is constructed, the structure of
the model can be used for feature selection. In other words, the feature selection
stage is embedded into the model training. In this regard, once the model is trained,
the importance of features are judged by the properties of learners. For example, the
magnitudes of coefficients in a linear model trained on normalized data could be
viewed as the influence of their corresponding features on the target variable. Now
if in the same scenario a feature coefficient is zero, it means no influence from that
feature on the target variable in the given data. There are two types of embedded
methods: 1) non-iterative; and 2) iterative.
Non-iterative Embedded Methods: Non-iterative embedded methods do not re-
quire an iterative search strategy to select features. Here, by “iteration”, we do not
refer to the iterative optimization method that is used to construct the underlying pre-
dictive models given a fixed set of features. Instead, we refer to an iterative procedure
for selecting features. An example of a non-iterative embedded method for feature
selection is logistic regression based on l1 regularization (lasso). As seen in Section
6.2.2, this classifier had the property of selecting features in a high-dimensional
setting as it sets the coefficients of many features to zero. Some other models such
as decision trees and random forest have also an internal mechanism to define “im-
portance” for a feature. A common metric to measure importance in random forests
is the Mean Decrease Impurity (MDI) (Louppe et al., 2013). The MDI for a feature
k ( fk ), denoted MDI fk , is defined as the normalized sum of impurity drops across
all nodes and all trees that use fk for splitting the nodes. Using similar notations as
in Section 7.2.2, for a trained random forest containing nT ree trees, this is given by
1 Õ Õ nSm
MDI fk = ∆m,ξ , (10.2)
nT r ee n
T r ee m∈T r ee: if ξ=(k,.)
n
where Tree is a variable indicating a tree in the forest, Snm shows the proportion of
samples reaching node m, ξ = (k, .) means if the split was based on fk with some
threshold, and
298 10 Feature Selection
nSYes nSNo
m, ξ m, ξ
∆m,ξ = i(Sm ) − i(SYes
m,ξ ) − m,ξ ) .
i(SNo (10.3)
nSm nSm
1.1. One stopping criterion could be the size of feature subsets to pick; for
example, one may decide to select k < p features.
1.2. Similar to SBS, another stopping criterion that does not require specifying
a predetermined number of features to select could be removing features as
long as there is a tangible improvement in the estimated performance score.
2. Use features in Ft to train a model. Identify the s features that are ranked lowest
and call this subset Ftlowest .
4. Repeat steps 2 and 3 until the stopping criterion is met. Once the search is
stopped, the best feature subset is Ft .
np.random.seed(42)
n_features = 100
n_informative = 10
lda = LDA()
X, y = make_classification(n_samples=1000, n_features=n_features,
,→n_informative=10, n_redundant=0, n_repeated=0, n_classes=2,
,→n_clusters_per_class=1, class_sep=1, shuffle=False)
X_train_full, X_test_full, y_train, y_test = train_test_split(X, y,
,→stratify=y, train_size=0.2)
import time
start = time.time()
rfe = RFE(lda, n_features_to_select=10)
rfe.fit(X_train_full, y_train)
print(rfe.get_support())
print("The number of truly informative features chosen by RFE is:",
,→sum(rfe.get_support()[:10]))
end = time.time()
print("The RFE execution time (in seconds): {:.4f}".format(end - start))
lda_score = rfe.score(X_test_full, y_test)
print("The score of the lda trained using selected features on training
,→data is: {:.4f}".format(lda_score))
[ True True True False True False True True True True True False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False True False False]
The number of truly informative features chosen by RFE is: 8
The RFE execution time (in seconds): 0.2335
The score of the lda trained using selected features on training data is:
,→0.9500
import time
start = time.time()
rfecv = RFECV(lda, cv=strkfold, n_jobs=-1)
rfecv.fit(X, y)
print(rfecv.get_support())
10.2 Feature Selection Techniques 301
end = time.time()
print("The RFE CV execution time (in seconds): {:.4f}".format(end -
,→start))
[ True True True True True False True True True True False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False True False False False True False False
False False False False False False True False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False]
The number of features selected by RFE CV is: 12
The number of truly informative features chosen by RFE is: 9
The RFE CV execution time (in seconds): 1.3327
The score of the lda trained using selected features on training data is:
,→0.9563
Efficiency of RFE with respect to SBS is an extremely useful property. The
computational burden incurred by SBS is a major impediment in the utility of SBS
for high-dimensional data. In examples 10.4 and 10.5 we use a relatively modest
dimensionality and we see that RFECV is about 50.1/1.33 ≈ 38 times faster to
compute than SBS (the exact ratio depends on the runs and the machine). This ratio,
which is an indicator of efficiency of RFECV with respect to SBS, becomes larger if
we increase the number of dimensions and/or training sample size (see exercises 4
and 5).
Exercises:
Exercise 1: Suppose in a dataset we have m features and for each feature we test a
null hypothesis H0 against an alternative hypothesis Ha to determine whether the
feature is important/significant (i.e., H0 is rejected in the favor of Ha ). Assuming
each test has a significance level of α = 0.05, what is FWER (the probability of
having at least one false positive)?
É
Exercise 2: Suppose a study involves m hypothesis tests. Show that if each null
hypothesis is rejected when the P-value for the corresponding test is less than or
equal to α/m (Bonferroni correction), where α is the significance level of each test,
302 10 Feature Selection
then FWER≤ α. To prove use the following well-known results: 1) P-values have
uniform distribution between 0 and 1 whenthe null hypothesis is true (Rice, 2006,
p. 366); and 2) for events Ai, i = 1, . . . , m, P ∪i=1
m A ≤ Ím P(A ), which is known
i i=1 i
as the union bound.
Exercise 3: Assume we would like to choose the p/2 suboptimal feature subset out
of p (even) number of features using both SFS and SBS.
1. write an expression in terms of p that shows the number of feature subsets that
must be evaluated using SFS (simplify your answer as much as possible).
2. write an expression in terms of p that shows the number of feature subsets that
must be evaluated using SBS (simplify your answer as much as possible).
Exercise 4: A) Run examples 10.4 and 10.5 on your machine and find the ratio of
execution time of RFE-CV to that of SBS.
B) Rerun part (A) but this time by increasing the number of candidate features
from 100 to 150 (i.e., setting n_features in make_classification to 150). How
does this change affect the execution time ratio found previously in part (A)?
C) As in part (A), use n_features = 100 but this time, change the size of
training set from 200 to 1000 and run the examples. How does this change affect the
execution time ratio found previously in part (A)?
So far we have seen that in order to achieve a final predictive model, several steps are
often required. These steps include (but not limited to) normalization, imputation,
feature selection/extraction, model selection and training, and evaluation. The misuse
of some of these steps in connection with others have been so common in various
fields of study that have occasionally triggered warnings and criticism from the
machine learning community. This state of affairs can be attributed in large part
to the misuse of resampling evaluation rules such as cross-validation and bootstrap
for model selection and/or evaluation in connection with a combination of other
steps to which we refer as a composite process (e.g., combination of normalization,
feature extraction, and feature selection). In this chapter we focus primarily on the
proper implementation of such composite processes in connection with resampling
evaluation rules. To keep discussion succinct, we use feature selection and cross-
validation as typical representatives of the composite process and a resampling
evaluation rule, respectively. We then describe appropriate implementation of
1. Feature Selection and Model Evaluation Using Cross-Validation
2. Feature and Model Selection Using Cross-Validation
3. Nested Cross-Validation for Feature and Model Selection, and Evaluation
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 303
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_11
304 11 Assembling Various Learning Steps
Goal:
1. learn Ξ from Str,n —learning Ξ from a given data Str,n means estimating all
transformations that are part of Ξ from Str,n . Let Ξtr denote the learned com-
posite process;
2. use Ξtr to transform Str,n into a new training
data Ξtr Str,n ;
3. use the transformed training data Ξtr Str,n along with a classification rule to
train a classifier ψtr (x); and
4. use K-fold CV to evaluate the trained classifier ψtr (x).
In a nutshell, the correct way of using K-fold CV to evaluate ψtr (x) is to apply K-
fold CV external to the composite process Ξ. In fact implementation of the external
CV is not new–it is merely the proper way of implementing the standard K-fold
CV algorithm that was presented in Section 9.1.1 by realizing that the “learning
algorithm” Ψ presented there is a combination of a composite estimator Ξ that
transforms the data plus a classifier/regressor construction algorithm. Nevertheless,
we present a separate algorithm, namely, “External” K-fold CV (Ambroise and
McLachlan, 2002), to describe the implementation in more details. Furthermore,
hereafter we refer to Ψ as a “composite estimator” because it is the combination of
a composite transformer and an estimator (classifier/regressor).
Note that in the “External” K-fold CV algorithm, although the general form
of processes used within Ξ1, Ξ2, . . . , ΞK are the same, the precise form of each
transformation is different because each Ξk is learned from a different data Str,n −
Foldk , k = 1, . . . , K. At the same time, we observe that throughout the algorithm,
Foldk is not used anyhow in learning Ξk or training ψk (x). Therefore, Foldk is truly
treated as an unseen data for the classifier trained on iteration-specific training set
Str,n − Foldk . To better grasp and appreciate the essence of the “External” K-fold
CV algorithm in achieving the “Goal”, in the next few sections in this chapter we
use feature selection as a typical processing step used in Ξ; after all, it was mainly
the common misuse of feature selection and cross-validation that triggered several
warnings (Dupuy and Simon (2007); Ambroise and McLachlan (2002); Michiels
et al. (2005)). That being said, the entire discussion remains valid if we augment Ξ
with other transformations discussed before.
11.2 A Common Mistake 305
2. For k from 1 to K
1.1. Learn Ξ from Str,n − Foldk ; Let Ξtr,k denote the learned composite process
in this iteration.
1.2. Use Ξtr,k to transform Str,n − Foldk into a new data Ξtr,k Str,n − Foldk .
1.3. Use the transformed training data Ξtr,k Str,n − Foldk to train a surrogate
classifier ψk (x).
1.4 Classify observations in the transformed held-out fold Ξtr,k Foldk using
ψk (x).
Insofar as feature selection and cross-validation are concerned, the most common
mistake is to apply feature selection on the full data at hand, construct a classifier
using selected features, and then use cross-validation to evaluate the performance
of the classifier. This practice leads to selection bias, which is caused because the
classifier is evaluated based on samples that were used in the first place to select
features that are part of the classifier Ambroise and McLachlan (2002). This mistake
may stem from the discrepancy between the perception of the “Goal” and the actual
practical steps required to reach it. Here is a common perception of the Goal:
Unfortunately, here intuition could lead us astray. The following example illus-
trates the issue with this common mistake.
306 11 Assembling Various Learning Steps
Example 11.1 Here we create a synthetic training data that has the same size as
the genomic dataset (172 × 21050) used in Example 10.3. However, this synthetic
dataset is a null data where all observations for all features across both classes are
random samples of the standard Gaussian distribution. Naturally, all features are
equally irrelevant to distinguish between the two classes because samples from both
classes are generated from the same distribution. By keeping the sample size across
both classes equal, we should expect a 50% accuracy for any learning algorithm
applied and evaluated on this null dataset.
We wish to use this null data to examine the effect of the aforementioned common
mistake. In this regard, once the training data is generated, we apply ANOVA (see
Chapter 10) on the training set and pick the best 10 features, and then apply cross-
validation on the training data to evaluate the performance of a CART classifier
constructed using the 10 selected features. At the same time, because this is a
synthetic model to generate data, we set the size of test set almost two times larger
than the training set. In particular, we choose 172 (training set size) + 328 (test set
size) = 500. Furthermore, in order to remove any bias towards a specific realization
of the generated datasets or selected folds used in CV, we repeat this entire process
50 times (mc_no = 50 in the following code) and report the average 5-fold CV and
the average test accuracies at the end of this process.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier as CART
np.random.seed(42)
n_features = 21050
n_sample = 500 # size of training + test sets
ratio_train2full = 172/500 # the proportion of training data to n_sample
mc_no = 50 # number of Monte Carlo repetitions
kfold = 5 # K-fold CV
cv_scores_matrix = np.zeros((mc_no, kfold))
test_scores = np.zeros(mc_no)
for j in np.arange(mc_no):
X = np.random.normal(0, 1, size=(n_sample, n_features))
y = np.concatenate((np.ones(n_sample//2, dtype='int_'), np.
,→zeros(n_sample//2, dtype='int_')))
As we saw in the previous section, the correct approach to evaluate ψtr (x) is
to apply K-fold CV external to feature selection. In other words, to evaluate the
classifier trained based on features that are selected by applying the feature selection
on full training data, external K-fold CV implies:
We need to successively hold out one fold as part of cross-validation, apply fea-
ture selection on the remaining data to select features, construct a classifier using
these selected features, and evaluate the performance of the constructed classifier on
the held-out fold (and naturally take the average score at the end of cross-validation).
From a practical perspective, cross-validation can be used for both model eval-
uation, model selection, and both at the same time. As a result, in what follows we
discuss implementations of all these cases separately.
taking the composite process Ξ as the feature selection. This can also be seen as
the application of the standard K-fold CV presented in Section 9.1.1 by taking
the composite estimator Ψ as feature selection (Ξ) + classifier/regressor learning
algorithm.
Scikit-Learn Pipeline class and implementation of feature selection and CV
model evaluation: As we know from Chapter 9, cross_val_score, for example,
expects a single estimator as the argument but here our estimator should be
a composite estimator ( Ψ ≡ feature selection + classifier construction rule). Can
we create such a composite estimator Ψ and treat it as a single estimator in scikit-
learn? The answer to this question is affirmative. In scikit-learn we can make a single
estimator out of a feature selection procedure (the object from the class implementing
the procedure) and a classifier (the object from the class implementing the classifier).
This is doable by the Pipeline class from sklearn.pipeline module.
The common syntax for using the Pipeline class is Pipeline(steps) where
steps is a list of tuples (arbitrary_name, transformer) with the last one
being a tuple of (arbitrary_name, estimator). In other words, all estimators
that are chained together in a Pipeline except for the last one must implement the
transform method so that the output of one is transformed and used as the input
to the next one (these transformers create the composite transformer Ξ). The last
estimator can be used to add the classification/regression
learning algorithm to the
Pipeline to create the composite estimator Ψ . Below is an example where two
transformers PCA() and SelectKBest(), and one estimator CART() are chained
together to create a composite estimator named pipe.
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier as CART
from sklearn.feature_selection import SelectKBest
pipe = Pipeline(steps=[("dimen_reduc", PCA(n_components=10)),
,→("feature_select", SelectKBest(k=5)), ("classifier", CART())])
Example 11.2 Here we aim to apply the external K-fold CV under the same settings
as in Example 11.1. The external K-fold CV is used here to correct for the selection
bias that occurred previously in Example 11.1. As mentioned before, we should
create one composite estimator to implement feature selection + classifier learning
algorithm and use that estimator within each iteration of CV. As in Example 11.1,
at the end of the process, the average performance estimate obtained using cross-
validation and test set are compared.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
11.3 Feature Selection and Model Evaluation Using Cross-Validation 309
np.random.seed(42)
n_features = 21050
n_sample = 500 # size of training + test set
ratio_train2full = 172/500
mc_no = 50 # number of Monte Carlo repetitions
kfold = 5 # K-fold CV
cv_scores_matrix = np.zeros((mc_no, kfold))
test_scores = np.zeros(mc_no)
strkfold = StratifiedKFold(n_splits=kfold, shuffle=True)
pipe = Pipeline(steps=[('kbest', SelectKBest(k=10)), ('clf', CART())])
,→# kbest and clf are arbitrary names
for j in np.arange(mc_no):
X = np.random.normal(0, 1, size=(n_sample, n_features))
y = np.concatenate((np.ones(n_sample//2, dtype='int_'), np.
,→zeros(n_sample//2, dtype='int_')))
As we can see here, the average CV estimate of the accuracy is very close
to that of test-set estimate and both are about 50% as they should (recall that
we had a null dataset). Here by cross_val_score(pipe, X_train, y_train,
cv=strkfold), we are training 5 CART surrogate classifiers where each is trained
using (possibly) different set of features of size 10 (because Str,n, p − Foldk is dif-
ferent in each iteration of CV), and evaluate each on its corresponding held-out
fold (Foldk ). This is a legitimate performance estimate of the outcome of applying
the composite estimator Ψ to Str,n, p ; that is to say, the performance estimate of
our final constructed classifier, which itself is different from all surrogate classi-
fiers. This is the classifier constructed using final_cart = pipe.fit(X_train,
y_train)—we name it “final_cart” because this is what we use for predicting
unseen observations in the future. Here as we could easily generate test data
from our synthetic model, we evaluated and reported the performance of this
classifier on test data using test_scores[j] = final_cart.score(X_test,
310 11 Assembling Various Learning Steps
y_test). However, in the absence of test data, the CV estimate itself obtained by
cross_val_score(pipe, X_train, y_train, cv=strkfold) is a reasonable
estimate of the performance of final_cart—as observed the average difference
between CV estimate and the hold-out estimate is less than 1%.
As discussed in Section 9.2.1, a common approach for model selection is the grid
search cross-validation. However, once feature selection is desired, the search for the
best model should be conducted by taking into account the joint relationship between
the search and the feature selection; that is to say, the best feature-model is the one
that leads to the highest CV score across all examined combinations of features and
models (i.e., a joint feature-model selection).
Scikit-Learn implementation of feature selection and CV model selection: As
discussed in Section 9.2.1, the grid search cross-validation is conveniently im-
plemented using sklearn.model_selection.GridSearchCV class. Therefore,
a convenient way to integrate feature selection as part of model selection is using an
object of Pipeline in GridSearchCV. However, as Pipeline creates a composite
estimator out of several other estimators, there is a special syntax that we need to use
to instruct param_grid used in GridSearchCV about the specific estimators over
which we wish to conduct tuning (and of course the candidate parameter values for
those estimator). Furthermore, we need to instantiate the Pipeline using one pos-
sible choice of transformer/estimator and then use param_grid of GridSearchCV
to populate the choices used in steps of Pipeline. To illustrate the idea as well as
introduce the special aforementioned syntax, we look into the following instructive
example.
np.random.seed(42)
n_features = 1000
acc_test = np.zeros(n_features)
kfold = 5 # K-fold CV
X, y = make_classification(n_samples=1000, n_features=n_features,
,→n_informative=10, n_redundant=0, n_repeated=0, n_classes=2,
,→n_clusters_per_class=1, class_sep=1, shuffle=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
,→train_size=0.2)
pipe = Pipeline(steps=[('kbest', SelectKBest()), ('clf', CART())]) #
,→kbest and clf are arbitrary names
strkfold = StratifiedKFold(n_splits=kfold, shuffle=True)
param_grid = [
{'clf': [LRR()], 'clf__penalty': ['l2'], 'clf__C': [0.1, 1, 10],
'kbest': [SelectKBest()], 'kbest__k': [5, 10, 50]},
{'clf': [LRR()], 'clf__penalty': ['l1'], 'clf__C': [0.1, 1, 10],
'clf__solver': ['liblinear'], 'kbest': ['passthrough']},
{'clf': [CART(), RF()], 'clf__min_samples_leaf': [1, 5],
'kbest': [SelectKBest()], 'kbest__k': [5, 10, 50]}
]
df = pd.DataFrame(gscv.cv_results_)
#df # uncomment this to see all 24 combinations examined in this example
Pipeline(steps=[('kbest',SelectKBest()),
('clf', KNeighborsClassifier())])
would have been very close to E[εθ ∗ ]—here each performance estimate is a legiti-
mate estimate. This is what is meant by “before selecting the best classifier”: we treat
θ ∗ as any other combination of hyperparameter values; that is, the same argument
applies to any other combination of hyperparameter values because all classifiers are
trained using the same combination.
Case 2. Similar to what we did in finding θ ∗ , let us train a number of classifiers on
Str,n − Foldk0 where each is using a unique combination of hyperparameter values.
0
0
Then by comparing their performance estimates on Foldk0 , we select θ ∗ , which is
not necessarily the same as θ ∗ . Doing the same on Str,n 00 − Fold 00 and Fold 00 leads
k k
∗00
to θ , . . . . Here to find each best classifier, we use each Foldk , Foldk0 , Foldk00, . . .
314 11 Assembling Various Learning Steps
0 00
in finding θ ∗ , θ ∗ , θ ∗ , . . . . In this case, the average of the performance estimates
of each best classifier that is obtained on Foldk0 , Foldk00, . . . , is not neecessrily close
to any of E[εθ ∗ ], E[εθ ∗0 ], E[εθ ∗00 ], . . . . The problem is caused because held-out
folds are seen in training each base classifier (via selecting the best classifier), and
each time we end up with a different classifier characterized by a unique θ. As a
result, the performance estimates obtained are neither hold-out error estimates nor
CV estimates if repeated on different splits.
Therefore, it is not legitimate to use the performance estimates obtained as part
of the grid search CV for evaluating the performance of the best classifier. The
implication of violating this is not purely philosophical as it could have serious
implications in practice. To see that, in the following example we investigate the
implication of such a malpractice in Example 11.2. In particular, we examine the
impact of perceiving the value of CV obtained as part of the grid search as the
performance estimate of the best classifier.
Example 11.4 Here we apply the grid search CV using the same hyperparameter
space defined in Example 11.3 on the set of null datasets generated in Example
11.2. We examine whether the average CV accuracy estimates of the best classifiers
selected as part of the grid search is close to 50%, which should be the average true
accuracy of any classifier on these null datasets.
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier as CART
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.linear_model import LogisticRegression as LRR
np.random.seed(42)
n_features = 21050
n_sample = 500
ratio_train2test = 172/500
mc_no = 20 # number of MC repetitions
kfold = 5 # K-fold CV
cv_scores = np.zeros(mc_no)
test_scores = np.zeros(mc_no)
pipe = Pipeline(steps=[('kbest', SelectKBest()), ('clf', CART())]) #
,→kbest and clf are arbitrary names
grids = [
{'clf': [LRR()], 'clf__penalty': ['l2'], 'clf__C': [0.1, 1, 10],
11.5 Nested Cross-Validation for Feature and Model Selection, and Evaluation 315
for j in np.arange(mc_no):
X = np.random.normal(0, 1, size=(n_sample, n_features))
y = np.concatenate((np.ones(n_sample//2, dtype='int_'), np.
,→zeros(n_sample//2, dtype='int_')))
As we can see there is a difference of more than 6% between the average accuracy
estimates of the best 20 classifiers (mc_no = 20) obtained as part of the grid search
CV and their average true accuracy (estimated on independent hold-out sets). The
56.8% average estimate obtained on this balanced null dataset is an indicator of some
predictability of the class labels using given data, which is wrong as these are null
datasets.
As we saw in the previous section, the performance estimates obtained as part of the
grid search cross-validation can not be used for evaluating the performance of the
best classifier. The question is then how to possibly estimate its performance? We
have used test sets in Example 11.4 for evaluating the best classifier. In the absence
of an independent test set, one can split the given data to training and test sets and use
hold-out estimator. However, as discussed in Section 9.1.1, in situations where the
sample size is not large, cross-validation estimator is generally preferred with respect
to hold-out estimator. But can we use cross-validation to evaluate the performance of
the classifier trained using the best combination of hyperparameter values obtained
by the grid search CV?
316 11 Assembling Various Learning Steps
Exercises:
Exercise 1: Here we would like to use the same data generating model and hyper-
parameter space that were used in Example 11.4. Keep all the following and other
parameters the same
n_features = 21050
n_sample = 500
ratio_train2test = 172/500
mc_no = 20
Apply nested cross-validation on X_train and y_train and report the average
estimate of accuracy obtained by that (recall that the average is over the number of
Monet-Carlo simulations: mc_no = 20). For both CVs used as part of the model
selection and the model evaluation, use stratified 5-fold CV. What is the average CV
estimate obtained using nested cross-validation? Is it closer to the 50% compared to
the what was achieved in Example 11.4?
11.5 Nested Cross-Validation for Feature and Model Selection, and Evaluation 317
Hint: in Example 11.4 after GridSearchCV is instantiated, use the object as estima-
tor used in cross_val_score.
A) f = f1 = f2 = f3
B) f , f1 = f2 = f3
C) f , f1 , f2 , f3
D) f = f1 ∪ f2 ∪ f3 and f1 = f2 = f3
E) f = f1 ∩ f2 ∩ f3 and f1 = f2 = f3
F) f = f1 ∪ f2 ∪ f3 and f1 , f2 , f3
G) f = f1 ∩ f2 ∩ f3 and f1 , f2 , f3
A) Both the goal and the action taken to achieve the goal are legitimate.
B) The goal is legitimate but the action taken to achieve the goal is not because the
use of CV in this way would make it even more pessimistically biased.
C) The goal is legitimate but the action taken to achieve the goal is not because the
use of CV in this way could make it even optimistically biased.
Chapter 12
Clustering
Let Sn = {x1, x2, . . . , xn } denote a given set of observations (training data) where
xi ∈ R p, i = 1, . . . , n, represents a vector including the values of p feature variables,
and n is the number of observations. Partitional clustering breaks Sn into K non-
empty subsets C1 , . . . , CK such that
• Ci ∩ C j = ∅, i , j, i, j = 1, . . . , K ,
K C =S ,
• ∪i=1 i n
where ∩ and ∪ denote intersection and union operators, respectively. The value of
K may or may not be specified. The set P = {C1, . . . , CK } is known as a clustering
and each Ci is a cluster. A partitional clustering algorithm produces a clustering
when presented with data, and this is regardless of whether data actually contains
any cluster or not. Therefore, it is the responsibility of the user to examine whether
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 319
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_12
320 12 Clustering
there is any merit to clustering before performing the analysis. Furthermore, the
lack of information about class labels makes clustering essentially a subjective
process in nature. The same data items (observations) can be clustered differently for
different purposes. For example, consider the following set of four data items: S4 =
{Cherry, Mint, Mango, Chive}. Depending on the definition of clustering criterion,
below are some possible clusterings of S4 :
clustering criterion clustering
word initial {Cherry, Chive}, {Mint, Mango}
word length {Mint}, {Chive, Mango}, {Cherry}
botanical distinction {Cherry, Mango}, {Chive, Mint}
color {Cherry}, {Mint, Chive}, {Mango}
This shows the importance of incorporating domain knowledge in clustering. Such a
knowledge can be encoded in the form of: 1) feature vectors that are used to represent
each data item; 2) measures of similarity or dissimilarity between feature vectors;
3) grouping schemes (clustering algorithms) that are used to produce clusters from
feature vectors; and 4) ultimate interpretation, confirmation, or rejection of the results
obtained by a clustering algorithm.
Example 12.1 List all possible clusterings of a given data S3 = {x1, x2, x3 } by a
partitional clustering algorithm.
2) to add x4 to each cluster member of any clustering that already has three clusters;
that is to say,
Therefore, there are in total six ways to partition 4 observations into 3 clusters.
Example 12.2 sheds light on a systematic way to count the number of all possible
clusterings of n observations into K clusters, denoted S(n, K). Suppose we list all
possible clusterings of Sn−1 , which is a set containing n − 1 observations. There are
two ways to form K clusters of Sn = Sn−1 ∪ {xn }:
1. to add xn as a singleton cluster to each clustering of Sn−1 with K − 1 clusters—
there are S(n − 1, K − 1) of such clusterings;
2. to add xn to each cluster member of any clustering that already has K clusters—
there are S(n − 1, K) of such clusterings, and each has K clusters.
Therefore,
The solution to this difference equation is Stirling number of the second kind (see
(Jain and Dubes, 1988) and references therein):
K
1 Õ K−i K n
S(n, K) = (−1) i . (12.2)
K! i=0 i
This number grows quickly as a function of n and K. For example, S(30, 5) > 1018 . As
a result, even if K is fixed a priori, it is impractical to devise a clustering algorithm
that enumerates all possible clusterings to find the most sensible one. Therefore,
clustering algorithms only evaluate a small “reasonable” fraction of all possible
clusterings to achieve at a sensible solution. One of the simplest and most popular
partitional clustering techniques is K-means (Lloyd, 1982; Forgy, 1965; MacQueen,
1967), which is described next.
12.1.1 K -Means
The standard form of K-means clustering algorithm, also known as Lloyd’s algorithm
(Lloyd, 1982), is implemented in the algorithm presented next.
In (Selim and Ismail, 1984), it is shown that the K-means algorithm can be formu-
lated as a nonconvex mathematical programming problem where a local minimum
need not be the global minimum. That being said, it is shown that the algorithm con-
verges to a local minimum for the criterion defined in (12.3); however, if dE2 [xi, j , µ j ]
322 12 Clustering
Algorithm K-Means
1. Set iteration counter t = 1, fix K, and randomly designate K data points as the
K cluster centroids (means).
2. Assign each data point to the cluster with the nearest centroid.
where xi, j is the i th data point belonging to cluster j having current mean µ j
and n j data points, and dE2 [x, y] is the square of the Euclidean distance between
two vectors x and y, which is a dissimilarity metric (see Section 5.1.3 for the
definition of Euclidean distance).
4. Recompute the centroid for each cluster using the current data points belonging
to that cluster, and set t ← t + 1.
Algorithm K-Means++
1. Fix K and randomly designate a data point as the the first cluster centroid.
Example 12.3 Consider six data points that are identified in Fig. 12.1 with • points.
Apply K-means clustering with K = 2 and take data points at [0, 1]T and [0, 0]T as
the initial centroids of the two clusters. Stop the algorithm either once no data points
change clusters or there is no change in the squared error.
2
1
0
1
3 2 1 0 1 2 3 4 5 6
Let x1 , x2 , x3 , x4 , x5 , and x6 represent data points at [0, 1]T , [1, 1]T , [5, 1]T , [0, 0]T ,
[−1, 0]T , and [−2, 0]T , respectively.
For t = 1:
• step 2: x1 , x2 , and x3 are assigned to cluster 1, denoted C1 , with the centroid
µ 1 = x1 , and x4 , x5 , and x6 are assigned to cluster 2, denoted C2 , with the
centroid µ 2 = x4 (see Fig. 12.2a);
• step 3: e(Sn, P) = 25 + 1 + 1 + 4 = 31;
• step 4: recomputing µ 1 and µ 2 yields µ 1 = [2, 1]T and µ 2 = [−1, 0]T .
For t = 2:
• step 2: x2 and x3 are assigned to C1 , and x1 , x4 , x5 , and x6 are assigned to C2
(see Fig. 12.2b); √
• step 3: e(Sn, P) = 9 + 1 + ( 2)2 + 1 + 1 = 14;
• step 4: recomputing µ 1 and µ 2 yields µ 1 = [3, 1]T and µ 2 = [−3/4, 1/4]T .
324 12 Clustering
For t = 3:
• step 2: x3 is assigned to C1 , and all other points are assigned to C2 (see Fig.
12.2c);
• step 3:
2 2
1 1
0 0
1 1
3 2 1 0 1 2 3 4 5 6 3 2 1 0 1 2 3 4 5 6
(a) (b)
2 2
1 1
0 0
1 1
3 2 1 0 1 2 3 4 5 6 3 2 1 0 1 2 3 4 5 6
(c) (d)
Fig. 12.2: The clustering in Example 3 produced by K-means at: (a) t = 1; (b) t = 2;
(c) t = 3; and (d) t = 4; • and represent points belonging to C1 and C2 , respectively.
A × identifies a cluster centroid.
parameter. The init parameter can be used to set the centroids initialization to
'k-means++' (default), 'random' (for random assignments of data points to cen-
troids), or specific points (a specific array) specified by the user. By default (as of
version 1.2.2), the algorithm will run 10 times (can be changed by n_init) and
the best clustering in terms of having the lowest within-cluster sum of squares is
used. Because a KMeans object is an estimator, similar to other estimators has a
fit() method that can be used to construct the K-means clusterer. At the same
time, KMeans implements the predict() method to assign data points to clusters.
To do so, an observation is assigned to the cluster with the nearest centroid. Never-
theless, the clustering of training data (assigned cluster labels from 0 to K − 1) can be
retrieved by the labels_ attribute of a trained KMeans object and the centroids are
obtained by cluster_centers_ attribute. Last but not least, KMeans implements
the score(X) method, which returns the negative of squared error computed for a
given data X and the identified centroids of clusters that are already part of a trained
KMeans object; that is to say, given a data X, it computes −e(X, P) defined in (12.3)
where P is already determined from the clustering algorithm. Here, the negative
is used to have a score (the higher, the better) as compared with loss. Nonetheless,
because (12.3) is not normalized with respect to the number of data points in X
(similar to the current implementation of the score(X) method), for a fixed P and
two datasets X1 and X2 where X1 ⊂ X2 , −e(X1, P) > −e(X2, P).
Remarks on K -Means Functionality: The use of squared error makes large-scale
features to dominate others. As a result, it is quite common to normalize data before
applying K-means (see Section 4.6 for scaling techniques). At the same time, the use
of the centroids to represent clusters naturally works well when the actual underlying
clusters of data, if any, are isolated or compactly isotropic. As a result, when the
actual clusters are elongated and rather mixed, K-means algorithm may not identify
clusters properly. The situation is illustrated in the following example.
Example 12.4 In this example, we wish to train a K-means clusterer by treating the
Iris training dataset that was prepared (normalized) in Section 4.6 as an unlabeled
training data. Similar to Example 6.1, we only consider two features: sepal width
and petal length. As we already know the actual “natural” clusters in this data, which
are setosa, versicolor, and virginica Iris flowers, we set K = 3. Furthermore, we
compare the clustering of unlabeled data with the actual groupings to have a sense
of designated clusters.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cluster import KMeans
# plotting the cluster boundaries and the scatter plot of training data
Z = kmeans.predict(coordinates)
Z = Z.reshape(X.shape)
axs[0].tick_params(axis='both', labelsize=6)
axs[0].pcolormesh(X, Y, Z, cmap = cmap, shading='nearest')
axs[0].contour(X ,Y, Z, colors='black', linewidths=0.5)
axs[0].plot(X_train[:, 0], X_train[:, 1],'k.', markersize=4)
axs[0].set_title('$K$=' + str(K), fontsize=8)
axs[0].set_ylabel('petal length (normalized)', fontsize=7)
axs[0].set_xlabel('sepal width (normalized)', fontsize=7)
axs[1].tick_params(axis='both', labelsize=6)
axs[1].pcolormesh(X, Y, Z, cmap = cmap, shading='nearest')
axs[1].contour(X ,Y, Z, colors='black', linewidths=0.5)
axs[1].plot(X_train[y_train==0, 0], X_train[y_train==0, 1],'g.',
,→markersize=4)
X shape = (120, 4)
The score for K=3 is -57.201
K=3 K=3
1.5
petal length (normalized)
1.0
0.5
0.0
0.5
1.0
1.5
2 1 0 1 2 2 1 0 1 2
sepal width (normalized) sepal width (normalized)
Fig. 12.3: The scatter plot of normalized Iris dataset (sepal width and petal length
features) and cluster boundaries determined with K-means for K = 3. The plot on
the left shows unlabeled data. The plot on the right shows the actual grouping of
data points, which is known for the Iris dataset: green (setosa), red (versicolor), and
black (virginica). Comparing plots on the left and the right shows that the K-means
has been able to cluster green data points pretty well but the other two groups are
quite mixed within identified clusters.
As we can see in Fig. 12.3, despite the elongated spread of green points, they are
quite isolated from others and are identified as one cluster by K-means. However,
the other two groups are quite mixed within identified clusters. For the sake of
comparison, in Fig. 12.4 we plot the cluster boundaries obtained by two other
combination of features, namely, petal width and petal length combination. Here all
clustered are compactly isotropic and K-means works pretty well in clustering data
points from all actual groups.
328 12 Clustering
K=3 K=3
1.5
petal length (normalized)
1.0
0.5
0.0
0.5
1.0
1.5
1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
petal width (normalized) petal width (normalized)
Fig. 12.4: The scatter plot of normalized Iris dataset (petal width and petal length) and
cluster boundaries determined with K-means for K = 3. The plot on the left shows
unlabeled data. The plot on the right shows the actual grouping of data points, which
is known for the Iris dataset: green (setosa), red (versicolor), and black (virginica).
Comparing plots on the left and the right shows that the K-means has performed
well in clustering data points from all actual groups.
K=2 K=2
1.5
petal length (normalized)
1.0
0.5
0.0
0.5
1.0
1.5
2 1 0 1 2 2 1 0 1 2
sepal width (normalized) sepal width (normalized)
K=4 K=4
1.5
petal length (normalized)
1.0
0.5
0.0
0.5
1.0
1.5
2 1 0 1 2 2 1 0 1 2
sepal width (normalized) sepal width (normalized)
Fig. 12.5: The scatter plot of normalized Iris dataset (sepal width and petal length
features) and cluster boundaries determined with K-means for: K = 2 (top row) and
K = 4 (bottom row). The left panel shows unlabeled data and the right panel shows
shows the actual grouping of data points, which is known for the Iris dataset: green
(setosa), red (versicolor), and black (virginica).
does not change slope from some K onwards (thus creating the shape of an “elbow”
in the curve). Such a point in the curve might be an indicator of the optimal number
of clusters because the following small decreases are simply due to decrease in
partition radius rather than data configuration. In the method based on the elbow
phenomenon (“elbow method”), we plot the squared error for a range of K, which is
determined a priori. Then the method designates the K after which the curve does
not considerably change slop as the “appropriate” number of clusters. Naturally, such
a K is acceptable as long as it is not widely in odds with the domain knowledge. We
examine the method in the following example.
Example 12.5 Here we examine the elbow method for the K-means and the two
feature combinations in the Iris flower datasets that were considered in Example
12.4: Case 1) sepal width and petal length; and Case 2) petal width and petal length.
To plot the curve of squared error as a function of K, we simply run K-means
algorithm for a range of K and plot -kmeans.score(X_train) (see Example 4).
In Case 1 (left plot in Fig. 12.6), elbow method points to K = 2. This is the points
after which the slope of the curve does not change markedly. This choice would
become more sensible by looking back more closely at Figs. 12.3 and 12.5. As we
330 12 Clustering
Case 1 Case 2
250 250
200 200
squared error
squared error
150 150
100 100
50
50
0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
K K
Fig. 12.6: Curve of squared error of K-means as a function of K for (unlabeled) Iris
flower datasets: Case 1) sepal width and petal length features (left); and Case 2) petal
width and petal length (right).
observed in Fig. 12.5, for K = 2 data points are grouped into two distinct clusters
by K-means. It happens that one cluster (at the bottom of the figure) is populated
by Iris setosa flowers (green points), and the other (on top of the figure) is mainly a
mixture of Iris versicolor and virginica. However, when K = 3 in Fig. 12.3, one of
the clusters is still quite isolated but the other two seem neither compactly isotropic
nor isolated. Interestingly, looking into the types of Iris flowers within these two
clusters also show that these two clusters are still quite mixed between Iris versicolor
and virginica flowers.
In Case 2 (right plot in Fig. 12.6), however, it seems more reasonable to consider
K = 3 as the right number of clusters because the slope of the curve does not change
considerably after that point. In this case, when K = 2, there are still two distinct
clusters similar to Case 1 (figure not shown): one cluster at the bottom left corner
and one on the top right corner. However, for K = 3, the newly created clusters
on top right corner are quite distinct in the left panel of Fig. 12.4: one cluster is
characterized by petal width and petal length in the range of [-0.27, 0.76] and [-0.26,
0.74], respectively, whereas the other cluster has a relatively higher range for these
two features. Here, it happens that we can associate some botanical factors to these
clusters—each cluster is distinctly populated by one type of Iris flower.
b(xi, j ) − a(xi, j )
s(xi, j ) = , (12.4)
max{a(xi, j ), b(xi, j )}
where
1 Õ
a(xi, j ) = d[xi, j , xl, j ] , (12.5)
nj − 1
xl, j ∈ C j −{xi, j }
1 Õ
b(xi, j ) = min { d[xi, j , xl,r ]} , (12.6)
r,j nr xl, r ∈ Cr
and where d[xi, j , xl, j ] is a dissimilarity measure between vectors xi, j and xl, j (e.g.,
Euclidean distance between two vectors). For a singleton cluster C j , s(xi, j ) is defined
0 (Rousseeuw, 1987). Assuming n j > 1, ∀ j, then the value of a(xi, j ) shows the
average dissimilarity of xi, j from all other members of the cluster to which it is
assigned. In Fig. 12.7, this is the average length of all lines from the blue point in
cluster C1 to all other points in the same cluster (average of all solid lines). To find
b(xi, j ) in (12.6), we successively compute the average dissimilarities of xi, j to all
objects from other clusters and find the minimum. In Fig. 12.7, this means we find
the average length of all lines from the blue point in C1 to all members of cluster
C2 (i.e., the average of all dotted lines), and then the average length of all lines from
the blue point to all members of cluster C3 (i.e., the average of all dashed lines).
Because the former average is less than the latter, b(xi, j ) (for the blue point) becomes
the average of all dotted lines. This means that b(xi, j ) is a measure of dissimilarity
between xi, j and all members of the second-best cluster (after C j ) for xi, j . This also
implies that to compute b(xi, j ), we need to assume availability of two clusters at
least.
332 12 Clustering
C3
C2
C1
Fig. 12.7: Assuming xi,1 is the blue point in C1 , then: 1) a(xi,1 ) is the average length
of all lines from the blue point to all other points in C1 (i.e., the average of all solid
lines); and 2) b(xi,1 ) is the average length of all lines from the blue point to all
members of cluster C2 , which is the second-best (“neighboring”) cluster for xi,1 (i.e.,
the average of all dotted lines).
From the definition of s(xi, j ) in (12.4), it is clear that s(xi, j ) lies between -1 and 1.
We can consider three cases, which help understand the meaning of s(xi, j ):
• b(xi, j ) >> a(xi, j ): it means the average dissimilarity of xi, j to other data points
in its own cluster is much smaller than its average dissimilarity to objects in the
second-best cluster for xi, j . This implies that cluster C j is the very right cluster
for xi, j . In this case, from (12.4), s(xi, j ) is close to 1.
• b(xi, j ) << a(xi, j ): it means the average dissimilarity of xi, j to other data points
in its own cluster is much larger than its average dissimilarity to objects in the
second-best cluster for xi, j . This implies that cluster C j is not the right cluster
for xi, j . In this case, from (12.4), s(xi, j ) is close to -1.
• b(xi, j ) ≈ a(xi, j ): this is the case when in terms of average dissimilarities, as-
signing xi, j to C j or its second-best cluster does not make much difference. In
this case, from (12.4), s(xi, j ) is close to 0.
The values s(xi, j ) for data points in cluster C j are used to define cluster-specific
average silhouette width, denoted s̄(C j ), j = 1, . . . , K, which is given by
1 Õ
s̄(C j ) = s(xi, j ) . (12.7)
nj xi, j ∈ C j
A value of s̄(C j ) close to 1 implies having a distinct cluster C j . Finally, the overall
average silhouette width, denoted s̄(K), is defined as the average of all values s(xi, j )
across all clusters, which is equivalent to:
12.1 Partitional Clustering 333
1 Õ
s̄(K) = ÍK n j s̄(C j ) . (12.8)
j=1 nj ∀C j
The value of s̄(K) can be seen as the average “within” and “between” dissimilarities
of data points in a given clustering (partition). At the same time, one way to select
an “appropriate” number of clusters is to find the K that maximizes s̄(K). This
maximum value is known as silhouette coefficient (Kaufman and Rousseeuw, 1990,
p. 87), denoted SC, and is given by
In (Kaufman and Rousseeuw, 1990, p. 87), authors state that SC is a “measure of the
amount of clustering structure that has been discovered” by the clustering algorithm.
They also propose a subjective interpretation of SCs, which is summarized in Table
12.1 (see (Kaufman and Rousseeuw, 1990, p. 88)):
Example 12.6 Here we consider the K-means and the two feature combinations in
the Iris flower datasets that were used in Example 12.5: Case 1) sepal width and
petal length; and Case 2) petal width and petal length. We find an appropriate value
for K from overall average silhouette width.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# loading the scaled data, and setting feature indices and range of K
arrays = np.load('data/iris_train_scaled.npz')
X_train_full = arrays['X']
feature_indices = [[1,2], [3,2]]
K_val = range(2,8)
for j, f in enumerate(feature_indices):
X_train = X_train_full[:,f] # selecting the two features
Case 1 Case 2
0.600 0.750
0.575 0.725
0.550 0.700
0.525 0.675
s(K)
s(K)
0.500 0.650
0.475 0.625
0.450 0.600
0.425
2 3 4 5 6 7 2 3 4 5 6 7
K K
Fig. 12.8: The overall average silhouette width as a function of K for clusters obtained
using K-means in the (unlabeled) Iris flower datasets: Case 1) sepal width and petal
length features (left); and Case 2) petal width and petal length (right).
Here the use of overall average silhouette width points to K = 2 as the appropriate
number of clusters in both cases. Furthermore, for Case 1 and Case 2, we have
SC = 0.59 and SC = 0.74, respectively. Based on Table 12.1, this means for Case
1 and Case 2, ‘reasonable’ and ‘strong’ clustering structures have been discovered,
respectively.
In contrast with partitional clustering algorithms that create a partition of data points,
hierarchical clustering algorithms create a nested sequence of partitions. Given two
clusterings P1 and P2 of the same data items with K1 and K2 clusters, respectively,
where K2 < K1 , we say P1 is nested in P2 , denoted P1 @ P2 , if each cluster in P1
is a subset of a cluster in P2 . This also implies that each cluster in P2 is obtained by
merging some clusters in P1 .
Example 12.7 Which of the following clusterings are nested in partition P where
where dE2 [x, y] is the square of the Euclidean distance between two vectors x and
y, and µ i and µ j are centroids of Ci and C j , respectively.
1 They are called functions because strictly speaking, there are cases that even for a metric d[x, y],
the dissimilarity of sets is neither a metric nor a measure (see (Theodoridis and Koutroumbas,
2009, pp. 620-621))
338 12 Clustering
An algorithms that uses the dissimilarity update rule (12.15) is known as maxi-
mum algorithms (Johnson, 1967), furthest-neighbor methods, or more famously
as complete linkage methods (Hastie et al., 2001; Jain and Dubes, 1988; Theodor-
idis and Koutroumbas, 2009).
12.2 Hierarchical Clustering 339
This is because
1 Õ Õ
davg [Ci ∪ C j , Ck ] = davg [x, y]
(ni + n j )nk x∈Ci ∪C j y∈Ck
1 hÕ Õ Õ Õ i
= davg [x, y] + davg [x, y]
(ni + n j )nk x∈C y∈C x∈C j y∈Ck
i k
1 h i
= ni nk davg [Ci, Ck ] + n j nk davg [C j , Ck ] , (12.18)
(ni + n j )nk
which proves (12.17). An algorithm that uses the dissimilarity update rule
(12.17) is known as unweighted pair group method using arithmetic averages
(UPGMA). The UPGMA (also known as group average linkage) is a specific
type of average linkage methods. Other average linkage methods are defined
based on various forms of “averaging”. A detailed representation of these al-
gorithms are provided in (Sneath and Sokal, 1973; Jain and Dubes, 1988).
Nevertheless, due to popularity of UPGMA, sometimes in clustering literature
“average linkage” and UPGMA are used interchangeably.
Ward’s (minimum variance) linkage: Suppose we take the Ward dissimilar-
ity function defined in (12.13) as the pairwise cluster dissimilarity. Then the
dissimilarity update rule (12.14) becomes (see Exercise 5)
ni + nk
dWard [Ci ∪ C j , Ck ] = dWard [Ci, Ck ]
ni + n j + nk
n j + nk nk
+ dWard [C j , Ck ] − dWard [Ci, C j ] .
ni + n j + nk ni + n j + nk
(12.19)
The algorithm that uses the dissimilarity update rule (12.19) is known as Ward’s
linkage method (Ward, 1963).
In (Lance and Williams, 1967), it was shown that all the aforementioned update
rules (as well as several others that are not covered here) are special forms of the
following general update rule:
linkage αi αj β γ
1 1 −1
single 2 2 0 2
1 1 1
complete 2 2 0 2
ni nj
group average n i +n j n i +n j 0 0
n i +n k n j +n k −n k
Ward n i +n j +n k n i +n j +n k n i +n j +n k 0
Table 12.2: Coefficients of the general update rule (12.20) for various linkage meth-
ods.
2 see (Vicari, 2014) and references therein for some cases of asymmetric dissimilarity matrices
12.2 Hierarchical Clustering 341
2. Identify the pair of clusters with the least dissimilarity; that is, the tuple (i, j)
such that (here we assume there is no tie to identify the minimum in (12.21);
otherwise, a tie breaking strategy should be used and that, in general, can affect
the outcome):
4. Repeat steps 2 and 3 until the stopping criterion is met—a common stopping
criterion is to merge all data points into a single cluster. Once the search is
stopped, the sequence of obtained clusterings are P1 , P2 , . . . , Pt , which satisfy
P1 @ P2 @ . . . @ Pt .
they were merged at level (iteration) 1. “Cutting” the dendrogram at (right above)
level 1, leads to the clustering {{x2, x3 }, {x1 }, {x4 }, {x5 }}. This cut is shown by the
horizontal dashed line right above 0.4. In the second iteration, the least dissimilarity
was due to clusters {{x2, x3 }} and {{x4 }} and, as the result, they were merged at
level 2. Cutting the dendrogram at this level leads to {{x2, x3, x4 }, {x1 }, {x5 }}. This
is identified by the horizontal dashed line right above 0.8.
Example 12.8 Suppose S8 = {x1, x2, . . . , x8 } where x1 = [1.5, 1]T , x2 = [1.5, 2]T ,
x3 = [1.5, −1]T , x4 = [1.5, −2]T , x5 = −x1 , x6 = −x2 , x7 = −x3 , and x8 = −x4 .
Apply the matrix updating algorithm with both the single linkage and complete
linkage until there are two clusters. Use Euclidean distance metric as the measures
of dissimilarity between data points.
Fig. 12.10a shows the scatter plot of these data points. We first consider the single
linkage. In this regard, using Euclidean distance metric, D1 becomes (because it is
symmetric we only show the upper triangular elements):
342 12 Clustering
t=4
t=3
t=2
t=1
x5 x1 x4 x2 x3
x8 x2 x8 x2 x8 x2
x7 x1 x7 x1 x7 x1
feature 2
feature 2
feature 2
x5 x3 x5 x3 x5 x3
x6 x4 x6 x4 x6 x4
x8 x2 x8 x2 x8 x2
x7 x1 x7 x1 x7 x1
feature 2
feature 2
feature 2
x5 x3 x5 x3 x5 x3
x6 x4 x6 x4 x6 x4
Fig. 12.10: The scatter plot of data points in Example 12.8, and merged clusters at
iteration: (b) 1; (c) 2; (d) 4; (e) 5; and (f) 6
12.2 Hierarchical Clustering 343
There are multiple pairs of clusters that minimize (12.21). Although we break the
ties randomly, here due to symmetry of the problem, the outcome at the stopping
point will remain the same no matter how the ties are broken.
For t = 1, the (selected) pair of clusters that minimizes (12.21) is (3, 4)—at this
stage they are simply {{x3 }} and {{x4 }}. Therefore, these two clusters are merged
to form a new cluster (see Fig. 12.10b), and D1 is updated to obtain D2 (at t = 2):
For t = 2, the (selected) pair of clusters that minimizes (12.21) is (5, 6)—at this
stage they are simply {{x5 }} and {{x6 }}. Therefore, these two clusters are merged
to form a new cluster (see Fig. 12.10c), and D3 is updated to obtain D3 :
For t = 5, the (selected) pair of clusters that minimizes (12.21) are the two clusters
on the right part of Fig. 12.10e that are merged. D5 is updated to obtain D6 :
0 3.00 3.00
D6 = . 0 2.00 .
(12.26)
. . 0
For t = 6, the pair of clusters that minimizes (12.21) are the two clusters on the
left part of Fig. 12.10f that are merged. D6 is updated to obtain D7 :
0 3.00
D7 = . (12.27)
. 0
At this stage, we have two clusters and the stopping criterion is met. Thus, we
stop the algorithm. Fig. 12.10f shows the nested sequence of clusterings obtained by
the algorithm and Fig. 12.11 shows its corresponding dendrogram.
x1 x2 x3 x4 x5 x6 x7 x8
Fig. 12.11: The dendrogram for the single linkage clustering implemented in Exam-
ple 12.8.
Repeating the aforementioned steps for the complete linkage results in Fig. 12.12
(steps are omitted and are left as an exercise [Exercise 4]).
Single linkage vs. complete linkage vs. Ward’s linkage vs. group average
linkage: In the single linkage clustering, two clusters are merged if a “single”
dissimilarity (between their data points) is small regardless of dissimilarity of other
observations within the two clusters. This causes grouping very dissimilar data
points at early stages (low dissimilarities in the dendrogram) if there is some chain
of pairwise rather similar data points between them (Hansen and Delattre, 1978).
This is known as chaining effect and causes clusters produced by single linkage to
be long and “straggly” (Sneath and Sokal, 1973; Jain and Dubes, 1988). This is in
contrast with the complete linkage and the Ward’s linkage that create clusters that are
12.2 Hierarchical Clustering 345
x8 x2
x7 x1
feature 2
x5 x3
x6 x4
feature 1
Fig. 12.12: Complete linkage clustering of data points used in Example 12.8.
relatively compact and hyperspherical (Jain and Dubes, 1988). In the case of complete
linkage clustering, this is an immediate consequence of the max dissimilarity function
used in the complete linkage: two clusters are merged if all pairs of data points that
are part of them are “relatively similar”. This causes these observations be generally
grouped at higher dissimilarities in the dendrogram. The group average linkage
seems to have an intermediate effect between the single linkage and the complete
linkage clusterings.
There are several Monte Carlo studies to compare the rate at which these methods
recover some known clusters in synthetically generated data. In this regard, the results
of (Blashfield, 1976) obtained by multivariate normally distributed data indicated
that the Ward’s linkage method appeared to be the best followed by the complete
linkage, then the group average linkage, and finally, by the single linkage. The
results also show that in all cases, the “recovery rate” was negatively affected by the
ellipticity of clusters. The results of (Kuiper and Fisher, 1975) using multivariate
normal distributions with identity covariance matrices generally corroborate the
results of (Blashfield, 1976) by placing the Ward’s linkage and the complete linkage
as the best and runner up methods, and the single linkage as the poorest. Another
study generated null data sets using univariate uniform and normal distributions
and examined the performance of the methods (excluding the Ward’s method) in
terms of not discovering unnecessary clusters (Jain et al., 1986). The results of
this study placed the complete linkage as the best method, followed by the single
linkage, and then by the group average. In contrast with the previous methods using
normally distributed data, another work studied the recovery rate of these methods
for synthetically generated data sets that satisfy the ultrametric inequality (Milligan
and Isaac, 1980). The results placed the group average linkage first, followed by
the complete linkage, then the Ward’s method, and finally the single linkage as
346 12 Clustering
the poorest. There are many other studies that are not covered here. Despite all
these comparative analyses and recommendations, the consensus conclusion is that
there is no single best clustering approach. Different methods are good for different
applications (Jain and Dubes, 1988). For example, if there is a priori information
that there are elongated clusters in the data, the single linkage chaining “defect” can
be viewed as an advantage to discover such cluster structures.
Python implementation of agglomerative clustering: There are two common
ways to implement agglomerative clustering in Python. One is based on the scikit-
learn AgglomerativeClustering class from sklearn.cluster module, and
the other is based on the linkage function from scipy.cluster.hierarchy
module. In the first approach, similar to any other scikit-learn estimator, we can
instantiate the class and use the fit() method to train the estimator. Two stop-
ping criteria that can be used are n_clusters (number of clusters to be found
[at the level that the algorithm stops]) and distance_threshold (the dissimilar-
ity at or above which no merging occurs). The type of linkage and the distance
metric between data points can be set by the linkage (possible choices: 'ward',
'complete', 'average', 'single') and the metric (e.g., 'euclidean') param-
eters, respectively. Cluster labels can be found by the labels_ attribute of a trained
AgglomerativeClustering object. In the following code, we use this approach
for both the single and the complete linkage clusterings of data points presented in
Example 12.8.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
dissimilarity for merged clusters (the third column); and 3) and the number of data
points in the newly formed cluster (the fourth column). It is common to use this
numpy array as the input to scipy.cluster.hierarchy.dendrogram function to
plot the clustering dendrogram. In the following code, we plot the dendrogram for
single linkage clustering in Example 12.8.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
linkage matrix=
[[ 0. 1. 1. 2.]
[ 2. 3. 1. 2.]
[ 4. 5. 1. 2.]
[ 6. 7. 1. 2.]
[ 8. 9. 2. 4.]
[10. 11. 2. 4.]
[12. 13. 3. 8.]]
348 12 Clustering
3.0
2.5
2.0
dissimilairty
1.5
1.0
0.5
0.0
0 1 2 3 4 5 6 7
Fig. 12.13: The dendrogram for the single linkage clustering of data points used in
Example 12.8 (the data points indices are presented next to the leaf nodes in the
tree).
Exercises:
Exercise 1: Generate six sets of training data such that each training data is pro-
duced by setting a constant µ to one of the values {3, 1.5, 1.2, 1.1, 0.9, 0.0} and then
randomly sampling four bivariate Gaussian distributions with an identical identity
covariance matrix and means: µ 0 = [µ, µ]T , µ 1 = [−µ, µ]T , µ 2 = [−µ, −µ]T , and
µ 3 = [−µ, −µ]T . Generate 1000 observations from each distribution, and then con-
catenate them to create each training data (thus each training data has a size of 4000
× 2). The training data that is created in this way has naturally four distinct clusters
when µ = 3 and becomes a “null” data (all data points from a single cluster) when
µ = 0.
(A) Show the scatter plots of each training data, and plot the curve of overall av-
erage silhouette width as a function of K for K = 2, 3, . . . , 40. Describe your
observations about the behavior of the tail of these curves as µ decreases.
(B) What is the estimated optimal number of clusters based on the silhouette coeffi-
cient in each case?
(C) What is the silhouette coefficients for the case where µ = 3? What is the inter-
pretation of this value based on Table 12.1?
Exercise 2: Apply the single linkage and the complete linkage clustering methods
to the Iris training dataset that was used in Example 12.4 in this chapter. By merely
looking at the dendrogram, which method, in your opinion, has produced “healthier”
clusters?
at each iteration, and observe that the achieved squared errors (inertia) are the same
as the hand calculations obtained in Example 12.3. (Hint: set verbose parameter to
1 to observe inertia).
Exercise 4: Use hand calculations to apply the matrix updating algorithm with the
complete linkage to the data presented in Example 12.8.
É
Exercise 5: Prove the dissimilarity update rule presented in (12.19) for Ward’s
dissimilarity function.
A) s/2 B) s C) −s D) 0
Exercise 7: A clustering algorithm has produced the two clusters depicted in Fig.
12.14 where points belonging to cluster 1 and cluster 2 (a singleton cluster) are
identified by • and ×, respectively. Using Euclidean distance as the measure of
dissimilarity between data points, what is the overall average silhouette width for
this clustering (use hand calculations)?
1
1 0 1 2 3 4 5
13 13 13 13 13
A) 10 B) 20 C) 30 D) 40 E) 60
Chapter 13
Deep Learning with Keras-TensorFlow
Deep learning is a subfield of machine learning and has been applied in various
tasks such as supervised, unsupervised, semi-supervised, and reinforcement learning.
Among all family of predictive models that are used in machine learning, by deep
learning we exclusively refer to a particular class of models known as multilayered
Artificial Neural Network (ANN), which are partially inspired by our understanding
of biological neural circuits. Generally, an ANN is considered shallow if it has one or
two layers; otherwise, it is considered a deep network. Each layer in an ANN performs
some type of transformation of the input to the layer. As a result, multiple successive
layers in an ANN could potentially capture and mathematically represent complex
input-output relationship in a given data. In this chapter, we introduce some of the
key principles and practices used in learning deep neural networks. In this regard,
we use multi-layer perceptrons as a typical ANN and postpone other architectures to
later chapters. In terms of software, we switch to Keras with TensorFlow backend as
they are well-optimized for training and tuning various forms of ANN and support
various forms of hardware including CPU, GPU, or TPU.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 351
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_13
352 13 Deep Learning with Keras-TensorFlow
Although the term “deep learning” was coined for the first time by Rina Dechter
in 1986 (Dechter, 1986), the history of its development goes well beyond that point.
The first attempt in mathematical representation of nervous system as a network
of (artificial) neurons was based on seminal work of McCulloch and Pitts in 1943
(McCulloch and Pitts, 1943). They showed that under certain conditions, any arbi-
trary Boolean function can be represented by a network of triggering units with a
threshold (neurons). Nevertheless, the strength of interconnections between neurons
(synaptic weight) were not “learned” in their networks and should have adjusted
manually. Rosenblatt studied similar network of threshold units in 1962 under the
name perceptrons (Rosenblatt, 1962). He proposed an algorithm to learn the weights
of perceptron. The first attempts in successfully implementing and learning a deep
neural network can be found in studies conducted by Ivekhenko and his colleagues
in mid 60s and early 70s (Ivakhnenko and Lapa, 1965; Ivakhnenko, 1971). The work
of Ivekhenko et al. are essentially implementation of a multi-layer perceptron-type
network. A few years later, some other architectures, namely, Cognitron (Fukushima,
1975) and Neocognitron (Fukushima, 1980), were introduced and became the foun-
dation of modern convolutional neural network, which is itself one of the major
driving force behind the current momentum that we observe today in deep learning.
In this section, we introduce multilayer perceptron (MLP), which is one of the
basic forms of ANN with a long histroy and strong track record of success in
many applications. The structure of MLP is closely related to Kolmogorov–Arnold
representation theorem (KA theorem). The KA theorem states that we can write any
continous function f of p variables f : [0, 1] p → R as a finite sum of composition
of univariate functions; that is to say, for f there exists univariate functions hi and
gi j such that
2p+1
Õ p
Õ
f (x) = hi gi j (x j ) , (13.1)
i=1 j=1
where x = [x1, x2, . . . , x p ]T . However, KA theorem does not provide a recipe for the
form of hi and gi j . Suppose we make choices as follows:
• gi j (x j ) takes the following form:
ai0
gi j (x j ) = ai j x j + , j = 1, 2, . . . , p , (13.2)
p
where ai0 (known as “biases”) and ai j are some unknown constants.
• hi (t) takes the following form:
b0
hi (t) = bi φ(t) + , i = 1, 2, . . . , k , (13.3)
k
where b0 (bias) and bi are some unknown constants and in contrast with (13.1),
rather than taking i from 1 to 2p + 1, we take it from 1 to k, which is an arbitrary
13.1 Artificial Neural Network, Deep Learning, and Multilayer Perceptron 353
1
σlogistic (t) = . (13.4)
1 + e−t
Another commonly used form of sigmoid function in the context of neural
network is hyperbolic tangent, which is given by
Although less common compared with the previous two sigmoids, tangent in-
verse is another example of sigmoid functions:
2
σarctanh (t) = arctanh(t) . (13.6)
π
An important non-sigmoid activation function is ReLU (short for Rectified
Linear Unit) given by
k
Õ p
Õ
f (x) ≈ b0 + bi φ ai0 + ai j x j , (13.8)
i=1 j=1
k
Õ
f (x) ≈ b0 + bi ui , (13.9)
i=1
p
Õ
ui = φ(ai0 + ai j x j ) . (13.10)
j=1
354 13 Deep Learning with Keras-TensorFlow
A biological neuron
𝑎!$
𝑎!!
.
𝑥! . ∑ 𝑏!
. ∅𝑡
𝑏$
𝑎!"
. .
. .
∑ 𝑦
. .
Output Layer
𝑎 #!
.
𝑥" . ∑ 𝑏#
. ∅𝑡
𝑎 #"
An artificial neuron
𝑎 #$ Hidden Layer
Input Layer
Fig. 13.1: MLP with one hidden layer and k neurons. This is known as a three-layer
network with one input layer, one hidden layer, and one output layer.
As observed in Fig. 13.1 (also see (13.8)), each input feature is multiplied by a
coefficient and becomes part of the input to each of the k neurons in the hidden layer.
The output of these neurons are also multiplied by some weights, which, in turn, are
used as inputs to the output layer. Although in Fig. 13.1, we take y as the sum of
inputs to the single neuron in that layer, we may add a sigmoid and/or a threshold
function used after the summation if classification is desired. Furthermore, the output
layer itself can have a collection of artificial neurons, for example, if multiple target
variables should be estimated for a given input.
13.1 Artificial Neural Network, Deep Learning, and Multilayer Perceptron 355
The network in Fig. 13.1 is also known as a three-layer MLP because of the three
layers used in is structure: 1) the input layer; 2) the hidden layer; and 3) the output
layer. A special case of this network is where there is no hidden layer and each input
feature is multiplied by a weight and is used as the input to the neuron in the output
layer. That leads to the structure depicted in Fig 13.2, which is known as perceptron.
𝑥!
𝑎"
𝑎!
.
.
. . ∑ ∅ 𝑡
𝑦
.
𝑎#
𝑥#
There are two important differences between (13.8) and (13.1) that are highlighted
here:
k2
Õ
f (x) ≈ c0 + cl ul , (13.11)
l=1
k1
Õ
ul = φ(bl0 + bli vi ) 1 ≤ l ≤ k 2 , (13.12)
i=1
Õp
vi = φ(ai0 + ai j x j ) 1 ≤ i ≤ k1 . (13.13)
j=1
356 13 Deep Learning with Keras-TensorFlow
𝑏!#
𝑎!# 𝑏!!
𝑎!! .
. ∑ 𝑐!
. . ∅𝑡
𝑥! . ∑
. ∅𝑡
𝑏 !"!
𝑐#
𝑎!"
. .
.
. .
. ∑ 𝑦
. .
.
Output
𝑎 !!" Layer
. 𝑏!""
𝑥" . ∑
. .
. ∑
∅𝑡
𝑐$ $
𝑎 !!# . ∅𝑡
𝑎 !!% 1st Hidden Layer
Input Layer 𝑏!"!!
𝑏!"% 2nd Hidden Layer
Fig. 13.3: MLP with two hidden layers and k 1 and k 2 neurons in the first and second
hidden layers, respectively.
updated
optimizer backprop loss score
weights
observation
layers (linear
and nonlinear predicted target
operations)
loss
function
actual target
So far we did not specify how the unknown weights in an MLP are adjusted. As in
many other learning algorithms, these weights are estimated from training data in
the training phase. In this regard, we use the iterative gradient descent process that
uses steepest descent direction to update the weights. Fig. 13.4 presents a schematic
description of this training process. In particular, the current weights of the network,
which are initially assigned to some random values, result in a predicted target. The
distance between the predicted target from the actual target is measured using a loss
function and this distance is used as a feedback signal to adjust the weights.
To find the steepest descent direction, we need the gradient of the loss function
with respect to weights. This is done efficiently using a two-stage algorithm known
as backpropagation (also referred to as backprop), which is essentially based on
chain rules of calculus. The two stages in backprop include some computations
from the first to the last layer of the network (the first stage: forward computations)
and some computations from the last layer to the first (the second stage: backward
computations). Once the gradient is found using backprop, the adjustment to weights
in the direction of steepest descent is performed using an optimizer; that is to say,
the optimizer is the update rule for the weight estimates. There are two ways that we
can update the weights in the aforementioned process:
1. We can compute the value of the gradient of loss function across all training
data, and then update the weights. This is known as batch gradient descent.
However, for a single update, all training data should be used in the backprop
and the optimizer, and then take one single step in the direction of steepest
descent. At the same time, a single presentation of the entire training set to the
backprop is called an epoch. Therefore, in the batch gradient descent, one update
for the weights occurs at the end of each epoch. In other words, the number of
epochs is an indicator of the amount of times training data is presented to the
network, which is perceived as the amount of training the ANN. The number
of epochs is generally considered as a hyperparameter and should be tuned in
model selection.
2. We can compute the value of the gradient of loss function for a mini training
data (a small subset of training data), and then update the weights. This is known
as mini-batch gradient descent. Nevertheless, the concept of epoch is still the
same: a single presentation of the entire training set to the backprop. This means
that if we partition a training data of size n into mini-batches of size K << n,
then in an epoch we will have dn/Ke updates for weights. In this context, the
size of each mini-batch is called batch size. A good choice, for example, is 32
but in general it is a hyperparameter and should be tuned in model selection.
358 13 Deep Learning with Keras-TensorFlow
Although we can use scikit-learn to train certain types of ANNs, we will not use it
for that purpose because:
1. training “deep” neural networks requires estimating and adjusting many param-
eters and hyperparameters, which is a computationally expensive process. As a
result successful training various forms of neural networks highly rely on paral-
lel computations, which are realized using Graphical Processing Units (GPUs)
or Tensor Processing Units (TPUs). However, scikit-learn currently does not
support using GPU or TPU.
2. scikit-learn does not currently implement some popular forms of neural networks
such as convolutional neural networks or recurrent neural networks.
As a result, we switch to Keras with TensorFlow backend as they are particularly
designed for training and tuning various forms of ANN and support various forms
of hardware including CPU, GPU, or TPU.
As previously mentioned in Chapter 1, simplicity and readability along with a
number of third-party packages have made Python as the most popular programming
language for data science. When it comes to deep learning, Keras library has become
such a popular API to the extent that its inventor, Francois Chollet, refers to Keras
as “the Python of deep learning” (Chollet, 2021). In a recent survey conducted by
Kaggle in 2021 on “State of Data Science and Machine Learning” (Kaggle-survey,
2021), Keras usage among data scientists was ranked fourth (47.3%) only preceded
by xgboost library (47.9%), TensorFlow (52.6%), and scikit-learn (82.3%). But what
is Keras?
Keras is a deep learning API written in Python. It provides high-level “building
blocks” for efficiently training, tuning, and evaluating deep neural networks. How-
ever, Keras relies on a backend engine, which does all the low-level operations such
as tensor products, differentiation, and convolution. Keras was originaly released to
support Theano backend, which was developed at Université de Montréal. Several
months later when TensorFlow was relased by google, Keras was refractored to
support both Theano and TensorFlow. In 2017, Keras was even supporting some
additional backends such as CNTK and MXNet. Later in 2018, TensorFlow adopted
Keras as its high-level API. In September 2019, TensorFlow 2.0 was released and
Keras 2.3 became its last multi-backend release and it was announced that: “Going
forward, we recommend that users consider switching their Keras code to tf.keras
in TensorFlow 2.0. It implements the same Keras 2.3.0 API (so switching should be
as easy as changing the Keras import statements)” (Keras-team, 2019).
If Anaconda was installed as explained in Section 2.1, TensorFlow can easily be
installed using conda by executing the following command line in a terminal (Tensor-
flow, 2023): conda install -c conda-forge tensorflow. We can also check
their version as follows:
import tensorflow as tf
from tensorflow import keras
13.4 Google Colaboratory (Colab) 359
print(tf.__version__)
print(keras.__version__)
2.9.2
2.9.0
1.0.2
By default, running codes in Colab is done over CPUs provided by Google servers.
However, to utilize GPU or even TPU, we can go to “Runtime –> Change runtime
type” and choose GPU/TPU as the Hardware accelerator (see Fig. 13.7). However,
if we want to train ANN models on TPU, it requires some extra work, which is not
needed when GPU is used. As a result of that, and as for our purpose here (and
360 13 Deep Learning with Keras-TensorFlow
in many other applications) the use of GPU suffice the need, we skip those details
(details for that purpose are found at (Tensorflow-tpu, 2023)).
Once a GPU is selected as the Hardware accelerator, we can ensure the GPU is used
as follows:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.
,→list_physical_devices('GPU')))
The MNIST dataset has a training set of 60,000 grayscale 28 × 28 images of hand-
written digits, and an additional test set of size 10,000. It is part of keras.dataset
module and can be loaded as four numpy arrays as follows (in the code below and to
have reproducible results as we proceed with this application, we also set the random
seeds for Python, NumPy, and TensorFlow):
362 13 Deep Learning with Keras-TensorFlow
seed_value= 42
# set the seed for Python built-in pseudo-random generator
import random
random.seed(seed_value)
# set the seed for numpy pseudo-random generator
import numpy as np
np.random.seed(seed_value)
# set the seed for tensorflow pseudo-random generator
import tensorflow as tf
tf.random.set_seed(seed_value)
<class 'numpy.ndarray'>
uint8
(60000, 28, 28)
3
As can be seen the data type is unit8, which is unsigned integers from 0 to 255.
Before proceeding any further, we first scale down the data to [0, 1] by dividing
each entry by 255—here rescaling is not data-driven and we use prior knowledge in
doing so. The default data type of such a division would be double precision. As the
memory itself is an important factor in training deep ANNs, we also change the data
type to float32.
X_train_val, X_test = X_train_val.astype('float32')/255, X_test.
,→astype('float32')/255
print(X_train_val.dtype)
float32
X_train_val[0].shape
13.5 The First Application Using Keras 363
(784,)
For the model selection purposes, we randomly split X_train_val into training and
validation sets:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train_val,
,→y_train_val, stratify=y_train_val, test_size=0.25)
X_train.shape
(45000, 784)
All models in Keras are objects of tf.keras.Model class. There are three ways to
build object of this class (to which we simply refer as models): 1) Sequential API;
2) Functional API; and 3) Subclassing API. Sequential models are used when stack
of standard layers is desired where each layer has one input tensor and one output
tensor. This is what we are going to use hereafter. The functional and subclassing
are used for cases when multiple inputs/outputs (for example to locate and classify
an object in an image) or more complex uncommon architectures are desired. There
are two ways to use Sequential models:
• Method 1: pass a list of layers to keras.Sequential class constructor to
instantiate.
• Method 2: instantiate keras.Sequential first with no layer and then use add()
method to add layers one after another.
There are many layers available in keras.layers API. Each layer consists of a
recipe for mapping the input tensor to the output tensor. At the same time, many lay-
ers have state, which means some weights learned using a given data by the learning
algorithm. However, some layers are stateless—they are only used for reshaping an
input tensor. A complete list of layers in Keras can be found at (Keras-layers, 2023).
Next we examine the aforementioned two ways to build a Sequential model.
Method 1:
from tensorflow import keras
from tensorflow.keras import layers
mnist_model = keras.Sequential([
layers.Dense(128, activation="sigmoid"),
layers.Dense(64, activation="sigmoid"),
layers.Dense(10, activation="softmax")
])
364 13 Deep Learning with Keras-TensorFlow
Method 2:
from tensorflow import keras
from tensorflow.keras import layers
mnist_model = keras.Sequential()
mnist_model.add(layers.Dense(128, activation="sigmoid"))
mnist_model.add(layers.Dense(64, activation="sigmoid"))
layers.Dense(10, activation="softmax")
<tensorflow.python.keras.layers.core.Dense at 0x7fab7e4ec730>
Let us elaborate on this code. A Dense layer, also known as fully connected layer,
is indeed the same hidden layers that we have already seen in Figs. 13.1 and 13.3
for MLP. They are called fully connected layers because every input to a layer is
connected to every neuron in the layer. In other words, the inputs to each layer, which
could be the outputs of neurons in the previous layer, go to every neuron in the layer.
While the use of activation functions (here logistic sigmoid) in both hidden layers
is based on (13.12) and (13.13), the use of softmax in the last layer is due to the
multiclass (single label) nature of this classification problem. To understand this,
first note that the number of neurons in the output layer is the same as the number of
digits (classes). This means that for each input feature vector x, the model will output
a c-dimensional vector (here c = 10). The softmax function takes this c-dimensional
vector, say a = [a0, a1, ..., a(c−1) ]T , and normalized that into a probability vector
softmax(a) = p = [p0, p1, . . . , p(c−1) ]T (which should add up to 1) according to:
e ai
pi = Í a , i = 0, . . . , c − 1 . (13.14)
ie
i
mnist_model = keras.Sequential([
13.5 The First Application Using Keras 365
layers.Dense(128, activation="sigmoid",
,→input_shape = X_train[0].shape),
layers.Dense(64, activation="sigmoid"),
layers.Dense(10, activation="softmax")
])
...,
[-1.50822967e-01, 6.51585162e-02, 1.69166803e-01,
-1.23789206e-01, 1.93499446e-01, 2.29302317e-01,
2.81352669e-01, -2.16220707e-01, -5.38927913e-02,
-2.35425845e-01]], dtype=float32)>,
<tf.Variable 'dense_239/bias:0' shape=(10,) dtype=float32,
,→numpy=array([0., 0.,
We can use summary() method that tabulates the shape of output and the total
number of parameters that are/should be estimated in each layer.
mnist_model.summary()
Model: "sequential_76"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_237 (Dense) (None, 128) 100480
_________________________________________________________________
dense_238 (Dense) (None, 64) 8256
_________________________________________________________________
dense_239 (Dense) (None, 10) 650
=================================================================
Total params: 109,386
Trainable params: 109,386
Non-trainable params: 0
_________________________________________________________________
same rank-3 tensor as the input (without reshaping). In that case the model
is defined as follows:
# mnist_model = keras.Sequential([
# layers.Flatten(),
# layers.Dense(128,
,→activation="sigmoid"),
# layers.Dense(64,
,→activation="sigmoid"),
# layers.Dense(10, activation="softmax")
# ]) # here we have not specified the
,→input_shape but we can do so by layers.
,→Flatten(input_shape=(28,28))
After building the model structure in Keras, we need to compile the model. Compiling
in Keras is the process of configuring the learning process via compile() method
of keras.Model class. In this regard, we set the choice of optimizer, loss function,
and metrics to be computed through the learning process by setting the optimizer,
loss, and metrics parameters of compile() method, respectively. Here we list
some possible common choices for classification and regression:
1. optimizer can be set by replacing name in optimizer = name, for exam-
ple, with one of the following strings where each refers to an optimizer known
with the same name: adam, rmsprop, sgd, or adagrad (for a complete list
of optimizers supported in Keras see (Keras-optimizers, 2023)). Doing so
uses default values of parameters in the class implementing each optimizer.
If we want to have more control over the parameters of an optimizer, we need
to set the optimizer parameter to an object of optimizer class; for exam-
ple, optimize = keras.optimizers.Adam(learning_rate=0.01) using
which we are changing the 0.001 default value of learning_rate used in Adam
optimizer to 0.01.
2. metrics can be set by replacing name(s) in metrics = [name(s)] with
a list of strings where each refer to a metric with the same name; for exam-
ple, accuracy (for classification), or mean_squared_error (for regression).
A more flexible way is to use the class implementing the metric; for exam-
ple, metrics = ["mean_squared_error", "mean_absolute_error"] is
equivalent to
metrics = [keras.metrics.MeanSquaredError(),
keras.metrics.MeanAbsoluteError()]
For a complete list of metrics refer to (Keras-metrics, 2023).
13.5 The First Application Using Keras 367
3. loss can be set by replacing name in loss = name with one of the following
strings where each refer to a loss known with the same name:
• binary_crossentropy (for binary classification as well as multilabel clas-
sification, which more than one label is assigned to an instance);
• categorical_crossentropy (for multiclass classification when target
values are one-hot encoded);
• sparse_categorical_crossentropy (for multiclass classification when
target values are integer encoded),
• mean_squared_error and mean_absolute_error for regression.
For a complete list of losses supported in Keras see (Keras-losses, 2023). We
can also pass an object of the class implementing the loss function; for exam-
ple, loss = keras.losses.BinaryCrossentropy() is equivalent to loss
= binary_crossentropy.
In case of binary classification (single-label) and assuming the labels are already
encoded as integers 0 and 1, if we use binary_crossentropy, in the last layer we
should use one neuron and sigmoid as the activation function. The output of activa-
tion function in the last layer in this case is the probability of class 1. Alternatively, it
can also be treated as a multi-class classification and use two neurons in the last layer
with a softmax activation and then use sparse_categorical_crossentropy. In
this case, the output of the last layer is the probability of classes 0 and 1. Here are
examples to see this equivalence:
true_classes = [0, 1, 1] # actual labels
class_probs_per_obs = [[0.9, 0.1], [0.2, 0.8], [0.4, 0.6]] # the
,→outputs of softmax for two neurons
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(true_classes, class_probs_per_obs).numpy()
0.27977657
true_classes = [0, 1, 1]
class_1_prob_per_obs = [0.1, 0.8, 0.6] # the outputs of sigmoid for one
,→neuron
bce = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce(true_classes, class_1_prob_per_obs).numpy()
0.2797764
Answer to Q2: Integer encoding, also known as ordinal encoding, treats un-
ordered categorical values as if they were numeric integers. For example,
a class label, which could be either “Red”, “Blue”, and “Green” is treated
as 0, 1, and 2. This type of encoding raises questions regarding the rela-
tive magnitudes assigned to categories after numeric transformations; after
all, categorical features have no relative “magnitude” but numbers do. On
the other hand, one-hot encoding replaces one categorical variable with N
possible categories with standard basis vectors of N dimensional Euclidean
vector space. For example, “Red”, “Blue”, and “Green” are replaced by
[1, 0, 0]T , [0, 1, 0]T , and [0, 0, 1]T , respectively. As a result, these orthogonal
binary vectors are equidistant with no specific order, which compared with
the outcomes of integer encoding, align better with the intuitive meaning of
categorical variables. However, because each utilized standard basis vector
lives in an N dimensional vector space, one-hot encoding can blow up the
size of data matrix if many variables with many categories are present.
Answer to Q3: The answer to this question lies in the nature of backprop.
As we said before, in backprop we efficiently evaluate the steepest descent
direction for which we need the gradient of the loss function with respect
to weights. However, using a loss such as 1 − accuracy is not suitable for
this purpose because this is a piecewise constant function of weights, which
means the gradient is either zero (the weights do not move) or undefined.
Therefore, other loss functions such as categorical_crossentropy for
classificiation is commonly used. The use of mean_absolute_error loss
function for regression is justified by defining its derivative at zero as a
constant such as zero.
We now compile the previously built mnist_model. Although the labels stored
in y_train, y_val, and y_test are already integer encoded (and we can use
sparse_categorical_crossentropy to compile the model), here we show
the utility of to_categorical() for one-hot encoding and, therefore, we use
categorical_crossentropy to compile the model:
13.5 The First Application Using Keras 369
mnist_model.compile(optimizer="adam", loss="categorical_crossentropy",
,→metrics=["accuracy"])
(45000,)
(45000, 10)
É
categorical_crossentropy and sparse_categorical_crossentropy:
As we said before, depending on whether we use one-hot encoded labels or integer
encoded labels, we need to use either the categorical_crossentropy or the
sparse_categorical_crossentropy. But what is the difference between them?
Suppose for a multiclass (single label) classification with c classes, a training
data Str = {(x1, y1 ), (x2, y2 ), ..., (xn, yn )} is available. Applying one-hot encoding on
y j , j = 1, . . . , n, leads to label vectors y j = [y j0, ..., y j(c−1) ] where y ji ∈ {0, 1}, i =
0, . . . , c − 1. Categorical cross-entropy (also known as cross-entropy) is defined as
n c−1
1 ÕÕ
e(θ) = − y ji log P(Yj = i|x j ; θ) ,
(13.15)
n j=1 i=0
where θ denotes ANN parameters (weights). Comparing this expression with the
cross-entropy expression presented in Eq. (6.39) in Section 6.2.2 shows that (13.15)
is indeed the extension of the expression presented there to multiclass classification
with one-hot encoded labels. Furthermore, P(Yj = i|x j ; θ) denotes the probability
that a given observation x j belongs to class i. In multiclass classification using ANN,
we take this probability as the output of the softmax function when x j is used as the
input to the network; that is to say, we estimate P(Yj = i|x j ; θ) by the value of pi
defined in (13.14) for the given observation. Therefore, we can rewrite (13.15) as:
n
1Õ T
e(θ) = − y j log p j ,
(13.16)
n j=1
where here log(.) is applied element-wise and p j = [p j0, p j1, . . . , p j(c−1) ]T is the
vector output of softmax for a given input x j . This is what is implemented by
categorical_crossentropy. The sparse_categorical_crossentropy im-
plements a similar function as in (13.16) except that rather than one-hot encoded
labels, it expects integer encoding with labels y j in the range of 0, 1, . . . , c − 1. This
way, we can write (sparse) cross-entropy as:
370 13 Deep Learning with Keras-TensorFlow
n
1Õ
e(θ) = − log p jy j .
(13.17)
n j=1
Example 13.1 Suppose we have a class label with possible values, Red, Blue,
and Green. We use one-hot encoding and represent them as Red : [1, 0, 0]T ,
Blue : [0, 1, 0]T , and Green : [0, 0, 1]T . Two observations x1 and x2 are presented
to an ANN with some values for its weights. The true classes of x1 and x2 are
Green and Blue, respectively. The output of the network after the softmax for x1 is
[0.1, 0.4, 0.5]T and its output for x2 is [0.05, 0.8, 0.15]T . What is the cross-entropy?
The classes of x1 and x2 are [0, 0, 1]T and [0, 1, 0]T , respectively; that is, y1 =
[0, 0, 1]T and y2 = [0, 1, 0]T . At the same time, p1 = [0.1, 0.4, 0.5]T and p2 =
[0.05, 0.8, 0.15]T . Therefore, (13.16) yields
(13.18)
0.45814535
For the sparse categorical cross-entropy, the labels are expected to be integer
encoded; that is to say, y1 = 2 and y2 = 1. Therefore, we can pick the corresponding
probabilities at index 2 and 1 (assuming the first element in each probability vector
has index 0) from p1 and p2 , respectively, to write
1
e(θ) = − log(0.5) + log(0.8) = 0.458
(13.19)
2
We can also use keras.losses.SparseCategoricalCrossentropy() to com-
pute this:
labels = [2, 1]
class_probs_per_obs = [[0.1, 0.4, 0.5], [0.05, 0.8, 0.15]]
13.5 The First Application Using Keras 371
scce = keras.losses.SparseCategoricalCrossentropy()
scce(labels, class_probs_per_obs).numpy()
0.45814538
As we can see in both cases (using categorical cross-entropy with one-hot encoded
labels and sparse categorical cross-entropy with integer encoded labels), the loss is
the same.
——————————————————————————————-
13.5.4 Fitting
Once a model is compiled, training the model is performed by calling the fit()
method of keras.Model class. There are several important paramaters of the fit()
method that warrant particular attention:
1. x: represents the training data matrix;
2. y: represents the training target values;
3. batch_size: can be used to set the batch size used in the mini-batch gradient
descent (see Section 13.2);
4. epochs: can be used to set the number of epochs to train the model (see Section
13.2);
5. validation_data: can be used to specify the validation set and is generally in
the form of (x_val, y_val);
6. steps_per_epoch: this is used to set the number of batches processed per
epoch. The default value is None, which for a batch size of K, sets the number
of batches per epoch to the conventional value of dn/Ke where n is the sample
size. This option would be helpful, for example, when n is such a large number
that each epoch may take a long time if dn/Ke is used;
7. callbacks_list: a list of callback objects. A callback is an object that is called
at various stages of training to take an action. There are several callbacks imple-
mented in Keras but a combination of the following two is quite handy in many
cases (each of these accepts multiple parameters [see their documentations] but
here we present them along with few of their useful parameters):
• keras.callbacks.EarlyStopping(monitor="val_loss",patience=0)
Using this callback, we stop training after a monitored metric (monitor param-
eter) does not improve after a certain number of epochs (patience). Why do
we need to stop training? On one hand, training an ANN with many epochs may
lead to overfitting on the training data. On the other hand, too few epochs could
lead to underfitting, which is the situation where the model is not capable of
372 13 Deep Learning with Keras-TensorFlow
capturing the relationship between input and output even on training data. Using
early stopping we can set a large number of epochs that could be potentially used
for training the ANN but stop the training process when no improvement in the
monitored metric is observed after a certain number of epochs.
• keras.callbacks.ModelCheckpoint(filepath="file_path.keras",
monitor="val_loss", save_best_only=False, save_freq="epoch")
By default, this callback saves the model at the end of each epoch. This default
behaviour could be changed by setting save_freq to an integer in which case
the model is saved at the end of that many batches. Furthermore, we often desire
to save the best model at the end of epoch if the monitored metric (monitor) is
improved. This behaviour is achieved by save_best_only=True in which case
the saved model will be overwritten only if the monitor metric is improved. We
can also set verbose to 1 to see additional information about when the model
is/is not saved.
Here we first prepare a list of callbacks and then train our previously compiled model:
import time
my_callbacks = [
keras.callbacks.EarlyStopping(
monitor="val_accuracy",
patience=20),
keras.callbacks.ModelCheckpoint(
filepath="model/best_model.keras",
monitor="val_loss",
save_best_only=True,
verbose=1)
]
start = time.time()
history = mnist_model.fit(x = X_train,
y = y_train_1_hot,
batch_size = 32,
epochs = 200,
validation_data = (X_val, y_val_1_hot),
callbacks = my_callbacks)
end = time.time()
training_duration = end - start
print("training duration = {:.3f}".format(training_duration))
Epoch 1/200
1407/1407 [==============================] - 2s 1ms/step - loss: 0.9783 -
accuracy: 0.7587 - val_loss: 0.2579 - val_accuracy: 0.9225
model/best_model.keras
Epoch 2/200
1407/1407 [==============================] - 2s 1ms/step - loss: 0.2295 -
accuracy: 0.9330 - val_loss: 0.1773 - val_accuracy: 0.9485
...,
Epoch 13/200
1407/1407 [==============================] - 2s 1ms/step - loss: 0.0137 -
accuracy: 0.9971 - val_loss: 0.0888 - val_accuracy: 0.9753
...,
...,
Epoch 67/200
1407/1407 [==============================] - 1s 1ms/step - loss: 0.0011 -
accuracy: 0.9997 - val_loss: 0.1728 - val_accuracy: 0.9756
plt.subplot(121)
plt.plot(epoch_count, history.history['loss'], 'b', label = 'training
,→loss')
0.3 0.94
loss
0.2 0.92
0.90
0.1
0.88 training accuracy
0.0 validation accuracy
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
epoch epoch
Fig. 13.8: The loss (left) and the accuracy (right) of mnist_model as functions of
epoch on both the training set and the validation set.
Here we put together all the aforementioned codes for this MNIST classification
problem in one place:
import time
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist
from sklearn.model_selection import train_test_split
seed_value= 42
random.seed(seed_value)
# data preprocessing
(X_train_val, y_train_val), (X_test, y_test) = mnist.load_data()
X_train_val, X_test = X_train_val.astype('float32')/255, X_test.
,→astype('float32')/255
X_train_val, X_test = X_train_val.reshape(60000, 28*28), X_test.
,→reshape(10000, 28*28)
X_train, X_val, y_train, y_val = train_test_split(X_train_val,
,→y_train_val, stratify=y_train_val, test_size=0.25)
y_train_1_hot = keras.utils.to_categorical(y_train, num_classes = 10)
y_val_1_hot = keras.utils.to_categorical(y_val, num_classes = 10)
y_test_1_hot = keras.utils.to_categorical(y_test, num_classes = 10)
# building model
mnist_model = keras.Sequential([
layers.Dense(128, activation="sigmoid",
,→input_shape = X_train[0].shape),
layers.Dense(64, activation="sigmoid"),
layers.Dense(10, activation="softmax")
])
# compiling model
mnist_model.compile(optimizer="adam", loss="categorical_crossentropy",
,→metrics=["accuracy"])
# training model
my_callbacks = [
keras.callbacks.EarlyStopping(
monitor="val_accuracy",
patience=20),
keras.callbacks.ModelCheckpoint(
filepath="best_model.keras",
monitor="val_loss",
save_best_only=True,
verbose=1)
]
376 13 Deep Learning with Keras-TensorFlow
start = time.time()
history = mnist_model.fit(x = X_train,
y = y_train_1_hot,
batch_size = 32,
epochs = 200,
validation_data = (X_val, y_val_1_hot),
callbacks=my_callbacks)
end = time.time()
training_duration = end - start
print("training duration = {:.3f}".format(training_duration))
print(history.history.keys())
print(mnist_model.summary())
The claim is correct if metric A and metric B are the same; otherwise,
it is not necessarily true. In the above application, for example, the best
model is saved at epoch 13, while the simulations stopped at epoch 67. This
13.5 The First Application Using Keras 377
We can evaluate the performance of a trained network on a given test data using
the evaluate() method of keras.Model class. It returns the loss and metric(s)
that were specified in compiling the model. For efficiency purposes (and facilitating
parallel computation on GPU), the computation occurs over batches, which by default
have size 32 (can be changed).
Furthermore, when we talk about evaluating a model, we generally wish to
use the “best” model; that is, the model that led to the best value of metric
used in callbacks.ModelCheckpoint(monitor=metric). Note that this is the
model saved at the end of the training process if ModelCheckpoint is used with
save_best_only=True and it is different from the model fitted at the end of train-
ing process. For example, in the MNIST classification that we have seen so far, we
would want to use best_model.keras, and not mnist_model, which holds the
weights obtained at the end of training process. Therefore, we need to load and use
the saved best model for prediction:
# load and evaluate the "best" model
best_mnist_model = keras.models.load_model("best_model.keras")
loss, accuracy = best_mnist_model.evaluate(X_test, y_test_1_hot,
,→verbose=1)
Similarly, we would like to use the “best” model for predicting targets for new
observations (inference)—in the above code, once model is loaded, it is called
best_mnist_model. In this regard, we have couple of options as discussed next:
1. we can use predict() method of keras.Model class to obtain probabilitites
of an observation belonging to a class. The computation is done over batches,
which is useful for large-scale inference:
378 13 Deep Learning with Keras-TensorFlow
prob_y_test_pred = best_mnist_model.predict(X_test)
print("the size of predictions is (n_sample x n_classes):",
,→prob_y_test_pred.shape)
print("class probabilitites for the first instance:\n",
,→prob_y_test_pred[0])
print("the assigned class for the first instance is: ",
,→prob_y_test_pred[0].argmax()) # the one with the highest probability
print("the actual class for the first instance is: ", y_test_1_hot[0].
,→argmax())
y_test_pred = prob_y_test_pred.argmax(axis=1)
print("predicted classes for the first 10 isntances:", y_test_pred[:10])
At this stage, as we can compute the class probabilitites (scores) and/or classes
themselves, we may want to use various classes of sklearn.metrics to evaluate
the final model using different metrics:
from sklearn.metrics import accuracy_score, confusion_matrix,
,→roc_auc_score
Accuracy = 0.9779
Confusion Matrix is
[[ 971 0 1 0 1 1 1 3 1 1]
[ 0 1124 2 3 0 1 2 1 2 0]
[ 4 1 1005 8 2 0 2 3 7 0]
[ 0 0 4 989 0 5 0 6 4 2]
[ 1 0 2 0 950 0 6 3 3 17]
[ 3 0 0 6 1 873 4 0 4 1]
[ 8 3 1 0 2 4 938 0 2 0]
[ 2 4 9 4 0 0 0 1005 1 3]
[ 4 1 1 3 3 5 2 3 949 3]
[ 3 5 0 4 8 5 0 7 2 975]]
Macro Average ROC AUC = 1.000
It is quite likely that executing the MNIST classification code on a typical modern
laptop/desktop with no GPU hardware/software installed (i.e., using only CPUs) is
more efficient (faster) than using GPUs (e.g., available through Google Colab). Why
is that the case?
GPUs are best suitable when there is a large amount of computations that can
be done in parallel. At the same time, computations in training neural networks
are embarrassingly parallelizable. However, we may not see the advantage of using
GPUs with respect to CPUs if “small”-scale computations are desired (for example,
training a small size neural networks as in this example). To examine the advantage
of using GPU with respect to CPU, in Exercise 1 we scale up the size of the neural
network used in this example and compare the execution times.
Overfitting: Here is a comment about overfitting from renowned figures in the field
of machine learning and pattern recognition (Duda et al., 2000, p. 16):
While an overly complex system may allow perfect classification of the training samples, it
is unlikely perform well on new patterns. This situation is known as overfitting’.
can judge overfitting by the data at hand. Recall what we do in model selection when
we try to tune a hyperparameter on a validation set (or using cross-validation if we
desire to remove the bias to a specific validation set): we monitor the performance
of the model on the validation set as a function of the hyperparameter and pick the
value of the hyperparameter that leads to the best performance. The reason we use a
validation set is that the performance on that serves as a proxy for the performance of
the model on test set. Therefore, we can use it to judge the possibility of overfitting
as well. In this regard, Duda et al. write (Duda et al., 2000, p. 16):
For most problems, the training error decreases monotonically during training [as a func-
tion of hyperparameter],. . . . Typically, the error on the validation set decreases, but then
increases, an indication that the classifier may be overfitting the training data. In validation,
training or parameter adjustment is stopped at the first minimum of the validation error.
With this explanation about overfitting, let us examine the performance curves
obtained before for the MNIST application. Looking into the loss curve as a function
of epoch in Fig. 13.8 shows that after epoch 13, the validation loss increases, which
13.5 The First Application Using Keras 381
is an indicator that the classifier may be overfitting. There are various ways to prevent
possible overfitting. A popular approach is the concept of dropout that we discuss
next.
Dropout: Since its conception in (Srivastava et al., 2014), dropout has become a
popular approach to guard against overfitting. The idea is to randomly omit each
neuron in hidden layers or even each input feature with a pre-specified probability
known as dropout rate. Although dropout rate is typically set between 0.2 to 0.5,
it is itself a hyperparameter that should be ideally tuned in the process of model
selection. However, typically the dropout rate for an input feature is lower (e.g.,
0.2) than a neuron in a hidden layer (e.g., 0.5). By omitting a unit, we naturally
omit any connection to or from that unit as well. Fig. 13.9 shows what is meant by
omitting some neurons/features along with their incoming and outgoing connections.
To facilitate implementation, it is generally assumed the same dropout rate for all
units (neuron or an input feature) in the same layer.
(a) (b)
Fig. 13.9: (a) an MLP when no dropout is used; (b) an MLP when a dropout is used
The mini-batch gradient descent that was discussed before is used to train neural
networks with dropout. The main difference lies in training where for each training
instance in a mini-batch, a network is sampled by dropping out some units from the
full architecture. Then the forward and backward computations that are part of the
backprop algorithm are performed for that training instance in the mini-batch over
the sampled network. The gradient for each weight that is used in optimizer to move
the weight is averaged over the training instances in that mini-batch. In this regard,
if for a training instance, its corresponding sampled network does not have a weight,
the contribution of that training instance towards the gradient is set to 0. At the end
of this process, we obtain estimates of weights for the full network (i.e., all units and
their connections with no dropout). In testing stage, the full network is used (i.e.,
all connections with no dropout). However, the outgoing weights from a unit are
multiplied by (1 − p) where p is the dropout rate for that unit. This approximation
accounts for the fact that the unit was active in training with probability of (1 − p).
In other words, by this simple approximation and using the full network in testing
382 13 Deep Learning with Keras-TensorFlow
we do not need to use many sampled networks with shared weights that were used
in training.
In Keras, there is a specific layer, namely, keras.layers.Dropout class, that
can be used to implement the dropout. In this regard, if a dropout for a specific layer
is desired, a Dropout layer should be added right after that. Let us now reimplement
the MNIST application (for brevity part of the output is omitted):
import time
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist
from sklearn.model_selection import train_test_split
seed_value= 42
# data preprocessing
(X_train_val, y_train_val), (X_test, y_test) = mnist.load_data()
X_train_val, X_test = X_train_val.astype('float32')/255, X_test.
,→astype('float32')/255
# building model
mnist_model = keras.Sequential([
layers.Dense(128, activation="sigmoid",
,→input_shape = X_train[0].shape),
layers.Dropout(0.3),
13.5 The First Application Using Keras 383
layers.Dense(64, activation="sigmoid"),
layers.Dropout(0.3),
layers.Dense(10, activation="softmax")
])
# compiling model
mnist_model.compile(optimizer="adam", loss="categorical_crossentropy",
,→metrics=["accuracy"])
# training model
my_callbacks = [
keras.callbacks.EarlyStopping(
monitor="val_accuracy",
patience=20),
keras.callbacks.ModelCheckpoint(
filepath="best_model.keras",
monitor="val_loss",
save_best_only=True,
verbose=1)
]
start = time.time()
history = mnist_model.fit(x = X_train,
y = y_train_1_hot,
batch_size = 32,
epochs = 200,
validation_data = (X_val, y_val_1_hot),
callbacks=my_callbacks)
end = time.time()
training_duration = end - start
print("training duration = {:.3f}".format(training_duration))
print(history.history.keys())
print(mnist_model.summary())
plt.legend()
plt.ylabel('loss')
plt.xlabel('epoch')
plt.subplot(122)
384 13 Deep Learning with Keras-TensorFlow
plt.legend()
plt.ylabel('accuracy')
plt.xlabel('epoch')
Epoch 1/200
1407/1407 [==============================] - 3s 2ms/step - loss: 1.2182 -
accuracy: 0.6240 - val_loss: 0.2984 - val_accuracy: 0.9124
...
Epoch 74/200
1407/1407 [==============================] - 2s 1ms/step - loss: 0.0308 -
accuracy: 0.9904 - val_loss: 0.1008 - val_accuracy: 0.9779
1.00
training loss
0.7 validation loss
0.6 0.95
0.5
0.90
accuracy
0.4
loss
0.3
0.85
0.2
0.1 0.80 training accuracy
validation accuracy
0.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
epoch epoch
Fig. 13.10: The loss (left) and the accuracy (right) of mnist_model trained using
dropout as functions of epoch on both the training set and the validation set. The
figure shows that with dropout the performance curves over training and validation
sets do not diverge as in Fig. 13.8.
print('Test accuarcy of the dropout model with lowest loss (the best
,→model) = {:.3f}'.format(accuracy))
Here we observe that using dropout slightly improved the performance on the
test data: 97.7% (with dropout) vs. 96.9% (without dropout). At the same time,
comparing Fig. 13.10 with Fig. 13.8 shows that the performance curves over training
and validation sets are closer, which indicates that the dropout is indeed guarding
against overfitting.
As mentioned in Section 13.1, a major bottleneck in using deep neural networks is the
large number of hyperparameters to tune. Although Keras-TensorFlow comes with
several tuning algorithms, here we discuss the use of grid search cross-validation im-
plemented by sklearn.model_selection.GridSearchCV that was discussed in
Section 9.2.1. The entire procedure can be used with random search cross-validation
discussed in Section 9.2.2 if GridSearchCV and param_grid are replaced with
RandomizedSearchCV and param_distributions, respectively.
To treat a Keras classifier/regressor as a scikit-learn estimator and use it along
with scikit-learn classes, we can use
386 13 Deep Learning with Keras-TensorFlow
• keras.wrappers.scikit_learn.KerasClassifier
• keras.wrappers.scikit_learn.KerasRegressor
Similar wrappers are also implemented by scikeras library (Scikeras-wrappers,
2023), which supports Keras functional and subclassing APIs as well. As the number
of layers and neurons are (structural) hyperparameters to tune, we first write a
function, called construct_model, which constructs and compiles a sequential
Keras model depending on the number of layers and neurons in each layer. Then
treating construct_model as the argument of KerasClassifier, we create the
classifier that can be treated as a scikit-learn estimator; for example, to use it within
GridSearchCV or, if needed, as the classifier used within Pipeline class to create
a composite estimator along with other processing steps. In what follows, we also
treat the dropout rate of the optimizer as well as the number of epochs as two
additional hyperparameters to tune—in contrast with EarlyStopping callback,
here we determine the epoch that jointly with other hyperparameters leads to the
highest CV score. For that reason, and to reduce the computational complexity of
the grid search, we assume candidate epoch values are in the set {10, 30, 50, 70}.
def construct_model(hidden_layers = 1, neurons=32, dropout_rate=0.25,
,→learning_rate = 0.001):
# building model
model = keras.Sequential()
for i in range(hidden_layers):
model.add(layers.Dense(units=neurons, activation="sigmoid"))
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(10, activation="softmax"))
# compiling model
model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(learning_rate),
metrics=['acc'])
return model
import time
import pandas as pd
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist
from sklearn.model_selection import train_test_split
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV
seed_value= 42
import random
random.seed(seed_value)
# data preprocessing
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train, X_test = X_train.astype('float32')/255, X_test.
,→astype('float32')/255
X_train, X_test = X_train.reshape(60000, 28*28), X_test.reshape(10000,
,→28*28)
# training model
score_best_estimator=gscv.fit(X_train, y_train).score(X_test, y_test)
Epoch 1/10
1250/1250 [==============================] - 1s 690us/step - loss: 1.2428
,→- acc:
0.6646
Epoch 2/10
1250/1250 [==============================] - 1s 674us/step - loss: 0.4253
,→- acc:
0.8824
...
'neurons': 64}
the accuracy of the best estimator on the test data is: 0.978
Exercises:
Exercise 1: In the MNIST application presented in this chapter, use the same param-
eter for compiling and fitting except the following three changes:
1. replace the MLP with another MLP having 5 hidden layers with 2048 neurons
in each layer.
2. remove the early stopping callback and instead train the model for 10 epochs.
3. modify the code to add a dropout layer with a rate of 0.2 to the input feature vector
(flattened input). Do not use dropout in any other layer. Hint: Use Flatten()
layer to flatten input images. This allows adding a dropout layer before the first
hidden layer.
(A) Execute the code only on CPUs available in a personal machine. What is the
execution time?
(B) Execute the code on GPUs provided by Colab. What is the execution time?
(C) Suppose each weight in this network is stored in memory in 4 bytes (float32).
What is approximately the minimum amount of memory (RAM) in mega byte
(MB) require to store all parameters of the network? (the answer to questions of
this type are specially insightful when, for example, we wish to deploy a trained
network on memory-limited embedded devices).
Exercise 3: We have the perceptron depicted in Fig. 13.11 that takes binary inputs
x1 and x2 ; that is to say, xi = 0, 1, i = 1, 2 and lead to an output y. With choices
of weights indicated below, the network leads to an approximate implementation of
which logical gates (OR gate, AND gate, NAND gate, XOR gate)?
A) with a1 = a2 = 50 and a0 = −75 the network implements _______ gate
B) with a1 = a2 = −50 and a0 = 75 the network implements _______ gate
Exercise 4: Suppose the perceptron depicted in Fig. 13.12 is used in a binary
classification problem. What activation function, what threshold T, and what loss
function should be used to make the classifier ψ(x) equivalent to logistic regression?
390 13 Deep Learning with Keras-TensorFlow
𝑎!"
𝑥! 𝑎!
∑ 𝑦
𝑥# 𝑎#
1
∅ 𝑡 =
1 + 𝑒 !"
𝑥"
𝑎"
𝑎!
.
.
∑
𝑦
. .
activation
functions
𝜓 𝒙 = 𝐼{$%&}
.
𝑎#
𝑥!
A) 500 B) 50 C) 20 D) 1
amount of memory (in MB) require to store all parameters of the model?
A) 20 MB B) 44 MB C) 88 MB D) 176 MB E) 196 MB
A) -1 B) 0 C) 1 D) 101 E) 10101
Chapter 14
Convolutional Neural Networks
The embedded feature extraction mechanism along with training efficiency and the
performance have empowered CNNs to achieve breakthrough results in a wide range
of areas such as image classification (Krizhevsky et al., 2012; Simonyan and Zisser-
man, 2015), object detection (Redmon et al., 2016), electroencephalogram (EEG)
classification (Lawhern et al., 2018), and speech recognition (Abdel-Hamid et al.,
2014), to just name a few. Consider an image classification application. The con-
ventional way to train a classifier is to use some carefully devised feature extraction
methods such as HOG (Histogram of Oriented Gradients) (Dalal and Triggs, 2005),
SIFT (Scale Invariant Feature Transform) (Lowe, 1999), or SURF (Speeded-Up Ro-
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 393
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_14
394 14 Convolutional Neural Networks
bust Feature) (Bay et al., 2006), and use the extracted features in some conventional
classifiers such as logistic regression or kNN. This way, we are restricted to a handful
of feature extraction algorithms that can heavily influence the performance of our
classifiers. However, in training CNNs features are extracted in the process of model
construction. A CNN is essentially an ANN that uses convolution operation in at
least one of its layers. Therefore, to understand the working mechanism of CNN, it
is important to have a grasp of the convolution operation.
Suppose x[i] and k[i] are two discrete 1-dimensional signals, where i is an
arbitrary integer. In the context of convolutional neural networks, x[i] is the input
signal, and k[i] is referred to as kernel, albeit as we will see later both have generally
higher dimensions than one. The convolution of x and k, denoted x ∗ k, is
∞
Õ
y[m] , (x ∗ k)[m] = x[i]k[m − i], ∀m . (14.1)
i=−∞
This can be seen as flipping the kernel, sum the products, and slide the kernel
over the input. One can easily show that convolution operation is commutative; that
is, (x ∗ k) = (k ∗ x). This is due to the fact that in convolution definition, we are
flipping one signal with respect to another. If we do not flip one, we will have the
cross-correlation between x and k given by
∞
Õ
y[m] , (x ? k)[m] = x[i]k[i − m], ∀m . (14.2)
i=−∞
used for processing multivariate time-series, images, and videos, respectively. The
distinguishing factor between these types of convolution is the number of dimensions
through which the kernel slides. For example, in 1D convolution, the kernel is slid
in one dimension (through time steps), whereas in 2D convolution, the kernel is slid
in two dimensions (e.g., in both height and width directions of an image). This also
implies that, if desired, we can even use 2D convolution with sequential data such
as multivariate time-series.
As the majority of applications are focused on the utility of 2D convolutions
(this does not mean others are not useful), in what follows we elaborate in detail
the working principles of 2D convolutions. The use of “2D” convolution, should not
be confused with the dimensionality of input and kernel tensors used to hold the
data and the kernel. For example, in the 2D convolution that is used in CNN, each
input image is stored in a 3D tensor (rank-3 tensor: height, width, and depth) and the
kernel is a 4D tensor (rank-4 tensor: height, width, input depth, and output depth).
To better understand how the 2D convolution operates on a 3D tensor of input and
4D tensor of kernel, we first start with a simpler case, which is the convolution of a
2D input tensor with a 2D kernel tensor.
Fig. 14.1: The “convolution” operation for a 2D input signal X and a 2D kernel K
to create the output Y . The dotted area in the middle of X shows elements of X at
which K could be centered to have a legitimate convolution.
Fig. 14.2: The “convolution” with zero padding for a 2D input signal X and a 2D
kernel K to create an output Y of the same size as X. The dotted area in the middle of
X shows elements of X at which K could be centered to have a legitimate convolution.
the output feature map from one layer is used as the input to the next layer in a
multi-layer CNN.
Parameter sharing: Parameter sharing means that all neurons in the same layer
share the same weights (kernel). This is because the small-size kernel (due to sparse
connectivity) slides over the entire input tensor, which, in turn, produces the inputs
to each neuron by multiplying different patches of the input tensor with the same
kernel. Beside the sparse connectivity, this property (i.e, parameter/kernel sharing
between neurons in the same layer) is another factor that leads to a small number
of parameters to estimate. At the same time, a kernel, when learned, serves as a
feature detector. For example, if the learned kernel serves as a vertical edge detector,
convolving that over an image and looking at the outputs of neurons (i.e., feature
maps) tells us the areas of the image at which this feature is located. This is because
we apply the same feature detector over the entire image.
Strides: Rather than the regular convolution that was described earlier in which
the kernel is slid with a step (stride) of 1 in both direction, we can have strided
convolution in which the stride is greater than 1. We can even have different stride
values across height and width directions. Fig. 14.4 shows a strided convolution
398 14 Convolutional Neural Networks
𝑎"!
𝑌 0,4 + 𝑎"!
∑ ∅ 𝑡
𝑌 0,0 + 𝑎!! ap
em
∑ ∅ 𝑡
fea
tur
𝑎!!
Fig. 14.3: Each artificial neuron in a CNN layer produces one “pixel” in the output
feature map.
where the stride in the height and width directions are 2 and 3, respectively. If hs and
ws denote strides in the direction of height and width, respectively, then the output
of the strided convolution is (b h Xh−h
s
K
c + 1) × (b wXw−w
s
K
c + 1). As we can see, the
effect of a stride greater than 1 is downsampling the feature map by a rate equal to
the stride value.
Max-pooling: In CNN, it is typical to perform pooling once output feature maps
are generated. This is done by using a pooling layer right after the convolutional
layer. Pooling replaces the values of the feature map over a small window with a
summary statistic obtained over that window. There are various types of pooling that
can be used; for example, max-pooling replaces the values of the feature map within
the pooling window with one value, which is their maximum. Average-pooling, on
the other hand, replaces these values with their average.
Regardless of the actual function used for pooling, it is typical to use windows of
2 × 2 that are shifted with a stride of 2—the concept of “stride” is the same as what
we saw before for the convolution operation and the magnitude of strides used for
pooling does not need to be the same as the convolutional strides. Similar to the use
of convolutional stride that was discussed before, using pooling in this way leads
to smaller feature maps; that is to say, feature maps are downsampled with a rate
of 2. From a computational perspective, downsampling helps reduce the number of
14.2 Working Mechanism of 2D Convolution 399
Fig. 14.4: Strided convolution. The effect of a stride greater than 1 is downsampling
the feature map.
24 4
max-pooling
X1 K
4 8 0 5 5 3 3 1 0 -1
2 1 10 5 8 0 4 ✻ 0 1 0
= 4 24 1 4 3 Y1
1 1 0 9 1 3 2 -1 0 1
8 0 5 5 3 3 2 1 0 -1
1 10 5 8 0 4 1 ✻ 0 1 0 = 24 1 4 3 2 Y2
1 0 9 1 3 2 0 -1 0 1
max-pooling
X2 K
24 4
The actual “2D” convolution that occurs in a layer of CNN, is the convolution of
a 3D input tensor and a 4D kernel tensor. To formalize the idea, let X, K, and Y
denote the 3D input tensor, 4D kernel tensor, and 3D output tensor, respectively.
Let X[i, j, k] denote the element in row i, column j, and channel k of the input X
for 1 ≤ i ≤ hX (input height), 1 ≤ j ≤ wX (input width), and 1 ≤ k ≤ cX (the
number of channels or depth). For an image encoded in RGB format, cX = 3 where
each channel refers to one of the red, green, and blue channels. Furthermore, here
we used the channels-last convention to represent the channels as the last dimenion
in X. Alternatively, we could have used channels-first convention that uses the first
dimension to refer to channels.
Furthermore, let K[i, j, k, l] denote the 4D kernel tensor where 1 ≤ i ≤ hK
(kernel height), and 1 ≤ j ≤ wK (kernel width), 1 ≤ k ≤ cK = cX (input
channels or input depth), and 1 ≤ l ≤ cY (output channels or output depth).
The sparse connectivity assumption stated before means that generally hK << hX
and wK << wX . Furthermore, note that cK is the same as cX . The fourth di-
mension can be understood easily by assuming K containing cY filters of size
hK [smaller than input] × wK [smaller than input] × cX [same as input]. The convo-
lution (to be precise, cross-correlation) of X and K (with no padding of the input) is
obtained by computing the 3D output feature map tensor Y with elements Y[m, n, l]
where 1 ≤ m ≤ hY = hX − hK + 1, 1 ≤ n ≤ wY = wX − wK + 1, and 1 ≤ l ≤ cY , as
Õ
Y[m, n, l] = X[i, j, k] K[m + i, n + j, k, l] , (14.3)
i, j,k
14.3 Implementation in Keras: Classification of Handwritten Digits 401
where the summation is over all valid indices. Fig. 14.6 shows the working mecha-
nism of 2D convolution as shown in (14.3). Few remarks about the 2D convolution
as shows here:
1. l in the third dimension of Y[m, n, l] refers to each feature map (matrix), and all
feature maps have the same size hY × wY ;
2. assuming one bias term for each kernel, training a single layer in CNN means
estimating cY (cX hK wK +1) parameters for the kernel using the back-propagation
algorithm;
3. when there are multiple convolutional layers in one network, the output feature
map of one layer serves as the input tensor to the next layer;
4. all channels of a kernel are shifted with the same stride (see Fig. 14.7); and
5. the pooling, if used, is applied to every feature map separately;
p1 p3 p4
p2 ma ma
ma ma e e
tur
e e tur tur
tur fea fea
fea fea
4D kernel tensor:
(height × width × input depth × output depth) =5
h
idt
lw
rne
ke
kernel input depth = 3
stride width =2
p
e ma
tur
fea
stride height =3
1. the input data: in Section 13.5.1, we vectorized images to create feature vectors.
This was done because MLP accepts feature vectors; however, in using CNN, we
do not need to do this as CNNs can work with images directly. That being said,
each input (image) to a 2D convolution should have the shape of height×width×
channels (depth). However, our raw images have shape of 28 × 28. Therefore,
we use np.newaxis to add the depth axis;
2. the layers of CNN: to add 2D convolutional layer in Keras, we can use
keras.layers.Conv2D class. The number of filters (number of output fea-
ture maps), kernel size (hK × wK ), the strides (along the height and the width),
and padding are set through the following parameters:
keras.layers.Conv2D(filters, kernel_size,
strides=(1, 1), padding="valid")
keras.layers.MaxPooling2D(pool_size=(2, 2),
strides=(1, 1), padding="valid").
pool_size can be either a tuple/list of 2 integers that shows the height and
width of the pooling window, or one integer that is then considered as the size
in both dimensions;
4. although we can use any activation functions presented in Chapter 13, a recom-
mended activation function in convolutional layers is ReLU;
5. the “classification” part: convolutional layers act like feature detectors (extrac-
tors). Regardless of whether we use one or more convolutional layers, we end up
with a number of feature maps, which is determined by the number of filters in
the last convolutional layer. The question is how to perform classification with
these feature maps. There are two options:
• flattening the feature maps of the last convolutional layer using a flattening
layer (layers.Flatten) and then use the results as feature vectors to train
fully connected layers similar to what we have seen in Chapter 13. Although
this practice is common, there are two potential problems. One is that due to
large dimensionality of these feature vectors, the fully connected layer used
here could potentially overfit the data. For this reason, it is recommended to
use dropout in this part to guard against overfitting. Another problem is that
the fully connected layers used in this part have their own hyperparameters
and tuning them entails an increase in the computational needs;
• apply the global average pooling (Lin et al., 2014) after the last convolutional
layer. This can be done using keras.layers.GlobalAveragePooling2D.
At this stage, we have two options for the number of filters in the last convolu-
tional layer: 1) we can set the number of filters in the last layer to the number
of categories and then use a softmax layer (keras.layers.Softmax())
to convert the global average of each feature map as the probability of the
corresponding class for the input; and 2) use the number of filters in the last
layer as we genereally use in CNNs (e.g., increasing the number of filters as
we go deeper in the CNN), and then apply the global average pooling. As
this practice leads to a larger vector than the number of classes (one value
per feature map), we can then use that as the input to a fully connected layer
that has a softmax activation and a number of neurons equal to the number
of classes. In what follows, we will examine training a CNN classifier on
MNIST dataset using these three approaches.
Case 1: Flattened feature maps of the last convolutional layer as the input to two
fully connected layers in the “classification” part of the network trained for the
MNIST application.
For faster execution, use GPU, for example, provided by the Colab. This case is
implemented as follows (for brevity, part of the code output is omitted):
404 14 Convolutional Neural Networks
import time
import numpy as np
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist
from sklearn.model_selection import train_test_split
seed_value= 42
# building model
mnist_model = keras.Sequential([
layers.Conv2D(16, 3, activation="relu",
,→input_shape = X_train[0].shape),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, activation="relu"),
layers.MaxPooling2D(),
layers.Flatten(),
14.3 Implementation in Keras: Classification of Handwritten Digits 405
layers.Dropout(0.25),
layers.Dense(32, activation="relu"),
layers.Dropout(0.25),
layers.Dense(10, activation="softmax")
])
# compiling model
mnist_model.compile(optimizer="adam", loss="categorical_crossentropy",
,→metrics=["accuracy"])
# training model
my_callbacks = [
keras.callbacks.EarlyStopping(
monitor="val_accuracy",
patience=5),
keras.callbacks.ModelCheckpoint(
filepath="best_model.keras",
monitor="val_loss",
save_best_only=True,
verbose=1)
]
start = time.time()
history = mnist_model.fit(x = X_train,
y = y_train_1_hot,
batch_size = 32,
epochs = 200,
validation_data = (X_val, y_val_1_hot),
callbacks=my_callbacks)
end = time.time()
training_duration = end - start
print("training duration = {:.3f}".format(training_duration))
print(history.history.keys())
print(mnist_model.summary())
plt.legend()
plt.ylabel('loss')
406 14 Convolutional Neural Networks
plt.xlabel('epoch')
plt.subplot(122)
plt.plot(epoch_count, history.history['accuracy'], 'b', label =
,→'training accuracy')
plt.plot(epoch_count, history.history['val_accuracy'], 'r', label =
,→'validation accuracy')
plt.legend()
plt.ylabel('accuracy')
plt.xlabel('epoch')
best_mnist_model = keras.models.load_model("best_model.keras")
loss, accuracy = best_mnist_model.evaluate(X_test, y_test_1_hot,
,→verbose=1)
print('Test accuracy of the model with the lowest loss (the best model)
,→= {:.3f}'.format(accuracy))
Epoch 1/200
1407/1407 [==============================] - 13s 9ms/step - loss: 0.6160 -
accuracy: 0.8051 - val_loss: 0.0902 - val_accuracy: 0.9727
...
Epoch 24/200
1407/1407 [==============================] - 12s 8ms/step - loss: 0.0218 -
accuracy: 0.9928 - val_loss: 0.0442 - val_accuracy: 0.9909
training loss
0.30 validation loss
0.98
0.25
0.96
0.20
accuracy
loss
0.15 0.94
0.10 0.92
0.05 training accuracy
0.90 validation accuracy
0 5 10 15 20 25 0 5 10 15 20 25
epoch epoch
Fig. 14.8: The output of the code in Case 1: (left) loss; and (right) accuracy, as
functions of epoch.
Case 2: Global average pooling with the number of filters in the last convolutional
layer being equal to the number of classes in the MNIST application.
For this purpose, we replace the “building model” part in the previous code with
the following code snippet:
# building model
mnist_model = keras.Sequential([
layers.Conv2D(32, 3, activation="relu",
,→input_shape = X_train[0].shape),
layers.MaxPooling2D(),
layers.Conv2D(10, 3, activation="relu"),
layers.MaxPooling2D(),
layers.GlobalAveragePooling2D(),
keras.layers.Softmax()
])
408 14 Convolutional Neural Networks
Epoch 1/200
1407/1407 [==============================] - ETA: 0s - loss: 1.6765 -
,→accuracy: 0.4455
Epoch 1: val_loss improved from inf to 1.23908, saving model to
,→best_model.keras
1407/1407 [==============================] - 16s 11ms/step - loss: 1.6764
accuracy: 0.4454 - val_loss: 1.2391 - val_accuracy: 0.6179
...
Epoch 40/200
1407/1407 [==============================] - ETA: 0s - loss: 0.3431 -
,→accuracy: 0.8950
Epoch 40: val_loss did not improve from 0.36904
1407/1407 [==============================] - 15s 11ms/step - loss: 0.3432
accuracy: 0.8950 - val_loss: 0.3699 - val_accuracy: 0.8851
training duration = 630.068
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_4 (Conv2D) (None, 26, 26, 32) 320
=================================================================
Total params: 3,210
Trainable params: 3,210
Non-trainable params: 0
_________________________________________________________________
None
313/313 [==============================] - 1s 2ms/step - loss: 0.3394 -
accuracy: 0.9008
Test accuracy of the dropout model with lowest loss (the best model) = 0.
,→901
Case 3: Global average pooling with a “regular” number of filters in the last convo-
lutional layer and then a dense layer with the same number of neurons as the number
of classes.
14.3 Implementation in Keras: Classification of Handwritten Digits 409
accuracy
1.0
loss
0.8 0.6
0.6
0.5 training accuracy
0.4 validation accuracy
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
epoch epoch
Fig. 14.9: The output of the code for Case 2: (left) loss; and (right) accuracy, as
functions of epoch.
For this purpose, we replace the “building model” part in Case 1 with the following
code snippet:
# building model
mnist_model = keras.Sequential([
layers.Conv2D(16, 3, activation="relu",
,→input_shape = X_train[0].shape),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, activation="relu"),
layers.MaxPooling2D(),
layers.GlobalAveragePooling2D(),
layers.Dense(10, activation="softmax")
])
Epoch 1/200
1407/1407 [==============================] - ETA: 0s - loss: 1.1278 -
,→accuracy: 0.6446
...
Epoch 26/200
1407/1407 [==============================] - ETA: 0s - loss: 0.0924 -
,→accuracy: 0.9721
=================================================================
Total params: 10,090
Trainable params: 10,090
Non-trainable params: 0
_________________________________________________________________
None
313/313 [==============================] - 1s 2ms/step - loss: 0.0775 -
accuracy: 0.9767
Test accuracy of the model with the lowest loss (the best model) = 0.977
training loss
validation loss 0.95
1.0
0.90
0.8
0.85
accuracy
loss
0.6 0.80
0.4 0.75
0.70
0.2 training accuracy
0.65 validation accuracy
0 5 10 15 20 25 0 5 10 15 20 25
epoch epoch
Fig. 14.10: Output of the code for Case 3: (left) loss; and (right) accuracy, as functions
of epoch.
Exercises:
choose an increasing pattern for the number of filters that depends on the depth of
the convolutional base of the network. In particular, the number of filters to examine
are:
1. for a CNN with one layer: 4 filters;
2. for a CNN with two layers: 4 and 8 filters in the first and second layers, respec-
tively; and
3. for a CNN with three layers: 4, 8, and 16 filters in the first, second, and third
layers, respectively;
In all cases: 1) the output of the (last) convolutional layer is flattened and used as
the input to two fully connected layers with 32 neurons in the first layer; 2) there is a
2×2 max-pooling right after each convolutional layer and a dropout layer with a rate
of 0.25 before each dense layer; and 3) all models are trained for 5 epochs.
(A) write and execute a code that performs hyperparameter tuning using 3-fold
stratified cross-validation with shuffling. For reproducibility purposes, set the
seed for numpy, tensorflow, and Python built-in pseudo-random generators to
42. What is the highest CV score and the pattern achieving that score?
(B) what is the estimated test accuracy of the selected (“best”) model trained on the
full training data?
(C) the following lines show parts of the code output:
Epoch 1/5
1250/1250 [==============================] - ...
Epoch 2/5
1250/1250 [==============================] - ...
Epoch 3/5
1250/1250 [==============================] - ...
Epoch 4/5
1250/1250 [==============================] - ...
Epoch 5/5
1250/1250 [==============================] - ...
625/625 [==============================] - ...
What do “1250” and “625” represent? Explain how these numbers are deter-
mined.
(D) the following lines show parts of the code output:
Epoch 1/5
1875/1875 [==============================] - ...
Epoch 2/5
1875/1875 [==============================] - ...
Epoch 3/5
1875/1875 [==============================] - ...
Epoch 4/5
1875/1875 [==============================] - ...
Epoch 5/5
412 14 Convolutional Neural Networks
Exercise 3: In a 2D convolutional layer, the input height, width, and depth are 30, 30,
and 32, respectively. We use a strided convolution (stride being 3 in all directions),
no padding, and a kernel that has height, width, and the output depth of 5, 5, and
128, respectively. What is the shape of the output tensor?
A) 9 × 9 × 128 B) 10 × 10 × 128 C) 9 × 9 × 32 D) 10 × 10 × 32
2 -1 -1 -1
-1 2 -1 -1
-1 -1 2 -1
-1 -1 -1 2
In the convolutional layer, we use a kernel that has height, width, and output depth
of 3, 3, and 32, respectively, stride 1, and the ReLU activation function. This layer is
followed by the global average pooling as the input to a fully connected layer to esti-
mate the value of the output. Assuming all weights and biases used in the structure
of the network are 1, what is the output for the given input (use hand calculations)?
A) -17 B) -15 C) 0 D) 1 E) 15 F) 17
A) 1 B) 17 C) 33 D) 49 E) 65 F) 81
14.3 Implementation in Keras: Classification of Handwritten Digits 413
Exercise 6: Use Keras to build the models specified in Exercise 4 and 5 and compute
the output for the given inputs.
Hints:
• Use the kernel_initializer and bias_initializer in the convolutional
and dense layers to set the weights.
• Pick an appropriate initializer from (keras-initializers, 2023).
• The 1D convolutional layer that is needed for the model specified in Exercise 5
can be implemented by keras.layers.Conv1D. Similarly, the global average
pooling for that model is layers.GlobalAveragePooling1D.
• To predict the output for the models specified in Exercise 4 and 5, the input
should be 4D and 3D, respectively (the first dimension refers to “samples”, and
here we have only one sample point as the input).
Chapter 15
Recurrent Neural Networks
The information flow in a classical ANN is from the input layer to hidden layers
and then to the output layer. An ANN with such a flow of information is referred
to as a feed-forward network. Such a network, however, has a major limitation in
that it is not able to capture dependencies among sequential observations such as
time-series. This is because the network has no memory and, as a result, observations
are implicitly assumed to be independent.
Recurrent neural networks (RNNs) are fundamentally different from the feed-
forward networks in that they operate not only based on input observations, but also
on a set of internal states. The internal states capture the past information in sequences
that have already been processed by the network. That is to say, the network includes
a cycle that allows keeping past information for an amount of time that is not fixed
a-priori and, instead, depends on its input observations and weights. Therefore, in
contrast with other common architectures used in deep learning, RNN is capable
of learning sequential dependencies extended over time. As a result, it has been
extensively used for applications involving analyzing sequential data such as time-
series, machine translation, and text and handwritten generation. In this chapter, we
first cover fundamental concepts in RNNs including standard RNN, stacked RNN,
long short-term memory (LSTM), gated recurrent unit (GRU), and vanishing and
exploding gradient problems. We then present an application of RNNs for sentiment
classification.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 415
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2_15
416 15 Recurrent Neural Networks
where Wxh (input weight matrix), Whh (recurrent weight matrix), and Why denote
l × p, l × l, and q × l weight matrices, bh and by denote bias terms, and fh ( . ) denote
an element-wise nonlinear hidden layer function (an activation function) such as
logistic sigmoid or hyperbolic tangent. As in previous neural networks, depending
on the application one may use an activation function in (15.2) to obtain yt ; for
example, in a multiclass single label classification problem, Why ht + by is used as
the input to a softmax to obtain yt . Fig. 15.1 depicts the working mechanism of
the standard RNN unit characterized by the recursive utility of equations (15.1) and
(15.2)—in Keras, this is implemented via keras.layers.simpleRNN class.
A basic RNN unit is already temporally deep because if it is unfolded in time,
it becomes a composite of multiple nonlinear computational layers (Pascanu et al.,
2014). Besides the depth of an RNN in a temporal sense, Pascanu et al. explored
various other ways to define the concept of depth in RNN and to construct deep
RNNs (Pascanu et al., 2014). A common way to define deep RNNs is to simply
stack multiple recurrent hidden layers on top of each other (Pascanu et al., 2014).
Nonetheless, various forms of stacking have been proposed. The standard stacked
RNN is formed by using the hidden state of a layer as the input to the next layer.
Assuming N hidden layers, the hidden states for layer j and the output of the network
yt are iteratively computed for j = 1, . . . , N, and t = 1, . . . , T, as (Graves et al., 2013)
j j−1 j j
ht = fh Wh j−1 h j ht + Wh j h j ht−1 + bh , (15.3)
yt = Wh N y htN + by , (15.4)
j
where h0t = xt , and ht denotes the state vector for the j th hidden layer, and where an
identical hidden layer function fh ( . ) is assumed across all layers. Fig. 15.2a depicts
the working mechanism of the stacked RNN characterized by (15.3) and (15.4).
The recurrence relation (15.3) was also used in (Hermans and Schrauwen, 2013)
but the output is computed by all the hidden states. This form of stacked RNN is
characterized by (Hermans and Schrauwen, 2013)
j j−1 j j
ht = fh Wh j−1 h j ht + Wh j h j ht−1 + bh , (15.5)
N
Õ
j
yt = Wh j y ht + by , (15.6)
j=1
where h0t = 0.
RNN unit
Fig. 15.1: A schematic diagram for the working mechanism of the standard RNN
unit.
The models presented so far are known as unidirectional RNNs because to predict
yt they only consider information from the present and the past context; that is to
say, the flow of information is from past to present and future observations are
not used to predict yt . Schuster et al. (Schuster and Paliwal, 1997) proposed the
bidirectional RNN that can utilize information from both the past and the future in
predicting yt . In this regard, two sets of hidden layers are defined: 1) forward hidden
f
states, denoted ht , which similar to the standard RNN, are computed iteratively
from t = 1 to T i.e., using sequence (x1, x2, . . . , xT ) ; and 2) backward hidden
states, denoted htb , which are computed iteratively from t = T to 1 i.e., using
sequence (xT , xT −1, . . . , x1 ) . Once hidden states are computed, the output sequence
is obtained by a linear combination of forward and backward states. In particular, we
have
f f f f f
ht = fh Wxh xt + Whh ht−1 + bh , (15.9)
htb = fh Wbxh xt + Whh ht+1 + bhb ,
b b
(15.10)
f f
yt = Why ht + Why
b b
ht + b y , (15.11)
f f f f
where Wxh , Whh , Wbxh , Whhb , W , Wb , b , bb , and b are weight matrices and
hy hy h h y
bias terms of compatible size.
418 15 Recurrent Neural Networks
.. . . .. . . .. . . .. . . . .. . . .. . . .. . . .. . . .
.. . . .. . . .. . . .. . . . .. . . .. . . .. . . .. . . .
(a) (b)
Fig. 15.2: A schematic diagram for the working mechanism of stacked RNN proposed
in (Graves et al., 2013) (a), and (Graves, 2013) (b).
É
15.2 Vanishing and Exploding Gradient Problems
Despite the powerful working principles of standard RNN, it does not exhibit the
capability of capturing long-term dependencies when trained in practice. This state
of affairs is generally attributed to the vanishing gradient and the exploding gradient
problems described in (Bengio et al., 1994; Hochreiter and Schmidhuber, 1997).
É
15.2 Vanishing and Exploding Gradient Problems 419
h5 = fh (wxh x5 + whh h4 ) ,
h4 = fh (wxh x4 + whh h3 ) ,
h3 = fh (wxh x3 + whh h2 ) ,
h2 = fh (wxh x2 + whh h1 ) ,
h1 = fh (wxh x1 ) . (15.14)
or, equivalently,
!!
h5 = fh wxh x5 + whh fh wxh x4 + whh fh wxh x3 + whh fh wxh x2 + whh fh (wxh x1 ) .
(15.15)
∂h j+1 ∂h j 0
= h j + whh f , j = 1, . . . , t0 − 1, (15.18)
∂whh ∂whh j
∂h1
= 0, (15.19)
∂whh
where for the ease of notation we denoted derivative of a function g with respect
to its input as g 0 and denoted fh0(wxh x j+1 + whh h j ) by f j0. Recursively computing
(15.18) starting from (15.19) for j = 1, . . . , t0 − 1, yields1
0 −1
tÕ 0 −1
tÖ
∂ht0 t0 −1−k
= w hk f j0 (15.20)
∂whh k=1 hh j=k
For instance,
∂h5
= h4 f40 + whh h3 f40 f30 + whh
2
h2 f40 f30 f20 + whh
3
h1 f40 f30 f20 f10 . (15.21)
∂whh
To analyze the effect of repetitive multiplications by whh , we assume a linear activa-
tion function fh (u) = u. This could be perceived simply as an assumption (e.g., see
(Pascanu et al., 2013)) or, alternatively, the first (scaled) term of Maclaurin series
for some common activation functions (and therefore, an approximation around 0).
Therefore, f j0 = 1, j = 1, . . . , t0 − 1. Using this assumption to recursively compute
h j from (15.12) yields
j
Õ
j−k
h j = wxh whh xk , j = 1, . . . , t0 − 1. (15.22)
k=1
t −1
∂ht0 0
Õ
t0 −1−k
= wxh (t0 − k)whh xk . (15.23)
∂whh k=1
1 Here to write the expression (15.20) we have used the regular chain rule of calculus, which
is well-known. We can also achieve at this relationship using the concept of immediate partial
derivatives, which is introduced by Werbose (Werbos, 1990) and used, for example, in (Pascanu
et al., 2013) to treat the vanishing and exploding gradient phenomena.
15.3 LSTM and GRU 421
if whh < 1 (or if whh > 1), high powers of whh will vanish (or will explode), which
causes the vanishing (or exploding) gradient problem.
The problem of gradient exploding associated with the standard RNN can be re-
solved by the gradient clipping. For example, clipping the norm of the gradient when
it goes over a threshold (e.g., the average norm over a number of updates) (Pascanu
et al., 2013). However, to alleviate the vanishing gradient problem, more sophis-
ticated recurrent network architectures namely, long short-term memory (LSTM)
units (Hochreiter and Schmidhuber, 1997), and a more recent one, gated recurrent
units (GRUs) (Cho et al., 2014), have been proposed. In both these networks, the
mapping between pair of xt and ht−1 to ht , which in standard RNN was based on a
single activation function, is replaced by a more sophisticated activation function. In
particular, LSTM implements the activation function fh ( . ) in (15.1) by the following
composite function:2
2 Comparing to the original form of LSTM (Hochreiter and Schmidhuber, 1997), this version is
due to (Gers et al., 1999) and has an additional gate (the forget gate).
422 15 Recurrent Neural Networks
are learned in (15.30)-(15.32) (keras-gru, 2023). This makes the total number of
parameters learned in a GRU layer 3(l p + l 2 + 2l).
In this application, we intend to use text data to train an RNN that can classify a
movie review represented as a string of words to a “positive review” or “negative
review”—this is an example of sentiment classification. In this regard, we use the
“IMDB dataset”. This is a dataset, which contains 25000 positive and the same
number of negative movie reviews collected from the IMDB website (Maas et al.,
2011). Half of the entire dataset is designated for training and the other half for
testing. Although the (encoded) dataset comes bundled with the Keras library and
could be easily loaded by: from tensorflow.keras.datasets import imdb
and then imdb.load_data(), we download it in its raw form from (IMDB, 2023)
to illustrate some preprocessing steps that would be useful in the future when working
raw text data.
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
from tensorflow.keras import layers
#from tensorflow.keras.datasets import imdb
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
seed_value= 42
Flow Dataset object. We then use its concatenate method to add the negative
reviews to positive reviews.
# load the training (+ validation) data for positive and negative
,→reviews, concatenate them, and generate the labels
X_train_val_pos = keras.utils.
,→text_dataset_from_directory(directory=directory, batch_size = None,
,→label_mode=None, shuffle=False)
directory = ".../aclImdb/train/neg"
X_train_val_neg = keras.utils.
,→text_dataset_from_directory(directory=directory, batch_size = None,
,→label_mode=None, shuffle=False)
X_train_val = X_train_val_pos.concatenate(X_train_val_neg)
y_train_val = np.array([0]*len(X_train_val_pos) +
,→[1]*len(X_train_val_neg))
# load the test data for positive and negative reviews, concatenate
,→them, and generate the labels
directory = ".../aclImdb/test/pos"
X_test_pos = keras.utils.
,→text_dataset_from_directory(directory=directory, batch_size = None,
,→label_mode=None, shuffle=False)
directory = ".../aclImdb/test/neg"
X_test_neg = keras.utils.
,→text_dataset_from_directory(directory=directory, batch_size = None,
,→label_mode=None, shuffle=False)
X_test = X_test_pos.concatenate(X_test_neg)
y_test = np.array([0]*len(X_test_pos) + [1]*len(X_test_neg))
Next we use the learned vocabulary to transform each sequence of words that are
part of X_train_val into corresponding sequence of integers.
# encode each sequence ("seq") that is part of the X_train_val Dataset
,→object using the learned vocabulary
X_train_val_int_encoded_var_len = []
for seq in X_train_val:
seq_encoded = tv([seq]).numpy()[0]
X_train_val_int_encoded_var_len.append(seq_encoded)
X_train_val_int_encoded_var_len[0]
decoded_sentence
426 15 Recurrent Neural Networks
'[UNK] high is a cartoon comedy it ran at the same time as some other
,→[UNK] about school life such as teachers my [UNK] years in the
,→teaching [UNK] lead me to believe that [UNK] [UNK] satire is much
,→closer to reality than is teachers the [UNK] to survive [UNK] the
,→[UNK] students who can see right through their pathetic teachers
,→[UNK] the [UNK] of the whole situation all remind me of the schools i
,→knew and their students when i saw the episode in which a student
,→repeatedly tried to burn down the school i immediately [UNK] at high
,→a classic line inspector im here to [UNK] one of your teachers
,→student welcome to [UNK] high i expect that many adults of my age
,→think that [UNK] high is far [UNK] what a pity that it isnt'
array([ 10, 418, 3, 208, 11, 18, 226, 311, 101, 107, 1,
6, 33, 4, 164, 340, 5, 1946, 527, 934, 12, 10,
14, 1, 6, 68, 9, 80, 36, 49, 10, 672, 5,
1, 1, 27, 14, 61, 491, 6, 82, 220, 10, 14,
359, 1, 251, 2, 109, 5, 3216, 1, 53, 74, 3,
1785, 1, 251, 1001, 1, 17, 136, 1, 2, 1890, 5,
4, 50, 18, 7, 12, 9, 69, 3006, 17, 254, 1400,
11, 29, 117, 593, 12, 2, 417, 740, 60, 14, 2940,
46, 14, 3000, 33, 2162, 299, 2, 86, 357, 5, 2,
18, 3, 66, 1614, 6, 1670, 299, 2, 326, 357, 131,
1, 2, 740, 10, 22, 61, 208, 106, 362, 8, 1670,
19, 106, 371, 2269, 346, 15, 74, 258, 2660, 22, 6,
372, 253, 68, 98, 2570, 11, 18, 14, 85, 3, 10,
1405, 12, 23, 138, 68, 9, 156, 23, 1856])
Until now the encoded sequence of reviews have different length. However, for
training, we need sequences of the same length. Here we choose the first 200 words
(that are of course part of the learned vocabulary) of each sequence. In this regard,
we pad the sequences. Those that are shorter will be (by default) pre-padded by zeros
to become 200 long.
# unify the length of each encoded review to a sequence_len. If the
,→review is longer, it is cropped and if it is shorter it will padded
,→by zeros (by default)
sequence_len = 200
X_train_val_padded_fixed_len = keras.utils.
,→pad_sequences(X_train_val_int_encoded_var_len, maxlen=sequence_len)
X_test_padded_fixed_len = keras.utils.
,→pad_sequences(X_test_int_encoded_var_len, maxlen=sequence_len)
X_train_val_padded_fixed_len[0:2]
15.4 Implementation in Keras: Sentiment Classification 427
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 322, 7, 4,
1078, 220, 9, 2085, 31, 2, 167, 62, 15, 47, 81,
1, 43, 400, 119, 136, 15, 4894, 56, 1, 148, 8,
2, 4946, 1, 480, 70, 6, 256, 12, 1, 1, 1972,
7, 73, 2363, 6, 641, 71, 7, 4894, 2, 1, 6,
2031, 1, 2, 1, 1422, 37, 69, 68, 205, 141, 65,
1216, 4894, 1, 2, 1, 5, 2, 219, 904, 32, 2929,
70, 5, 2, 4711, 10, 672, 3, 65, 1422, 51, 10,
208, 2, 383, 8, 60, 4, 1474, 3622, 775, 6, 3580,
187, 2, 400, 10, 1192, 1, 31, 322, 4, 350, 363,
2971, 142, 132, 6, 1, 29, 5, 123, 4894, 1474, 2409,
6, 1, 322, 10, 517, 12, 106, 1471, 5, 56, 582,
103, 12, 1, 322, 7, 234, 1, 49, 4, 2292, 12,
9, 207],
[ 1, 228, 335, 2, 1, 1, 33, 4, 1, 101, 30,
420, 21, 25, 1, 113, 1, 879, 81, 102, 571, 4,
246, 33, 2, 398, 5, 4676, 1, 2039, 3841, 34, 1,
37, 183, 4415, 156, 2236, 40, 345, 3, 40, 1, 1,
2151, 4635, 3, 1, 1, 2587, 37, 24, 449, 334, 6,
2, 1889, 493, 4164, 1, 207, 228, 22, 334, 6, 4463,
1, 1, 39, 27, 272, 117, 51, 107, 1004, 113, 30,
537, 42, 2805, 502, 42, 28, 1, 13, 131, 2, 116,
1961, 193, 4676, 3, 1, 268, 1659, 6, 114, 10, 249,
119, 4495, 6, 28, 29, 5, 3691, 2834, 1, 95, 113,
2545, 6, 107, 4, 220, 9, 265, 4, 4244, 506, 1054,
6, 25, 2602, 161, 136, 15, 1, 1, 182, 1, 42,
1, 16, 2, 549, 6, 120, 49, 30, 39, 252, 140,
4428, 156, 2236, 9, 2, 367, 262, 42, 21, 2, 81,
545, 245, 4, 375, 2081, 39, 32, 1004, 83, 82, 51,
35, 90, 118, 49, 6, 82, 17, 65, 276, 270, 35,
139, 192, 9, 6, 2, 3230, 297, 5, 747, 9, 39,
1, 1, 13, 42, 270, 11, 20, 77, 1, 23, 6,
328, 377]], dtype=int32)
Now we are in the position that we can start building and training the model.
However, for applications related to natural language processing, there is an important
layers that can significantly boost the performance and, at the same time, speed
up the training. The core principle behind using this layer, which is known as
word embedding layer, is to represent each word by a relatively low-dimensional
vector, which is learned during the training process. The idea of representing a
word by a vector is not really new. One-hot encoding was an example of that.
Suppose we have a vocabulary of 15 words. We are given the following sequence
of four words and all these words belong to our vocabulary: “hi how are you”.
One way to use one-hot encoding to represent the sequence is similar to Fig. 15.3.
However, using this approach we lose information about the order of words within
the sequence. As a result, this is not suitable for models such as RNN and CNN that
428 15 Recurrent Neural Networks
are used with sequential data. Another way to keep the order of words using one-
hot encoding is depicted in Fig. 15.4. However, this approach leads to considerably
sparse high-dimensional matrices. In word embedding, which is implemented by
keras.layers.Embedding, we represent each word with a low-dimensional vector
(the dimension is a hyperparameter) where each element of this vector is estimated
through the learning algorithm (see Fig. 15.5). Given a batch of word sequences of
size “samples × sequence length”, the embedding “layer” in Keras returns a tensor
of size “samples × sequence length × embedding dimensionality”. Next, we use an
embedding 16-dimensional layer, a GRU with 32 hidden units, and early stopping
call back with a patience parameter of 3 to train our classifier. The classifier shows
an accuracy of 86.4% on the test set.
we why when what hey were hi are bye how they you is not yes
seq 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0
Fig. 15.3: The use of one-hot encoding without preserving the order of words within
sequence identified by “seq”: hi how are you.
we why when what hey were hi are bye how they you is not yes
hi 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
sequence
how 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
are 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
you 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Fig. 15.4: The use of one-hot encoding to preserve the order of words within se-
quence.
Fig. 15.5: An embedding of 5 dimension (in practice the dimension is usually set
between 8 to 1024).
15.4 Implementation in Keras: Sentiment Classification 429
# training model
my_callbacks = [
keras.callbacks.EarlyStopping(
monitor="val_accuracy",
patience=3),
keras.callbacks.ModelCheckpoint(
filepath="best_model.keras",
monitor="val_accuracy",
save_best_only=True,
verbose=1)
]
Epoch 1/100
586/586 [==============================] - ETA: 0s - loss: 0.4868 -
,→accuracy:
0.7461
Epoch 1: val_accuracy improved from -inf to 0.83408, saving model to
best_model.keras
586/586 [==============================] - 32s 52ms/step - loss: 0.4868 -
accuracy: 0.7461 - val_loss: 0.3796 - val_accuracy: 0.8341
Epoch 2/100
586/586 [==============================] - ETA: 0s - loss: 0.2927 -
,→accuracy:
0.8811
Epoch 2: val_accuracy improved from 0.83408 to 0.86656, saving model to
430 15 Recurrent Neural Networks
best_model.keras
586/586 [==============================] - 30s 50ms/step - loss: 0.2927 -
accuracy: 0.8811 - val_loss: 0.3194 - val_accuracy: 0.8666
Epoch 3/100
586/586 [==============================] - ETA: 0s - loss: 0.2374 -
,→accuracy:
0.9073
Epoch 3: val_accuracy did not improve from 0.86656
586/586 [==============================] - 30s 50ms/step - loss: 0.2374 -
accuracy: 0.9073 - val_loss: 0.3407 - val_accuracy: 0.8578
Epoch 4/100
586/586 [==============================] - ETA: 0s - loss: 0.2070 -
,→accuracy:
0.9202
Epoch 4: val_accuracy did not improve from 0.86656
586/586 [==============================] - 31s 53ms/step - loss: 0.2070 -
accuracy: 0.9202 - val_loss: 0.3395 - val_accuracy: 0.8558
Epoch 5/100
586/586 [==============================] - ETA: 0s - loss: 0.1751 -
,→accuracy:
0.9348
Epoch 5: val_accuracy did not improve from 0.86656
586/586 [==============================] - 32s 55ms/step - loss: 0.1751 -
accuracy: 0.9348 - val_loss: 0.3844 - val_accuracy: 0.8664
print(history.history.keys())
print(imdb_model.summary())
best_imdb_model = keras.models.load_model("best_model.keras")
15.4 Implementation in Keras: Sentiment Classification 431
=================================================================
Total params: 84,833
Trainable params: 84,833
Non-trainable params: 0
_________________________________________________________________
None
782/782 [==============================] - 8s 9ms/step - loss: 0.3210 -
accuracy: 0.8637
Test accuracy = 0.864
0.50
training loss 0.925 training accuracy
0.45 validation loss validation accuracy
0.900
0.40 0.875
0.35
accuracy
0.850
loss
0.30 0.825
0.800
0.25
0.775
0.20
0.750
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
epoch epoch
Fig. 15.6: The loss (left) and the accuracy (right) of the GRU classifier with a word
embedding layer as functions of epoch for both the training and the validation sets.
Exercises:
(B) Replace the GRU layer with an LSTM layer with the same number of hidden
units. What is the accuracy on the test set (make sure to use the embedding layer
as in the original network)?
(C) In this part, we wish to show the utility of CNNs for analyzing sequen-
tial data. In this regard, replace the GRU layer with a 1D convolutional
layer (layers.Conv1D) with 32 filters and kernel of size 7. Use a one-
dimensional max-pooling (layers.MaxPooling1D) with a pooling size of
2 (default), followed by a (one-dimensional) global average pooling layer
(GlobalAveragePooling1D). What is the accuracy on the test set (use the
embedding layer as in the original network)?
Exercise 2: In a regression problem, suppose we have an input sequence of (x1, ..., xT )
where xt = t − 2 for t = 1, . . . , T and T ≥ 2. We would like to use the standard
RNN with a “tanh" activation function and a 2-dimensional hidden state ht for
t = 1, ..., T to estimate an output variable at t, denoted yt , where t = 1, ..., T. Assume
h0 = [0, 0]T and all weight matrices/vectors and biases used in the structure of the
RNN are weights/vectors/scalers of appropriate dimensions with all elements being
1. What is y2 (i.e., yt for t = 2)?
Exercise 4: Consider a GRU with two layers that is constructed by the standard RNN
j
stacking approach. In each layer we assume we have only one hidden unit. Let ht
j
denote the hidden state at time t for layer j = 1, 2. Suppose x1 = 0.5, h0 = 0, ∀ j, and
all weight coefficients are 1 and biases -0.5. What is h12 ?
We use this network to classify word sequences of length 200. How many parameters
in total are learned for this network?
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., and Yu, D. (2014).
Convolutional neural networks for speech recognition. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 22(10):1533–1545.
Abibullaev, B. and Zollanvari, A. (2019). Learning discriminative spatiospectral fea-
tures of erps for accurate brain-computer interfaces. IEEE Journal of Biomedical
and Health Informatics, 23(5):2009–2020.
Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kégl, B., and Rousseau,
D. (2015). The Higgs boson machine learning challenge. In Cowan, G., Germain,
C., Guyon, I., Kégl, B., and Rousseau, D., editors, Proceedings of the NIPS
2014 Workshop on High-energy Physics and Machine Learning, volume 42 of
Proceedings of Machine Learning Research, pages 19–55.
Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the
basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99:6562–
6566.
Anaconda (2023). https://www.anaconda.com/products/individual, Last
accessed on 2023-02-15.
Anderson, T. (1951). Classification by multivariate analysis. Psychometrika, 16:31–
50.
Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seed-
ing. In SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium
on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied
Mathematics.
Bache, K. and Lichman, M. (2013). UCI machine learning repository. University of
California, Irvine, School of Information and Computer.
Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf: Speeded up robust features.
In Leonardis, A., Bischof, H., and Pinz, A., editors, Computer Vision – ECCV
2006, pages 404–417, Berlin, Heidelberg. Springer Berlin Heidelberg.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependen-
cies with gradient descent is difficult. IEEE Transactions on Neural Networks,
5(2):157–166.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 435
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2
436 References
Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human de-
tection. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 1, pages 886–893.
Dechter, R. (1986). Learning while searching in constraint-satisfaction-problems.
In Proceedings of AAAI.
Devore, J. L., Farnum, N. R., and Doi, J. A. (2013). Applied Statistics for Engineers
and Scientists. Cengage Learning, third edition.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern
Recognition. Springer, New York.
Dougherty, E. R., Kim, S., and Chen, Y. (2000). Coefficient of determination in
nonlinear signal processing. Signal Processing, 80(10):2219–2235.
Drucker, H. (1997). Improving regressors using boosting techniques. In International
Conference on Machine Learning.
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification. Wiley.
Dudani, S. A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Trans-
actions on Systems, Man, and Cybernetics, SMC-6(4):325–327.
Dupuy, A. and Simon, R. M. (2007). Critical review of published microarray studies
for cancer outcome and guidelines on statistical analysis and reporting. J Natl.
Cancer Inst., 99:147–157.
Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on
cross-validation. Journal of the American Statistical Association, 78(382):316–
331.
Efron, B. (2007). Size, power and false discovery rates. The Annals of Statistics,
35(4):1351–1377.
f2py (2023). https://www.numfys.net/howto/F2PY/, Last accessed on 2023-
02-15.
Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Ann.
Eugen., 7:179–188.
Fisher, R. (1940). The precision of discriminant function. Ann. Eugen., 10:422–429.
Fix, E. and Hodges, J. (1951). Discriminatory analysis, nonparametric discrimi-
nation: consistency properties. Technical report. Randolph Field, Texas: USAF
School of Aviation Medicine.
Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency versus inter-
pretability of classification. Biometrics, 21(3):768–769.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-
line learning and an application to boosting. Journal of Computer and System
Sciences, 55(1):119–139.
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive Logistic Regression: a
Statistical View of Boosting. The Annals of Statistics, 38(2).
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting ma-
chine. Annals of Statistics, 29:1189–1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics &
Data Analysis, 38:367–378.
Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network.
Biological Cybernetics, 20:121–136.
438 References
Jain, N. C., Indrayan, A., and Goel, L. R. (1986). Monte carlo comparison of six
hierarchical clustering methods on random data. Pattern Recognition, 19(1):95–
99.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3):241–
254.
Kaggle-survey (2021). https://www.kaggle.com/kaggle-survey-2021, Last
accessed on 2023-02-15.
Kass, G. V. (1980). An exploratory technique for investigating large quantities
of categorical data. Journal of the Royal Statistical Society. Series C (Applied
Statistics), 29(2):119–127.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction
to Cluster Analysis. John Wiley.
Kay, S. M. (2013). Fundamentals of Statistical Signal Processing, Volume III,
Practical Algorithm Development. Prentice Hall.
Kelley Pace, R. and Barry, R. (1997). Sparse spatial autoregressions. Statistics &
Probability Letters, 33(3):291–297.
keras-gru (2023). https://github.com/tensorflow/tensorflow/blob/
e706a0bf1f3c0c666d31d6935853cfbea7c2e64e/tensorflow/python/
keras/layers/recurrent.py#L1598, Last accessed on 2023-02-15.
keras-initializers (2023). https://keras.io/api/layers/initializers/,
Last accessed on 2023-02-15.
Keras-layers (2023). https://keras.io/api/layers/, Last accessed on 2023-
02-15.
Keras-losses (2023). https://keras.io/api/losses/, Last accessed on 2023-
02-15.
Keras-metrics (2023). https://keras.io/api/metrics/, Last accessed on
2023-02-15.
Keras-optimizers (2023). https://keras.io/api/optimizers/, Last accessed
on 2023-02-15.
Keras-team (2019). https://github.com/keras-team/keras/releases/
tag/2.3.0, Last accessed on 2023-02-15.
Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits.
Journal of the American Statistical Association, 96(454):589–604.
Kittler, J. and DeVijver, P. (1982). Statistical properties of error estimators in per-
formance assessment of recognition systems. IEEE Trans. Pattern Anal. Machine
Intell., 4:215–220.
Krijthe, J. H. and Loog, M. (2014). Implicitly constrained semi-supervised linear
discriminantanalysi. In Proceedings of the 22nd International Conference on
Pattern Recognition, pages 3762–3767.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and
Weinberger, K. Q., editors, Advances in Neural Information Processing Systems
25, pages 1097–1105.
Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis. 29:1–27.
440 References
Kuiper, F. K. and Fisher, L. (1975). 391: A monte carlo comparison of six clustering
procedures. Biometrics, 31(3):777–783.
Lachenbruch, P. A. and Mickey, M. R. (1968). Estimation of error rates in discrimi-
nant analysis. Technometrics, 10:1–11.
Lance, G. N. and Williams, W. T. (1967). A general theory of classificatory sorting
strategies 1. hierarchical systems. 9(4):373–380.
Lawhern, V. J., Solon, A. J., Waytowich, N. R., Gordon, S. M., Hung, C. P., and
Lance, B. J. (2018). Eegnet: a compact convolutional neural network for eeg-based
brain-computer interfaces. Journal of Neural Engineering, 15(5):056013.
Lemoine, B. (2022). https://cajundiscordian.medium.com/
is-lamda-sentient-an-interview-ea64d916d917, Last accessed on
2023-02-15.
Lin, M., Chen, Q., and Yan, S. (2014). Network in network. In International
Conference on Learning Representations (ICLR).
Lloyd, S. P. (1982). Least squares quantization in pcm. IEEE Trans. Inf. Theory,
28(2):129–136.
Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classification trees.
Statistica Sinica, 7(4):815–840.
Loog, M. (2014). Semi-supervised linear discriminant analysis through moment-
constraint parameter estimation. Pattern Recognition Letters, 37:24–31.
Loog, M. (2016). Contrastive pessimistic likelihood estimation for semi-supervised
classification. IEEE Transactions on Pattern Analysis and Machine Intelligence,
38(3):462–475.
Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable
importances in forests of randomized trees. In Burges, C., Bottou, L., Welling,
M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information
Processing Systems, volume 26.
Lowe, D. (1999). Object recognition from local scale-invariant features. In Proceed-
ings of the Seventh IEEE International Conference on Computer Vision, volume 2,
pages 1150–1157 vol.2.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011).
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Tech-
nologies, pages 142–150, Portland, Oregon, USA. Association for Computational
Linguistics.
MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate
observations. In Cam, L. M. L. and Neyman, J., editors, Proc. of the fifth Berkeley
Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297.
University of California Press.
matplotlib-colors (2023). https://matplotlib.org/stable/gallery/color/
named_colors.html, Last accessed on 2023-02-15.
matplotlib-imshow (2023). https://matplotlib.org/stable/api/_as_gen/
matplotlib.axes.Axes.imshow.html, Last accessed on 2023-02-15.
matplotlib-markers (2023). https://matplotlib.org/stable/api/markers_
api.html#module-matplotlib.markers, Last accessed on 2023-02-15.
References 441
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 445
A. Zollanvari, Machine Learning with Python, https://doi.org/10.1007/978-3-031-33342-2
446 INDEX
ensemble.GradientBoostingRegressor, model_selection.RandomizedSearchCV,
229 276
ensemble.RandomForestClassifier, model_selection.ShuffleSplit,
218 249
ensemble.RandomForestRegressor, model_selection.StratifiedKFold,
218 249
feature_selection.RFECV, 299 model_selection.cross_val_score,
feature_selection.RFE, 299 248, 262, 263
feature_selection.SelectFdr, model_selection.cross_validate,
290 263
feature_selection.SelectFpr, model_selection.train_test_split,
290 116
feature_selection.SelectFwe, neighbors.KNeighborsClassifier,
290 123
feature_selection.SelectKBest, neighbors.KNeighborsRegressor,
290 144
feature_selection.f_classif, pipeline.Pipeline, 308, 312
290 preprocessing.LabelEncoder, 130
feature_selection.f_regression, preprocessing.MinMaxScaler, 122
290 preprocessing.OrdinalEncoder,
feature_selection. 130
SequentialFeatureSelector, 295 preprocessing.StandardScaler,
impute.SimpleImputer, 181 122
linear_model.ElasticNet, 180 tree.DecisionTreeClassifier,
linear_model.Lasso, 180 199
linear_model.LinearRegression, tree.DecisionTreeRegressor, 199
179 utils.Bunch, 113
linear_model.Ridge, 179 sklearn, see scikit-learn
metrics.RocCurveDisplay, 281 sparse_categorical_crossentropy,
metrics.accuracy_score, 259 367, 369
metrics.confusion_matrix, 259 tensorflow, 17, 358
metrics.f1_score, 260 xgboost, 17, 234
metrics.precison_score, 260
metrics.recall_score, 260
metrics.roc_auc_score, 260 accuracy, 127, 253
metrics.roc_curve, 281 AdaBoost, 220
metrics.silhouette_samples, 333 AdaBoost.R2, 222, 223
metrics.silhouette_score, 333 AdaBoost.SAMME, 220, 222
model_selection.GridSearchCV, agent, 5
271, 274 agglomerative clustering, 336, 346
model_selection.KFold, 245 alternative hypothesis, 287
model_selection.LeaveOneOut, Anaconda, 21
246 analysis of variance, 287, 290
model_selection.PredefinedSplit, ANOVA, see analysis of variance
274 apparent error, 239
INDEX 447