Machine Learning Q and AI 1686653642
Machine Learning Q and AI 1686653642
Machine Learning Q and AI 1686653642
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Who Is This Book For? . . . . . . . . . . . . . . . . . . . . 2
What Will You Get Out of This Book? . . . . . . . . . . . 3
How To Read This Book . . . . . . . . . . . . . . . . . . . 4
Sharing Feedback and Supporting This Book . . . . . . . 6
Acknowledgements . . . . . . . . . . . . . . . . . . . . . 7
About the Author . . . . . . . . . . . . . . . . . . . . . . . 8
Copyright and Disclaimer . . . . . . . . . . . . . . . . . . 9
Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Afterword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Acknowledgements
Writing a book is an enormous undertaking. This project would not
have been possible without the help of the open source and machine
learning communities who collectively created the technologies
that this book is about. Moreover, I want to thank everyone who
encouraged me to share my flashcard decks, as this book is an
improved and polished version of these.
I also want to thank the following readers for helpful feedback on
the manuscript:
Credits
Cover image by ECrafts / stock.adobe.com.
Introduction
Thanks to rapid advancements in deep learning, we have seen a
significant expansion of machine learning and AI in recent years.
On the one hand, this rapid progress is exciting if we expect these
advancements to create new industries, transform existing ones,
and improve the quality of life for people around the world.
On the other hand, the rapid emergence of new techniques can
make it challenging to keep up, and keeping up can be a very time-
consuming process. Nonetheless, staying current with the latest
developments in AI and deep learning is essential for professionals
and organizations that use these technologies.
With this in mind, I began writing this book in the summer of 2022
as a resource for readers and machine learning practitioners who
want to advance their understanding and learn about useful tech-
niques that I consider significant and relevant but often overlooked
in traditional and introductory textbooks and classes.
I hope readers will find this book a valuable resource for obtaining
new insights and discovering new techniques they can implement
in their work.
Happy learning,
Sebastian
Chapter 1. Neural
Networks and Deep
Learning
Q1. Embeddings, Representations, and Latent Space 13
Data parallelism
Data parallelism has been the default mode for multi-GPU train-
ing for several years. Here, we divide a minibatch into smaller
microbatches. Then, each GPU processes a microbatch separately
to compute the loss and loss gradients for the model weights. After
the individual devices process the microbatches, the gradients are
combined to compute the weight update for the next round.
An advantage of data parallelism over model parallelism is that the
GPUs can run in parallel – each GPU processes a portion of the
training minibatch, a microbatch. However, a caveat is that each
GPU requires a full copy of the model. This is obviously not feasible
if we have large models that don’t fit into the GPU’s VRAM.
Tensor parallelism
Tensor parallelism (also referred to as intra op parallelism) is a more
efficient form of model parallelism (inter op parallelism). Here,
the weight and activation matrices are spread across the devices
instead of distributing whole layers across devices. Specifically,
the individual matrices are split, so we split an individual matrix
multiplication across GPUs.
overhead is high.
Pipeline parallelism is definitely an improvement over model paral-
lelism, even though it is not perfect, and there will be idle bubbles.
However, for modern architectures that are too large to fit into
GPU memory, it is nowadays more common to use a blend of data
parallelism and tensor parallelism (as opposed to model parallelism)
techniques.
Sequence parallelism
Sequence parallelism is a new concept developed for transformer
models¹¹. One shortcoming of transformers is that the self-
attention mechanism (the original scaled-dot product attention)
scales quadratically with the input sequence length. There are,
of course, more efficient alternatives to the original attention
mechanism that scales linearly¹²¹³; however, they are less popular,
and most people prefer the original scaled-dot product attention
mechanism.
Sequence parallelism splits the input sequence into smaller chunks
that can be distributed across GPUs to work around memory
limitations as illustrated in the figure below.
¹¹Li, Xue, Baranwal, Li, and You (2021). Sequence Parallelism: Long Sequence Training from
[a] System[s] Perspective, https://arxiv.org/abs/2105.13120.
¹²Tay, Dehghani, Bahri, and Metzler (2020). Efficient Transformers: A Survey,
https://arxiv.org/abs/2009.06732.
¹³Zhuang, Liu, Pan, He, Weng, and Shen (2023). A Survey on Efficient Training of Trans-
formers, https://arxiv.org/abs/2302.01107.
Q7. Multi-GPU Training Paradigms 24
Q16. “Self”-Attention
This content is not available in the sample book. The book can be
purchased on Leanpub at http://leanpub.com/machine-learning-q-
and-ai.
Q17. Encoder- And Decoder-Style Transformers 37
In the figure above, the input text (that is, the sentences of the
text that is to be translated) is first tokenized into individual word
tokens, which are then encoded via an embedding layer¹⁷ before
it enters the decoder part. Then, after adding a positional encoding
vector to each embedded word, the embeddings go through a multi-
head self-attention layer. The multi-head attention layer is followed
by an “Add & normalize” step, which performs a layer normaliza-
tion and adds the original embeddings via a skip connection (also
known as a residual or shortcut connection). Finally, after entering
a “fully connected layer,” which is a small multilayer perceptron
¹⁷See Q1 for more details about embeddings.
Q17. Encoder- And Decoder-Style Transformers 39
Figure 17.3. Illustration of the next-sentence prediction task used in the original
transformer.
Figure 17.4. An overview of some of the most popular large language trans-
formers organized by architecture type and developers.
Q21. Data-Centric AI
This content is not available in the sample book. The book can be
purchased on Leanpub at http://leanpub.com/machine-learning-q-
and-ai.
Q22. Speeding Up Inference 52
4) Self-supervised learning
Similar to transfer learning, self-supervised learning, the model
is pretrained on a different task before it is finetuned to a target
task for which only limited data exists. However, in contrast to
transfer learning, self-supervised learning usually relies on label
information that can be directly and automatically extracted from
unlabeled data. Hence, self-supervised learning is also often called
unsupervised pretraining. Common examples include “next word”
(e.g., used in GPT) or “masked word” (e.g., used in BERT) prediction
in language modeling. Or, an intuitive example from computer
vision includes inpainting: predicting the missing part of an im-
age that was randomly removed. (For more details about self-
supervised learning, also see Q2.)
²⁶While decision trees for incremental learning are not commonly
implemented, algorithms for training decision trees in an iterative fashion do exist
(https://en.wikipedia.org/wiki/Incremental_decision_tree).
Q30. Limited Labeled Data 63
5) Active learning
In active learning, we typically involve manual labelers or users for
feedback during the learning process. However, instead of labeling
the entire dataset upfront, active learning includes a prioritization
scheme for suggesting unlabeled data points for labeling that
maximize the machine learning model’s performance.
The name active learning refers to the fact that the model is actively
selecting data for labeling in this process. For example, the simplest
form of active learning selects data points with high prediction
uncertainty for labeling by a human annotator (also referred to as
an oracle).
6) Few-shot learning
Q30. Limited Labeled Data 64
rule to these²⁷.
In short, weakly supervised learning is an approach for increasing
the number of labeled instances in the training set. Hence, other
techniques, such as semi-supervised, transfer, active, and zero-shot
learning, are fully compatible with weakly supervised learning.
9) Semi-supervised learning
Semi-supervised learning is closely related to weakly supervised
learning described above: we create labels for unlabeled instances
in the dataset. The main difference between weakly supervised and
semi-supervised learning is how we create the labels²⁸.
In weak supervision, we create labels using an external labeling
function that is often noisy, inaccurate or only covers a subset of the
data. In semi-supervision, we do not use an external label function
but leverage the structure of the data itself.
Figure 30.8. Illustration of the two main types of multi-task learning. For
simplicity, the figure depicts only two tasks, but multitask learning can be used
for any number of tasks.
The figure above illustrates the difference between hard and soft
parameter sharing. In hard parameter sharing, only the output
layers are task-specific, while all tasks share the same hidden
layers and neural network backbone architecture. In contrast, soft
parameter sharing uses separate neural networks for each task, but
regularization techniques such as distance minimization between
parameter layers are applied to encourage similarity among the
networks.
12) Multi-modal learning
While multi-task learning involves training a model with multiple
tasks and loss functions, multi-modal learning focuses on incorpo-
rating multiple types of input data.
Common examples of multi-modal learning are architectures that
take both image and text data as input³¹. Depending on the task,
³⁰Ruder (2017). An Overview of Multi-Task Learning in Deep Neural Networks.
https://www.ruder.io/multi-task/.
³¹Multi-modal learning is not restricted two only two modalities and can be used for any
number of inputs modalities.
Q30. Limited Labeled Data 70
The figure above shows image and text encoders as separate com-
ponents. The image encoder can be a convolutional backbone or
a vision transformer, and the language encoder can be a recurrent
neural network or language transformer. However, it’s common
nowadays to use a single transformer-based module that can simul-
taneously process image and text data³².
Optimizing a matching loss, as shown in the previous figure, can
be useful for learning embeddings that can be applied to various
tasks, such as image classification or summarization. However, it is
also possible to directly optimize the target loss, like classification
or regression, as the figure below illustrates.
³²For example, VideoBERT is a model that with a joint module that processes both video and
text for action classification and video captioning. Reference: Sun, Myers, Vondrick, Murphy,
Schmid (2019). VideoBERT: A Joint Model for Video and Language Representation Learning,
https://arxiv.org/abs/1904.01766.
Q30. Limited Labeled Data 71