Learning Representatio For Images Heirarchical Labels

arXiv:2004.00909v2 [cs.
LG] 11 Apr 2020
Learning Representations For Images

With Hierarchical Labels
Master’s Thesis
Ankit Dhall
2019
Advisors: Anastasia Makarova, Dr. Octavian-Eugen Ganea, Dario Pavllo

Prof. Dr. Andreas Krause
Department of Computer Science, ETH Zürich
Abstract
Image classification has been studied extensively but there has been limited
work in the direction of using non-conventional, external guidance other than
traditional image-label pairs to train such models. In this thesis we present
a set of methods to leverage information about the semantic hierarchy in-
duced by class labels. In the first part of the thesis, we inject label-hierarchy
knowledge to an arbitrary classifier and empirically show that availability of
such external semantic information in conjunction with the visual semantics
from images boosts overall performance. Taking a step further in this direc-
tion, we model more explicitly the label-label and label-image interactions
by using order-preserving embedding-based models, prevalent in natural lan-
guage, and tailor them to the domain of computer vision to perform image
classification. Although, contrasting in nature, both the CNN-classifiers in-
jected with hierarchical information, and the embedding-based models out-
perform a hierarchy-agnostic model on the newly presented, real-world ETH
Entomological Collection image dataset [11].
i
Acknowledgements
I would like to thank Prof. Dr. Andreas Krause and Anastasia Makarova for
believing in me and granting me the opportunity to work on this thesis in
collaboration with the Institute of Machine Learning at ETH Zurich. I am
grateful to Dr. Octavian-Eugen Ganea and Dario Pavllo for coming on-board
the project and sharing their ideas and insight. It was great collaborating and
brainstorming with all my supervisors who made this an extremely enriching
experience.
I would extend my gratitude to Dr. Michael Greeff from the ETH Entomologi-
cal Collection for allowing access to their collection and Maximiliane Okonnek
from the ETH Library Lab for ensuring that this project has a meaningful and
significant impact for the scientific community long after its completion.
I would like to thank my friends for their support. I am grateful to my family
for their understanding, support and the unconditional freedom to pursue
my goals.
ii
Contents
Contents iii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Leveraging label-label interactions . . . . . . . . . . . . . . . . 1
1.1.2 Long-tailed data distributions . . . . . . . . . . . . . . . . . . 2
1.1.3 Visual similarity does not imply semantic similarity . . . . . 2
1.1.4 Uncovering the black-box model . . . . . . . . . . . . . . . . . 3
1.2 Predicting Taxonomy for Scientific Collections . . . . . . . . . . . . . 3
1.2.1 ETHEC dataset: a new entomological image dataset with
label-hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Injecting label-hierarchy information to improve CNN classi-
fiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Performing image classification by jointly embedding labels
and images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Contributions Summary . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Problem Statement & Background 11

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Embedding based models for text and language . . . . . . . . 11
2.1.2 Embedding based models for images . . . . . . . . . . . . . . 13
2.1.3 Convolutional Neural Networks based models . . . . . . . . 14
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Order-embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Euclidean Cones . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Hyperbolic Cones . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Optimization in Hyperbolic Space . . . . . . . . . . . . . . . . 17
2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Hierarchical CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Hierarchical Fashion MNIST . . . . . . . . . . . . . . . . . . . 18
2.3.3 ETH Entomological Collection . . . . . . . . . . . . . . . . . . 18
2.4 CNN-backbones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iii
Contents
2.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Methods: Injecting label-hierarchy into CNN classifiers 23

3.1 Hierarchy-agnostic classifier . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Per-class decision boundary (PCDB) . . . . . . . . . . . . . . . 24
3.1.2 One-fits-all decision boundary (OFADB) . . . . . . . . . . . . 24
3.2 Per-level classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Marginalization (bottom up) . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Masked Per-level classifier . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Empirical Analysis: Injecting label-hierarchy into CNN classifiers 31

4.1 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Hierarchical CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Per-level classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Hierarchy-agnostic classifier . . . . . . . . . . . . . . . . . . . 33
4.3 Hierarchical Fashion MNIST . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 ETHEC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.2 Per-level classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.3 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.4 Masked Per-level classifier . . . . . . . . . . . . . . . . . . . . 46
4.4.5 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.6 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Methods: Order-preserving embedding-based models 51

5.1 Cosine Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Order-Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Euclidean Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Hyperbolic Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Embedding Label-Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Jointly Embedding Images with Label-Hierarchy . . . . . . . . . . . 56
6 Empirical Analysis: Order-preserving embedding-based models 59

6.1 Embedding Label-Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1 Graph reconstruction quality for label-embeddings . . . . . . 60
6.1.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Jointly Embedding Images with Label-Hierarchy . . . . . . . . . . . 63
6.2.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.2 Hierarchical level-wise classification performance . . . . . . . 65
6.2.3 W’s model capacity . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.4 Sampling strategy . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.5 Choice of Optimizer . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.6 Label initialization for joint-embeddings . . . . . . . . . . . . 67
6.2.7 Inverted cosine embeddings and euclidean cones . . . . . . . 67
iv
Contents
7 Conclusions 69
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 73
v
Chapter 1
Introduction
1.1 Motivation
In machine learning, the task of classification is traditionally performed using soft-
max and one compares class scores and returns the highest scoring label as the
prediction. Such an approach safely assumes that categories might not be corre-
lated among each other. Contrary to this assumption, in many commonly used
datasets, labels are correlated and can be agglomerated to create more abstract
concepts which are made up of a collection of relatively specific concepts. For
instance jeans, t-shirt, rain-jacket and ball-gown are all dresses. Only a
handful of previous works have used hierarchical information in the context of
computer vision. Among them, in [34] the label-hierarchy from WordNet [30]
is used to consolidate data across various datasets. On another occasion, [10]
show how to optimize the trade-off between accuracy and fine-grained-ness of the
predicted class, but their proposed method only considers the label-hierarchy (=se-
mantic similarity) and therefore disregards the visual similarity when performing
this optimization.
Even though a classifier might not be able to distinguish between two breeds
of dogs, it can still predict a more abstract yet correct label, dog. Predicting la-
bels at different levels of abstractions can help catch errors when predicting more
fine-grained labels and hence provide more meaningful predictions. Labels with
varying levels of abstraction may also be beneficial for further downstream tasks
that involve both natural language and computer vision such as image captioning,
scene graph generation and visual-question answering (VQA). This work tries to
exploit semantic information available in the form of hierarchical labels. We show
that visual models when provided such guidance outperform a hierarchy-agnostic
model. We also show how these models can be made more interpretable by using
more explicity representation models such as embeddings for the task of image
classification.
1.1.1 Leveraging label-label interactions

Image classification models are usually designed as flat N-way classifiers. Origi-
nally, these models relied on hand-crafted features but nowadays use learnable
1
1. Introduction
convolutional filters to extract image features. These convolutional layers are

tuned during the training procedure to maximize classification performance. Ini-
tial convolution layers contain simpler, more generic feature extractors for edges
and blobs and as one moves through the cascade of filters, these meld together
to extract more complex visual features such as textures and patters and eventu-
ally parts of objects and finally whole objects themselves. Such models perform
classification solely on the basis of visual signals. These models only capture the
label-image interactions and do not use additional information available about
the inter-label interaction that could boost performance and additionally make
the model more understandable.
1.1.2 Long-tailed data distributions

Data imbalance is a common sight in the real-world machine learning setting. It is
often the case that only a handful of images are available for some of the classes.
A plausible explanation could be when the object of interest occurs infrequently
in the domain from which the data is collected. In life science it could be a rarely
occurring anomaly while in image-based datasets it could be an object that is seen
less often than others.
If one were to arrange the labels in the form of a hierarchy or a directed acyclic
graph (DAG), classes that represent more abstract data would usually occupy the
upper levels while more specific classes would be their descendants, forming the
lower levels of the hierarchy. The data distribution is such that there are fewer
classes in the upper levels but, on average, have a larger number of data points for
a given label. The distribution gradually change trends as one traverses down to
the lower levels in the label hierarchy. At the bottom level, the complete opposite
holds, the levels have a large number of labels but with least amount of data per
label. This leads to the formation of a long tailed distribution with the classes
in the upper level contributing to a large number of samples while classes in the
lower level forming the long tail.
Such long tail distributions are not best-suited for machine learning models whose
generalization capabilities rely largely on the availability of large amounts of data
for each label. Leveraging auxiliary information could be of help in the presence
of long tail distributions. Using this, coupled with visual features the model is
able to relate classes across different levels and can exploit information from the
data-rich upper levels in the hierarchy.
For instance, if a particular label lower in the hierarchy has only a handful of data,
it can still share visual information about labels via its siblings (i.e. which share
the same parent) if information about the hierarchy is injected into the model.
Usually concepts that are siblings or belong to the same sub-tree of the hierarchy
have commonalities among them this can be exploited by the model if information
about label-label interactions is used.
1.1.3 Visual similarity does not imply semantic similarity

Visual models rely on image based features to distinguish between different ob-
jects. But more often than not semantically related classes might exhibit marked
2
1.2. Predicting Taxonomy for Scientific Collections
visual dissimilarity. Sometimes it might even be the case that the intra-class vari-
ance of visual features for a single label is larger than the inter-class variance. In
such scenarios learned representations for two instance with different visual ap-
pearance would be coerced away from each other, indirectly affecting the image
understanding capability of the model. In fig. 1.1 one can notice how semantic sim-
ilarity and visual similarity are different concepts but are both essential to achieve
better visual understanding.
(a) orange (b) clock (c) clock (d) clock
Figure 1.1: Although an orange and a basketball-themed clock have visual simi-
larity, they are semantically unrelated. On the other hand, the digital, analog and
basketball-themed clock are all visually distinct from each other but semantically
similar as all of them are instances of clock. By introducing auxiliary information
in the form of the label hierarchy such confusion could be avoided by models that
only pay attention to visual features. Image credits: Wikimedia, Lucky retail, Amazon,
Pixabay
1.1.4 Uncovering the black-box model

If a human is tasked with classifying an image, the natural way to proceed is
to identify the membership of the image to abstract concepts or labels and then
move to increasingly detailed labels that provide more fine-grained understanding
of the object in question. Even if an untrained eye cannot tell apart an Alaskan
Malamute from a Siberian Husky, it is more likely to at least get the concept of
mammal and its sub-concept dog correct.
Similarly, using the label hierarchy to guide the classification models we are able to
bridge the gap in the way machines and humans deal with visual understanding.
Incorporating such auxiliary information positively affects the explainability and
interpretability of image understanding models.
1.2 Predicting Taxonomy for Scientific Collections

One of the main goals of this work is to assist natural collections, museums and
other similar organizations that maintain a large library of biodiversity including
both the flora and fauna. A lot of amateur collectors maintain their personal
collections of insects and butterflies over their lifetime. Eventually most of these
are donated and end up at collections and museums. With more than 2,000,000
3
1. Introduction
Figure 1.2: The stark resemblance between an Alaskan Malamute and a Siberian
Husky would make the life of an image classification model tough as it relies
solely on visual features. Image credits: Karin Newstrom, Animal Photography; Sally Anne
Thompson, Animal Photography
specimens, the ETH Zürich Entomological Collection is one of the largest insect
collections in Central Europe.
The collection needs to sort these specimens according to their taxonomies. The
process involves hiring of an external specialist who specializes in particular fam-
ilies of these organisms. The process of sorting these is not only expensive but is
also constrained by the number of available specialists. If this resource intensive
task could be preceded by a pre-sorting procedure where these specimens are cat-
egorized based on their family, sub-family, genus and species in that order, it
would make the complete process more economical.
With the help of data and machine learning, such a repetitive can be facilitated by
non-specialists, largely cutting the costs. For example, in Switzerland, from 120
CHF per hour to 28 CHF per hour.
Annually, 40,000 specimens are donated to the ETHEC by the public. If this tech-
nology is accessible to the general public, the collection will already receive pre-
sorted specimen, making their task simpler. A 100 million euros initiative begin-
ning in 2019 will develop standards to integrate digitization across European insti-
tutions (DiSSCo [1]). In Switzerland a similar initiative is underway (SwissCollNet
[13]).
In this work, we particularly focus on the entomology of insects, more specifically

the butterflies. A digitized version from the ETH’s collection was used to create a
dataset [11] to perform empirical analysis using the methods investigated in this
work.
4
1.2. Predicting Taxonomy for Scientific Collections
1.2.1 ETHEC dataset: a new entomological image dataset with label-

hierarchy
We present a new dataset with images and the corresponding inter-label relation-
ships in addition to the generally provided image-label relationships. The chal-
lenging dataset provides a good foundation to build upon for the rest of the work
by using it to evaluate experiments.
The ETH Entomological Collection (ETHEC) dataset [11] has been directly taken
from the field and is representative of a real-world dataset with imbalance not only
in terms of the images per sample but also there is a significant disparity between
classes and their descendant sub-classes as some of these sub-trees are dispropor-
tionately sized in terms of nodes. In fig. 1.3 we illustrate the data distribution for
each label in the ETHEC hierarchy.
Figure 1.3: The diagram shows the image distribution across each labels from the
4 levels of the hierarchy: 6 family, 21 sub-family, 135 genus and 550 species.
The x-axis represents the number of images for a particular label and the ticks on
the y-axis represent each label. For clarity, we have omitted the labels for genus
and species.
With these peculiarities the dataset is representative of real-world scenarios and

is more realistic as compared to the optimistic CIFAR [22] or ImageNet [9] that
have balanced classes. The proposed dataset provides a challenging addition to
the multi-label image classification task.
The ETHEC dataset is the digitized form of a subset of the collection which con-
tains 47,978 butterfly specimens with 723 labels spread across 4 levels of hierar-
chical labels: 6 family, 21 sub-family, 135 genus and 550 species (561 genus +
species combinations). They are interesting to look at from a computer vision
perspective as they are the most visual specimens in the collection and these cues
can be used as features to make label prediction and distinguishing specimens.
5
1. Introduction
Figure 1.4: Sample images and their 4-level labels from the ETHEC dataset. The
dataset consists of 47,978 butterfly specimens with 723 labels spread across 4 levels
of hierarchical labels: 6 family, 21 sub-family, 135 genus and 550 species.
The images are taken from the digitized collection at the ETH Entomological Col-
lection. We pre-process them to remove any visual signals (barcodes, text labels,
markings) that might leak label information about the specimen to a visual model.
We also crop the images to lie at the center of the image and resize them to
448 × 448. We also provide metadata and labels for each of the 4 levels in the
label-hierarchy. The dataset is split into train, val and test as 80-10-10. For
labels with fewer than 10 images, we split the images equally between the three
sets.
The dataset is been made publicly available, and can be found at the open-access
link: https://www.research-collection.ethz.ch/handle/20.500.11850/365379.
1.3 Contributions
1.3.1 Injecting label-hierarchy information to improve CNN classifiers
The work proposes, in addition to a hierarchy-agnostic baseline, 4 different meth-

ods of passing on knowledge about the label-hierarchy to a classifier to boost
performance over a hierarchy agnostic classifier. Each method differs in the way
they make this information available to the classifier and also the kind of infor-
mation injected. So on top of the traditionally available image-label pairs during
training, the methods provide additional label-label information as well.
The proposed methods are agnostic to the kind of features used or in general
the feature extractor and can be easily extended to any classifier whose labels are
arranged in a hierarchy. Since, the work tackles image classification, we use well-
known visual feature extractor convolutional neural networks (CNNs) [18, 23, 36]
in our experiments. Although, there are works that propose modifications [19]
directly to the CNN architecture, we refrain from doing so such that these methods
are model-agnostic can be used with any general classifier.
6
1.3. Contributions
1.3.2 Performing image classification by jointly embedding labels and

images
Order-preserving embeddings have shown great promise for capturing relations
between concepts and tokens in the field of natural language processing [16, 39,
38, 25]. This work explores them in the context of computer vision to solve the
task of image classification. We embed the labels (which arrange themselves as a
hierarchy) and the images in a joint embedding space. Relations between labels
and a given image can be used to predict labels and classify a given image.
In contrast to state-of-the-art approaches that use classical cross-entropy inspired
classification loss function on CNN-based feature extraction backbones for images,
we use embedding models to more explicitly represent label-label and label-image
interactions. The idea is to allow the CNN to benefit from label-hierarchy infor-
mation. In our experiments we show that a model trained with a classical cross-
entropy inspired loss function performs worse than embedding-based classifiers
that exploit the label-hierarchy.
Depending on the geometry that the parameters use and the space in which
the embeddings live, these models can be categorized into Euclidean and non-
Euclidean models.
Euclidean models
The field of natural language usually deals with modeling concepts as hierarchi-
cal structures and learning embeddings from unstructured text. Recent works
[39, 16] model them as DAGs and suggest to embed them in order to preserve
their asymmetric entailment relations. This information is usually lost if sym-
metric distance functions are used. Order-embeddings [39] propose propose an
asymmetric distance function that arranges the embedded concepts in an order-
preserving manner. A more recent approach, entailment cones [16], use a more
generalized version of the order-embeddings that are more space efficient and per-
form better. In contrast to the above approaches that have been proposed in the
context of natural language we propose to jointly embed images and their labels
and use their interactions to predict labels for unseen images.
Non-Euclidean models
Unlike the Euclidean models, non-euclidean models exploit non-zero curvature
of their geometries. Hyperbolic geometry has negative curvature and can accom-
modate tree like structures (such as DAGs) with ease in comparison to Euclidean
geometry. In hyperbolic space the volume of a ball grows exponentially with the
radius [32] unlike the polynomial growth that we are aware of in Euclidean space.
A set of works have [32, 16, 38] proposed to exploit spaces of negative curvature
to better embed concepts and create state-of-the-art models to embed hierarchies.
We use a model similar to the hyperbolic entailment cones [16] where in addition
to the labels we embed the images as well, treating the problem in a joint manner.
Generally embedding models and CNN-based classifiers are hard to compare be-
cause of the vastly different use-case and domain they are generally applied to.
7
1. Introduction
We use the embedding models as image classifiers and are able to make a fair
performance comparison between different model categories. In addition to the
image classification and joint embedding of labels and images, for the embedding
based models, we also look at the quality of the embedding of the label-hierarchy
itself. We report the performance on the ETH Entomological Collection (ETHEC)
dataset [11].
1.3.3 Contributions Summary
• We show how order-preserving embedding models, which are generally

used for NLP tasks, can be extended for computer vision tasks such as
image classification. We compare embedding-based classifiers with the label-
hierarchy injected CNN-based classifiers. Both the Euclidean and non-Euclidean
variants of embedding models are implemented and outperform the hierarchy-
agnostic baseline. This shows promise for modeling and tackling down-
stream tasks that lie at the intersection of computer vision and natural lan-
guage in a joint fashion.
• We compare a multi-label hierarchy-agnostic classifier as the baseline and 4
different methods detailed in the thesis to inject label-hierarchy knowledge
into a classifier. Each of these methods takes into account hierarchical infor-
mation at different levels of abstraction such as: the depth of hierarchy, edge
connections and sub-tree relations.
• Although CNNs and embedding-based models are based on distinct paradigms
we investigate the performance boost obtained by incorporating label-hierarchy
information. Using the ETHEC dataset presented with this work, our ex-
periments show that irrespective of the type of model being used, exploit-
ing label-hierarchy leads to better image classification performance for both
CNN-based and embedding-based classifier.
1.4 Outline
The remainder of this thesis is structured as follows:
• In chapter 1 the motivation behind the methods and the need to exploit
information from hierarchically organized labels is outlined.
• We skim over the relevant work in a similar direction as the one proposed in
this manuscript in chapter 2. It provides mathematical background for meth-
ods that this work extends for joint label-image embedding for image clas-
sification. It also contains information regarding datasets and CNN-based
feature extractors (CNN-backbones).
• In chapter 3 we discuss in detail label-hierarchy injection into CNN-based
models, probability distributions computation over the labels and finally
how the predictions are made. We first discuss the baseline that disregards
any external information than the image-label interactions. For the rest of
the models, with each model, more information regarding the hierarchy is
8
1.4. Outline
made available to the classifier. For clarity we separate out and compile the
empirical analysis for the CNN-based models in chapter 4
• In chapter 5 we sketch the details for embeddings based models both Eu-
clidean and non-Euclidean variants. The chapter also discusses label-embeddings
before jointly embeddings labels together with images. We present the em-
pirical results of the embeddings-based models in chapter 6.
• Concluding remarks and possible directions for future work are discussed
in chapter 7.
9
Chapter 2
Problem Statement & Background
2.1 Related Work

2.1.1 Embedding based models for text and language
An embedding is a mapping that maps discrete objects such as images, words or
concepts to a relatively compact representation in the form of a vector living in
low-dimensional embedding space.
For instance, words in a particular language can be represented using one-hot
encoding in an V-dimensional space where V would be the vocabulary size for that
particular language. However, this representation would contain little information
or semantic meaning due to the inherent sparsity of the one-hot encoding. In
addition to the embeddings lying in low-dimensional space, ideally, one would
want these embeddings to arrange themselves in a manner such that the objects
embedded close together represent high semantic similarity among themselves.
Traditionally, embeddings use a symmetric distance function to measure similarity
between two objects. When one tries to embed concepts that have an asymmetric
relation between them then using symmetric distance functions this detail is lost.
One needs to use an asymmetric distance function to capture this relationship.
Order-embeddings. In [39], the authors tackle embedding of a semantic hierar-

chy as a partial ordering. Their work embeds a visual-semantic hierarchy that
is anti-symmetric in nature. Instead of considering Euclidean or Manhattan dis-
tance, between two concepts as a measure of similarity the authors propose to use
an asymmetric distance function when representing a hierarchy over images and
text via embeddings. The work proposes a function that measures the presence of
a parent-concept child-concept relation if the child-concept lives within a part of
the subspace that is owned by the parent-concept. The distance metric is designed
such that it defines a sub-space where it is valid for a child-concept to lie. This
valid space is the positive orthant translated such that its origin is at the location
(coordinates) of the embedding of the particular concept.
As opposed to the distance-preserving nature (which is generally the case), the
order-preserving nature of order-embeddings ensures that anti-symmetric and
11
2. Problem Statement & Background
transitive relations can be captured well without having to rely on physical close-
ness between points. Instead, the embeddings are learned by minimizing a loss
that penalizes order violations. In [39] the authors tackle two tasks: hypernymy
prediction and image-caption retrieval. A hypernym is a pair of concepts where
the first concept is more generic or abstract than the second. For instance, (fruit,
mango) or (emotion, happiness). The hypernymy prediction task has a natural
hierarchy to the concepts, however, for the image-caption they create a two-level
hierarchy where the captions form the more abstract, upper level while the images
being more detailed form the lower level.
Euclidean cones. One major restriction of the representation and indirectly the
distance function proposed in [39] is that each concept occupies a large volume
in the embeddings space (the coordinates of each embedding own a translated
orthant irrespective of the number of descendants they have) and also suffers from
heavy orthant intersections. This ill-effect is amplified especially in extremely low
dimensions such as R2 . To ameliorate such affects, the authors in [16] propose a
generalized version of order-embeddings called the entailment cones. These are
more flexible and the region owned by a concept is not restricted to be a translated
orthant but a convex cone. The cone that is owned by a concept originates at the
location of the concept’s embedding with its apex lying at these coordinates. Any
concept that falls within the cone is considered as a sub-concept in context of
hypernymy prediction.
Hyperbolic cones. In addition to the Euclidean cones, [16] takes advantage of

non-Euclidean geometry by learning embeddings in the hyperbolic space where
the volume of a ball grows exponentially with the radius as compared to poly-
nomially in Euclidean space. This property allows one to embed directed-acyclic
graphs (DAGs); especially trees that grow exponentially with the height of the tree
(height = logbranchingFactor ( Nnodes )), quite well even in very low-dimensional space
[16].
The authors use a version of the Stochastic Gradient Descent (SGD) [4] that is for
optimizing parameters on the Riemannian manifold, the Riemannian SGD to op-
timize embeddings in non-Euclidean manifolds. In their work, they propose the
non-Euclidean entailment cones living in the hyperbolic space as well as their Eu-
clidean variant. They focus on the task of hypernymy prediction on the WordNet
hierarchy [30] by embedding a directed-acyclic graph using hyperbolic entailment
cones and use it to classify whether a pair of concepts is a hypernym pair.
Hyperbolic Neural Networks. In a more recent work [15] the authors propose
to have feed-forward neural networks to be parameterized in hyperbolic space.
This allows downstream tasks to use hyperbolic embeddings for natural language
processing (NLP) tasks in a more principled and natural fashion. They derive
hyperbolic variants of logistic regression, feed-forward neural networks and recur-
rent neural networks. These are then used to take as input hyperbolic embeddings
and are seen to perform at par or better than their Euclidean counterparts.
12
2.1. Related Work
Disk embeddings. [38] proposes a generalization of order-embeddings [39] and

entailment cones [16] for embedding DAGs with exponentially increasing nodes.
The work focuses on the task of hypernymy prediction on the WordNet hierarchy
[30] given a pair of concepts.
Other embedding methods. The work proposed in [3] maps images onto class
embeddings where pairwise dot product is used as a measure of similarity. To em-
bed the class labels they use a deterministic algorithm to compute class centroids
by using hierarchical information from WordNet [30] to guide the embeddings se-
mantically. They conjecture that semantics are complicated and are hard to learn
only from visual cues. The class embeddings are pre-computed using the hierar-
chy. The image embeddings are mapped to the fixed class embeddings using a
CNN with a combination of image classification and embedding loss. Their work
focuses on the image retrieval task. A drawback of such an approach is that the
label embeddings are fixed when training on the image embeddings. The labels
might be embedded properly however they might not be arranged in a way that
puts visually similar labels together. Fixing them when learning image embed-
dings prevents the combination of visual and semantic similarity to re-arrange the
label embeddings in a manner that is better suited.
[25] combines the idea of Hearst patterns and hyperbolic embeddings to infer is-a
relationship from text such as is-a(car, object) or is-a(Paris, city). They propose
to create a graph with the help of Hearst patterns and consequently embed it in
low-dimensional hyperbolic space. They focus on different hypernymy tasks for
text given a pair of concepts (u, v): (1) if u is a hypernym of v, (2) is u more general
than v, and (3) to what degree u is a v.
2.1.2 Embedding based models for images

Visual-semantic embeddings, proposed in [12], defines a similarity measure in-
stead of a function that classifies a given pair as positive or negative. They calcu-
late similarity scores and return the closest concept in the embedding space for
a given query. They map features for language via an LSTM (Long short-term
memory) and images via a CNN and map to the joint embeddings space through
a linear mapping and measure similarity in this space using the inner product.
They minimize a hinge-based triplet loss term and emphasize on hard-negatives
by computing the loss for the closest negative (=hard-negative) instead of sum-
ming over all negatives. They focus on the task of cross-modal retrieval: caption
retrieval given an images and image retrieval given a caption.
In one particular work, embeddings have been used for image classification [21].
The work uses order-embeddings to embed labels and images together for classi-
fication of the hierarchy. The work proposes to embed the labels first and then
uses the transitivity of the embeddings across levels in the hierarchy to implic-
itly predict the upper levels after explicitly predicting the lower-most level. We
extend this and use non-Euclidean models and also propose CNN-based models
that exploit the hierarchy in varying degrees.
In contrast to general CNNs for image classification, the work done in [14] extracts
and exploits external information in the form of unannotated text in addition to
13
the labeled images. They use a single unified models with embeddings and trans-
fer knowledge from the text-domain to a model for visual recognition. They addi-
tionally perform zero-shot classification on classes extended on top of the ones in
the ImageNet dataset [9]. The proposed work uses a combination of inner prod-
uct to measure similarity and the hinge loss. With this approach they generalize
well to unseen labels and are able to make relevant prediction even if the model
classifies an image incorrectly (compared to the ground truth) for unseen classes
from ImageNet 21K.
2.1.3 Convolutional Neural Networks based models

Kumar et al. [24] predict labels based on a tree formed from types of clothing.
They create the hierarchy on the basis of detection errors, more specifically from
the commonly confused classes by using a matrix of false positives. They use their
methodology to classify clothing types by creating a 2-level hierarchy. To account
for the hierarchy they predict conditional probability P(child|parent) as outputs
from their classifier and multiply probabilities together to make predict labels for
both the levels. They perform experiments with a 2-level hierarchy with a handful
of labels in total and we observe the issues arise when the hierarchy is extensive
and the data is scarce. A drawback of their method is the fact that the hierarchy is
formed on the basis of confusion while predicting solely based on visual cues. This
implies that the constructed hierarchy might not really have semantic similarity as
the guiding principle but rather visual similarity. Our methods, on the other hand
jointly incorporate both visual and semantic similarity to the model via injecting
information about both image-label and label-label interactions.
In work done by Chen et al. [7] they propose to predict labels for different levels
in a hierarchy. Their work is closest to ours in the sense that it tries to predict
labels for each level in the hierarchy to which the images belong. They develop a
sophisticated CNN architecture that uses a common feature extractor which then
uses separate neural networks where each specializes to predict labels for each
level. The fact that they use completely separate networks to predict labels for
each level makes the model prone to over-fitting when the dataset is small and
computationally intensive as well. They present a dataset with a 4-level hierarchy
with images of butterflies across 200 species similar to the ETHEC dataset and
construct hierarchies for existing Caltech UCSD Birds dataset [41]. They compare
performance for the final level in the hierarchy with many baseline methods but
these methods only predict labels for the the most fine-grained label category
(=the final level in the hierarchy) and not the others.
In tasks relating to fine-grained image classification, it is common to have class

labels with only a handful of images. Instead of fine-tuning models pre-trained
on all classes of large dataset like the ImageNet, the work done in [8] proposes
to select a subset of top-K labels to be used for pre-training based on domain
similarity between the source and target domains. After pre-training on a subset
of the large dataset that is visually similar, the transfer-learning then yields better
performance than models which are fine-tuned after pre-training on the entirety
of the large dataset. They propose a better method that is more efficient when
14
2.2. Background
performing pre-training. In a similar direction, [37] use hierarchy information to

perform transfer learning.
Hu et al.[19] work on the task of classifying fine-grained visual classification where
intra-class variance is large and inter-class variance is small due to the visual
similarity of objects in images. They propose a CNN architecture which tries to
learn discriminative regions in the image via attention maps. This is then used
to refine the prediction of the model by looking at it closely with the help of the
learned attention maps and draw attention to discriminative parts of the object. In
addition to this, they also propose using unsupervised image data augmentation
strategy guided by the attention maps by zooming, cropping and erasing parts of
the image in order to generalize better. The proposed model uses attention maps
to help focus on smaller details however, unlike our proposed CNN-models, it
does not use any information about the hierarchy in which the labels are arranged.
One could consider it as a complementary approach to the ones proposed in the
thesis, based only on visual cues.
There have been a set of other works [27, 27, 40, 6] that use attention based mech-
anisms to focus on discriminative regions in an image.
[28, 5] also explore the idea of exploiting external information by using part-based
attributes to help models during the learning process.
2.2 Background
2.2.1 Order-embeddings
Typically a symmetric distance is used to ascertain semantic similarity between
concepts in the embedding space. Order-embeddings [39] propose to learn a map-
ping that cares about preserving the order between objects than distance and intro-
duce the problem of partial order completion. From a set of known ordered-pairs
P and unordered-pairs N the goal is to determine if an arbitrary, unseen pair is
ordered or not.
They propose to use a reversed product order on R N due to its desirable properties.
This is defined in eq. (2.1).
N
^
y x if and only if yi ≥ xi (2.1)
i =1
The reversed order means that smaller coordinates represent a ”higher” or more
abstract position in the partial ordering.
Instead of having a hard-constraint they introduce an approximate order-embedding
to violate them as less as possible.
E( x, y) = ||max(0, x − y)|| (2.2)
L = ∑ E( f (u), f (v)) + ∑ max(0, α − E( f (u0 ), f (v0 ))) (2.3)

(u,v)∈P (u0 ,v0 )∈N
15
where, P and N represent positive and negative edges respectively in the dataset
X . α ∈ R+ is the margin. f is a function that maps a concept to it’s embedding.
E( f (u), f (v)) is the energy that defines the severity of the order-violation for a
given pair (u, v) and is given by eq. (2.2).
According to the energy E( x, y) = 0 ⇐⇒ y x. For positive pairs where y
is-a x, one would like embeddings such that E( x, y) = 0. a is-a b implies that
a is a sub-concept of b or equivalently b is more abstract than a and that is its
generalization.
2.2.2 Euclidean Cones

Euclidean cones [16] are a generalization of order-embeddings [39]. For each vec-
tor x in R N , the aperture of the cone (with its apex located at this point) is based
solely on the Euclidean norm of the vector, || x ||, [16] and is given by ψ( x ) in
eq. (2.4). One of the properties of these cones is that a cone can have a maximum
aperture of π/2 [16]. In addition to this, to ensure continuity and transitivity, the
aperture should be a smooth, non-increasing function. To satisfy properties men-
tioned in [16], the domain of the aperture function has to be restricted to (ε, 1] for
some ε. ε = f (K ) where K is a hyper-parameter.

K
ψ( x ) = arcsin (2.4)
|| x ||
Ξ( x, y) computes the minimum angle between the axis of the cone at x and the vec-
tor y. E( x, y) measures the cone-violation which is the minimum angle required
to rotate the axis of the cone at x to bring y into the cone.
||y||2 − || x ||2 − || x − y||2

Ξ( x, y) = arccos (2.5)
2 || x || || x − y||
E( x, y) = max(0, Ξ( x, y) − ψ( x )) (2.6)
2.2.3 Hyperbolic Cones

The Poincaré ball is defined by the manifold Dn = { x ∈ Rn : || x || < 1}. The
distance between two points x, y ∈ Dn is given by [16]:
|| x − y||2

dD ( x, y) = arccosh 1 + 2 (2.7)
(1 − || x ||2 )(1 − ||y||2 )
and the Poincaré norm is defined as [16]:
|| x ||D = dD (0, x ) = 2 arctanh(|| x ||) (2.8)
Angles in hyperbolic space is the angle between the initial tangents of the geodesics.
The angle between two tangent vectors u, v ∈ Tx Dn is given by cos(∠(u, v)) =
hu,vi
||u|| ||v||
[16].
16
2.2. Background
One needs to replace the cosine law and the exponential map to obtain the hy-
perbolic formulation for the cones in hyperbolic space [16]. ||.|| represents the
Euclidean norm, h., .i represents the Euclidean scalar-product and the unit vector
x̂ = x/|| x ||.
The aperture of the cone is given by ψ( x ).
1 − || x ||2

ψ( x ) = arcsin K (2.9)
|| x ||
Ξ( x, y) computes the minimum angle between the axis of the cone at x and the vec-
tor y. E( x, y) measures the cone-violation which is the minimum angle required
to rotate the axis of the cone at x to bring y into the cone.
h x, yi(1 + || x ||2 ) − || x ||2 (1 + ||y||2 )

Ξ( x, y) = arccos p (2.10)
|| x || || x − y|| 1 + || x ||2 ||y||2 − 2h x, yi
E( x, y) = max(0, Ξ( x, y) − ψ( x )) (2.11)
2.2.4 Optimization in Hyperbolic Space

In order to perform optimization, one cannot simply use the Euclidean gradients.
For a given parameter u one generally performs the following usual Euclidean
gradient update:
u ← u − η ∇u L (2.12)
Instead, for parameters living in hyperbolic space, one should compute the Rie-
mannian gradient and update it using the gradient direction in the tangent space
and move u along the corresponding geodesic in the hyperbolic space with the
following update rule [16] (Riemannian Stochastic Gradient Descent):
u ← expu (η ∇uR L ) (2.13)
where, ∇uR L is the Riemannian gradient for parameter u and is computed by re

scaling the Euclidean gradient by the inverse of the metric tensor given by:
∇uR L = (1/λu )2 ∇u L (2.14)
λu is different for each parameter and is computed as λu = 2/(1 − ||u||2 ) [16].

The exponential-map at a point x, expx (v) : Tx Dn → Dn , maps a point v in the
tangent space to the hyperbolic space and is defined as [16]:
17
λ x (cosh(λ x ||v||) + h x, v̂i sinh(λ x ||v||))

expx (v) = x
1 + (λ x − 1)cosh(λ x ||v||) + λ x h x, v̂i sinh(λ x ||v||)
(1/||v||) sinh(λ x ||v||)
+v (2.15)
1 + (λ x − 1)cosh(λ x ||v||) + λ x h x, v̂i sinh(λ x ||v||)
2.3 Datasets
2.3.1 Hierarchical CIFAR-10
We also perform experiments with the CIFAR-10 dataset [22]. There are 10 classes
with 6000 32x32 images per class. In total, the dataset has 50000 images for training
and 10000 for testing. To be consistent with our experiments, we use a 80% −
10% − 10% split for training, validation and testing respectively. All fine-tuning
is performed on the validation set, the test set is only used to report the model’s
performance.
The original dataset does not have a label hierarchy associated. Instead each image
has a single ground truth label. Additional labels are added to introduce a 3-level
hierarchy. Each image now is associated with 3 labels. The original labels are the
leaves of this hierarchy. The root of the hierarchy is entity, the first level of hier-
archy splits into (living, non-living). living entities are divided between (mammal,
non-mammal). non-living entities are divided between (vehicle, craft). The original
classes are {airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck}. mammal
is a parent of (cat, deer, dog, horse) and non-mammal of (bird, frog). vehicle is a parent
of (automobile, truck) while craft is a parent of (airplane, ship).
2.3.2 Hierarchical Fashion MNIST

Fashion MNIST [42] is a dataset similar to MNIST where instead of hand-written
digits it consists of 10 classes of clothing images. The dataset has 60,0000 training
and 10,000 test samples. We split the training set such that 50,000 are used for
training while the remaining 10,000 are used for validation. Each image is a 28x28
gray-scale image.
As in the case of CIFAR-10, here too, a 2-level hierarchy is introduced. The root
of the hierarchy is fashion-wear. The first level consists of top-wear, bottom-wear
and accessories and footwear. top-wear has t-shirt, pullover, dress, coat and shirt as
the descendants. bottom-wear and accessories has trousers and bag as descendants.
footwear has sandal, sneaker, ankle-boot as descendants.
2.3.3 ETH Entomological Collection

ETH Library’s IMAGO project
For experiments, we use data provided by ETH Entomological Collection abbrevi-
ated as ETHEC. The dataset, associated with ETH Library’s IMAGO project, is an
extensive collection of Lepidoptera specimens that have been carefully curated and
digitized with accurate metadata.
18
2.3. Datasets
Figure 2.1: Hierarchy of labels from the ETHEC (Merged) dataset. It consists of
labels arranged across 4 levels: family (blue), sub-family (aqua), genus (brown)
and species. This visualisation depicts the first 3 levels. The name of the family is
displayed next to its sub-tree.
There exists metadata from 197052 specimen samples with all samples having
labels spread across various hierarchical levels. For our experiments we make use
of 4 such levels: 25 unique families, 91 unique subfamilies, 842 unique genera and
2429 unique specific epithets labels. The average branching factors are 25, 3.64, 9.25
and 2.88 for the respective levels. The label hierarchy has 3537 edges and 3387
nodes.
In the hierarchy, the maximum number of descendants belong to the family Noc-
tuidae 19, to the subfamily Noctuinae 155 and to the genus Eupithecia 79. Maximum
specimens belong to Geometridae 48635 (family), Noctuinae 29555 (subfamily), Zy-
gaena 17243 (genus) and filipendulae 2456 (specific epithet).
Since this data is much larger as compared to other datasets discussed in this work,
to better understand the data, we visualize the dataset using as an interactive
graph using JavaScript. The visualization has basic functionality to view relations
between nodes, the number of samples per label and the hierarchy level for a
particular label. The nodes have size proportional to the order of magnitude of the
number of samples for that label and are also color coded based on their hierarchy
19
level. We visualize a subset of the complete hierarchy of the ETHEC dataset in

fig. 2.1. More specifically, it is the subset that is used for our experiments.
ETHEC Merged (ETHEC dataset)

According to the way the nomenclature is defined, the specific epithet (species)
name associated with a specimen may not be unique. For instance, two samples
with the following set of labels, (Pieridae, Coliadinae, Colias, staudingeri) and (Ly-
caenidae, Polyommatinae, Cupido, staudingeri) have the same specific epithet but differ
in all the other label levels - family, subfamily and genus. However, the combination
of the genus and specific epithet is unique. To ensure that the hierarchy is a tree
structure and each node has a unique parent, we define a version of the database
where there is a 4-level hierarchy - family (6), subfamily (21), genus (135) and genus
+ specific epithet (561) with a total of 723 labels. We call this version of the ETHEC
dataset as ETHEC Merged dataset. We decide to keep the genus level as according
to experts in the field, information about genera helps distinguish among samples
and result in a better performing model. For our experiments we use the merged
version of the dataset to ensure that the hierarchy is a tree. The first 3 levels of the
hierarchy are visualized in fig. 2.1.
ETHEC Small dataset

In order to allow for debugging and checking algorithms, we additionally use a
smaller subset of the original ETHEC dataset, called the ETHEC Small dataset.
table 2.1 enumerates all labels across the 4 levels in the hierarchy for this the
ETHEC Small dataset.
2.4 CNN-backbones
We use convolutional neural networks to extract visual features from the images
to perform classification. The CNN-based models are optimized using SGD [4]
with a learning rate of 0.01 for 100 epochs and a batch-size of 64 unless specified
otherwise.
2.4.1 AlexNet
AlexNet [23] proposed in 2012 shot to fame after exceptional performance on the
ImageNet [9] challenge. It consists of 8 layers in total: the first 5 being convo-
lutional layers and the remaining 3 being fully-connected layers. The original
architecture outputs logits for 1000 class labels from the ImageNet challenge [9].
2.4.2 VGG
VGGNet [36] comprises of 16 convolutional layers and with 138 million parame-
ters, is much larger than AlexNet [23]. They propose to use smaller filters (3 x 3) as
opposed to larger filter size in previous CNNs. This reduces the effective number
of parameters to achieve the same receptive field and in addition also incorporates
more than one non-linearity.
20
2.4. CNN-backbones
Level Label name

Family Hesperiidae
Riodinidae
Lycaenidae
Papilionidae
Pieridae
Subfamily Hesperiinae
Pyrginae
Nemeobiinae
Polyommatinae
Parnassiinae
Pierinae
Genus Ochlodes
Hesperia
Pyrgus
Spialia
Hamearis
Polycaena
Agriades
Parnassius
Aporia
Genus + species Ochlodes venata
Hesperia comma
Pyrgus alveus
Spialia sertorius
Hamearis lucina
Polycaena tamerlana
Agriades lehanus
Parnassius jacquemonti
Aporia crataegi
Aporia procris
Aporia potanini
Aporia nabellica
Table 2.1: ETHEC Small dataset, a subset of the ETHEC dataset.
21
2.4.3 ResNet
ResNet [18] showed that increasing depth improves network performance. They
introduce skip-connections between groups of layers allowing the model to learn
identity functions thus ensuring that the performance is as good as that of a shal-
lower network. This facilitates better convergence rates than plain networks. The
skip connections or shortcut connections do not increase the number of parame-
ters in comparison to the original network (without skip connections). Even with
the remarkable increase in depth ResNet-152 (152 layers) has fewer parameters
than the VGG-16/19 [36]. ResNets (pre-trained on the ImageNet dataset) are a
popular choice as feature extractors for image related tasks.
22
Chapter 3
Methods: Injecting label-hierarchy into

CNN classifiers
In this chapter we propose CNN-based models that use a convolutional layers to

extract visual features and classify images. The architecture of the CNN in itself
is not modified, rather the work focuses more on how different probability dis-
tribution formulations (across the labels) can be used to incrementally pass more
information to the model about the label hierarchy. The chapter describes in detail
5 models where the first model is a baseline that is agnostic to any information
from the label-hierarchy whatsoever. The remaining 4 models gradually make
more information available to the model about such as the number of levels in the
hierarchy and the edges between different labels.
3.1 Hierarchy-agnostic classifier
Figure 3.1: Model schematic for the hierarchy-agnostic classifier. The model is a
multi-label classifier and does not utilize any information about the presence of
an explicit hierarchy in the labels.
As a baseline, we use a state-of-the-art convolutional neural network (CNN) image

classification. For this, we use the residual network models proposed by [18].
The baseline is agnostic to any information available in the form of the label hier-
archy present in the dataset. In other words, it treats class labels from different
23
3. Methods: Injecting label-hierarchy into CNN classifiers
levels in an unrelated manner with only the image being available for the model
to predict a label for each level in the hierarchy. Labels across levels do not hold
any special meaning and are treated indifferently.
The model performs Ntotal -way classification. Ntotal = ∑iL=1 Ni represents labels
across all L levels and Ni are the number of distinct labels on the i-th level. It uses
the one-versus-rest strategy for each of the Ntotal labels.
Ntotal
1 1
L ( x, y) = −
Ntotal
∗ ∑ y j ∗ log
(1 + exp(− x j ))
j =1
exp(− x j )

+ (1 − y j ) ∗ log (3.1)
(1 + exp(− x j ))
where, x ∈ R Ntotal , y ∈ {0, 1} Ntotal and y T y = L.

F (I) = x, where x are the logits (normalized and interpreted as a probability
distribution) from the last layer of a model F which takes as input image I .
3.1.1 Per-class decision boundary (PCDB)

Since, each image would be associated with more than one label we would apply
a multi-label approach where multiple predictions for a single image are valid.
For each class, the threshold is tuned based on the micro-F1 performance on the
validation set [44].
During evaluation time for the L classes, label L j is assigned to an image if x [ j] >
θ j , ∀ j. θ j is tuned separately for label j and is set to the value that maximizes
F1-score performance on the validation set for that class.
3.1.2 One-fits-all decision boundary (OFADB)

In this variant, instead of having a different decision boundary for each class only
a single decision boundary is used. To tune for a single threshold across classes,
predicted scores and the ground truth labels across classes are used together to
find the threshold that maximizes the micro-F1 score on the validation set.
During evaluation time for the L classes, label L j is assigned to an image if x [ j] >
θ, ∀ j. Instead of having a per-class θ j this version of the model uses a global
threshold θ. This is set to the value that maximizes F1-score performance on the
entire validation set across all the L classes.
3.2 Per-level classifier

For the next method more information from the label hierarchy would be made
available to the network. Instead of a single Ntotal -way classifier as in the case of
the hierarchy-agnostic model, we replace it with L Ni -way classifiers where each
of the L classifiers handles all the Ni labels present in level Li . For this task we use
a multi-label soft-margin loss.
24
3.3. Marginalization (bottom up)
Figure 3.2: Model schematic for the per-level classifier (=L Ni -way classifiers). The
model use information about the label-hierarchy by explicitly predicting a single
label per level for a given image.
L
L ( x, τ ) = ∑ Li (xi , τi ) (3.2)
i =1
  !
Ni
exp( xi [τi ])
Li ( xi , τi ) = − log  N
 = − xi [τi ] + log ∑ exp(xi [ j]) (3.3)
∑ j=i 1 exp( xi [ j]) j =1
where, τi is the true label for the i-th level. xi ∈ R Ni , τ ∈ I+

L.
F (I) = x where, x are the logits from the last layer of a model F which takes
as input image I . xi is a continuous sub-sequence of the predicted logits x, i.e.
xi = ( xi [ Ni−1 + 1], xi [ Ni−1 + 2], ..., xi [ Ni−1 + Ni ]).
3.3 Marginalization (bottom up)

In the baseline, the hierarchy-agnostic classifier, the model disregards the existence
of a hierarchy in the labels. In the per-level classifier, the fact that each sample
has exactly L labels (due to the L-level hierarchy) is built into the design of the
model by using L such separate multi-class classifiers. The per-level classifier is
still unaware of how these levels are ordered and is indifferent to the relation
present between nodes from different levels. With the Marginalization method
the information about the parents of each node is made available to the model.
The L-classifiers are replaced by a single classifier that outputs a probability dis-
tribution over the final level in the hierarchy. Instead of having classifiers for the
remaining L − 1 levels, we compute the probability distribution over each one of
these by summing the probability of the children nodes. Although, the network
does not explicitly predict these scores, the models is still penalized for incorrect
predictions across the L-level hierarchy using the cross-entropy loss.
L L
L ( x, τ ) = ∑ Li (xi , τi ) = − ∑ log ( pi [τi ]) (3.4)
i =1 i =1
25
Figure 3.3: Model schematic for the Marginalization method. Instead of predicting
a label per level, the model outputs a probability distribution over the leaves of
the hierarchy. Probability for non-leaf nodes is determined by marginalizing over
the direct descendants. The Marginalization method models how different nodes
are connected among each other in addition to the fact that there are L levels in
the label-hierarchy.

L.
F (I) = x where, x are the logits from the last layer of a model F which takes as
input image I .
This is the same loss from eq. (3.3) however, the manner in which each xi is com-
puted is different. The difference is that here the model predicts a probability
distribution only over the leaf labels. To obtain a probability of a label that is a
non-leaf label, the probabilities of the direct children are summed over and this
marginalization results in the probability of the parent label. This way a valid
probability distribution is obtained for each level in the hierarchy.
∑
j
pi [ j] = P(vi |I) = P(c|I), ∀i ∈ 1, 2, ..., ( L − 1) (3.5)
j
c∈childrenOf(vi )
j
where, vi is the j-th vertex (node) in the i-th level.
All but the last level use eq. (3.5) to compute the probabilities for their labels.
!
j exp( x j )
p L [ j] = P(v L |I) = N
(3.6)
∑k=L 1 exp( xk )
For the final level, we compute the probability distribution over the leaf nodes by
directly using the logits output from the model, F . This computation is indicated
in eq. (3.6) using softmax. Once p L is determined, p L−1 can be calculated. For this
reason we compute the probabilities for the complete hierarchy in a bottom up
fashion: starting from the bottom-most layer and moving to the upper levels.
26
3.4. Masked Per-level classifier
3.4 Masked Per-level classifier
On the upper levels of the hierarchy one has more data per label and fewer la-
bels to choose from. Naturally, this makes classifying relatively accurate closer to
the root of the hierarchy. This model exploits knowledge about the parent-child
relationship between nodes in a top down manner.
Figure 3.4: Model schematic for the Masked Per-level classifier. The model is
trained exactly like the L Ni -way classifier. While predicting, one assumes the
model performs better for upper levels than lower levels. Keeping this in mind,
when predicting a label for a lower level, the model’s prediction for the level above
is used to mask all infeasible descendant nodes, assuming the model predicts cor-
rectly for the level above. This results in competition only among the descendants
of the predicted label in the level above.
Unlike Marginalization (bottom up), here, we have L-classifiers, one for each hier-
archical level. For the first level, the model predicts the class with the highest score
among the logits. For consequent level li , the information about the models belief
i.e. it’s prediction for the li−1 level is leveraged. Instead of naively predicting the
label with the highest score for level li (comparing among all possible logits), all
nodes except the children of the predicted label for level li−1 are masked. This
translates to computing the loss over a subset of the original nodes in level li .
With the availability of the parent-child relationship and assuming that the model
predicts correctly the parent label (on level li−1 , the only possible labels are the
children of this predicted parent. As mentioned earlier, classification in the upper
levels is more accurate and since we perform this in a top down fashion, this is a
reasonable assumption. Another work has shown this to be the case [21]. For the
last L-1 levels, only a subset of the logits (formed by the children of the predicted
parent) are compared against each other, ignoring the rest.
While training, the loss is computed over the children of the parent conforming
to the ground truth. Even if the model predicts the parent incorrectly, we still use
the ground truth to penalize its prediction for the children.
For data with unknown ground truth i.e. during evaluation, the model uses the
predictions from level li−1 to make infer about level li by masking nodes that
correspond to labels that are not possible.
27
L
L ( x, τ ) = ∑ Li (xi , τi ) (3.7)
i =1
! !
exp( xi [τi ])
L ( xi , τi ) = − log
∑ j∈C exp( xi [ j])
= − xi [τi ] + log ∑ exp(xi [ j]) (3.8)
j∈C

L . C = childrenOf( v i−1 ). τ
i −1
j τ
vi is the j-th vertex (node) in the i-th level and consequently, vi−i−11 is the node
corresponding to the ground-truth on level (i − 1).
F (I) = x where, x are the logits from the last layer of a model F which takes
as input image I . xi is a continuous sub-sequence of the predicted logits x, i.e.
xi = ( xi [ Ni−1 + 1], xi [ Ni−1 + 2], ..., xi [ Ni−1 + Ni ]).
3.5 Hierarchical Softmax

For this the model predicts logits for every node in the hierarchy. The logits are
predicted by dedicated linear layers for each group of siblings and consequently
there is a separate probability distribution over each one of these groups. This
j ji−1 ji
is probability conditioned on the parent node i.e. p(vii |vi− 1 ), ∀ vi ∈ C, such that
i −1 j
C = childrenOf(vi− 1 ).
In the context of natural language processing something similar has been dis-
cussed in previous work [31, 29]. But their main goal is to reduce the computa-
tional complexity over very large vocabularies. In the context of computer vision
this is relatively unexplored and we propose to decompose the probability distri-
bution and predict conditional distributions for each set of direct descendants in
the hierarchy, in order to exploit the label-hierarchy and boost performance.
 
exp( x ji −1 [ ji ])
j ji−1 vi −
p ( vi i | vi − 1  ∀v ji ∈ C, x ∈ R| C |
1) = j (3.9)

i
∑k∈C exp( x
i −1
ji −1 [ k ]) vi − 1
v i −1
The vector x j i −1 represents the logits that exclusively correspond to all the children
vi − 1
j
i −1
of node vi− 1 . With this in place, for the set of children of a give node, a conditional
probability distribution is output by the model F .
F (I) = p(·) where, p(·) is the conditional probability for every child node given
j ji−1
the parent, p(vii |vi− 1 ). F takes as input image I .
In order to calculate the joint distribution over the leaves, probabilities along the
path from the root to each leaf are multiplied.
j j j L −1) j j j j j j L −1)
p(v11 , v22 , ..., v((L−1) , v LL ) = p(v11 ) p(v22 |v11 )...p(v LL |v((L−1) ) (3.10)
28
3.5. Hierarchical Softmax
j ( i +1) j
where, vii is the parent node of vi+ 1 . The nodes belonging to the i-th level and
the (i+1)-st level respectively.
The cross-entropy loss is directly computed only over the leaves but since the dis-
tribution over the leaves implicitly uses the internal nodes for calculation, all levels
are optimized over indirectly and the performance gradually improves (across all
levels).
j L −1)

j j
L ( x, τ ) = − log p(v11 , v22 , ..., v((L−1) , vτLL ) = − log p(v1τ1 , v2τ2 , ..., v( LL−−11) , vτLL )
τ
(3.11)
L.
eq. (3.11) can be re-written as eq. (3.12) because when τL is known the path to the
root is unique and the remaining τi , ∀i ∈ 1, 2, ..., ( L − 1) are determined.

L ( x, τ ) = − log p(v1τ1 , v2τ2 , ..., v( LL−−11) , vτLL )
τ
(3.12)
29
Chapter 4
Empirical Analysis: Injecting

label-hierarchy into CNN classifiers
In this Chapter, we describe the numerical experiments used to evaluate our meth-
ods that help classifiers exploit label-hierarchies. Before going into the experi-
mental details, we discuss the choice of performance metrics to compare across
different models.
4.1 Performance metrics

In order to quantify the performance we use micro and macro averaged scores. Al-
though micro scores take contributions in proportion to the size of the class they
end up overshadowing classes that occur less frequently. Such patterns are very
much part of dataset with hierarchical labels as class higher up the hierarchy ab-
stract their descendants and have more samples as compared to classes below with
the leaves of the hierarchy having the least number of samples. The macro scores
in contrast take an un-weighted average over the scores computed individually
for all classes.
Consider a dataset as shown in table 4.1. When using a classifier for each level
in the hierarchy, the classifier prefers to blindly predict the majority label to boost
its micro score. By always predicting Hesperiidae, Pyrginae and Pyrgus alveus
it obtains a micro-averaged precision, recall and F1-score of (0.5, 0.5, 0.5). This
type of behavior is undesirable. However, the macro-averaged scores are (0.1364,
0.2727, 0.1724) which reflect the poor performance of the classifier.
To get better insight about where the model under-performs micro and macro
averaged scores are also computer per level in the hierarchy.
True positive rate True positive rate (TPR) is the fraction of actual positives
predicted correctly by the method.
tp
TPR = (4.1)
totalPositives
31
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Label name Level Frequency

Hesperiidae Family 76
Riodinidae Family 16
Hesperiinae Subfamily 36
Pyrginae Subfamily 40
Nemeobiinae Subfamily 16
Ochlodes venata Genus + Species 18
Hesperia comma Genus + Species 18
Pyrgus alveus Genus + Species 22
Spialia sertorius Genus + Species 18
Hamearis lucina Genus + Species 14
Polycaena tamerlana Genus + Species 2
Table 4.1: A subset of the ETHEC dataset to demonstrate the pros and cons of
using macro and micro scoring.
True negative rate True negative rate (TNR) is the fraction of actual negatives
predicted correctly by the method.
tn
TNR = (4.2)
totalNegatives
Precision Precision computes what fraction of the labels predicted true by the
model are actually true.
tp
P= (4.3)
tp + f p
Recall Recall computes what fraction of the true labels were predicted as true.
tp
R= (4.4)
tp + f n
F1-score
2∗P∗R
F1 = (4.5)
P+R
Hit@k
N
1
∑ 1[labeli
gt
Hit@K = ∈ SortedPredictions(i )] (4.6)
N i =1
pred pred pred pred
where, SortedPredictions(i ) = {label0 , label1 , ..., labelk−1 , labelk } is the set
of the top-K predictions for the i-th data sample.
Macro-averaged score A macro-averaged score for a metric is calculated by av-

eraging the metric across all labels.
N
1
M-metric =
N ∑ metric(labeli ) (4.7)
i =1
32
4.2. Hierarchical CIFAR-10
Micro-averaged score A micro-averaged score for a metric is calculated by ac-

cumulating contributions (to the performance metric) across all labels and these
accumulated contributions are used to calculate the micro score.
4.2 Hierarchical CIFAR-10

To see the details of the hierarchy creation please refer to section 2.3.
4.2.1 Per-level classifier

Influence of training set size on performance
To test the influence of dataset size to performance, we run experiments by chang-
ing the size of the train set. We randomly pick 3 differently sized subsets of the
dataset to see the effect on the classification performance and as a sanity check of
the implementation.
We choose 3 configurations of the training set (all samples, 1000 samples, 100 samples)
and train for 100 epochs. We keep the validation and test set same through out
as discussed in section 2.3. Refer to table 4.2 for performance comparison. The
numbers are reported on the unseen test set.
m-P m-R m-F1 M-P M-R M-F1

ResNet-50 (update all weights)
100 samples 0.6129 0.6129 0.6129 0.5336 0.4924 0.4724
1000 samples 0.7947 0.7947 0.7947 0.7217 0.7234 0.7190
all samples 0.9358 0.9358 0.9358 0.9124 0.9101 0.9108
Table 4.2: Performance metrics for Per-level classifier on the Hierarchical CIFAR-10
data when varying the amount of training data. The models used in this experi-
ment are pre-trained on the 1000-class ImageNet data set. All weights are updated
with a learning rate of 0.01 and input spatial dimensions are 224x224. P, R and
F1 represent Precision, Recall and F1-score. Metrics prefixed with m are micro-
averaged while the ones with M are macro-averaged. The top performing models
are in bold-face.
4.2.2 Hierarchy-agnostic classifier

In table 4.3 we summarize the results of hierarchy-agnostic classifier for the Hier-
archical CIFAR-10 dataset. We show performance for different CNN models for
feature extraction. In addition to that, models have either only their last layer fine-
tuned (keeping the rest of the weights fixed) or all the weights in the model are
updated.
33
m-Precision m-Recall m-F1 M-F1

Alexnet
Per-class decision boundary 0.7311 0.7863 0.7577 0.6814
One-fits-all decision boundary 0.7687 0.7564 0.7625 0.6708
Alexnet (update all weights)
VGG
VGG (update all weights)
ResNet-18
ResNet-50
Table 4.3: Performance metrics for the hierarchy-agnostic classifier on the Hierar-
chical CIFAR-10 data. The models used in this experiment are pre-trained on the
1000-class ImageNet data set. For these experiments, only the last layer is fine-
tuned (unless mentioned otherwise), fixing the rest of the weights with a learning
rate of 0.01 and input spatial dimensions of 224x224 for 100 epochs. Metrics pre-
fixed with m are micro-averaged while the ones with M are macro-averaged. The
top performing models are in bold-face.
34
4.3. Hierarchical Fashion MNIST
4.3 Hierarchical Fashion MNIST

To see the details of the hierarchy creation please refer to section 2.3.

As for the Hierarchical CIFAR-10, we also use the same hierarchy-agnostic model
on the Hierarchical Fashion MNIST dataset with results tabulated in table 4.4. In
both cases the ResNet [18] backbone seems to be the best performing for micro
and macro averaged F1-score.
m-Precision m-Recall m-F1 M-F1

Alexnet
Alexnet (update all weights)
VGG
VGG (update all weights)
ResNet-18
ResNet-50
Table 4.4: Performance metrics for the hierarchy-agnostic classifier on the Hierar-
chical FMNIST data. The models used in this experiment are pre-trained on the
1000-class ImageNet data set. For these experiments, only the last layer and the
first layer is fine-tuned (unless mentioned otherwise), fixing the rest of the weights
with a learning rate of 0.01 and input spatial dimensions of 224x224. Metrics pre-
35
4.4 ETHEC Dataset

This sections describes the experiments and their results on the ETHEC dataset
[11]. In order to keep the section compact we perform experiments with best
performing CNN backbones on the Hierarchical CIFAR-10 and FMNIST which is
the ResNet-50.
The presence of a hierarchy introduces imbalance in the number of samples across
labels in different levels of the hierarchy. In addition to this kind of imbalance that
the ETHEC dataset also suffers from different number of samples for labels in the
same level, unlike CIFAR-10 and FMNIST.
To tackle such imbalance in the ETHEC dataset, experiments are performed where
the dataset is re-sampled and the term corresponding to every class in the loss
function is re-weighed.
Re-sampling is performed proportional to the inverse frequency of the occurrence
of a label in the dataset which are then normalize to sum to 1 to sample using a
multinomial distribution. For more details please refer to PyTorch documentation
[33]. Similarly, the loss is re-weighted using the per-class inverse frequencies as
the to scale the corresponding terms contributed by data belonging to a particular
class.

Before, running the experiments on the complete dataset, a small subset of the
original dataset was created ETHEC Merged Small. This dataset was used to
debug the learning framework and ensure that we can learn to distinguish when
there are a handful of classes in the 4-level structure. The labels used at different
levels can be found in table 2.1.
The hierarchy-agnostic classifier disregards the hierarchy in the dataset and treats
all levels in the same manner. This classifier model essentially flattens the hierar-
chical structure and performs multi-label classification to predict the correct labels
given an image of a specimen. For every label, the model predicts whether the
image is a member or not for that class. There is no restriction on the number
of predictions that the model should make; as information about the number of
levels (in the hierarchy) is unavailable to the model.
Per-class decision boundary (PCDB) models

The ill-effects of such free rein are reflected in table 4.5. Models with a high
average number of predictions, especially the per-class decision boundary (PCDB)
models, have high recall as they predict a lot more than just 4 labels for a given
image. Predicting the image’s membership in a lot of classes improves the chances
of predicting the correct label but at the cost of a large number of false positives.
The models have high recall and that is due to them predicting a lot of false
positives as the precision is on the other extreme of the spectrum and generally
very poor in comparison to the recall scores. The (min, max), µ ± σ column clearly
shows the reckless behavior of the model predicting a maximum of 718 labels
36
4.4. ETHEC Dataset
cw rs m-P m-R m-F1 M-P M-R M-F1 (min, max), µ ± σ

ResNet-50 - Per-class decision boundary
7 7 0.0355 0.7232 0.0677 0.3066 0.4053 0.2195 (3, 351), 81.42 ± 69.51
7 3 0.7159 0.7543 0.7346 0.4402 0.4362 0.3718 (0, 13), 4.21 ± 2.07
3 7 0.0077 0.8702 0.0153 0.0120 0.8397 0.0183 (84, 718), 451.14 ± 136.69
3 3 0.0081 0.7519 0.0161 0.0105 0.5909 0.0165 (33, 714), 369.96 ± 120.55
ResNet-50 - One-fits-all decision boundary
7 7 0.9324 0.7235 0.8147 0.1913 0.1462 0.1568 (0, 7), 3.10 ± 1.16
7 3 0.9500 0.6564 0.7763 0.1078 0.0947 0.0959 (0, 5), 2.76 ± 0.60
3 7 0.2488 0.2960 0.2704 0.0021 0.0067 0.0030 (4, 9), 4.76 ± 0.76
3 3 0.1966 0.3800 0.2591 0.0027 0.0110 0.0037 (4, 10), 7.73 ± 0.61
Table 4.5: Performance metrics for the hierarchy-agnostic classifier on the ETHEC
Merged dataset. The models used in this experiment are pre-trained on the 1000-
class ImageNet data set. All weights are updated with a learning rate of 0.01, a
batch-size of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R
and F1 represent Precision, Recall and F1-score; cw and rs represent class weight
and re-sampling. Metrics prefixed with m are micro-averaged while the ones with
M are macro-averaged. The top performing models are in bold-face. Since, the
model can predict any number of labels (between 0 and Ntotal ), the table includes
the minimum and the maximum number of labels predicted (min, max) as well as
the number of labels predicted on average µ ± σ. These statistics, like the rest, are
calculated for samples in the test set.
for one such sample and 451.14 ± 136.69 on average for the worst performing
multi-label model in our experiments.
One-fits-all decision boundary (OFADB) models

On the other hand, the one-fits-all decision boundary (OFADB) models have a
high micro scores but the macro scores are worse than those of the PCDB. Since,
models with a better micro-F1 score are preferred, and they all share a common
decision boundary for every label, the model sets this decision boundary in such
a way that labels with a large number of samples are predicted correctly because
that boosts the micro scores. Due to a commonly shared global threshold the
models are more conservative in exercising the free rein as is evident from the
(min, max ), µ ± σ column.
The one-fits-all decision boundary (OFADB) performs better than the same model
with per-class decision boundaries (PCDB). We believe that the OFADB prevents
over-fitting, especially in the case when many labels have very few data samples
to learn from, which is the case for the ETHEC database. Here too, the nature
of the multi-label setting allows the model to predict as many labels as it wants
however, there is a marked difference between the (min, max), µ ± σ statistics
when comparing between the OFADB and PCDB. The best performing OFADB
model predicts 3.10 ± 1.16 labels on average. This is close to the correct number
of labels per specimen which is equal to the 4 levels in the label hierarchy.
37
Loss reweighing and Data re-sampling

Both data re-sampling and loss re-weighing remedy imbalance across different
labels but via different paradigms. Instead of modifying what the model sees
during training, reweighing the loss instead penalizes different data points differ-
ently. We choose to use the inverse-frequency of the label as weights that scale
loss corresponding to the data point belonging to a particular label.
re-sampling involves by presenting the model a modified training set. This is
done by choosing to show some samples multiple times while omitting others by
over-sampling and under-sampling. Since, we wish to prevent the model from
being biased by the population of data belonging to a particular label, ideally, we
would like to present equal number of samples for each label in the dataset. We
perform re-sampling based on the inverse-frequency of a label in the train set. If
a label has an exceedingly large number of samples (more than the average) then
it would be under-sampled while data belonging to less frequent labels would be
over-sampled.
In the context of boosting, [35] show that empirically, re-sampling is a better strat-
egy than reweighing when the dataset consists of over-represented and under-
represented classes. In our experiments as well, re-sampling significantly outper-
forms loss reweighing.
Performance breakdown
From fig. 4.1 one can see the relation between number of training samples for a
particular label and it’s F1 performance. The 5 out of the 6 families have more
than 1,000 training data and the model performs well for these classes. However,
the importance of a larger dataset is visible in the performance versus training
data size plots in fig. 4.1b, fig. 4.1c and fig. 4.1d. There seems to be a “S-shaped”
or sigmoid like trend in performance as the number of training data increases.
Labels, across the 4 levels, that have more than 1,000 training samples have very
high F1 scores while labels that have less than 100 training points are on the lower
end of the spectrum.
The micro-F1 performance for the levels: family and subfamily is comparable to the
best performing multi-level classifier (see table 4.8). But moving down to the lower
levels in the hierarchy the performance is much worse than the best performing
multi-level classifier as detailed in section 4.4.2.
38
4.4. ETHEC Dataset
1 2
0 2 0
1.0 10 103 104 1.0 1.0 101 102 103 1041.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

Score
Score
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

102 103 104 0 1 101 102 103 104 0.0 2.5
Training data size Training data size
(a) family (b) subfamily
10
50
0 0
1.0 10
0 101 102 103 1.0 1.0 10
0 101 102 103 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

Score
Score
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

100 101 102 103 0 50 100 101 102 103 0 250 500
(c) genus (d) genus + specific epithet
Figure 4.1: Per-label F1 performance across levels plotted against number of training
samples for hierarchy-agnostic classifier using ResNet-50 (cw: 7, rs: 7). For each panel, a
point represents a label with its position indicating the number of samples corresponding
to that label in the train set and the F1-score that particular label achieve on the test set.
To better visualize, the distribution of the samples in the train set and also the
distribution of the performance, we display the marginal histograms for both training
data size (on top of each scatter plot) and the F1-score (on the right of each scatter plot).
Level Ni m-P m-R m-F1 M-P M-R M-F1

ResNet-50 (OFADB) with resampler (cw: 7, rs: 7)
family 6 0.9861 0.9012 0.9417 0.9718 0.8801 0.9173
subfamily 21 0.9860 0.9065 0.9446 0.7941 0.6548 0.6968
genus 135 0.9290 0.7518 0.8311 0.3918 0.2961 0.3212
genus + specific epithet 561 0.7249 0.3345 0.4578 0.1121 0.0832 0.0888
Table 4.6: Performance metrics for hierarchy-agnostic classifier on the ETHEC

Merged dataset with the best performing configuration in its category. The model
used in this experiment are pre-trained on the 1000-class ImageNet data set. All
weights are updated with a learning rate of 0.01, a batch-size of 64 and input
spatial dimensions are 224x224 for 100 epochs. P, R and F1 represent Precision,
Recall and F1-score. Metrics prefixed with m are micro-averaged while the ones
with M are macro-averaged.
39
4.4.2 Per-level classifiers

For the experiments reported here, we use the complete dataset instead of a subset.
As mentioned in section 4.4.1, due to the imbalanced nature of the ETHEC dataset
we perform experiments with a combination of data re-sampling and loss re-
weighing. The results are summarized in table 4.7.
Right off the bat, it is empirically seen that this model outperforms the model
from section 4.4.1. Not only does it perform better on the global metrics but also
the 4-level split metrics shown in table 4.8 in comparison to table 4.6.
The fact that the model is informed about the presence of different levels and that
each level has one and only one correct label makes a significant difference in the
performance. Instead of being able to predict as many labels as it wants (which
was the case in the multi-label classifier) here, the model is designed such that is
constrained to predict a single label per level in the hierarchy.
Additionally, only logits belonging to the same hierarchical level compete against
each other. By incorporating more information about the hierarchy the model
outperforms the hierarchy-agnostic classifier. The best hierarchy-agnostic classi-
fier has a m-F1 and M-F1 of 0.8147 and 0.3718 (from two separate models) while
the best performing per-level classifier has a m-F1 and M-F1 of 0.9084 and 0.6888
(from the same model).
cw rs m-P m-R m-F1 M-P M-R M-F1

ResNet-50
3 7 0.8483 0.8483 0.8483 0.6648 0.6789 0.6411
7 7 0.8930 0.8930 0.8930 0.6854 0.7094 0.6677
7 3 0.9084 0.9084 0.9084 0.7134 0.7223 0.6888
3 3 0.8760 0.8760 0.8760 0.6782 0.6874 0.6537
7 sqrt 0.9067 0.9067 0.9067 0.6941 0.7073 0.6700
Table 4.7: Performance metrics for Per-level classifier on the ETHEC Merged
dataset. The models used in this experiment are pre-trained on the 1000-class
ImageNet data set. All weights are updated with a learning rate of 0.01, a batch-
size of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R and F1
represent Precision, Recall and F1-score; cw and rs represent class weight and re-
sampling. Metrics prefixed with m are micro-averaged while the ones with M are
macro-averaged. The top performing models are in bold-face. sqrt in the “rs”
denotes re-sampling using the inverse of the square-root of the class frequency as
weights, in other cases the inverse of the class frequency is used.
It is important to notice the trend between the macro and micro-averaged scores.
In table 4.7 all micro-averaged scores are always higher than their macro-averaged
counterparts which is a direct result of optimizing the model for best performance
measured by the micro-F1 score, which is a function of the micro precision and
micro recall.
Micro-averaged scores are calculated by computing a “combined” confusion ma-
40
4.4. ETHEC Dataset
trix across all labels and then computing the (averaged) score, in this case the
precision, recall and F1-score. Here, the notion of “averaging” comes from the fact
that individual, label-wise confusion matrix elements (true positive, false positives,
true negatives and false negatives) are combined in to a single global confusion
matrix. While for the macro-averaged scores, the per-label scores i.e. precision,
recall and F1-score are calculated and then their mean across all labels results in
the macro-averaged scores.
If the micro scores are higher than the macro scores it is an indicator of the classes
with fewer samples being classified incorrectly while the more popular classes
are being classified correctly and eventually inflating the micro score. On the
other hand, higher macro scores would indicate that labels with a large number
of samples are being misclassified.
In our case it is the former, where the micro scores dominate the macro scores
implying that the model performs well for classes with more training data which
is intuitive as the model has more data to train on and sees a variety of samples
for the same label in contrast to classes with only a handful of samples.
Refer to fig. 4.3d for the macro-F1, fig. 4.3c for the micro-F1 score and fig. 4.3b
for the training, validation and testing loss over 100 epochs. The micro-F1 per-
formance for experiments with different combinations of class weights and re-
sampling are compared in fig. 4.3a.
sqrt: square root resampler

In order to magnify the extremely under-represented classes, instead of using the
inverse of the frequency for a particular label we use the square-root of the inverse
of the frequency to perform re-sampling.
The use of an inverse square-root is to perform an even more aggressive re-sampling
for classes with only a handful of data points. However, we notice empirically that
this does not make a significant difference in the model’s performance.
Performance breakdown
In order to see where the model under-performs we break down the best per-
forming model’s score such that we have metrics across different hierarchical lev-
els. From table 4.8 it is evident that the performance is much better when there
are less classes since classes higher up the hierarchy agglomerate the descendant
nodes together, ending up with more data and less labels to differentiate between.
fig. 4.2 gives more insight into the data distribution for each level and the corre-
sponding performance measured by the F1 score. The performance of the model
deteriorates as one moves to the lower levels in the hierarchy. At the leaves the
performance is the worst among the four levels due to extensive branching in
the hierarchy and only several data samples per leaf label. With the help of in-
verse frequency weighted re-sampling the ill-effect of data deficiency is mitigated
to an extent with an improved performance as compared to when there is no
re-sampling at all (refer to table 4.7).
41
2 3
2
1
1
0 0
103 104 102 103 104
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

Score
Score
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

103 104 0 1 2 102 103 104 0.0 2.5 5.0
15 60
10 40
5 20
0 0
102 103 5 × 101 6 × 101 7 × 101 8 × 101 9 × 101 102
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

Score
Score
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

102 103 0 20 40 5 × 101 6 × 101 7 × 101 8 × 101 9 × 101 102 0 100
Figure 4.2: Per-label F1 performance across 4 hierarchical levels plotted against number
of training samples for Per-level classifier using ResNet-50 with resampler (cw: 7, rs: 3).
It is important to note that the population statistics, especially of lower levels in the
hierarchy (fig. 4.2c and fig. 4.2d) are skewed to the higher end as an effect of re-sampling
the less frequent classes.
42
4.4. ETHEC Dataset
Micro-F1 performance of multi-level models train loss, val loss and test loss
5 train loss
val loss
0.8 test loss
4
0.6 3
0.4 2
test micro_f1 resampler 1

0.2 test micro_f1 class weights
test micro_f1 class weights + resampler
test micro_f1 x 0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) Micro-F1 performance over 100 epochs (b) Loss evolution over 100 epochs on the
for ResNet-50 multi-level classifier with dif- train, val and test datasets for ResNet-
ferent combinations of resampler and class 50 multi-level classifier with resampler (cw:
weights. Legend: x = cw: 7, rs: 7 7, rs: 3).
train micro_f1, val micro_f1 and test micro_f1 train macro_f1, val macro_f1 and test macro_f1
1.00 1.0
0.95 0.9
0.90 0.8
0.85 0.7
0.80 0.6
0.75 0.5
train micro_f1 train macro_f1
0.70 val micro_f1 0.4 val macro_f1
test micro_f1 test macro_f1
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) train, val and test micro-F1 perfor- (d) train, val and test macro-F1 perfor-
mance over 100 epochs for ResNet-50 multi- mance over 100 epochs for ResNet-50 multi-
level classifier with resampler (cw: 7, rs: level classifier with resampler (cw: 7, rs:
3). 3).
Figure 4.3: Comparing micro-F1 performance across different combinations of class

weights and resampler for the multi-level classifier. Refer to table 4.7 for best scores. Loss,
micro-F1 and macro-F1 for ResNet-50 with resampler (cw: 7, rs: 3) across 100 epochs.
43

ResNet-50 with resampler (cw: 7, rs: 3)
family 6 0.9766 0.9766 0.9766 0.9005 0.9328 0.9152
subfamily 21 0.9661 0.9661 0.9661 0.9433 0.9542 0.9424
genus 135 0.9204 0.9204 0.9204 0.8845 0.8482 0.8497
Table 4.8: Performance metrics for Per-level classifier on the ETHEC Merged
dataset when using a resampler which is the best performing model in its category.
The models used in this experiment are pre-trained on the 1000-class ImageNet
data set. All weights are updated with a learning rate of 0.01, a batch-size of 64
and input spatial dimensions are 224x224 for 100 epochs. P, R and F1 represent
Precision, Recall and F1-score. Metrics prefixed with m are micro-averaged while
the ones with M are macro-averaged.
44
4.4. ETHEC Dataset
4.4.3 Marginalization
model m-P m-R m-F1 M-P M-R M-F1

Models trained using grayscale images
ResNet-50 0.8586 0.8586 0.8586 0.6071 0.6070 0.5765
Models trained using normal color images
ResNet-50 0.9223 0.9223 0.9223 0.7095 0.7231 0.6927
ResNet-101 0.9110 0.9110 0.9110 0.7327 0.7262 0.7023
ResNet-152 0.9162 0.9162 0.9162 0.7181 0.7271 0.6954
Table 4.9: Performance metrics for Marginalization classifiers on the ETHEC

class ImageNet data set. All weights are updated with a learning rate of 10−5 , a
batch-size of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R and
F1 represent Precision, Recall and F1-score. For all models, data re-sampling pro-
portional to the inverse of the class frequency is performed while training. Metrics
prefixed with m are micro-averaged while the ones with M are macro-averaged.
The top performing models are in bold-face. All these models use level weights
[1, 1, 1, 1].
L1 L2 L3 L4 m-F1 m-F1 L1 m-F1 L2 m-F1 L3 m-F1 L4

term Li in loss Per-level micro-F1
3 0.9137 0.9814 0.9638 0.9134 0.7962
3 3 0.9070 0.9774 0.9626 0.9077 0.7804
3 3 3 0.9207 0.9891 0.9733 0.9255 0.7948
3 3 3 3 0.9223 0.9887 0.9758 0.9273 0.7972
Table 4.10: We compare performance of Marginalization models when trained

with and without losses from each hierarchical level. Computing the losses over
more levels improves model performance. The models used in this experiment are
pre-trained on the 1000-class ImageNet data set. All weights are updated with a
learning rate of 10−5 , a batch-size of 64 and input spatial dimensions are 224x224
for 100 epochs. P, R and F1 represent Precision, Recall and F1-score. Metrics
This section compiles the general results for image classification using the marginal-
ization model. The ResNet-50 is the best performing model. This model predicts
the non-leaf labels in the hierarchy by marginalizing over children labels whose
are probabilities explicitly predicted by the model. We also notice that a huge
performance boost is obtained when normal color images are used as compared
to grayscale images. It is not just about the patterns but also the colors on the
specimen that help distinguish them.
We also show the best performing model’s performance when different loss terms
are used to compute the loss. We train models where we sum up classification
45
losses across different levels in the hierarchy and observe that when losses are
computed over more levels it yields better performance.
4.4.4 Masked Per-level classifier

This model predicts in a top-down manner where it assumes prediction made on
the upper levels are more accurate than the ones made for lower levels in the
label-hierarchy. A similar phenomenon is observed in a previous work [21].
Here, the model masks out all immediate non-descendants of the previous label
(which is a part of the level above the current in the hierarchy). This scheme is
made to trickle down all the way down to the leaf nodes. When predicting the
label for any given level, the model chooses the label with the best score among
all direct descendants of the label predicted for the level above the current.

Models trained using grayscale images
ResNet-50 0.8443 0.8443 0.8443 0.6002 0.5931 0.5619
Models trained using normal color images
ResNet-50 0.9173 0.9173 0.9173 0.7107 0.7227 0.6915
ResNet-101 0.9169 0.9169 0.9169 0.7119 0.7260 0.6921
ResNet-152 0.9152 0.9152 0.9152 0.7167 0.7281 0.6958
Table 4.11: Performance metrics for Masked Per-level classifier on the ETHEC
class ImageNet data set. All weights are updated with a learning rate of 10−5 , a
batch-size of 64 and input spatial dimensions are 224x224 for 200 epochs. P, R
and F1 represent Precision, Recall and F1-score. For all models, data re-sampling
proportional to the inverse of the class frequency is performed while training.
Metrics prefixed with m are micro-averaged while the ones with M are macro-
averaged. The top performing models are in bold-face. In addition to the normal
experiments, we also include results from models trained on grayscale images.
This is the second best performing CNN-based model and we also look at the
level-wise performance split in table 4.12.
46
4.4. ETHEC Dataset

ResNet-50 Performance Breakdown
family 6 0.9828 0.9828 0.9828 0.9735 0.9361 0.9495
subfamily 21 0.9701 0.9701 0.9701 0.9684 0.9252 0.9356
genus 135 0.9233 0.9233 0.9233 0.8916 0.8432 0.8525
Table 4.12: Performance metrics for Masked Per-level classifier on the ETHEC
Merged dataset for the best performing model. The models used in this experi-
ment are pre-trained on the 1000-class ImageNet data set. All weights are updated
with a learning rate of 10−5 , a batch-size of 64 and input spatial dimensions are
224x224 for 200 epochs. P, R and F1 represent Precision, Recall and F1-score.
Metrics prefixed with m are micro-averaged while the ones with M are macro-
averaged.
L1 L2 L3 L4 m-F1 m-F1 L1 m-F1 L2 m-F1 L3 m-F1 L4

term Li in loss Per-level micro-F1
3 0.0633 0.2325 0.0162 0.0022 0.0022
3 3 0.1043 0.3058 0.0410 0.0386 0.0319
3 3 3 0.0848 0.0970 0.0919 0.0879 0.0622
3 3 3 3 0.9098 0.9808 0.9616 0.9116 0.7853
Table 4.13: We compare performance of Masked Per-level classifier when trained

with and without losses from each hierarchical level. Computing the losses over
more levels improves model performance. The models used in this experiment are
47
4.4.5 Hierarchical Softmax

ResNet-50 0.9055 0.9055 0.9055 0.6899 0.7049 0.6723
ResNet-101 0.9122 0.9122 0.9122 0.7049 0.7072 0.6780
ResNet-152 0.9180 0.9180 0.9180 0.7119 0.7174 0.6869
Table 4.14: Performance metrics for Hierarchical Softmax on the ETHEC Merged
dataset. The models used in this experiment are pre-trained on the 1000-class Im-
ageNet data set. All weights are updated with a learning rate of 10−5 , a batch-size
of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R and F1 repre-
sent Precision, Recall and F1-score. For all models, data re-sampling proportional
to the inverse of the class frequency is performed while training. Metrics prefixed
with m are micro-averaged while the ones with M are macro-averaged. The top
performing models are in bold-face.
The Hierarchical Softmax seems to be less prone to over-fitting and has the best
performing model with the ResNet-152 backbone. To recall, the model predicts
conditional distribution p(child|parent) for each label in the hierarchy. The joint
distribution is calculated by multiplying probabilities of all labels along a specific
path to obtain the probability for the leaf.

ResNet-152 with Hierarchical Softmax — Performance Breakdown
family 6 0.9879 0.9879 0.9879 0.9605 0.9452 0.9522
subfamily 21 0.9731 0.9731 0.9731 0.9605 0.9452 0.9522
genus 135 0.9253 0.9253 0.9253 0.8972 0.8504 0.8574
Table 4.15: Performance metrics for Hierarchical Softmax on the ETHEC Merged
dataset for the best performing model. The models used in this experiment are
48
4.4. ETHEC Dataset
2 2
1 1
0 2 0
10 103 104 102 103 104
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

Score
Score
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 2 0.0 0.0 0.0

10 103 104 0 1 2 102 103 104 0 5 10
15 40
10
20
5
0 0
102 103 5 × 101 6 × 101 7 × 101 8 × 101 9 × 101
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

Score
Score
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0

102 103 0 20 40 5 × 101 6 × 101 7 × 101 8 × 101 9 × 101 0 100
Figure 4.4: Per-label F1 performance across 4 hierarchical levels plotted against number
of training samples for ResNet-152 with Hierarchical Softmax and resampler (cw: 7, rs:
3). It is important to note that the population statistics, especially of lower levels in the
hierarchy (fig. 4.2c and fig. 4.2d) are skewed to the higher end as an effect of re-sampling
the less frequent classes.
49
4.4.6 Results Summary
baseline model m-P m-R m-F1 M-P M-R M-F1

hierarchy-agnostic classifier ResNet-50 0.9324 0.7235 0.8147 0.1913 0.1462 0.1568
Per-level classifier ResNet-50 0.9084 0.9084 0.9084 0.7134 0.7223 0.6888
Marginalization ResNet-50 0.9223 0.9223 0.9223 0.7095 0.7231 0.6927
Masked Per-level classifier ResNet-50 0.9173 0.9173 0.9173 0.7107 0.7227 0.6915
Hierarchical softmax ResNet-152 0.9180 0.9180 0.9180 0.7119 0.7174 0.6869
Table 4.16: Comparing best performing baseline classifiers on the ETHEC Merged
dataset. The models used in this experiment are pre-trained on the 1000-class Im-
ageNet data set. P, R and F1 represent Precision, Recall and F1-score. For details
regarding the specific baselines please refer to the respective sections. Metrics pre-
50
Chapter 5
Methods: Order-preserving
embedding-based models
5.1 Cosine Embeddings
Figure 5.1: The latent space of the modified the per-level model is used to extract
label embeddings. An additional layer is added after the final layer in the original
model. The model is trained exactly like the L Ni -way classifier. The weights of
the layer labeled N × Li hold the N-dimensional representations of the labels Here,
N=2, however it can be extended to any N-dimensional embedding space.
For cosine embeddings we extract label representations from the latent space of
a CNN trained for image classification. The learned representations are a by-
product of the model being explicitly trained only for image classification. It is
important to note that cosine embeddings are not necessarily order-preserving but
are presented in this chapter with all the other embedding based models.
51
5. Methods: Order-preserving embedding-based models
We modify the Per-level classifier model by adding a linear layer that projects
the final fully-connected layer of the original model to a latent space which is
interpreted as the label embedding.
The additional layer projects onto the N-dimensional embedding space for every
label. In the fig. 5.1, the weights of the layer labeled 2 × Li holds the 2-dimensional
representation, one for each label for the i-th level.
When performing image classification, matrix multiplication of layer weights and
image representation from the upper layer yield the logits for each label. The
weights in the last layer represent the label embeddings. The label logits that
represent the similarity between an image and the labels are computed by the dot-
product between the image’s representation and the representation of each label.
The larger the magnitude of the dot-product between the representations of an
image and a particular label, the larger the corresponding logit. A larger logit
implies the likelihood of the image belonging to the particular label is high.
5.2 Order-Embeddings
In this part we introduce learning representations for both concepts and images
via embeddings. Recent advances show how unconventional loss functions (in-
stead of the widely used vanilla inner-product or their p-norm distance) can model
the asymmetric relations between concepts. Directed graphs also model asymmet-
ric relations between two nodes connected by a directed edge.
We treat our label hierarchy as a directed-acyclic graph. More specifically, due to
the nature of defining the taxonomy and how we create the dataset, it is a directed
tree graph. The dataset X consists of entailment relations (u, v) connected via a
directed edge from u to v. (following the definition in [16]). These directed edges
or hypernym links convey that v is a sub-concept of u.
We train our model using the max-margin loss proposed in [39]. Unlike in the
work [39] we do not restrict the embeddings to have positive coordinates only
and get rid of the absolute function used by [39], granting the embeddings more
freedom.
5.3 Euclidean Cones

Euclidean cones is a generalization of the order-embeddings method. If v is-a u
then in an ideal order-embeddings f (v) lies within that orthant that has its origin
at f (u). Euclidean cones is a generalization of order-embeddings and allows for
more flexible volumes in R N to define is-a relations.
A concept that is more abstract than others and consequently entails a lot more
concepts can have more volume than the concepts that are entailed by it.
The general form of the loss is the same as in eq. (2.3). The violation penalty, E,
differs from that of order-embeddings and takes a more general form as defined
by eq. (2.6).
52
5.4. Hyperbolic Cones
Figure 5.2: Visualizing the latent 2D space of the modified per-level model for the
ETHEC dataset [11]. Each label’s embedding is plotted. The weights of the layer
labeled 2 × Li hold the 2-dimensional representations of the labels. We visualize
the relations between nodes by adding edges from the original label hierarchy.
Legend=cyan: family, magenta: subfamily, yellow: genus, black: genus+species.
We see that for labels from a given level, the model is more confident about those
that have a larger norm in the embedding space. For the lower levels (genus,
genus+species) the labels form roughly a circular pattern meaning that the model
has the same confident across the labels. We also see that cyan nodes are collapsed
towards the center even though they have most samples per label (as they are the
top-most level in the hierarchy) however since they capture images with a large
intra-label variance they have a smaller norm than the magenta nodes.
5.4 Hyperbolic Cones

Instead of the cones and the embedding space conforming to Euclidean geometry
one can formulate it to live in non-Euclidean space. Hyperbolic and spherical
geometry both are non-Euclidean geometries (non-zero curvature) with negative
and positive sectional curvature respectively.
The hyperbolic space can be modeled in 5 different ways. We use the Poincaré
ball like previous work [32, 16].
The hyperbolic cones like the rest of the models are implemented in PyTorch [33].
We follow the schemes from [15] to avoid numerical instabilities when learning
parameters in the hyperbolic space, more specifically the Poincaré ball.
53
Figure 5.3: An illustration of the order-embedding space. In an ideal scenario, if

(u, v) is a positive edge i.e. v is a sub-concept of u, then, all such sub-concepts lie
within a translated orthant in the embedding space (a quadrant in R2 ) originat-
ing at u’s embedding. Both the immediate and non-immediate descendants (of a
parent concept) should lie within the translated orthant that belongs to the parent
concept.
Figure 5.4: An illustration of euclidean cones. In an ideal scenario, if (u, v) is a

positive edge i.e. v is a sub-concept of u, then, all such sub-concepts lie within a
cone in the embedding space originating at u’s location in the embedding space.
Both the immediate and non-immediate descendants (of a parent concept) should
lie within the cone that belongs to the parent concept. Euclidean Cones is a gener-
alization of order-embeddings. Like the orthants in order-embeddings the cones
here extend to infinitely.
54
5.5. Embedding Label-Hierarchy
Label embeddings. For our implementation of the hyperbolic cones, the label-
embeddings live in the hyperbolic space Dn and are optimized using the RSGD as
per eq. (2.14) and eq. (2.13) with the help of the exponential-map from eq. (2.15).
RSGD is implemented by modifying the SGD gradients in PyTorch as it is not a
part of the standard library.
Image embeddings. For images, features from the final layer of the backbone of
the best performing CNN-based model are used (∈ R2048 ). In order to map them
to Dn we use a linear transform W ∈ R2048×n and then apply a projection into
Dn via the exponential-map at zero which is equivalent to exp0 ( x ). This bring
the image embeddings to the hyperbolic space with Euclidean parameters. This
allows for optimizing the parameters with well know optimization schemes such
as Adam [20].
5.5 Embedding Label-Hierarchy

First we begin by learning to represent the hierarchy alone with this model. Con-
sidering only the label-hierarchy and excluding the images (momentarily), we
model this problem as hypernym prediction where a hypernym pair represents
two concepts ( x, y) such that y is-a x.
Hypernyms occur naturally in our ETHEC dataset [11] where edges through dif-
ferent levels in the label-hierarchy represent the is-a relation with a directed edge
from node x to node y representing y is-a x.
Data splitting
We split the data into train, val and test in a similar manner to that of [16]. They
first compute the transitive reduction of the directed-acyclic graph. However, since
it is a tree it is already in the most minimal form and we use the tree to form the
“basic” edges for which the transitive closure can be fully recovered. If these edges
are not present in the train set, the information about them is unrecoverable and
therefore they are always included in the train set. Now, we randomly pick edges
from the transitive closure (=1974 edges) minus the “basic” edges (=723 edges) to
form a set of “non-basic” edges (=1257 edges). We use the “non-basic” edges to
create val (5%) (=62 edges) and test (5%) (=62 edges) splits and a proportion of
the rest are reserved for training (see Training details).
Training details
We follow the training details from [16]. We augment both the validation and
test set by generating 5 negative pairs each for ( x, y) (a positive pair): of the type
( x 0 , y) and ( x, y0 ) with a randomly chosen edge that is not present in the full
transitive closure of the graph. Generating 10 negatives for each positive. For
the training set, negative pairs are generated on-the-fly in the same manner. We
report performance on different training set sizes. We vary the training set to
include 0%, 10%, 25%, 50% of the “non-basic” edges selected randomly. We train
for 500 epochs with a batch size of 10 and a learning rate of 0.01. We run two sets
55
of experiments: one, we fix α = 1.0 as mentioned in [39] and two, tune α based on
the F1-score on the val set [16].
pick-per-level strategy
During the experiments, we found a better strategy to sample negative edges. In-
stead of sampling a negative edge ( x 0 , y) from candidates where x 0 is any node
that makes ( x 0 , y) a negative edge, we pick each x 0 from a different level in the
hierarchy. This serves a dual purpose. Because the hierarchy is a tree, 78.24%
of the nodes belong to the final level in the hierarchy and if a pick-per-level
strategy is not applied one would always sample edges where the corrupted end
would be from the last level majority of the times. This makes training and con-
vergence excruciatingly slow. Secondly, with this pick-per-level strategy we
are able to sample nodes that give hard negatives edges from the same level as
the non-corrupted node y, helping embeddings to disentangle and spread out in
space.
Optimization details
We use Adam optimizer [20] for order-embeddings and Euclidean cones. For
hyperbolic cones we use Riemannian SGD [16].
5.6 Jointly Embedding Images with Label-Hierarchy

In the order-embeddings paper [39], the authors propose a two-level hierarchy
for image-caption retrieval task. In their formulation, using a 2-level hierarchy,
images are put on the lower-level and the captions on the upper level; the reason
being: images are more detailed while captions represent concepts more abstract
than the image itself. They also use a different loss from the one proposed in the
hypernym prediction task.
For jointly embedding the images together with the labels we use the same hyper-
nym loss from eq. (2.3). The only change being that now in addition to the labels,
G (the graph representing the hierarchy) also contains images as nodes as leaves
at the lowest level.
G constitutes of two types of edges: an edge (u, v) can be such that u, v ∈ labels or
u ∈ labels, v ∈ images. This is not of concern as the only difference is the way the
embeddings are computed for images and labels. In the end, both f i and f l map
respective inputs to the same space.
Classification with Embeddings

Since our problem does not concern hypernym prediction but rather assigning
multiple labels to an image; instead of performing edge prediction (as the case
would be in a hypernym prediction task) we use the embeddings for the task of
classification.
To classify an image we compute the order-violation in eq. (2.2) between the given
image and each label and pick the label corresponding to the minimum violation.
56
5.6. Jointly Embedding Images with Label-Hierarchy
arg min E( f l (l ), f i (i )), ∀l ∈ labels (5.1)

l
Image and Label Embeddings

To generate image embeddings we use the best performing CNN model trained on
the ETHEC dataset [11] and use it extract fc7-features from the penultimate layer.
We use a learnable linear transformation, a matrix W, on top of the fc7-features to
be able to adjust the fc7-features and map them into the joint embedding space in
R N for Euclidean models and D N for hyperbolic models (Poincaré disk).
f i (i ) = W ∗ CNN(i ) ∈ R N (5.2)
where, CNN(i ) represent the fc7-features from our best performing CNN model
and W is a matrix. The weights of the CNN are frozen to calculate the fc7-features
with only W that can be learned.
For the labels, f l (l ) is just a lookup table that stores vectors in R N .
Data splitting
For these experiments, we split the data into train (80%), val (10%) and test
(10%) based on images only as done for the CNN-based models. Since, now we
embed images together with the labels, we create a combined graph G to represent
both. The graph contains directed edges from each label that “describes” the
image to the image itself as well as edges between related labels.
Training details
Let G represent the graph to be embedded. All edges in Gtc , the transitive clo-
sure of G , are considered as positive edges. To obtain negative edges, Gneg is
constructed by removing the edges in Gtc from a fully-connected di-graph with
the same nodes as G .
While training, for each positive edge ( x, y) 10 negatives are sampled, 5 each by
randomly corrupting either side of the positive edge (5 × ( x 0 , y) + 5 × ( x, y0 )). We
generate negatives by corrupting the edge with nodes from each level in the hier-
archy including the images (the lowest level) because the images outnumber the
labels and we would like to embed the label-hierarchy in addition to the images.
We use the pick-per-level strategy as described in the previous section.
We make sure that we do not sample a negative edge such that either side of the
edge is an image. This is to ensure that two images are not forced apart unless
their labels require them to do so because the images for the final level and there
are no cones that are nested with the cone formed by the image embedding (as it
is the last level in the graph G ).
For the validation and test set, we generate 5 negative pairs each for ( x, y) (a
positive pair): of the type ( x 0 , y) and ( x, y0 ) with a randomly chosen edge that is
57
not present in the graph Gneg . The validation and test sets are generated in the
beginning and are fixed during training. We follow the training details from [16].
Negatives are sampled only for the training set and are generated on-the-fly. The
model’s performance is measured on its ability to classify images correctly. Since
negative edges are not required for measuring classification performance, negative
edges are sampled only during training. For validation and testing, we measure
the model’s classification the val and test set images respectively.
Graph reconstruction task. In addition to the classification task, we also check

the quality of reconstruction of the label-hierarchy itself. Here, all the edges in G
that correspond to edges between labels are treated as positive edges, while the
the edges in Gneg that correspond to edges between labels are treated as negative
edges. We compute E(u, v) ∀e ∈ P ∪ N where e = (u, v) and choose a threshold
to classify edges as positive and negative using that yields the best F1-score on this
label-hierarchy reconstruction task. This task does not use any edges that have an
image on any side to check the quality of reconstruction.
For W we use a linear transformation, a matrix R2048× N . Non-linearity is not
applied to the output that maps to the embedding space.
Optimization details
For jointly embedding labels and images, we empirically found using vanilla
Adam [20] optimizer instead of the Riemannian SGD. The drawback being that
the label embeddings are parameterized in the Euclidean space and we use the ex-
ponential map at 0 from eq. (2.15) to map them to the hyperbolic space. This was
observed to be more stable and help converge the joint embeddings. Also, with
this implementation of the hyperbolic cones, for both labels and joint embeddings,
it was not necessary to initialize the embeddings with the Poincaré embeddings
[32] as suggested in [16].
We even notice that when training jointly one does not need to initialize the la-
bels with separate labels-only embedding. The model is still able to attain decent
image classification performance when the label embeddings are randomly initial-
ized. However, a performance boost is obtained when initialized with values from
embedding only the label-hierarchy.
58
Chapter 6
Empirical Analysis: Order-preserving

embedding-based models
6.1 Embedding Label-Hierarchy

We first embed the label-hierarchy exclusively. We report results where we per-
form embed only the complete ETHEC Merged Dataset label-hierarchy without
any images. From table 6.1 one can see that even with very few dimensions, one
is able to achieve a high F1 score on the hidden edges from the test set. Also,
in the same low-dimensions the entailment cones perform better than the order-
embeddings confirming that they are a generalization of the latter. The metrics in
table 6.1 are calculated on a set of hidden edges that form the test set of edges.
These metrics are different from the ones in the section below that discusses the
graph reconstruction task.
Figure 6.1: Visualization of the label-hierarchy using order-embeddings in 2

dimensions. The cyan nodes represent family, the magenta nodes represent
sub-family, the yellow nodes genus. genus+species nodes are omitted to vi-
sualize better.
59
6. Empirical Analysis: Order-preserving embedding-based models
Figure 6.2: Visualization of the label-hierarchy using Euclidean cones in 2 dimen-

sions. The cyan nodes represent family, the magenta nodes represent sub-family,
the yellow nodes genus. genus+species nodes are omitted to visualize better.
Model d=2 d=3 d=5 d=10 d=100

Order-embeddings 0.8271 0.9302 0.9457 0.9920 0.9920
Euclidean Cones 0.8550 0.9979 0.9593 0.9919 0.9752
Table 6.1: Micro-F1 score on the test set for embeddings on the label hierarchy of
ETHEC Merged dataset. We find the classification threshold that yields the best
val set performance. For these experiments we train for 200 epochs with α = 1.0
for order-embeddings and α = 0.01 for Euclidean cones, a batch-size of 10 and a
learning-rate of 0.1 with Adam optimizer. The F1-score corresponding to the test
set for the epoch with the best F1-score on the val set is reported. train set is
composed of all the “basic” edges and an additional 50% of the “non-basic” edges
for all experiments reported here. We vary the dimensionality of the embedding
space, d = {2, 3, 5, 10, 100}.
6.1.1 Graph reconstruction quality for label-embeddings
In addition to the entailment prediction task between two given concepts [16, 32,
38] we also check the reconstruction of the complete graph which basically checks
60
6.1. Embedding Label-Hierarchy
(a) Order-embeddings L=4, b=3 (b) Order-embeddings L=3, b=7
(c) Euclidean cones L=4, b=3 (d) Euclidean cones L=3, b=7
Figure 6.3: We embed 2 different toy graphs. One with 4 levels and a branching factor
of 4 and another one with 3 levels and a branching factor of 7. The model is trained for
1000 epochs with Adam (learning rate of 0.01). The toy graphs are embedded using both
order-embeddings and euclidean cones in R2 . We draw an edge between each node that
is connected in the original in order to better visualize the embedding quality. Nodes
from different levels are colored differently. The illustrations show the levels and
branching factor, the edges are split into train, val and test and report F1-score,
precision, recall and accuracy; and the threshold to decide if a pair of nodes have a
directed edge or equivalently if they are hypernyms.
61
the ability of the embedding to reproduce the asymmetric relations in the original
label hierarchy. This consists of positive and negative directed edges from the
original labels where the positive edges are present in the label-hierarchy while
negative edges are non-existent edges. It is important to note that only a handful
of edges are positive while a vast majority of the edges in the fully-connected
digraph for the set of negative, non-existent edges (in the original label-hierarchy).
For the ETHEC dataset, to obtain the full-F1 (measuring the reconstruction of the
label-hierarchy) we classify all the 723 positive and 521,289 negative edges. Due
to this very large imbalance between the negative and positive edges we refrain
from using accuracy or micro/macro F1 score and used the TPR, TNR and full-F1
instead.
d=2 d=100 d=1000

TPR/ TNR/ (full-F1) TPR/ TNR/ (full-F1) TPR/ TNR/ (full-F1)
OE 0.2309 / 0.9708 / (0.1372) 0.4686 / 0.9880 / (0.3894) 0.3788 / 0.9878 / (0.3489)
EC 0.3617 / 0.9975 / (0.3573) 0.4802 / 0.9985 / (0.4151) 0.5790 / 0.9973 / (0.4091)
HC 0.4443 / 0.9907 / (0.2296) 0.9336 / 0.9986 / (0.8060) 0.9721 / 0.9986 / (0.8257)
Table 6.2: Graph reconstruction performance. Here we measure the ability to

reconstruct all positive and negative edges in the original label-hierarchy using
the embeddings. We choose the threshold boundary by picking the one that
maximizes the full-F1 score. This is the F1-score on all positive and negative
edges in the (full) label-hierarchy. Due to the imbalance the F1-score may not
completely represent the reconstruction ability so we also report the true posi-
tive rate (TPR) and true negative rate (TNR). The cells in table are formatted as
TPR/ TNR/ (full-F1). The model is trained with labels only for 5000 epochs. HC
is trained using RSGD with a lr = 0.001, OE and EC are optimized with Adam[20]
with a lr = 0.1. For both the margin used is α = 0.1 and sampling 10 negatives for
each positive and a batch size of 100 for HC and 10 for OE and HC. Legend: OE:
Order-embeddings, EC: Euclidean cones, HC: Hyperbolic cones.
In table 6.2 we compare the embedding performance for labels-only from the
ETHEC dataset. We employ order-preserving embedding techniques as discussed
in previous chapters from order-embeddings [39] and euclidean and hyperbolic
variants of [16].
Positive edges constitute of only about 0.1% of the total edges in the DAG repre-
senting the label-hierarchy; making it very difficult to predict these. This is also
evident empirically as the TNR is quite high for extremely low-dimensions. For 2D
order-embeddings the TNR=0.9708 despite being the lowest for 2D embeddings.
The performance boost achieved by entailment cones, a generalization of order-
embeddings, can be seen from the table where the euclidean variant is always
better than order-embeddings. By parameterizing the cones to live in hyperbolic
space and use the corresponding hyperbolic geometry, they are able to achieve
almost twice the TPR of OE and EC in 100 dimensions. Further increasing the
dimensions to 1000 dimensions improves performance over 100-dimensional HC
from 0.8060 to 0.8267 in full-F1 score, exhibiting the representative capacity of HC.
62
We also note that for 1000-D EC and OE, the model seems to overfit and perform
worse than the 100-D counterparts. However the HC improves with increase in
the embedding dimensionality.
(a) Hyperbolic Cones 100-D (b) Hyperbolic Cones 1000-D
Figure 6.4: Projected visualization of labels embedded using hyperbolic cones in

100 and 1000 dimensions. The cyan nodes represent family, the magenta nodes
represent sub-family, the yellow nodes genus and black nodes genus+species.
This resembles a flower-like shape where the more generic concepts are closer to
the origin and at the base of this flower-like shape and most specific concepts
at the tip of the petals which forms the periphery are a visible the most (=black
nodes).
6.1.2 Optimization
Initially, the experiments used a batch-size of 10 and the models still had room
for improvement at the end of 5000 epochs. However, with a batch-size of 100
the models converged faster and also performed better. In general it was easier
to find hyperparameters for euclidean models that the non-euclidean ones. We
also noticed better, more stable training when parameterizing the euclidean cones
in cosine space rather than angle space. For euclidean cones experiments we
implement and use the cosine space formulation to compute the energy E from
eq. (2.6).
6.2 Jointly Embedding Images with Label-Hierarchy

When performing joint embedding, we simply add images to the existing graph
of the label-hierarchy. The notion behind which is to treat the images and labels
in the same manner. This allows to re-use the label-embeddings to additionally
embed images for the task of image classification.
To perform image classification using embeddings, the least violating order-penalty
63
E( f l (l ), f i (i )) across all possible labels for a given image is considered as the pre-
dicted label. For every level in the label-hierarchy this is done once, with the labels
from that particular level.
Figure 6.5: Visualization of jointly embedding labels and images using Euclidean
cones in 2 dimensions. The cyan nodes represent family, the magenta nodes
represent sub-family, the yellow nodes genus and black nodes genus+species.
The images are depicted using semi-transparent nodes which are accumulated
away from the origin around the periphery. To better visualize this, we clamp the
norms of image embeddings (to 500 units) in order to visualize them with labels
which are embedded much closer to the origin (due to them being more abstract).
6.2.1 Optimization
The EC models use α = 1.0 trained for 200 epochs with a learning-rate of 10−2 for
the label embeddings, 10−3 for the image embeddings using Adam. For the HC
we train for 100 epochs with a learning-rate of 10−4 for the label embeddings, 10−3
for the image embeddings using Adam. We use a batch-size of 100 for all with 10
negatives per positive pick-per-level sampling. We initialize the models’ label
embeddings using label-only embeddings for all.
64
classify test set images graph reconstruction

Model m-F1 hit@3 hit@5 TPR TNR full-F1
Euclidean Cones
d=10 0.7795 0.8893 0.9204 0.8045 0.9982 0.7040
d=100 0.8350 0.9018 0.9425 0.9630 0.9986 0.8210
d=1000 0.8013 0.8971 0.9278 0.8146 0.9981 0.7073
Hyperbolic Cones
d=100 0.8404 0.9200 0.9386 0.6418 0.9978 0.5756
d=1000 0.8045 0.9023 0.9281 0.5233 0.9973 0.4832
Table 6.3: The table summarizes the embedding model performance when used
to classify images. The metrics are reported on the hidden images of the test set
of the ETHEC dataset. The joint image and label embeddings live in Rd or Dd
(d=dimensionality of embedding space). The main metric to look at is the m-F1
for image classification performance using the embeddings. This is directly com-
parable to the CNN-based models and the hierarchy-agnostic model (baseline) as
well. Since these models are embedding based, in addition, to the classification
task, we report the quality of the reconstruction for the label-hierarchy obtained
during the joint embedding. The EC models use α = 1.0 trained for 200 epochs
with a learning-rate of 10−2 for the label embeddings, 10−3 for the image embed-
dings using Adam. For the HC we train for 100 epochs with a learning-rate of
10−4 for the label embeddings, 10−3 for the image embeddings using Adam. We
use a batch-size of 100 for all with 10 negatives per positive pick-per-level sam-
pling. We initialize the models’ label embeddings using label-only embeddings
for all. Best models are bold-faced. EC: euclidean cones, HC: hyperbolic cones.
6.2.2 Hierarchical level-wise classification performance

In this section we compare level-wise classification performance for the order-
preserving embedding models. For easier, side-by-side comparison, performance
of CNN-based models proposed earlier is also included.
table 6.4 shows that the hierarchy-agnostic baseline is beaten by all models which
use hierarchical information in one form or the other.
Embeddings that are completely different class of models and are seen to be
widely used in the context of natural language but are relatively unexplored for
image classification, are also able to outperform the baseline CNN classifier. The
embeddings also perform better for most of the levels in the level-wise m-F1 score
column.
6.2.3 W’s model capacity

We also use a simple matrix W that transforms fc7 image features to the embed-
ding space. A more elaborate 4-layer feed-forward neural network was also used
but we found it to be worse performing and hard to optimizer. Instead of using the
CNN to extract features one could instead have all its parameters learn-able dur-
ing the joint-embedding training procedure but in our experiments we observed it
65
Per-level micro-F1
Model m-F1 m-F1 L1 m-F1 L2 m-F1 L3 m-F1 L4
CNN-based methods
Hierarchy-agnostic (baseline) 0.8147 0.9417 0.9446 0.8311 0.4578
Per-level classifier 0.9084 0.9766 0.9661 0.9204 0.7704
Marginalization classifier 0.9223 0.9887 0.9758 0.9273 0.7972
Masked Per-level classifier 0.9173 0.9828 0.9701 0.9233 0.7930
Hierarchical-softmax 0.9180 0.9879 0.9731 0.9253 0.7855
Order-preserving (joint) embedding models
Euclidean cones d=100 0.8350 0.9728 0.9370 0.8336 0.5967
Hyperbolic cones d=100∗ 0.7627 0.9695 0.9205 0.7523 0.4246
Hyperbolic cones d=100 0.8404 0.9800 0.9439 0.8477 0.5977
Table 6.4: Comparing level-wise performance across different models both CNN-
based and embeddings based as proposed in the main body of the work. All
models that exploit any information from the hierarchy outperform the hierarchy-
agnostic classifier baseline. All scores are micro-averaged F1. We also include the
overall m-F1 in addition to the separate m-F1 across the 4 levels in the ETHEC
dataset. The best overall model is underlined and the best model in the model
category is bold-faced. Label embeddings for all joint-embeddings models are ini-
tialized using labels-only embeddings. ∗ = randomly initialized label embeddings.
to drastically over-fit and have high performance on the train set and extremely
low performance on the unseen test set.
6.2.4 Sampling strategy

To perform sampling of negative edges, we use the pick-per-level strategy with
5 (u0 , v) + 5 (u, v0 ) negative edges for each positive edge. Different number of neg-
ative edges were tried with 1+1 and 50+50 but the 5+5 (=10:1 negative to positive
edge ratio) worked best.
For joint-embedding the ETHEC dataset, since the images (around 50,000) outnum-
ber the labels (723) we thought it might be useful to randomly sample negative
edges such that the ratio of negative nodes have a proportion to be 50%:50% for
images:label ratio however, the original strategy works better.
6.2.5 Choice of Optimizer

Initial experiments for the hyperbolic cones (HC) used the RSGD optimizer as it
seemed to work for labels-only embeddings hyperbolic cones. When using the
same to optimize over the labels for the joint-embedding model, we noticed that
the label hierarchy moves towards the image labels and ends up collapsing from
a very good initialization (taken from the labels-only embeddings). The collapse
leads to entanglement between nodes from different labels and images, which
leads it to a point of no return and the performance worsens due to the label-
hierarchy becoming disarranged and its inability to recover. We believe that the
66
reason for its inability to rearrange is due to there being a two different types of
objects being embedded (and also being computed differently) and it compounded
by using different optimizers.
In our experiments we obtain best results when using the Adam optimizer even
if it means the update step for parameters living in hyperbolic space has to be
performed in an approximate manner. Adam optimizer with an approximate
update step works better in practice than RSGD with its mathematically more
precise update step.
6.2.6 Label initialization for joint-embeddings

Using RSGD we observed that if the labels are not initialized with the labels-
only embedding then the joint model finds it difficult to disentangle the label
embeddings and eventually this effect is cascaded to the images causing the image
classification performance to not improve.
With the RSGD replaced by the Adam optimizer, in experiments where we ran-
domly initialized the label-embeddings, we observed them to disentangle and
form entailment cones even with the images being involved and making the opti-
mization more complex. The joint-model still works decently well with random
label initialization and achieves an image classification m-F1 score of 0.7611 and
even outperforms the hierarchy-agnostic CNN in the m-F1 for L1 labels (see ta-
ble 6.4 for details).
[16] recommends to use Poincaré embeddings [32] to initialize the hyperbolic
cones model. The fact that the joint model as well as the labels-only hyperbolic
cones have great performance without any special initialization scheme is interest-
ing. We conjecture that this could be because of using an approximate yet better
optimizer.
6.2.7 Inverted cosine embeddings and euclidean cones

We notice the similarity between embedding label-hierarchy with Euclidean cones
and the inverted cosine embeddings. We invert the cosine embeddings in order
to have the most abstract concepts close to the origin and more specific concepts
farther away. The cosine embeddings arrange themselves in an inverted manner
due to them being more accurate and confident about labels from the upper levels
in the hierarchy due to the dot product nature of a fully-connected layer in a
neural network (=matrix multiplication and a non-linearity).
We invert the cosine embeddings by re-scaling them such that the one furthest are
close to the origin and vice-versa.
r ∗ x ∗ || xmax ||
xinverted = (6.1)
|| x ||
where, xmax is the cosine embedding with the largest norm and r ∈ R is the
minimum norm that any inverted embedding should have.
67
Figure 6.6: Visualization of inverted cosine embeddings (labels) in 2 dimensions.

The cyan nodes represent family, the magenta nodes represent sub-family, the
yellow nodes genus and black nodes genus+species. These closely resemble the
euclidean cones.
One can notice that the inverted cosine embeddings from fig. 6.6 (does not show
the images) closely resemble the Euclidean cones of the label-embeddings from
fig. 6.5
68
Chapter 7
Conclusions
7.1 Future work

In this section, we discuss possible future directions. We split them based on the
domain these directions pertain to.
• Optimization. During the experiments we noticed that using the Adam op-
timizer with an approximate update step for hyperbolic parameters worked
better than a mathematically more accurate Riemannian SGD optimizer. From
the point-of-view of optimizer, it is an interesting trade-off between strong
but approximate and weak but accurate optimizers. We believe that this was
observed because Adam is better-suited for optimizing such problems with
highly non-convex landscape as the one we try to solve with embeddings-
based models.
In the results section we show that joint-embeddings when trained with
labels-only embedding perform best. But randomly initialized labels also
converge and have a decent performance to show for on the image classi-
fication task. In their experiments [16] observed high-performing models
when initialized with Poincaré embeddings from [32] as they conjecture that
without such initialization the model does not perform as expected due the
problem being difficult to optimize.
• Datasets. We show different methods that can be used to inject label-hierarchy
knowledge to an arbitrary classifier. A direction would be to extend the pro-
posed methods and the use of embeddings models for image classification
to a variety of different datasets.
• Label accuracy vs. label specificity trade-off. On top of the proposed em-
bedding and classification methods, one can think of adding an additional
module/method to select a sweet-spot to trade-off between accuracy of the
predicted labels and the specificity of the predicted labels such as the work
proposed in [10]. If the classifier is less confident about a prediction it can
predict a more abstract label that is less informative but still correct.
• Model complexity. For embeddings-based models, one could use a more
sophisticated model to map from fc7 image features to the embedding space
69
7. Conclusions
for images. Obviously, this comes with possible issues like over-fitting and
difficulty to optimize parameters living in non-euclidean space. The W ma-
trix used to map in our experiments lives in the Euclidean space but very
recent work on hyperbolic neural networks [15] could be used as a pointer
to model feed-forward networks that replace the matrix W. This W would
then live as well as be optimized in hyperbolic space.
• Applications. We empirically show that a classifier that exploits its label-
hierarchy outperforms a model that is hierarchy-agnostic. This provides
a good base to improve existing image classification model to use this un-
tapped information source.
In addition to this, the learnt joint-embeddings can be used for downstream
tasks such as image captioning, scene understanding and scene graph gen-
eration [26, 17, 43] and visual question-answering (VQA) [2]. Models that
tackle these tasks lie at the intersection of images and natural language con-
cepts. Our work moves in the direction to bridge the two different fields and
treat them in a joint manner. Generally, for such tasks, the classifiers/CNN-
backbones are used for visual feature extraction and object proposals while
the semantics are obtained from a separate module (such as an LSTM [26]
or RNN [17, 43]) that models natural language and semantics. Because our
proposed methods are aware of both visual similarity and semantic similar-
ity (via the label-hierarchy information) this could improve performance by
virtue of modeling visual features and semantics jointly.
7.2 Summary
• We propose 4 different methods to pass label-hierarchy information to a clas-

sifier. The Marginalization model, Masked L-classifiers and the Hierarchical
Softmax all perform better with a higher m-F1 score as compared to the meth-
ods. These models include information not exclusively about the number of
levels in the hierarchy (like the per-level classifier) but also provide addi-
tional information information to the model (in different forms) about how
labels are related to each other. Information about connections between la-
bels is made available by (1) performing marginalization over child probabil-
ities to obtain a probability distribution for upper levels in the hierarchy, (2)
masking infeasible labels based on upper-level predictions to narrow down
possible labels, and, (3) predict conditional distributions over the hierarchy
and compute joint probabilities using the chain rule for probabilities.
• Order-preserving embeddings which have shown great promise for natural
language related tasks are applied in the context of computer vision in this
work. We break away from traditional softmax based classifiers and propose
to use embeddings to perform image classification. These embeddings-based
models outperform the hierarchy-agnostic classifier which has even been
explicitly trained to perform multi-label image classification.
This furthers the case that irrespective of the model paradigm (a classifier
or an embedding) label-hierarchy information boosts performance on the
70
7.2. Summary
image classification task. This shows promise for other tasks that encom-
pass computer vision alone or computer vision in conjunction with natural
language, where exploiting label-hierarchy could benefit the models.
• During the experiments with embeddings we realized that optimization is a
relatively tricky process and sometimes methods are delicate and sensitive
to the choice of initialization, optimizers and hyper-parameters.
• We provide extensive experiments, empirical analysis, visualizations for the
proposed methods and observe that the proposed label-hierarchy injection
methods as well as the order-preserving joint-embeddings outperform the
hierarchy-agnostic image classifier baseline on the introduced ETHEC dataset
[11] with 47,978 images and 723 labels spread across 4 hierarchical levels.
The implementation for all the methods will be made publicly available.
71
Bibliography
[1] Wouter Addink, Dimitris Koureas, and Ana Casino. Dissco: The physical and
data infrastructure for europe’s natural science collections. In EGU General
Assembly Conference Abstracts, volume 20, page 16356, 2018.
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv
Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering.
In Proceedings of the IEEE international conference on computer vision, pages 2425–
2433, 2015.
[3] Björn Barz and Joachim Denzler. Hierarchy-based image embeddings for
semantic image retrieval. arXiv preprint arXiv:1809.09924, 2018.
[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[5] Tianshui Chen, Liang Lin, Riquan Chen, Yang Wu, and Xiaonan Luo.
Knowledge-embedded representation learning for fine-grained image recog-
nition. arXiv preprint arXiv:1807.00505, 2018.
[6] Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. Recurrent at-
tentional reinforcement learning for multi-label image recognition. In Thirty-
Second AAAI Conference on Artificial Intelligence, 2018.
[7] Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang
Lin. Fine-grained representation learning and recognition by exploiting hier-
archical semantic embedding. arXiv preprint arXiv:1808.04505, 2018.
[8] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large
scale fine-grained categorization and domain-specific transfer learning. In
Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4109–4118, 2018.
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee, 2009.
73
Bibliography
[10] Jia Deng, Jonathan Krause, Alexander C Berg, and Li Fei-Fei. Hedging your
bets: Optimizing accuracy-specificity trade-offs in large scale visual recogni-
tion. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages
3450–3457. IEEE, 2012.
[11] Ankit Dhall. Eth entomological collection (ethec) dataset

[palearctic macrolepidoptera, spring 2019] https://www.research-
collection.ethz.ch/handle/20.500.11850/365379, 2019.
[12] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++:
Improving visual-semantic embeddings with hard negatives. arXiv preprint
arXiv:1707.05612, 2017.
[13] Holger Frick, Pia Stieger, and Christoph Scheidegger. Swisscollnet–a national
initiative for natural history collections in switzerland. Biodiversity Information
Science and Standards, 2019.
[14] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas
Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances
in neural information processing systems, pages 2121–2129, 2013.
[15] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural
networks. In Advances in neural information processing systems, pages 5345–
5355, 2018.
[16] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyper-

bolic entailment cones for learning hierarchical embeddings. arXiv preprint
arXiv:1804.01882, 2018.
[17] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang
Ling. Scene graph generation with external knowledge and image recon-
struction. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1969–1978, 2019.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 770–778, 2016.
[19] Tao Hu and Honggang Qi. See better before looking closer: Weakly super-
vised data augmentation network for fine-grained visual classification. arXiv
preprint arXiv:1901.09891, 2019.
[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
[21] Tasho Kjosev. Deep learning for generating template pictorial and textual
representations. Thesis, 2018.
[22] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on
cifar-10. Unpublished manuscript, 40(7), 2010.
74
Bibliography
[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural informa-
tion processing systems, pages 1097–1105, 2012.
[24] Suren Kumar and Rui Zheng. Hierarchical category detector for clothing
recognition from visual data. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2306–2312, 2017.
[25] Matt Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximil-
ian Nickel. Inferring concept hierarchies from text corpora via hyperbolic
embeddings. arXiv preprint arXiv:1902.00913, 2019.
[26] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. Scene
graph generation from objects, phrases and region captions. In Proceedings of
the IEEE International Conference on Computer Vision, pages 1261–1270, 2017.
[27] Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and Liang Lin.
Crowd counting using deep recurrent spatial-aware network. arXiv preprint
arXiv:1807.00601, 2018.
[28] Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, and Yuanqing Lin. Localizing
by describing: Attribute-guided attention localization for fine-grained recog-
nition. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean.
Distributed representations of words and phrases and their compositionality.
In Advances in neural information processing systems, pages 3111–3119, 2013.
[30] George A Miller. Wordnet: a lexical database for english. Communications of

the ACM, 38(11):39–41, 1995.
[31] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network
language model. In Aistats, volume 5, pages 246–252. Citeseer, 2005.
[32] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hi-
erarchical representations. In Advances in neural information processing systems,
pages 6338–6347, 2017.
[33] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam
Lerer. Automatic differentiation in pytorch. Software Library, 2017.
[34] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[35] Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano.
Resampling or reweighting: A comparison of boosting implementations. In
2008 20th IEEE International Conference on Tools with Artificial Intelligence, vol-
ume 1, pages 445–451. IEEE, 2008.
75
Bibliography
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[37] Nitish Srivastava and Ruslan R Salakhutdinov. Discriminative transfer learn-

ing with tree-based priors. In Advances in neural information processing systems,
pages 2094–2102, 2013.
[38] Ryota Suzuki, Ryusuke Takahama, and Shun Onoda. Hyperbolic disk embed-
dings for directed acyclic graphs. arXiv preprint arXiv:1902.04335, 2019.
[39] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-
embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
[40] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. Multi-
label image recognition by recurrently discovering attentional regions. In
Proceedings of the IEEE international conference on computer vision, pages 464–
472, 2017.
[41] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Per-

ona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California
Institute of Technology, 2010.
[42] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel im-
age dataset for benchmarking machine learning algorithms. Unpublished
manuscript/dataset, 2017.
[43] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph gen-
eration by iterative message passing. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5410–5419, 2017.
[44] Yiming Yang. A study of thresholding strategies for text categorization. In

Proceedings of the 24th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 137–145. ACM, 2001.
76

Learning Representatio For Images Heirarchical Labels

Uploaded by

Copyright:

Available Formats

Learning Representatio For Images Heirarchical Labels

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Representatio For Images Heirarchical Labels

Uploaded by

Copyright:

Available Formats

arXiv:2004.00909v2 [cs.

LG] 11 Apr 2020

Learning Representations For Images

Advisors: Anastasia Makarova, Dr. Octavian-Eugen Ganea, Dario Pavllo

2 Problem Statement & Background 11

3 Methods: Injecting label-hierarchy into CNN classifiers 23

4 Empirical Analysis: Injecting label-hierarchy into CNN classifiers 31

5 Methods: Order-preserving embedding-based models 51

6 Empirical Analysis: Order-preserving embedding-based models 59

1.1.1 Leveraging label-label interactions

convolutional filters to extract image features. These convolutional layers are

1.1.2 Long-tailed data distributions

1.1.3 Visual similarity does not imply semantic similarity

(a) orange (b) clock (c) clock (d) clock

1.1.4 Uncovering the black-box model

1.2 Predicting Taxonomy for Scientific Collections

In this work, we particularly focus on the entomology of insects, more specifically

1.2.1 ETHEC dataset: a new entomological image dataset with label-

With these peculiarities the dataset is representative of real-world scenarios and

1.3.1 Injecting label-hierarchy information to improve CNN classifiers

The work proposes, in addition to a hierarchy-agnostic baseline, 4 different meth-

1.3.2 Performing image classification by jointly embedding labels and

1.3.3 Contributions Summary

• We show how order-preserving embedding models, which are generally

Problem Statement & Background

2.1 Related Work

Order-embeddings. In [39], the authors tackle embedding of a semantic hierar-

Hyperbolic cones. In addition to the Euclidean cones, [16] takes advantage of

Disk embeddings. [38] proposes a generalization of order-embeddings [39] and

2.1.2 Embedding based models for images

2.1.3 Convolutional Neural Networks based models

In tasks relating to fine-grained image classification, it is common to have class

performing pre-training. In a similar direction, [37] use hierarchy information to

E( x, y) = ||max(0, x − y)|| (2.2)

L = ∑ E( f (u), f (v)) + ∑ max(0, α − E( f (u0 ), f (v0 ))) (2.3)

2.2.2 Euclidean Cones

||y||2 − || x ||2 − || x − y||2

2.2.3 Hyperbolic Cones

and the Poincaré norm is defined as [16]:

|| x ||D = dD (0, x ) = 2 arctanh(|| x ||) (2.8)

h x, yi(1 + || x ||2 ) − || x ||2 (1 + ||y||2 )

2.2.4 Optimization in Hyperbolic Space

u ← expu (η ∇uR L ) (2.13)

where, ∇uR L is the Riemannian gradient for parameter u and is computed by re

∇uR L = (1/λu )2 ∇u L (2.14)

λu is different for each parameter and is computed as λu = 2/(1 − ||u||2 ) [16].

λ x (cosh(λ x ||v||) + h x, v̂i sinh(λ x ||v||))

2.3.2 Hierarchical Fashion MNIST

2.3.3 ETH Entomological Collection

level. We visualize a subset of the complete hierarchy of the ETHEC dataset in

ETHEC Merged (ETHEC dataset)

ETHEC Small dataset

Level Label name

Table 2.1: ETHEC Small dataset, a subset of the ETHEC dataset.

Methods: Injecting label-hierarchy into

In this chapter we propose CNN-based models that use a convolutional layers to

3.1 Hierarchy-agnostic classifier

As a baseline, we use a state-of-the-art convolutional neural network (CNN) image

where, x ∈ R Ntotal , y ∈ {0, 1} Ntotal and y T y = L.