Learning Representatio For Images Heirarchical Labels
Learning Representatio For Images Heirarchical Labels
Learning Representatio For Images Heirarchical Labels
Master’s Thesis
Ankit Dhall
2019
Image classification has been studied extensively but there has been limited
work in the direction of using non-conventional, external guidance other than
traditional image-label pairs to train such models. In this thesis we present
a set of methods to leverage information about the semantic hierarchy in-
duced by class labels. In the first part of the thesis, we inject label-hierarchy
knowledge to an arbitrary classifier and empirically show that availability of
such external semantic information in conjunction with the visual semantics
from images boosts overall performance. Taking a step further in this direc-
tion, we model more explicitly the label-label and label-image interactions
by using order-preserving embedding-based models, prevalent in natural lan-
guage, and tailor them to the domain of computer vision to perform image
classification. Although, contrasting in nature, both the CNN-classifiers in-
jected with hierarchical information, and the embedding-based models out-
perform a hierarchy-agnostic model on the newly presented, real-world ETH
Entomological Collection image dataset [11].
i
Acknowledgements
I would like to thank Prof. Dr. Andreas Krause and Anastasia Makarova for
believing in me and granting me the opportunity to work on this thesis in
collaboration with the Institute of Machine Learning at ETH Zurich. I am
grateful to Dr. Octavian-Eugen Ganea and Dario Pavllo for coming on-board
the project and sharing their ideas and insight. It was great collaborating and
brainstorming with all my supervisors who made this an extremely enriching
experience.
I would extend my gratitude to Dr. Michael Greeff from the ETH Entomologi-
cal Collection for allowing access to their collection and Maximiliane Okonnek
from the ETH Library Lab for ensuring that this project has a meaningful and
significant impact for the scientific community long after its completion.
I would like to thank my friends for their support. I am grateful to my family
for their understanding, support and the unconditional freedom to pursue
my goals.
ii
Contents
Contents iii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Leveraging label-label interactions . . . . . . . . . . . . . . . . 1
1.1.2 Long-tailed data distributions . . . . . . . . . . . . . . . . . . 2
1.1.3 Visual similarity does not imply semantic similarity . . . . . 2
1.1.4 Uncovering the black-box model . . . . . . . . . . . . . . . . . 3
1.2 Predicting Taxonomy for Scientific Collections . . . . . . . . . . . . . 3
1.2.1 ETHEC dataset: a new entomological image dataset with
label-hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Injecting label-hierarchy information to improve CNN classi-
fiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Performing image classification by jointly embedding labels
and images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Contributions Summary . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iii
Contents
2.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
Contents
7 Conclusions 69
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 73
v
Chapter 1
Introduction
1.1 Motivation
In machine learning, the task of classification is traditionally performed using soft-
max and one compares class scores and returns the highest scoring label as the
prediction. Such an approach safely assumes that categories might not be corre-
lated among each other. Contrary to this assumption, in many commonly used
datasets, labels are correlated and can be agglomerated to create more abstract
concepts which are made up of a collection of relatively specific concepts. For
instance jeans, t-shirt, rain-jacket and ball-gown are all dresses. Only a
handful of previous works have used hierarchical information in the context of
computer vision. Among them, in [34] the label-hierarchy from WordNet [30]
is used to consolidate data across various datasets. On another occasion, [10]
show how to optimize the trade-off between accuracy and fine-grained-ness of the
predicted class, but their proposed method only considers the label-hierarchy (=se-
mantic similarity) and therefore disregards the visual similarity when performing
this optimization.
Even though a classifier might not be able to distinguish between two breeds
of dogs, it can still predict a more abstract yet correct label, dog. Predicting la-
bels at different levels of abstractions can help catch errors when predicting more
fine-grained labels and hence provide more meaningful predictions. Labels with
varying levels of abstraction may also be beneficial for further downstream tasks
that involve both natural language and computer vision such as image captioning,
scene graph generation and visual-question answering (VQA). This work tries to
exploit semantic information available in the form of hierarchical labels. We show
that visual models when provided such guidance outperform a hierarchy-agnostic
model. We also show how these models can be made more interpretable by using
more explicity representation models such as embeddings for the task of image
classification.
1
1. Introduction
2
1.2. Predicting Taxonomy for Scientific Collections
visual dissimilarity. Sometimes it might even be the case that the intra-class vari-
ance of visual features for a single label is larger than the inter-class variance. In
such scenarios learned representations for two instance with different visual ap-
pearance would be coerced away from each other, indirectly affecting the image
understanding capability of the model. In fig. 1.1 one can notice how semantic sim-
ilarity and visual similarity are different concepts but are both essential to achieve
better visual understanding.
Figure 1.1: Although an orange and a basketball-themed clock have visual simi-
larity, they are semantically unrelated. On the other hand, the digital, analog and
basketball-themed clock are all visually distinct from each other but semantically
similar as all of them are instances of clock. By introducing auxiliary information
in the form of the label hierarchy such confusion could be avoided by models that
only pay attention to visual features. Image credits: Wikimedia, Lucky retail, Amazon,
Pixabay
Similarly, using the label hierarchy to guide the classification models we are able to
bridge the gap in the way machines and humans deal with visual understanding.
Incorporating such auxiliary information positively affects the explainability and
interpretability of image understanding models.
3
1. Introduction
Figure 1.2: The stark resemblance between an Alaskan Malamute and a Siberian
Husky would make the life of an image classification model tough as it relies
solely on visual features. Image credits: Karin Newstrom, Animal Photography; Sally Anne
Thompson, Animal Photography
specimens, the ETH Zürich Entomological Collection is one of the largest insect
collections in Central Europe.
The collection needs to sort these specimens according to their taxonomies. The
process involves hiring of an external specialist who specializes in particular fam-
ilies of these organisms. The process of sorting these is not only expensive but is
also constrained by the number of available specialists. If this resource intensive
task could be preceded by a pre-sorting procedure where these specimens are cat-
egorized based on their family, sub-family, genus and species in that order, it
would make the complete process more economical.
With the help of data and machine learning, such a repetitive can be facilitated by
non-specialists, largely cutting the costs. For example, in Switzerland, from 120
CHF per hour to 28 CHF per hour.
Annually, 40,000 specimens are donated to the ETHEC by the public. If this tech-
nology is accessible to the general public, the collection will already receive pre-
sorted specimen, making their task simpler. A 100 million euros initiative begin-
ning in 2019 will develop standards to integrate digitization across European insti-
tutions (DiSSCo [1]). In Switzerland a similar initiative is underway (SwissCollNet
[13]).
4
1.2. Predicting Taxonomy for Scientific Collections
Figure 1.3: The diagram shows the image distribution across each labels from the
4 levels of the hierarchy: 6 family, 21 sub-family, 135 genus and 550 species.
The x-axis represents the number of images for a particular label and the ticks on
the y-axis represent each label. For clarity, we have omitted the labels for genus
and species.
5
1. Introduction
Figure 1.4: Sample images and their 4-level labels from the ETHEC dataset. The
dataset consists of 47,978 butterfly specimens with 723 labels spread across 4 levels
of hierarchical labels: 6 family, 21 sub-family, 135 genus and 550 species.
The images are taken from the digitized collection at the ETH Entomological Col-
lection. We pre-process them to remove any visual signals (barcodes, text labels,
markings) that might leak label information about the specimen to a visual model.
We also crop the images to lie at the center of the image and resize them to
448 × 448. We also provide metadata and labels for each of the 4 levels in the
label-hierarchy. The dataset is split into train, val and test as 80-10-10. For
labels with fewer than 10 images, we split the images equally between the three
sets.
The dataset is been made publicly available, and can be found at the open-access
link: https://www.research-collection.ethz.ch/handle/20.500.11850/365379.
1.3 Contributions
The proposed methods are agnostic to the kind of features used or in general
the feature extractor and can be easily extended to any classifier whose labels are
arranged in a hierarchy. Since, the work tackles image classification, we use well-
known visual feature extractor convolutional neural networks (CNNs) [18, 23, 36]
in our experiments. Although, there are works that propose modifications [19]
directly to the CNN architecture, we refrain from doing so such that these methods
are model-agnostic can be used with any general classifier.
6
1.3. Contributions
Euclidean models
The field of natural language usually deals with modeling concepts as hierarchi-
cal structures and learning embeddings from unstructured text. Recent works
[39, 16] model them as DAGs and suggest to embed them in order to preserve
their asymmetric entailment relations. This information is usually lost if sym-
metric distance functions are used. Order-embeddings [39] propose propose an
asymmetric distance function that arranges the embedded concepts in an order-
preserving manner. A more recent approach, entailment cones [16], use a more
generalized version of the order-embeddings that are more space efficient and per-
form better. In contrast to the above approaches that have been proposed in the
context of natural language we propose to jointly embed images and their labels
and use their interactions to predict labels for unseen images.
Non-Euclidean models
Unlike the Euclidean models, non-euclidean models exploit non-zero curvature
of their geometries. Hyperbolic geometry has negative curvature and can accom-
modate tree like structures (such as DAGs) with ease in comparison to Euclidean
geometry. In hyperbolic space the volume of a ball grows exponentially with the
radius [32] unlike the polynomial growth that we are aware of in Euclidean space.
A set of works have [32, 16, 38] proposed to exploit spaces of negative curvature
to better embed concepts and create state-of-the-art models to embed hierarchies.
We use a model similar to the hyperbolic entailment cones [16] where in addition
to the labels we embed the images as well, treating the problem in a joint manner.
Generally embedding models and CNN-based classifiers are hard to compare be-
cause of the vastly different use-case and domain they are generally applied to.
7
1. Introduction
We use the embedding models as image classifiers and are able to make a fair
performance comparison between different model categories. In addition to the
image classification and joint embedding of labels and images, for the embedding
based models, we also look at the quality of the embedding of the label-hierarchy
itself. We report the performance on the ETH Entomological Collection (ETHEC)
dataset [11].
1.4 Outline
The remainder of this thesis is structured as follows:
• In chapter 1 the motivation behind the methods and the need to exploit
information from hierarchically organized labels is outlined.
• We skim over the relevant work in a similar direction as the one proposed in
this manuscript in chapter 2. It provides mathematical background for meth-
ods that this work extends for joint label-image embedding for image clas-
sification. It also contains information regarding datasets and CNN-based
feature extractors (CNN-backbones).
• In chapter 3 we discuss in detail label-hierarchy injection into CNN-based
models, probability distributions computation over the labels and finally
how the predictions are made. We first discuss the baseline that disregards
any external information than the image-label interactions. For the rest of
the models, with each model, more information regarding the hierarchy is
8
1.4. Outline
made available to the classifier. For clarity we separate out and compile the
empirical analysis for the CNN-based models in chapter 4
• In chapter 5 we sketch the details for embeddings based models both Eu-
clidean and non-Euclidean variants. The chapter also discusses label-embeddings
before jointly embeddings labels together with images. We present the em-
pirical results of the embeddings-based models in chapter 6.
• Concluding remarks and possible directions for future work are discussed
in chapter 7.
9
Chapter 2
11
2. Problem Statement & Background
transitive relations can be captured well without having to rely on physical close-
ness between points. Instead, the embeddings are learned by minimizing a loss
that penalizes order violations. In [39] the authors tackle two tasks: hypernymy
prediction and image-caption retrieval. A hypernym is a pair of concepts where
the first concept is more generic or abstract than the second. For instance, (fruit,
mango) or (emotion, happiness). The hypernymy prediction task has a natural
hierarchy to the concepts, however, for the image-caption they create a two-level
hierarchy where the captions form the more abstract, upper level while the images
being more detailed form the lower level.
Euclidean cones. One major restriction of the representation and indirectly the
distance function proposed in [39] is that each concept occupies a large volume
in the embeddings space (the coordinates of each embedding own a translated
orthant irrespective of the number of descendants they have) and also suffers from
heavy orthant intersections. This ill-effect is amplified especially in extremely low
dimensions such as R2 . To ameliorate such affects, the authors in [16] propose a
generalized version of order-embeddings called the entailment cones. These are
more flexible and the region owned by a concept is not restricted to be a translated
orthant but a convex cone. The cone that is owned by a concept originates at the
location of the concept’s embedding with its apex lying at these coordinates. Any
concept that falls within the cone is considered as a sub-concept in context of
hypernymy prediction.
The authors use a version of the Stochastic Gradient Descent (SGD) [4] that is for
optimizing parameters on the Riemannian manifold, the Riemannian SGD to op-
timize embeddings in non-Euclidean manifolds. In their work, they propose the
non-Euclidean entailment cones living in the hyperbolic space as well as their Eu-
clidean variant. They focus on the task of hypernymy prediction on the WordNet
hierarchy [30] by embedding a directed-acyclic graph using hyperbolic entailment
cones and use it to classify whether a pair of concepts is a hypernym pair.
Hyperbolic Neural Networks. In a more recent work [15] the authors propose
to have feed-forward neural networks to be parameterized in hyperbolic space.
This allows downstream tasks to use hyperbolic embeddings for natural language
processing (NLP) tasks in a more principled and natural fashion. They derive
hyperbolic variants of logistic regression, feed-forward neural networks and recur-
rent neural networks. These are then used to take as input hyperbolic embeddings
and are seen to perform at par or better than their Euclidean counterparts.
12
2.1. Related Work
Other embedding methods. The work proposed in [3] maps images onto class
embeddings where pairwise dot product is used as a measure of similarity. To em-
bed the class labels they use a deterministic algorithm to compute class centroids
by using hierarchical information from WordNet [30] to guide the embeddings se-
mantically. They conjecture that semantics are complicated and are hard to learn
only from visual cues. The class embeddings are pre-computed using the hierar-
chy. The image embeddings are mapped to the fixed class embeddings using a
CNN with a combination of image classification and embedding loss. Their work
focuses on the image retrieval task. A drawback of such an approach is that the
label embeddings are fixed when training on the image embeddings. The labels
might be embedded properly however they might not be arranged in a way that
puts visually similar labels together. Fixing them when learning image embed-
dings prevents the combination of visual and semantic similarity to re-arrange the
label embeddings in a manner that is better suited.
[25] combines the idea of Hearst patterns and hyperbolic embeddings to infer is-a
relationship from text such as is-a(car, object) or is-a(Paris, city). They propose
to create a graph with the help of Hearst patterns and consequently embed it in
low-dimensional hyperbolic space. They focus on different hypernymy tasks for
text given a pair of concepts (u, v): (1) if u is a hypernym of v, (2) is u more general
than v, and (3) to what degree u is a v.
13
2. Problem Statement & Background
the labeled images. They use a single unified models with embeddings and trans-
fer knowledge from the text-domain to a model for visual recognition. They addi-
tionally perform zero-shot classification on classes extended on top of the ones in
the ImageNet dataset [9]. The proposed work uses a combination of inner prod-
uct to measure similarity and the hinge loss. With this approach they generalize
well to unseen labels and are able to make relevant prediction even if the model
classifies an image incorrectly (compared to the ground truth) for unseen classes
from ImageNet 21K.
In work done by Chen et al. [7] they propose to predict labels for different levels
in a hierarchy. Their work is closest to ours in the sense that it tries to predict
labels for each level in the hierarchy to which the images belong. They develop a
sophisticated CNN architecture that uses a common feature extractor which then
uses separate neural networks where each specializes to predict labels for each
level. The fact that they use completely separate networks to predict labels for
each level makes the model prone to over-fitting when the dataset is small and
computationally intensive as well. They present a dataset with a 4-level hierarchy
with images of butterflies across 200 species similar to the ETHEC dataset and
construct hierarchies for existing Caltech UCSD Birds dataset [41]. They compare
performance for the final level in the hierarchy with many baseline methods but
these methods only predict labels for the the most fine-grained label category
(=the final level in the hierarchy) and not the others.
14
2.2. Background
2.2 Background
2.2.1 Order-embeddings
Typically a symmetric distance is used to ascertain semantic similarity between
concepts in the embedding space. Order-embeddings [39] propose to learn a map-
ping that cares about preserving the order between objects than distance and intro-
duce the problem of partial order completion. From a set of known ordered-pairs
P and unordered-pairs N the goal is to determine if an arbitrary, unseen pair is
ordered or not.
They propose to use a reversed product order on R N due to its desirable properties.
This is defined in eq. (2.1).
N
^
y x if and only if yi ≥ xi (2.1)
i =1
The reversed order means that smaller coordinates represent a ”higher” or more
abstract position in the partial ordering.
Instead of having a hard-constraint they introduce an approximate order-embedding
to violate them as less as possible.
15
2. Problem Statement & Background
where, P and N represent positive and negative edges respectively in the dataset
X . α ∈ R+ is the margin. f is a function that maps a concept to it’s embedding.
E( f (u), f (v)) is the energy that defines the severity of the order-violation for a
given pair (u, v) and is given by eq. (2.2).
According to the energy E( x, y) = 0 ⇐⇒ y x. For positive pairs where y
is-a x, one would like embeddings such that E( x, y) = 0. a is-a b implies that
a is a sub-concept of b or equivalently b is more abstract than a and that is its
generalization.
Ξ( x, y) computes the minimum angle between the axis of the cone at x and the vec-
tor y. E( x, y) measures the cone-violation which is the minimum angle required
to rotate the axis of the cone at x to bring y into the cone.
E( x, y) = max(0, Ξ( x, y) − ψ( x )) (2.6)
|| x − y||2
dD ( x, y) = arccosh 1 + 2 (2.7)
(1 − || x ||2 )(1 − ||y||2 )
Angles in hyperbolic space is the angle between the initial tangents of the geodesics.
The angle between two tangent vectors u, v ∈ Tx Dn is given by cos(∠(u, v)) =
hu,vi
||u|| ||v||
[16].
16
2.2. Background
One needs to replace the cosine law and the exponential map to obtain the hy-
perbolic formulation for the cones in hyperbolic space [16]. ||.|| represents the
Euclidean norm, h., .i represents the Euclidean scalar-product and the unit vector
x̂ = x/|| x ||.
The aperture of the cone is given by ψ( x ).
1 − || x ||2
ψ( x ) = arcsin K (2.9)
|| x ||
Ξ( x, y) computes the minimum angle between the axis of the cone at x and the vec-
tor y. E( x, y) measures the cone-violation which is the minimum angle required
to rotate the axis of the cone at x to bring y into the cone.
E( x, y) = max(0, Ξ( x, y) − ψ( x )) (2.11)
u ← u − η ∇u L (2.12)
Instead, for parameters living in hyperbolic space, one should compute the Rie-
mannian gradient and update it using the gradient direction in the tangent space
and move u along the corresponding geodesic in the hyperbolic space with the
following update rule [16] (Riemannian Stochastic Gradient Descent):
17
2. Problem Statement & Background
2.3 Datasets
2.3.1 Hierarchical CIFAR-10
We also perform experiments with the CIFAR-10 dataset [22]. There are 10 classes
with 6000 32x32 images per class. In total, the dataset has 50000 images for training
and 10000 for testing. To be consistent with our experiments, we use a 80% −
10% − 10% split for training, validation and testing respectively. All fine-tuning
is performed on the validation set, the test set is only used to report the model’s
performance.
The original dataset does not have a label hierarchy associated. Instead each image
has a single ground truth label. Additional labels are added to introduce a 3-level
hierarchy. Each image now is associated with 3 labels. The original labels are the
leaves of this hierarchy. The root of the hierarchy is entity, the first level of hier-
archy splits into (living, non-living). living entities are divided between (mammal,
non-mammal). non-living entities are divided between (vehicle, craft). The original
classes are {airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck}. mammal
is a parent of (cat, deer, dog, horse) and non-mammal of (bird, frog). vehicle is a parent
of (automobile, truck) while craft is a parent of (airplane, ship).
18
2.3. Datasets
Figure 2.1: Hierarchy of labels from the ETHEC (Merged) dataset. It consists of
labels arranged across 4 levels: family (blue), sub-family (aqua), genus (brown)
and species. This visualisation depicts the first 3 levels. The name of the family is
displayed next to its sub-tree.
There exists metadata from 197052 specimen samples with all samples having
labels spread across various hierarchical levels. For our experiments we make use
of 4 such levels: 25 unique families, 91 unique subfamilies, 842 unique genera and
2429 unique specific epithets labels. The average branching factors are 25, 3.64, 9.25
and 2.88 for the respective levels. The label hierarchy has 3537 edges and 3387
nodes.
In the hierarchy, the maximum number of descendants belong to the family Noc-
tuidae 19, to the subfamily Noctuinae 155 and to the genus Eupithecia 79. Maximum
specimens belong to Geometridae 48635 (family), Noctuinae 29555 (subfamily), Zy-
gaena 17243 (genus) and filipendulae 2456 (specific epithet).
Since this data is much larger as compared to other datasets discussed in this work,
to better understand the data, we visualize the dataset using as an interactive
graph using JavaScript. The visualization has basic functionality to view relations
between nodes, the number of samples per label and the hierarchy level for a
particular label. The nodes have size proportional to the order of magnitude of the
number of samples for that label and are also color coded based on their hierarchy
19
2. Problem Statement & Background
2.4 CNN-backbones
We use convolutional neural networks to extract visual features from the images
to perform classification. The CNN-based models are optimized using SGD [4]
with a learning rate of 0.01 for 100 epochs and a batch-size of 64 unless specified
otherwise.
2.4.1 AlexNet
AlexNet [23] proposed in 2012 shot to fame after exceptional performance on the
ImageNet [9] challenge. It consists of 8 layers in total: the first 5 being convo-
lutional layers and the remaining 3 being fully-connected layers. The original
architecture outputs logits for 1000 class labels from the ImageNet challenge [9].
2.4.2 VGG
VGGNet [36] comprises of 16 convolutional layers and with 138 million parame-
ters, is much larger than AlexNet [23]. They propose to use smaller filters (3 x 3) as
opposed to larger filter size in previous CNNs. This reduces the effective number
of parameters to achieve the same receptive field and in addition also incorporates
more than one non-linearity.
20
2.4. CNN-backbones
21
2. Problem Statement & Background
2.4.3 ResNet
ResNet [18] showed that increasing depth improves network performance. They
introduce skip-connections between groups of layers allowing the model to learn
identity functions thus ensuring that the performance is as good as that of a shal-
lower network. This facilitates better convergence rates than plain networks. The
skip connections or shortcut connections do not increase the number of parame-
ters in comparison to the original network (without skip connections). Even with
the remarkable increase in depth ResNet-152 (152 layers) has fewer parameters
than the VGG-16/19 [36]. ResNets (pre-trained on the ImageNet dataset) are a
popular choice as feature extractors for image related tasks.
22
Chapter 3
Figure 3.1: Model schematic for the hierarchy-agnostic classifier. The model is a
multi-label classifier and does not utilize any information about the presence of
an explicit hierarchy in the labels.
23
3. Methods: Injecting label-hierarchy into CNN classifiers
levels in an unrelated manner with only the image being available for the model
to predict a label for each level in the hierarchy. Labels across levels do not hold
any special meaning and are treated indifferently.
The model performs Ntotal -way classification. Ntotal = ∑iL=1 Ni represents labels
across all L levels and Ni are the number of distinct labels on the i-th level. It uses
the one-versus-rest strategy for each of the Ntotal labels.
Ntotal
1 1
L ( x, y) = −
Ntotal
∗ ∑ y j ∗ log
(1 + exp(− x j ))
j =1
exp(− x j )
+ (1 − y j ) ∗ log (3.1)
(1 + exp(− x j ))
24
3.3. Marginalization (bottom up)
Figure 3.2: Model schematic for the per-level classifier (=L Ni -way classifiers). The
model use information about the label-hierarchy by explicitly predicting a single
label per level for a given image.
L
L ( x, τ ) = ∑ Li (xi , τi ) (3.2)
i =1
!
Ni
exp( xi [τi ])
Li ( xi , τi ) = − log N
= − xi [τi ] + log ∑ exp(xi [ j]) (3.3)
∑ j=i 1 exp( xi [ j]) j =1
F (I) = x where, x are the logits from the last layer of a model F which takes
as input image I . xi is a continuous sub-sequence of the predicted logits x, i.e.
xi = ( xi [ Ni−1 + 1], xi [ Ni−1 + 2], ..., xi [ Ni−1 + Ni ]).
L L
L ( x, τ ) = ∑ Li (xi , τi ) = − ∑ log ( pi [τi ]) (3.4)
i =1 i =1
25
3. Methods: Injecting label-hierarchy into CNN classifiers
Figure 3.3: Model schematic for the Marginalization method. Instead of predicting
a label per level, the model outputs a probability distribution over the leaves of
the hierarchy. Probability for non-leaf nodes is determined by marginalizing over
the direct descendants. The Marginalization method models how different nodes
are connected among each other in addition to the fact that there are L levels in
the label-hierarchy.
F (I) = x where, x are the logits from the last layer of a model F which takes as
input image I .
This is the same loss from eq. (3.3) however, the manner in which each xi is com-
puted is different. The difference is that here the model predicts a probability
distribution only over the leaf labels. To obtain a probability of a label that is a
non-leaf label, the probabilities of the direct children are summed over and this
marginalization results in the probability of the parent label. This way a valid
probability distribution is obtained for each level in the hierarchy.
∑
j
pi [ j] = P(vi |I) = P(c|I), ∀i ∈ 1, 2, ..., ( L − 1) (3.5)
j
c∈childrenOf(vi )
j
where, vi is the j-th vertex (node) in the i-th level.
All but the last level use eq. (3.5) to compute the probabilities for their labels.
!
j exp( x j )
p L [ j] = P(v L |I) = N
(3.6)
∑k=L 1 exp( xk )
For the final level, we compute the probability distribution over the leaf nodes by
directly using the logits output from the model, F . This computation is indicated
in eq. (3.6) using softmax. Once p L is determined, p L−1 can be calculated. For this
reason we compute the probabilities for the complete hierarchy in a bottom up
fashion: starting from the bottom-most layer and moving to the upper levels.
26
3.4. Masked Per-level classifier
On the upper levels of the hierarchy one has more data per label and fewer la-
bels to choose from. Naturally, this makes classifying relatively accurate closer to
the root of the hierarchy. This model exploits knowledge about the parent-child
relationship between nodes in a top down manner.
Figure 3.4: Model schematic for the Masked Per-level classifier. The model is
trained exactly like the L Ni -way classifier. While predicting, one assumes the
model performs better for upper levels than lower levels. Keeping this in mind,
when predicting a label for a lower level, the model’s prediction for the level above
is used to mask all infeasible descendant nodes, assuming the model predicts cor-
rectly for the level above. This results in competition only among the descendants
of the predicted label in the level above.
Unlike Marginalization (bottom up), here, we have L-classifiers, one for each hier-
archical level. For the first level, the model predicts the class with the highest score
among the logits. For consequent level li , the information about the models belief
i.e. it’s prediction for the li−1 level is leveraged. Instead of naively predicting the
label with the highest score for level li (comparing among all possible logits), all
nodes except the children of the predicted label for level li−1 are masked. This
translates to computing the loss over a subset of the original nodes in level li .
With the availability of the parent-child relationship and assuming that the model
predicts correctly the parent label (on level li−1 , the only possible labels are the
children of this predicted parent. As mentioned earlier, classification in the upper
levels is more accurate and since we perform this in a top down fashion, this is a
reasonable assumption. Another work has shown this to be the case [21]. For the
last L-1 levels, only a subset of the logits (formed by the children of the predicted
parent) are compared against each other, ignoring the rest.
While training, the loss is computed over the children of the parent conforming
to the ground truth. Even if the model predicts the parent incorrectly, we still use
the ground truth to penalize its prediction for the children.
For data with unknown ground truth i.e. during evaluation, the model uses the
predictions from level li−1 to make infer about level li by masking nodes that
correspond to labels that are not possible.
27
3. Methods: Injecting label-hierarchy into CNN classifiers
L
L ( x, τ ) = ∑ Li (xi , τi ) (3.7)
i =1
! !
exp( xi [τi ])
L ( xi , τi ) = − log
∑ j∈C exp( xi [ j])
= − xi [τi ] + log ∑ exp(xi [ j]) (3.8)
j∈C
In the context of natural language processing something similar has been dis-
cussed in previous work [31, 29]. But their main goal is to reduce the computa-
tional complexity over very large vocabularies. In the context of computer vision
this is relatively unexplored and we propose to decompose the probability distri-
bution and predict conditional distributions for each set of direct descendants in
the hierarchy, in order to exploit the label-hierarchy and boost performance.
exp( x ji −1 [ ji ])
j ji−1 vi −
p ( vi i | vi − 1 ∀v ji ∈ C, x ∈ R| C |
1) = j (3.9)
i
∑k∈C exp( x
i −1
ji −1 [ k ]) vi − 1
v i −1
The vector x j i −1 represents the logits that exclusively correspond to all the children
vi − 1
j
i −1
of node vi− 1 . With this in place, for the set of children of a give node, a conditional
probability distribution is output by the model F .
F (I) = p(·) where, p(·) is the conditional probability for every child node given
j ji−1
the parent, p(vii |vi− 1 ). F takes as input image I .
In order to calculate the joint distribution over the leaves, probabilities along the
path from the root to each leaf are multiplied.
j j j L −1) j j j j j j L −1)
p(v11 , v22 , ..., v((L−1) , v LL ) = p(v11 ) p(v22 |v11 )...p(v LL |v((L−1) ) (3.10)
28
3.5. Hierarchical Softmax
j ( i +1) j
where, vii is the parent node of vi+ 1 . The nodes belonging to the i-th level and
the (i+1)-st level respectively.
The cross-entropy loss is directly computed only over the leaves but since the dis-
tribution over the leaves implicitly uses the internal nodes for calculation, all levels
are optimized over indirectly and the performance gradually improves (across all
levels).
j L −1)
j j
L ( x, τ ) = − log p(v11 , v22 , ..., v((L−1) , vτLL ) = − log p(v1τ1 , v2τ2 , ..., v( LL−−11) , vτLL )
τ
(3.11)
where, τi is the true label for the i-th level. xi ∈ R Ni , τ ∈ I+
L.
eq. (3.11) can be re-written as eq. (3.12) because when τL is known the path to the
root is unique and the remaining τi , ∀i ∈ 1, 2, ..., ( L − 1) are determined.
L ( x, τ ) = − log p(v1τ1 , v2τ2 , ..., v( LL−−11) , vτLL )
τ
(3.12)
29
Chapter 4
In this Chapter, we describe the numerical experiments used to evaluate our meth-
ods that help classifiers exploit label-hierarchies. Before going into the experi-
mental details, we discuss the choice of performance metrics to compare across
different models.
Consider a dataset as shown in table 4.1. When using a classifier for each level
in the hierarchy, the classifier prefers to blindly predict the majority label to boost
its micro score. By always predicting Hesperiidae, Pyrginae and Pyrgus alveus
it obtains a micro-averaged precision, recall and F1-score of (0.5, 0.5, 0.5). This
type of behavior is undesirable. However, the macro-averaged scores are (0.1364,
0.2727, 0.1724) which reflect the poor performance of the classifier.
To get better insight about where the model under-performs micro and macro
averaged scores are also computer per level in the hierarchy.
True positive rate True positive rate (TPR) is the fraction of actual positives
predicted correctly by the method.
tp
TPR = (4.1)
totalPositives
31
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Table 4.1: A subset of the ETHEC dataset to demonstrate the pros and cons of
using macro and micro scoring.
True negative rate True negative rate (TNR) is the fraction of actual negatives
predicted correctly by the method.
tn
TNR = (4.2)
totalNegatives
Precision Precision computes what fraction of the labels predicted true by the
model are actually true.
tp
P= (4.3)
tp + f p
Recall Recall computes what fraction of the true labels were predicted as true.
tp
R= (4.4)
tp + f n
F1-score
2∗P∗R
F1 = (4.5)
P+R
Hit@k
N
1
∑ 1[labeli
gt
Hit@K = ∈ SortedPredictions(i )] (4.6)
N i =1
pred pred pred pred
where, SortedPredictions(i ) = {label0 , label1 , ..., labelk−1 , labelk } is the set
of the top-K predictions for the i-th data sample.
32
4.2. Hierarchical CIFAR-10
Table 4.2: Performance metrics for Per-level classifier on the Hierarchical CIFAR-10
data when varying the amount of training data. The models used in this experi-
ment are pre-trained on the 1000-class ImageNet data set. All weights are updated
with a learning rate of 0.01 and input spatial dimensions are 224x224. P, R and
F1 represent Precision, Recall and F1-score. Metrics prefixed with m are micro-
averaged while the ones with M are macro-averaged. The top performing models
are in bold-face.
33
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Table 4.3: Performance metrics for the hierarchy-agnostic classifier on the Hierar-
chical CIFAR-10 data. The models used in this experiment are pre-trained on the
1000-class ImageNet data set. For these experiments, only the last layer is fine-
tuned (unless mentioned otherwise), fixing the rest of the weights with a learning
rate of 0.01 and input spatial dimensions of 224x224 for 100 epochs. Metrics pre-
fixed with m are micro-averaged while the ones with M are macro-averaged. The
top performing models are in bold-face.
34
4.3. Hierarchical Fashion MNIST
Table 4.4: Performance metrics for the hierarchy-agnostic classifier on the Hierar-
chical FMNIST data. The models used in this experiment are pre-trained on the
1000-class ImageNet data set. For these experiments, only the last layer and the
first layer is fine-tuned (unless mentioned otherwise), fixing the rest of the weights
with a learning rate of 0.01 and input spatial dimensions of 224x224. Metrics pre-
fixed with m are micro-averaged while the ones with M are macro-averaged. The
top performing models are in bold-face.
35
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
36
4.4. ETHEC Dataset
Table 4.5: Performance metrics for the hierarchy-agnostic classifier on the ETHEC
Merged dataset. The models used in this experiment are pre-trained on the 1000-
class ImageNet data set. All weights are updated with a learning rate of 0.01, a
batch-size of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R
and F1 represent Precision, Recall and F1-score; cw and rs represent class weight
and re-sampling. Metrics prefixed with m are micro-averaged while the ones with
M are macro-averaged. The top performing models are in bold-face. Since, the
model can predict any number of labels (between 0 and Ntotal ), the table includes
the minimum and the maximum number of labels predicted (min, max) as well as
the number of labels predicted on average µ ± σ. These statistics, like the rest, are
calculated for samples in the test set.
for one such sample and 451.14 ± 136.69 on average for the worst performing
multi-label model in our experiments.
37
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Performance breakdown
From fig. 4.1 one can see the relation between number of training samples for a
particular label and it’s F1 performance. The 5 out of the 6 families have more
than 1,000 training data and the model performs well for these classes. However,
the importance of a larger dataset is visible in the performance versus training
data size plots in fig. 4.1b, fig. 4.1c and fig. 4.1d. There seems to be a “S-shaped”
or sigmoid like trend in performance as the number of training data increases.
Labels, across the 4 levels, that have more than 1,000 training samples have very
high F1 scores while labels that have less than 100 training points are on the lower
end of the spectrum.
The micro-F1 performance for the levels: family and subfamily is comparable to the
best performing multi-level classifier (see table 4.8). But moving down to the lower
levels in the hierarchy the performance is much worse than the best performing
multi-level classifier as detailed in section 4.4.2.
38
4.4. ETHEC Dataset
1 2
0 2 0
1.0 10 103 104 1.0 1.0 101 102 103 1041.0
Score
0.4 0.4 0.4 0.4
10
50
0 0
1.0 10
0 101 102 103 1.0 1.0 10
0 101 102 103 1.0
Score
Figure 4.1: Per-label F1 performance across levels plotted against number of training
samples for hierarchy-agnostic classifier using ResNet-50 (cw: 7, rs: 7). For each panel, a
point represents a label with its position indicating the number of samples corresponding
to that label in the train set and the F1-score that particular label achieve on the test set.
To better visualize, the distribution of the samples in the train set and also the
distribution of the performance, we display the marginal histograms for both training
data size (on top of each scatter plot) and the F1-score (on the right of each scatter plot).
39
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Table 4.7: Performance metrics for Per-level classifier on the ETHEC Merged
dataset. The models used in this experiment are pre-trained on the 1000-class
ImageNet data set. All weights are updated with a learning rate of 0.01, a batch-
size of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R and F1
represent Precision, Recall and F1-score; cw and rs represent class weight and re-
sampling. Metrics prefixed with m are micro-averaged while the ones with M are
macro-averaged. The top performing models are in bold-face. sqrt in the “rs”
denotes re-sampling using the inverse of the square-root of the class frequency as
weights, in other cases the inverse of the class frequency is used.
It is important to notice the trend between the macro and micro-averaged scores.
In table 4.7 all micro-averaged scores are always higher than their macro-averaged
counterparts which is a direct result of optimizing the model for best performance
measured by the micro-F1 score, which is a function of the micro precision and
micro recall.
Micro-averaged scores are calculated by computing a “combined” confusion ma-
40
4.4. ETHEC Dataset
trix across all labels and then computing the (averaged) score, in this case the
precision, recall and F1-score. Here, the notion of “averaging” comes from the fact
that individual, label-wise confusion matrix elements (true positive, false positives,
true negatives and false negatives) are combined in to a single global confusion
matrix. While for the macro-averaged scores, the per-label scores i.e. precision,
recall and F1-score are calculated and then their mean across all labels results in
the macro-averaged scores.
If the micro scores are higher than the macro scores it is an indicator of the classes
with fewer samples being classified incorrectly while the more popular classes
are being classified correctly and eventually inflating the micro score. On the
other hand, higher macro scores would indicate that labels with a large number
of samples are being misclassified.
In our case it is the former, where the micro scores dominate the macro scores
implying that the model performs well for classes with more training data which
is intuitive as the model has more data to train on and sees a variety of samples
for the same label in contrast to classes with only a handful of samples.
Refer to fig. 4.3d for the macro-F1, fig. 4.3c for the micro-F1 score and fig. 4.3b
for the training, validation and testing loss over 100 epochs. The micro-F1 per-
formance for experiments with different combinations of class weights and re-
sampling are compared in fig. 4.3a.
Performance breakdown
In order to see where the model under-performs we break down the best per-
forming model’s score such that we have metrics across different hierarchical lev-
els. From table 4.8 it is evident that the performance is much better when there
are less classes since classes higher up the hierarchy agglomerate the descendant
nodes together, ending up with more data and less labels to differentiate between.
fig. 4.2 gives more insight into the data distribution for each level and the corre-
sponding performance measured by the F1 score. The performance of the model
deteriorates as one moves to the lower levels in the hierarchy. At the leaves the
performance is the worst among the four levels due to extensive branching in
the hierarchy and only several data samples per leaf label. With the help of in-
verse frequency weighted re-sampling the ill-effect of data deficiency is mitigated
to an extent with an improved performance as compared to when there is no
re-sampling at all (refer to table 4.7).
41
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
2 3
2
1
1
0 0
103 104 102 103 104
1.0 1.0 1.0 1.0
Score
0.4 0.4 0.4 0.4
15 60
10 40
5 20
0 0
102 103 5 × 101 6 × 101 7 × 101 8 × 101 9 × 101 102
1.0 1.0 1.0 1.0
Score
Figure 4.2: Per-label F1 performance across 4 hierarchical levels plotted against number
of training samples for Per-level classifier using ResNet-50 with resampler (cw: 7, rs: 3).
It is important to note that the population statistics, especially of lower levels in the
hierarchy (fig. 4.2c and fig. 4.2d) are skewed to the higher end as an effect of re-sampling
the less frequent classes.
42
4.4. ETHEC Dataset
Micro-F1 performance of multi-level models train loss, val loss and test loss
5 train loss
val loss
0.8 test loss
4
0.6 3
0.4 2
(a) Micro-F1 performance over 100 epochs (b) Loss evolution over 100 epochs on the
for ResNet-50 multi-level classifier with dif- train, val and test datasets for ResNet-
ferent combinations of resampler and class 50 multi-level classifier with resampler (cw:
weights. Legend: x = cw: 7, rs: 7 7, rs: 3).
train micro_f1, val micro_f1 and test micro_f1 train macro_f1, val macro_f1 and test macro_f1
1.00 1.0
0.95 0.9
0.90 0.8
0.85 0.7
0.80 0.6
0.75 0.5
train micro_f1 train macro_f1
0.70 val micro_f1 0.4 val macro_f1
test micro_f1 test macro_f1
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) train, val and test micro-F1 perfor- (d) train, val and test macro-F1 perfor-
mance over 100 epochs for ResNet-50 multi- mance over 100 epochs for ResNet-50 multi-
level classifier with resampler (cw: 7, rs: level classifier with resampler (cw: 7, rs:
3). 3).
43
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Table 4.8: Performance metrics for Per-level classifier on the ETHEC Merged
dataset when using a resampler which is the best performing model in its category.
The models used in this experiment are pre-trained on the 1000-class ImageNet
data set. All weights are updated with a learning rate of 0.01, a batch-size of 64
and input spatial dimensions are 224x224 for 100 epochs. P, R and F1 represent
Precision, Recall and F1-score. Metrics prefixed with m are micro-averaged while
the ones with M are macro-averaged.
44
4.4. ETHEC Dataset
4.4.3 Marginalization
This section compiles the general results for image classification using the marginal-
ization model. The ResNet-50 is the best performing model. This model predicts
the non-leaf labels in the hierarchy by marginalizing over children labels whose
are probabilities explicitly predicted by the model. We also notice that a huge
performance boost is obtained when normal color images are used as compared
to grayscale images. It is not just about the patterns but also the colors on the
specimen that help distinguish them.
We also show the best performing model’s performance when different loss terms
are used to compute the loss. We train models where we sum up classification
45
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
losses across different levels in the hierarchy and observe that when losses are
computed over more levels it yields better performance.
Table 4.11: Performance metrics for Masked Per-level classifier on the ETHEC
Merged dataset. The models used in this experiment are pre-trained on the 1000-
class ImageNet data set. All weights are updated with a learning rate of 10−5 , a
batch-size of 64 and input spatial dimensions are 224x224 for 200 epochs. P, R
and F1 represent Precision, Recall and F1-score. For all models, data re-sampling
proportional to the inverse of the class frequency is performed while training.
Metrics prefixed with m are micro-averaged while the ones with M are macro-
averaged. The top performing models are in bold-face. In addition to the normal
experiments, we also include results from models trained on grayscale images.
This is the second best performing CNN-based model and we also look at the
level-wise performance split in table 4.12.
46
4.4. ETHEC Dataset
Table 4.12: Performance metrics for Masked Per-level classifier on the ETHEC
Merged dataset for the best performing model. The models used in this experi-
ment are pre-trained on the 1000-class ImageNet data set. All weights are updated
with a learning rate of 10−5 , a batch-size of 64 and input spatial dimensions are
224x224 for 200 epochs. P, R and F1 represent Precision, Recall and F1-score.
Metrics prefixed with m are micro-averaged while the ones with M are macro-
averaged.
47
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Table 4.14: Performance metrics for Hierarchical Softmax on the ETHEC Merged
dataset. The models used in this experiment are pre-trained on the 1000-class Im-
ageNet data set. All weights are updated with a learning rate of 10−5 , a batch-size
of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R and F1 repre-
sent Precision, Recall and F1-score. For all models, data re-sampling proportional
to the inverse of the class frequency is performed while training. Metrics prefixed
with m are micro-averaged while the ones with M are macro-averaged. The top
performing models are in bold-face.
The Hierarchical Softmax seems to be less prone to over-fitting and has the best
performing model with the ResNet-152 backbone. To recall, the model predicts
conditional distribution p(child|parent) for each label in the hierarchy. The joint
distribution is calculated by multiplying probabilities of all labels along a specific
path to obtain the probability for the leaf.
Table 4.15: Performance metrics for Hierarchical Softmax on the ETHEC Merged
dataset for the best performing model. The models used in this experiment are
pre-trained on the 1000-class ImageNet data set. All weights are updated with a
learning rate of 10−5 , a batch-size of 64 and input spatial dimensions are 224x224
for 200 epochs. P, R and F1 represent Precision, Recall and F1-score. Metrics
prefixed with m are micro-averaged while the ones with M are macro-averaged.
48
4.4. ETHEC Dataset
2 2
1 1
0 2 0
10 103 104 102 103 104
1.0 1.0 1.0 1.0
Score
0.4 0.4 0.4 0.4
15 40
10
20
5
0 0
102 103 5 × 101 6 × 101 7 × 101 8 × 101 9 × 101
1.0 1.0 1.0 1.0
Score
Figure 4.4: Per-label F1 performance across 4 hierarchical levels plotted against number
of training samples for ResNet-152 with Hierarchical Softmax and resampler (cw: 7, rs:
3). It is important to note that the population statistics, especially of lower levels in the
hierarchy (fig. 4.2c and fig. 4.2d) are skewed to the higher end as an effect of re-sampling
the less frequent classes.
49
4. Empirical Analysis: Injecting label-hierarchy into CNN classifiers
Table 4.16: Comparing best performing baseline classifiers on the ETHEC Merged
dataset. The models used in this experiment are pre-trained on the 1000-class Im-
ageNet data set. P, R and F1 represent Precision, Recall and F1-score. For details
regarding the specific baselines please refer to the respective sections. Metrics pre-
fixed with m are micro-averaged while the ones with M are macro-averaged. The
top performing models are in bold-face.
50
Chapter 5
Methods: Order-preserving
embedding-based models
Figure 5.1: The latent space of the modified the per-level model is used to extract
label embeddings. An additional layer is added after the final layer in the original
model. The model is trained exactly like the L Ni -way classifier. The weights of
the layer labeled N × Li hold the N-dimensional representations of the labels Here,
N=2, however it can be extended to any N-dimensional embedding space.
For cosine embeddings we extract label representations from the latent space of
a CNN trained for image classification. The learned representations are a by-
product of the model being explicitly trained only for image classification. It is
important to note that cosine embeddings are not necessarily order-preserving but
are presented in this chapter with all the other embedding based models.
51
5. Methods: Order-preserving embedding-based models
We modify the Per-level classifier model by adding a linear layer that projects
the final fully-connected layer of the original model to a latent space which is
interpreted as the label embedding.
The additional layer projects onto the N-dimensional embedding space for every
label. In the fig. 5.1, the weights of the layer labeled 2 × Li holds the 2-dimensional
representation, one for each label for the i-th level.
When performing image classification, matrix multiplication of layer weights and
image representation from the upper layer yield the logits for each label. The
weights in the last layer represent the label embeddings. The label logits that
represent the similarity between an image and the labels are computed by the dot-
product between the image’s representation and the representation of each label.
The larger the magnitude of the dot-product between the representations of an
image and a particular label, the larger the corresponding logit. A larger logit
implies the likelihood of the image belonging to the particular label is high.
5.2 Order-Embeddings
In this part we introduce learning representations for both concepts and images
via embeddings. Recent advances show how unconventional loss functions (in-
stead of the widely used vanilla inner-product or their p-norm distance) can model
the asymmetric relations between concepts. Directed graphs also model asymmet-
ric relations between two nodes connected by a directed edge.
We treat our label hierarchy as a directed-acyclic graph. More specifically, due to
the nature of defining the taxonomy and how we create the dataset, it is a directed
tree graph. The dataset X consists of entailment relations (u, v) connected via a
directed edge from u to v. (following the definition in [16]). These directed edges
or hypernym links convey that v is a sub-concept of u.
We train our model using the max-margin loss proposed in [39]. Unlike in the
work [39] we do not restrict the embeddings to have positive coordinates only
and get rid of the absolute function used by [39], granting the embeddings more
freedom.
52
5.4. Hyperbolic Cones
Figure 5.2: Visualizing the latent 2D space of the modified per-level model for the
ETHEC dataset [11]. Each label’s embedding is plotted. The weights of the layer
labeled 2 × Li hold the 2-dimensional representations of the labels. We visualize
the relations between nodes by adding edges from the original label hierarchy.
Legend=cyan: family, magenta: subfamily, yellow: genus, black: genus+species.
We see that for labels from a given level, the model is more confident about those
that have a larger norm in the embedding space. For the lower levels (genus,
genus+species) the labels form roughly a circular pattern meaning that the model
has the same confident across the labels. We also see that cyan nodes are collapsed
towards the center even though they have most samples per label (as they are the
top-most level in the hierarchy) however since they capture images with a large
intra-label variance they have a smaller norm than the magenta nodes.
The hyperbolic space can be modeled in 5 different ways. We use the Poincaré
ball like previous work [32, 16].
The hyperbolic cones like the rest of the models are implemented in PyTorch [33].
We follow the schemes from [15] to avoid numerical instabilities when learning
parameters in the hyperbolic space, more specifically the Poincaré ball.
53
5. Methods: Order-preserving embedding-based models
54
5.5. Embedding Label-Hierarchy
Label embeddings. For our implementation of the hyperbolic cones, the label-
embeddings live in the hyperbolic space Dn and are optimized using the RSGD as
per eq. (2.14) and eq. (2.13) with the help of the exponential-map from eq. (2.15).
RSGD is implemented by modifying the SGD gradients in PyTorch as it is not a
part of the standard library.
Image embeddings. For images, features from the final layer of the backbone of
the best performing CNN-based model are used (∈ R2048 ). In order to map them
to Dn we use a linear transform W ∈ R2048×n and then apply a projection into
Dn via the exponential-map at zero which is equivalent to exp0 ( x ). This bring
the image embeddings to the hyperbolic space with Euclidean parameters. This
allows for optimizing the parameters with well know optimization schemes such
as Adam [20].
Data splitting
We split the data into train, val and test in a similar manner to that of [16]. They
first compute the transitive reduction of the directed-acyclic graph. However, since
it is a tree it is already in the most minimal form and we use the tree to form the
“basic” edges for which the transitive closure can be fully recovered. If these edges
are not present in the train set, the information about them is unrecoverable and
therefore they are always included in the train set. Now, we randomly pick edges
from the transitive closure (=1974 edges) minus the “basic” edges (=723 edges) to
form a set of “non-basic” edges (=1257 edges). We use the “non-basic” edges to
create val (5%) (=62 edges) and test (5%) (=62 edges) splits and a proportion of
the rest are reserved for training (see Training details).
Training details
We follow the training details from [16]. We augment both the validation and
test set by generating 5 negative pairs each for ( x, y) (a positive pair): of the type
( x 0 , y) and ( x, y0 ) with a randomly chosen edge that is not present in the full
transitive closure of the graph. Generating 10 negatives for each positive. For
the training set, negative pairs are generated on-the-fly in the same manner. We
report performance on different training set sizes. We vary the training set to
include 0%, 10%, 25%, 50% of the “non-basic” edges selected randomly. We train
for 500 epochs with a batch size of 10 and a learning rate of 0.01. We run two sets
55
5. Methods: Order-preserving embedding-based models
of experiments: one, we fix α = 1.0 as mentioned in [39] and two, tune α based on
the F1-score on the val set [16].
pick-per-level strategy
During the experiments, we found a better strategy to sample negative edges. In-
stead of sampling a negative edge ( x 0 , y) from candidates where x 0 is any node
that makes ( x 0 , y) a negative edge, we pick each x 0 from a different level in the
hierarchy. This serves a dual purpose. Because the hierarchy is a tree, 78.24%
of the nodes belong to the final level in the hierarchy and if a pick-per-level
strategy is not applied one would always sample edges where the corrupted end
would be from the last level majority of the times. This makes training and con-
vergence excruciatingly slow. Secondly, with this pick-per-level strategy we
are able to sample nodes that give hard negatives edges from the same level as
the non-corrupted node y, helping embeddings to disentangle and spread out in
space.
Optimization details
We use Adam optimizer [20] for order-embeddings and Euclidean cones. For
hyperbolic cones we use Riemannian SGD [16].
56
5.6. Jointly Embedding Images with Label-Hierarchy
f i (i ) = W ∗ CNN(i ) ∈ R N (5.2)
where, CNN(i ) represent the fc7-features from our best performing CNN model
and W is a matrix. The weights of the CNN are frozen to calculate the fc7-features
with only W that can be learned.
For the labels, f l (l ) is just a lookup table that stores vectors in R N .
Data splitting
For these experiments, we split the data into train (80%), val (10%) and test
(10%) based on images only as done for the CNN-based models. Since, now we
embed images together with the labels, we create a combined graph G to represent
both. The graph contains directed edges from each label that “describes” the
image to the image itself as well as edges between related labels.
Training details
Let G represent the graph to be embedded. All edges in Gtc , the transitive clo-
sure of G , are considered as positive edges. To obtain negative edges, Gneg is
constructed by removing the edges in Gtc from a fully-connected di-graph with
the same nodes as G .
While training, for each positive edge ( x, y) 10 negatives are sampled, 5 each by
randomly corrupting either side of the positive edge (5 × ( x 0 , y) + 5 × ( x, y0 )). We
generate negatives by corrupting the edge with nodes from each level in the hier-
archy including the images (the lowest level) because the images outnumber the
labels and we would like to embed the label-hierarchy in addition to the images.
We use the pick-per-level strategy as described in the previous section.
We make sure that we do not sample a negative edge such that either side of the
edge is an image. This is to ensure that two images are not forced apart unless
their labels require them to do so because the images for the final level and there
are no cones that are nested with the cone formed by the image embedding (as it
is the last level in the graph G ).
For the validation and test set, we generate 5 negative pairs each for ( x, y) (a
positive pair): of the type ( x 0 , y) and ( x, y0 ) with a randomly chosen edge that is
57
5. Methods: Order-preserving embedding-based models
not present in the graph Gneg . The validation and test sets are generated in the
beginning and are fixed during training. We follow the training details from [16].
Negatives are sampled only for the training set and are generated on-the-fly. The
model’s performance is measured on its ability to classify images correctly. Since
negative edges are not required for measuring classification performance, negative
edges are sampled only during training. For validation and testing, we measure
the model’s classification the val and test set images respectively.
Optimization details
For jointly embedding labels and images, we empirically found using vanilla
Adam [20] optimizer instead of the Riemannian SGD. The drawback being that
the label embeddings are parameterized in the Euclidean space and we use the ex-
ponential map at 0 from eq. (2.15) to map them to the hyperbolic space. This was
observed to be more stable and help converge the joint embeddings. Also, with
this implementation of the hyperbolic cones, for both labels and joint embeddings,
it was not necessary to initialize the embeddings with the Poincaré embeddings
[32] as suggested in [16].
We even notice that when training jointly one does not need to initialize the la-
bels with separate labels-only embedding. The model is still able to attain decent
image classification performance when the label embeddings are randomly initial-
ized. However, a performance boost is obtained when initialized with values from
embedding only the label-hierarchy.
58
Chapter 6
59
6. Empirical Analysis: Order-preserving embedding-based models
Table 6.1: Micro-F1 score on the test set for embeddings on the label hierarchy of
ETHEC Merged dataset. We find the classification threshold that yields the best
val set performance. For these experiments we train for 200 epochs with α = 1.0
for order-embeddings and α = 0.01 for Euclidean cones, a batch-size of 10 and a
learning-rate of 0.1 with Adam optimizer. The F1-score corresponding to the test
set for the epoch with the best F1-score on the val set is reported. train set is
composed of all the “basic” edges and an additional 50% of the “non-basic” edges
for all experiments reported here. We vary the dimensionality of the embedding
space, d = {2, 3, 5, 10, 100}.
In addition to the entailment prediction task between two given concepts [16, 32,
38] we also check the reconstruction of the complete graph which basically checks
60
6.1. Embedding Label-Hierarchy
(c) Euclidean cones L=4, b=3 (d) Euclidean cones L=3, b=7
Figure 6.3: We embed 2 different toy graphs. One with 4 levels and a branching factor
of 4 and another one with 3 levels and a branching factor of 7. The model is trained for
1000 epochs with Adam (learning rate of 0.01). The toy graphs are embedded using both
order-embeddings and euclidean cones in R2 . We draw an edge between each node that
is connected in the original in order to better visualize the embedding quality. Nodes
from different levels are colored differently. The illustrations show the levels and
branching factor, the edges are split into train, val and test and report F1-score,
precision, recall and accuracy; and the threshold to decide if a pair of nodes have a
directed edge or equivalently if they are hypernyms.
61
6. Empirical Analysis: Order-preserving embedding-based models
the ability of the embedding to reproduce the asymmetric relations in the original
label hierarchy. This consists of positive and negative directed edges from the
original labels where the positive edges are present in the label-hierarchy while
negative edges are non-existent edges. It is important to note that only a handful
of edges are positive while a vast majority of the edges in the fully-connected
digraph for the set of negative, non-existent edges (in the original label-hierarchy).
For the ETHEC dataset, to obtain the full-F1 (measuring the reconstruction of the
label-hierarchy) we classify all the 723 positive and 521,289 negative edges. Due
to this very large imbalance between the negative and positive edges we refrain
from using accuracy or micro/macro F1 score and used the TPR, TNR and full-F1
instead.
In table 6.2 we compare the embedding performance for labels-only from the
ETHEC dataset. We employ order-preserving embedding techniques as discussed
in previous chapters from order-embeddings [39] and euclidean and hyperbolic
variants of [16].
Positive edges constitute of only about 0.1% of the total edges in the DAG repre-
senting the label-hierarchy; making it very difficult to predict these. This is also
evident empirically as the TNR is quite high for extremely low-dimensions. For 2D
order-embeddings the TNR=0.9708 despite being the lowest for 2D embeddings.
The performance boost achieved by entailment cones, a generalization of order-
embeddings, can be seen from the table where the euclidean variant is always
better than order-embeddings. By parameterizing the cones to live in hyperbolic
space and use the corresponding hyperbolic geometry, they are able to achieve
almost twice the TPR of OE and EC in 100 dimensions. Further increasing the
dimensions to 1000 dimensions improves performance over 100-dimensional HC
from 0.8060 to 0.8267 in full-F1 score, exhibiting the representative capacity of HC.
62
6.2. Jointly Embedding Images with Label-Hierarchy
We also note that for 1000-D EC and OE, the model seems to overfit and perform
worse than the 100-D counterparts. However the HC improves with increase in
the embedding dimensionality.
6.1.2 Optimization
Initially, the experiments used a batch-size of 10 and the models still had room
for improvement at the end of 5000 epochs. However, with a batch-size of 100
the models converged faster and also performed better. In general it was easier
to find hyperparameters for euclidean models that the non-euclidean ones. We
also noticed better, more stable training when parameterizing the euclidean cones
in cosine space rather than angle space. For euclidean cones experiments we
implement and use the cosine space formulation to compute the energy E from
eq. (2.6).
63
6. Empirical Analysis: Order-preserving embedding-based models
E( f l (l ), f i (i )) across all possible labels for a given image is considered as the pre-
dicted label. For every level in the label-hierarchy this is done once, with the labels
from that particular level.
Figure 6.5: Visualization of jointly embedding labels and images using Euclidean
cones in 2 dimensions. The cyan nodes represent family, the magenta nodes
represent sub-family, the yellow nodes genus and black nodes genus+species.
The images are depicted using semi-transparent nodes which are accumulated
away from the origin around the periphery. To better visualize this, we clamp the
norms of image embeddings (to 500 units) in order to visualize them with labels
which are embedded much closer to the origin (due to them being more abstract).
6.2.1 Optimization
The EC models use α = 1.0 trained for 200 epochs with a learning-rate of 10−2 for
the label embeddings, 10−3 for the image embeddings using Adam. For the HC
we train for 100 epochs with a learning-rate of 10−4 for the label embeddings, 10−3
for the image embeddings using Adam. We use a batch-size of 100 for all with 10
negatives per positive pick-per-level sampling. We initialize the models’ label
embeddings using label-only embeddings for all.
64
6.2. Jointly Embedding Images with Label-Hierarchy
Table 6.3: The table summarizes the embedding model performance when used
to classify images. The metrics are reported on the hidden images of the test set
of the ETHEC dataset. The joint image and label embeddings live in Rd or Dd
(d=dimensionality of embedding space). The main metric to look at is the m-F1
for image classification performance using the embeddings. This is directly com-
parable to the CNN-based models and the hierarchy-agnostic model (baseline) as
well. Since these models are embedding based, in addition, to the classification
task, we report the quality of the reconstruction for the label-hierarchy obtained
during the joint embedding. The EC models use α = 1.0 trained for 200 epochs
with a learning-rate of 10−2 for the label embeddings, 10−3 for the image embed-
dings using Adam. For the HC we train for 100 epochs with a learning-rate of
10−4 for the label embeddings, 10−3 for the image embeddings using Adam. We
use a batch-size of 100 for all with 10 negatives per positive pick-per-level sam-
pling. We initialize the models’ label embeddings using label-only embeddings
for all. Best models are bold-faced. EC: euclidean cones, HC: hyperbolic cones.
65
6. Empirical Analysis: Order-preserving embedding-based models
Per-level micro-F1
Model m-F1 m-F1 L1 m-F1 L2 m-F1 L3 m-F1 L4
CNN-based methods
Hierarchy-agnostic (baseline) 0.8147 0.9417 0.9446 0.8311 0.4578
Per-level classifier 0.9084 0.9766 0.9661 0.9204 0.7704
Marginalization classifier 0.9223 0.9887 0.9758 0.9273 0.7972
Masked Per-level classifier 0.9173 0.9828 0.9701 0.9233 0.7930
Hierarchical-softmax 0.9180 0.9879 0.9731 0.9253 0.7855
Order-preserving (joint) embedding models
Euclidean cones d=100 0.8350 0.9728 0.9370 0.8336 0.5967
Hyperbolic cones d=100∗ 0.7627 0.9695 0.9205 0.7523 0.4246
Hyperbolic cones d=100 0.8404 0.9800 0.9439 0.8477 0.5977
Table 6.4: Comparing level-wise performance across different models both CNN-
based and embeddings based as proposed in the main body of the work. All
models that exploit any information from the hierarchy outperform the hierarchy-
agnostic classifier baseline. All scores are micro-averaged F1. We also include the
overall m-F1 in addition to the separate m-F1 across the 4 levels in the ETHEC
dataset. The best overall model is underlined and the best model in the model
category is bold-faced. Label embeddings for all joint-embeddings models are ini-
tialized using labels-only embeddings. ∗ = randomly initialized label embeddings.
to drastically over-fit and have high performance on the train set and extremely
low performance on the unseen test set.
66
6.2. Jointly Embedding Images with Label-Hierarchy
reason for its inability to rearrange is due to there being a two different types of
objects being embedded (and also being computed differently) and it compounded
by using different optimizers.
In our experiments we obtain best results when using the Adam optimizer even
if it means the update step for parameters living in hyperbolic space has to be
performed in an approximate manner. Adam optimizer with an approximate
update step works better in practice than RSGD with its mathematically more
precise update step.
r ∗ x ∗ || xmax ||
xinverted = (6.1)
|| x ||
where, xmax is the cosine embedding with the largest norm and r ∈ R is the
minimum norm that any inverted embedding should have.
67
6. Empirical Analysis: Order-preserving embedding-based models
One can notice that the inverted cosine embeddings from fig. 6.6 (does not show
the images) closely resemble the Euclidean cones of the label-embeddings from
fig. 6.5
68
Chapter 7
Conclusions
69
7. Conclusions
for images. Obviously, this comes with possible issues like over-fitting and
difficulty to optimize parameters living in non-euclidean space. The W ma-
trix used to map in our experiments lives in the Euclidean space but very
recent work on hyperbolic neural networks [15] could be used as a pointer
to model feed-forward networks that replace the matrix W. This W would
then live as well as be optimized in hyperbolic space.
• Applications. We empirically show that a classifier that exploits its label-
hierarchy outperforms a model that is hierarchy-agnostic. This provides
a good base to improve existing image classification model to use this un-
tapped information source.
In addition to this, the learnt joint-embeddings can be used for downstream
tasks such as image captioning, scene understanding and scene graph gen-
eration [26, 17, 43] and visual question-answering (VQA) [2]. Models that
tackle these tasks lie at the intersection of images and natural language con-
cepts. Our work moves in the direction to bridge the two different fields and
treat them in a joint manner. Generally, for such tasks, the classifiers/CNN-
backbones are used for visual feature extraction and object proposals while
the semantics are obtained from a separate module (such as an LSTM [26]
or RNN [17, 43]) that models natural language and semantics. Because our
proposed methods are aware of both visual similarity and semantic similar-
ity (via the label-hierarchy information) this could improve performance by
virtue of modeling visual features and semantics jointly.
7.2 Summary
70
7.2. Summary
image classification task. This shows promise for other tasks that encom-
pass computer vision alone or computer vision in conjunction with natural
language, where exploiting label-hierarchy could benefit the models.
• During the experiments with embeddings we realized that optimization is a
relatively tricky process and sometimes methods are delicate and sensitive
to the choice of initialization, optimizers and hyper-parameters.
• We provide extensive experiments, empirical analysis, visualizations for the
proposed methods and observe that the proposed label-hierarchy injection
methods as well as the order-preserving joint-embeddings outperform the
hierarchy-agnostic image classifier baseline on the introduced ETHEC dataset
[11] with 47,978 images and 723 labels spread across 4 hierarchical levels.
The implementation for all the methods will be made publicly available.
71
Bibliography
[1] Wouter Addink, Dimitris Koureas, and Ana Casino. Dissco: The physical and
data infrastructure for europe’s natural science collections. In EGU General
Assembly Conference Abstracts, volume 20, page 16356, 2018.
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv
Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering.
In Proceedings of the IEEE international conference on computer vision, pages 2425–
2433, 2015.
[3] Björn Barz and Joachim Denzler. Hierarchy-based image embeddings for
semantic image retrieval. arXiv preprint arXiv:1809.09924, 2018.
[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[5] Tianshui Chen, Liang Lin, Riquan Chen, Yang Wu, and Xiaonan Luo.
Knowledge-embedded representation learning for fine-grained image recog-
nition. arXiv preprint arXiv:1807.00505, 2018.
[6] Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. Recurrent at-
tentional reinforcement learning for multi-label image recognition. In Thirty-
Second AAAI Conference on Artificial Intelligence, 2018.
[7] Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang
Lin. Fine-grained representation learning and recognition by exploiting hier-
archical semantic embedding. arXiv preprint arXiv:1808.04505, 2018.
[8] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large
scale fine-grained categorization and domain-specific transfer learning. In
Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4109–4118, 2018.
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee, 2009.
73
Bibliography
[10] Jia Deng, Jonathan Krause, Alexander C Berg, and Li Fei-Fei. Hedging your
bets: Optimizing accuracy-specificity trade-offs in large scale visual recogni-
tion. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages
3450–3457. IEEE, 2012.
[12] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++:
Improving visual-semantic embeddings with hard negatives. arXiv preprint
arXiv:1707.05612, 2017.
[13] Holger Frick, Pia Stieger, and Christoph Scheidegger. Swisscollnet–a national
initiative for natural history collections in switzerland. Biodiversity Information
Science and Standards, 2019.
[14] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas
Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances
in neural information processing systems, pages 2121–2129, 2013.
[15] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural
networks. In Advances in neural information processing systems, pages 5345–
5355, 2018.
[17] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang
Ling. Scene graph generation with external knowledge and image recon-
struction. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1969–1978, 2019.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 770–778, 2016.
[19] Tao Hu and Honggang Qi. See better before looking closer: Weakly super-
vised data augmentation network for fine-grained visual classification. arXiv
preprint arXiv:1901.09891, 2019.
[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
[21] Tasho Kjosev. Deep learning for generating template pictorial and textual
representations. Thesis, 2018.
[22] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on
cifar-10. Unpublished manuscript, 40(7), 2010.
74
Bibliography
[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural informa-
tion processing systems, pages 1097–1105, 2012.
[24] Suren Kumar and Rui Zheng. Hierarchical category detector for clothing
recognition from visual data. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2306–2312, 2017.
[25] Matt Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximil-
ian Nickel. Inferring concept hierarchies from text corpora via hyperbolic
embeddings. arXiv preprint arXiv:1902.00913, 2019.
[26] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. Scene
graph generation from objects, phrases and region captions. In Proceedings of
the IEEE International Conference on Computer Vision, pages 1261–1270, 2017.
[27] Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and Liang Lin.
Crowd counting using deep recurrent spatial-aware network. arXiv preprint
arXiv:1807.00601, 2018.
[28] Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, and Yuanqing Lin. Localizing
by describing: Attribute-guided attention localization for fine-grained recog-
nition. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean.
Distributed representations of words and phrases and their compositionality.
In Advances in neural information processing systems, pages 3111–3119, 2013.
[31] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network
language model. In Aistats, volume 5, pages 246–252. Citeseer, 2005.
[32] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hi-
erarchical representations. In Advances in neural information processing systems,
pages 6338–6347, 2017.
[33] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam
Lerer. Automatic differentiation in pytorch. Software Library, 2017.
[34] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[35] Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano.
Resampling or reweighting: A comparison of boosting implementations. In
2008 20th IEEE International Conference on Tools with Artificial Intelligence, vol-
ume 1, pages 445–451. IEEE, 2008.
75
Bibliography
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[38] Ryota Suzuki, Ryusuke Takahama, and Shun Onoda. Hyperbolic disk embed-
dings for directed acyclic graphs. arXiv preprint arXiv:1902.04335, 2019.
[39] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-
embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
[40] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. Multi-
label image recognition by recurrently discovering attentional regions. In
Proceedings of the IEEE international conference on computer vision, pages 464–
472, 2017.
[42] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel im-
age dataset for benchmarking machine learning algorithms. Unpublished
manuscript/dataset, 2017.
[43] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph gen-
eration by iterative message passing. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5410–5419, 2017.
76