Deepsetfusion

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Deep Multi-Modal Sets

Austin Reiter Menglin Jia Pu Yang


Facebook AI Research Cornell University Facebook AI Research
[email protected] [email protected] [email protected]

Ser-Nam Lim
Facebook AI Research
arXiv:2003.01607v1 [cs.CV] 3 Mar 2020

[email protected]

Abstract tion Answering (VQA) and video-based tasks. These types


of models are referred to as multi-modal models.
Many vision-related tasks benefit from reasoning over In multi-modal models, the goal is to construct a model
multiple modalities to leverage complementary views of that is able to leverage different types of information with a
data in an attempt to learn robust embedding spaces. Most common goal in mind. A very typical manifestation of this
deep learning-based methods rely on a late fusion technique is to combine both visual and textual information. Here, the
whereby multiple feature types are encoded and concate- model seeks a way to fuse these data sources, usually by
nated and then a multi layer perceptron (MLP) combines leveraging their individual discrimination capabilities, and
the fused embedding to make predictions. This has several then combining them together into a single representation
limitations, such as an unnatural enforcement that all fea- for the final prediction task. As such, an image model (e.g.,
tures be present at all times as well as constraining only CNN) may learn to represent raw images as feature embed-
a constant number of occurrences of a feature modality at dings while a text model (e.g., Text-CNN, LSTM, etc) is
any given time. Furthermore, as more modalities are added, similarly learned to represent raw text, and these two are
the concatenated embedding grows. To mitigate this, we “fused” (or concatenated) into a single feature. This fused
propose Deep Multi-Modal Sets: a technique that repre- feature is used for any final down-stream tasks.
sents a collection of features as an unordered set rather
than one long ever-growing fixed-size vector. The set is con- With this model architecture in mind, we sought out to
structed so that we have invariance both to permutations of explore the following question: what if we approached vi-
the feature modalities as well as to the cardinality of the sual modeling as a multi-modal problem? More specif-
set. We will also show that with particular choices in our ically, rather than encoding all visual information from a
model architecture, we can yield interpretable feature per- photo into a single, all-encompassing feature embedding (as
formance such that during inference time we can observe is typical, for example, as we do with ImageNet-like tasks),
which modalities are most contributing to the prediction. instead we look to break an image down into more compo-
With this in mind, we demonstrate a scalable, multi-modal sitional components, such as objects, faces, overlaid text,
framework that reasons over different modalities to learn etc, and then learn individual representations for each to be
various types of tasks. We demonstrate new state-of-the-art fused into a multi-modal framework. Can we do better like
performance on two multi-modal datasets (Ads-Parallelity this? In order to explore an answer to this question, we
[34] and MM-IMDb [1]). first must realize that a standard concatenation model is too
limiting to support features such as objects or faces, where
there may be 0, 1, or variable number of occurrences on any
given frame, and this does not stay constant across samples.
1. Introduction Therefore, a new way of thinking about how to fuse vari-
Traditional vision tasks typically formulate problems ous modalities is required. Further, any modalities besides
with a single input (e.g., an image) to infer a desired out- the image components mentioned, such as accompanying
put (e.g., classifying a scene, detecting objects, etc). More text and even audio, should also be easily added to the final
recently, the advantages of multiple complementary inputs embeddings.
to achieve a desired output has become more popular, espe- To answer this question, this paper offers the following
cially for higher-level reasoning tasks such as Visual Ques- 3 primary contributions:

1
Collected features Common space
Modality Set
... ...
Encodings Encoding

Person, dog Be an Angel for Animals Fixed-size


Face
POOLING embedding

Whole Image
Faces Objects OCR Image Text OCR
Whole Image
Whole Image
Text OCR
Text
Faces
Faces
Objects
Whole Image
Objects
Texts
OCR
OCR Objects
Faces

“Be an Angel for Animals!” Element-wise max Element-wise argmax Feature Importance

Figure 1: Overview of Deep Multi-Modal Sets. An image (bottom-left), possibly along with additional raw inputs (e.g.,
raw text; “Be an Angel for Animals!”), is/are processed through various types of detectors, such as face and object detectors
as well as OCR extraction. Additionally, we encode raw inputs such as the entire image through a CNN and raw text through
an NLP encoding model. Each is collected and re-encoded to a common space and then pooled to a fixed-size embedding.
We call this the “Set Encoding”. When we use an interpretable feature pooling scheme, such as max-pooling, we obtain a
metric of feature importance (see more in text).

• A new multi-modal architecture which utilizes un- were fused via boosting. Attention models have become
ordered sets; this adds much more flexibility in terms recently popular, rooted with early work from [10] showing
of number of feature inputs as well as number of fea- how a model is able to pick from a selection of networks via
ture occurrences per-sample. gating, based on the input. Multi-modal attention has also
been extended to temporal problems, as shown in [9, 18].
• This new architecture allows us to start to think about
As much of the field has moved towards deep learning,
compositional reasoning via multi-modal modeling
given superior results in all domains from computer vision
• An interpretable feature importance metric, allowing to natural language processing, it makes sense to focus in
us to inspect our model at inference time for which on methods built on deep models. [26] proposed a method
feature modalities are most contributing to a given pre- to search for an optimal neural architecture that optimizes
diction. fusing feature modalities based on an optimization scheme
that poses the neural architecture search problem as a com-
The remainder of the paper is laid out as follows: first binatorial search.
we review prior work on multi-modal modeling within the The work in [12] is particularly interesting and relevant
vision and machine learning literatures. Then we provide to the current proposed work. In this paper, the authors
an overview of our method, called Deep Multi-Modal Sets, investigate various methods to combine text, as a discrete
and conclude with experimental results and conclusions. representation, along with a vision-based, more continuous
modality, in a multi-modal framework. The target of this
2. Related Work
work was to develop methods that are appropriate for large
Early and late fusion methods for multi-modal models quantities of data that must be processed with light-weight,
have been explored for several years [2], for example, by computationally-efficient operations. In doing so, the au-
combining low-level features with prediction-level features. thors touched upon fusion approaches that are very simi-
It has been shown in some scenarios that late fusion meth- lar to unordered sets. We note the parallels to this work
ods outperform early fusion methods, but not in all cases and ours, where we are further generalizing and pushing
[27]. “Late fusion” is commonly defined as a combination these ideas to a more formal framework for multi-modal fu-
of the output scores (or embeddings) from each unimodal sion. As a follow-up to this work, [11] introduces a multi-
branch of a multi-modal model. Previous to deep learning, modal bitransformer model which seeks to highlight the
methods for combining these outputs ranged from weighted strengths of the text signal while supplementing with CNN-
averaging [21] to rank minimization [31]. type image-only features, showing significant SOTA perfor-
Combinations of early and late fusion methods were pro- mance on difficult multi-modal benchmarks.
posed in [30], whereby low-level and high-level features In [34], the question is explored as to whether two
modalities, images and text, are complementary to one an- Another common issue with this formulation for multi-
other and how to better understand their relationship. An modal modeling is when there is a large imbalance in fea-
ensemble of SVM predictors are proposed to exploit the re- ture dimensions amongst the modalities. When this hap-
lationship between an image and associated text, all with pens, it is possible those feature types with higher dimen-
a goal towards capturing “parallelity”. Multiple modalities sions dominate the model over those with less. This can be
modeled from deep neural networks were explored in [22] addressed by encoding each modality individually before
and [28]. Related to this, gated mechanisms and compact concatenating. For the purposes of this work, we will de-
bilinear pooling were proposed as alternative ways to fuse note φi (Xi ) as an encoder acting on modality i. The goal
modalities in [1, 8]. here is then to encode each modality to a common dimen-
Multi-modal models have also recently been proposed as sionality D: φi (Xi ) : RMi → RD for all i ∈ I.
a way to provide explanations for model decisions. In [24], Finally, if we consider the case where M is very large,
two datasets and a new model are proposed to provide joint either due to many different modalities or many different
textual rationales with attentional justifications for model occurrences of individual modalities, the final multi-modal
decisions in tasks that ask questions about content in pho- model can get quite large due to the fact that an MLP is
tos. Other similar work [30, 29] has similarly focused on fully-connected. To counter this, we would instead prefer
attention models in a multi-modal fashion for VQA tasks. an architecture that scales better with the cardinality of I
(denoted as |I|) and M. Towards this end, we describe a
3. Method Overview new class of models that we may adopt to the multi-modal
domain.
We begin with an overview of our Deep Multi-Modal
Sets methodology. First, we describe the generic multi- 3.2. Deep Sets
modal problem as well as the baseline approach to this.
Deep Sets [32] refer to a class of models for machine
Next, we describe the Deep Sets overview, and conclude
learning tasks that are defined on sets, in contrast to tradi-
with Deep Multi-Modal Sets.
tional approaches that operate on fixed dimensional vectors.
The main contribution of [32] is to define an architecture
3.1. The Multi-Modal Problem
with properties that guarantee invariances across permuta-
A multi-modal problem is one in which more than one tions of a collection of features as well as the cardinality
feature type, or modality, is given. The goal is to com- (e.g., number of elements) within them; this collection is
bine these features together within a single model towards a referred to as a set.
common task. More formally, consider that we have modal- Consider, as input to such a model, a set of vectors
ities i ∈ I, represented by feature embeddings Xi . In gen- X = {x1 , . . . , xS }. We construct a function ψ(X) that
eral, each Xi may be of dimensions (Ni × Mi ), for Ni ≥ 0 is “indifferent” to the ordering as well as the count of the
occurrences of modality i, each of which is (1 × Mi ) in it’s elements in X. Such a function would be able to map any
embedding space. In this way, a feature modality may occur number of elements to a fixed dimensional representation
0, 1, or more times in a given data sample. for use in a down-stream modeling task. One example is
In the most simple form of a multi-modal model, a con- the well-known sum-pooling operator:
catenated feature is formed by appending all features into a X
single vector: XC = concat([X1 , . . . XI ]), where concat ψsum (X) = φ(x) (1)
is a function that appends all 1-D vectors
PI to one-another. x∈X
In this way, XC ∈ RM , where M = i=1 Ni ∗ Mi . We
then typically attach an MLP to the end of XC to predict where φ(x) is a transformation on input x to an embedded
the target task. There are two complications that may arise space.
with this model in mind: 1) if Ni = 0 for any i ∈ I; and In general, any operator applied to X that yields an iden-
2) if Ni > 1 for any i ∈ I. For 1) it is typical to use a tical solution for any permutation π : ψ({x1 , . . . , xS }) =
placeholder, such as zeros, to indicate that the feature is ψ({xπ(1) , . . . , xπ(S) } is sufficient for this model. Another
missing; however, this can be unnatural to force into the example is the max-pooling operator:
model. As for 2), there are no wide-spread methods to deal
with multiple occurrences of a given feature modality, un- ψmax (X) = max φ(x) (2)
x∈X
less it’s constant throughout. However, if this varies across
data samples, it is unknown how best to deal with this (per- Here, the max operator is element-wise, in the sense that a
haps you pad with the maximum number of occurrences, set of S elements, each of which is (1 × D), when operated
which is quite wasteful). We will describe our technique to on by the function in Eq. 2, yields a (1 × D) vector at all
address these issues in the following sub-sections. times. In the end, the output of the pooling operation is
input to a down-stream model, such as an MLP, to classify pooling operators. For example, if the max pooling operator
label y from input X as: is used, we can get a sense of feature importance in a way
that is interpretable during inference time. We show this
y = ρ(ψ(X)) (3) depiction in Fig. 1. In short, max pooling operates over an
(N × d) tensor to produce a (1 × d) vector for any N by ex-
3.3. Deep Multi-Modal Sets tracting the maximum element for each of the d dimensions,
The primary contribution of this work is to incorporate across all N samples.
the concepts of Deep Sets into the multi-modal modeling For the proposed model, this can be useful because each
problem. All of the shortcomings of the baseline model of the N samples in the tensor (over which we are pool-
methodologies described in Sec. 3.1 can be addressed by ing) is an individual instance of a specific, known feature
applying the concepts described in Sec. 3.2, with a few ad- modality. Therefore, as each max element is extracted,
ditional advantages which we will describe later on. We call we keep track (per-dimension) of which modality is con-
these Deep Multi-Modal Sets (see Fig. 1). tributing to this max-pooled feature (the so-called argmax).
The idea of concatenating feature modalities to one an- Those modalities which occur the most for a given forward
other can be thought of as a form of a “pooling operator”, pass during inference-time are deemed the most important
except that they do not conform to order invariance and it for that set. We can use this analysis to observe which fea-
is very sensitive to number of modalities. However, at its tures are more or less important for a given modeling prob-
core, concatenation serves the purpose of combining sev- lem.
eral modalities into a single representation in order to per- The use of highly activated intermediate features as an
form down-stream modeling tasks. With this in mind, we importance measure has been shown in the vision literature
propose replacing the concatenation operation with a Deep [35]. In short, when we max-pool, the model is selecting
Set; we collect all feature modalities into a single set, and features amongst the modalities for down-stream predic-
we pool them in order to extract a single embedding that tions via ρ. We say the chosen features are “important”
jointly represents all features. In this way, this model has because they are “surviving” the pooling process; max-
the following advantages: pooling is selecting features rather than combining (for ex-
ample, as sum-pooling does), and these are the features that
• No matter how many modalities we have, or occur- are used down-stream amongst the other features in the set.
rences of individual modalities, the size of the model In various works such as [35], it is shown that this sort of
stays constant. This scales very nicely with increasing paradigm is highlighting attention to important parts of the
number of features and modalities. inputs that are most significantly impacting the predictive
outcomes.
• If a particular feature is not present for a given frame,
we do not force a place-holder; we simply do not in- 4. Experiments and Discussions
clude it in the set.
The proposed Deep Multi-Modal Sets are evaluated on
Using the notation from Sec. 3.1, given our feature two multi-modality datasets, namely MM-IMDb [1] and
modalities, we construct set X = {Xi , i ∈ I}. Given Ads-Parallelity Dataset [34].
that many feature modalities may naturally have different 4.1. Datasets
dimensionalities, we enforce encoders φi (Xi ) for each Xi
so that every feature occurrence in X has dimensionality Ads-Parallelity Dataset. This dataset contains images
(1 × D). For any modality with multiple occurrences, each and slogans from persuasive advertisements, for under-
standing the implicit relationship (parallel and non-parallel)
occurrence becomes an individual element in PX. In this
way, the number of elements in X is equal to i∈I Ni . between these two modalities. This binary classification
The goal of the multi-modal model is to then learn a task is to predict if the image and text in the same adver-
pooling operator ψ(X). In the end, we attach an MLP to tisement convey the same message. A total of 327 samples
the output of this operator, denoted as ρ(ψ(X)), to perform are used (after removing duplicated images). We use 5-fold
the final down-stream target task. cross-validation, and report both overall and per-class av-
erage accuracy following [34]. Additionally we also report
3.3.1 Feature Importance ROC AUC score.
MM-IMDb. Multimodal IMDb (MM-IMDb) dataset [1]
There are several choices for the pooling operator that one contains 25,959 movies including their plots, posters, and
can use. We listed two above in the sum and max pool- other metadata. The task is to predict genres for each movie.
ing approaches. Other popular choices are mean, min, and Since one movie can have multiple genres, this task is a
median pooling. However, there are side effects to certain multi-label classification problem. We use the original split
Features Inputs types these as embeddings, we construct the encoder by borrow-
WSL image (i), detected bounding boxes (bbox) ing from the text modeling domain. When we encode text,
Face image (i)
OCR image (i) we typically build a vocabulary of unique words and map
RoBERTa text (t), OCR text (ocr) each word to a corresponding integer index. These integer
Index Embedding (IE) object classes (obj), image naturalness (nat), indices are then mapped to dense feature embeddings by
text concreteness (con)
means of a learned lookup-table such that each index is a
row in a |V | × D dense matrix, for a vocabulary of size |V |.
Table 1: Features and inputs types used for our experiments. These are typically referred to as word embeddings as a way
Image naturalness, and text concreteness are metadata pro- to map raw text words to dense embeddings; in this way, we
vided by Ads-Parallelity Dataset. construct index embeddings as a way to map discrete class
indices to dense embeddings. For situations where there are
more than one indices for a sample, we use the model archi-
from the dataset, and report the F1 scores (micro, macro, tecture in [13] to encode this sequence further into a single
and samples) on the given test set. embedding. For example, in the Ads-Parallelity Dataset,
there are “image naturalness” and “text concreteness” dis-
4.2. Features crete labels; these are similarly modeled as described here.
We extract various features for both images and text us- 4.3. Implementation
ing off-the-shelf models. Table 1 summarizes all the input
types for different feature types. We used Pytorch [25] to implement and train all the mod-
WSL. We use the output from the second-to-last FC els on a single NVIDIA V100 GPU. Adam optimization
layer of the ResNeXt-101 model (32 × 16d, pretrained on with decoupled weight decay [19] is used. The learning rate
940 million public images and ImageNet1K dataset) [20] as is warmed up linearly from 0 to 0.001 during the first five
what we call a whole image feature. The dimension of this epochs, and then decayed with a cosine annealing sched-
feature is 2048. ule over the remaining epochs. Sigmoid cross-entropy loss
Face. We use MTCNN [33] to detect faces from im- is used for both the single- and multi-label classification
ages, and for each detected face, we extract out that part tasks. For imbalanced training datasets (MM-IMDb), we
of the image from the face bounding box and obtain per- use class weights (inverse square root of class frequency
detection face embeddings (dimension 256) from a pre- [20]) to balance out the loss. To stabilize the training pro-
trained model (SE-ResNet-50-256D [5]), trained on the cess, the bias for the last linear classification layer is initial-
VGG-Face2 dataset [4]. ized with b = − log ((1 − π) /π) [15, 7], where the prior
Optical Character Recognition (OCR). We use an ex- probability π is set to 0.01.
isting OCR model from [3] to extract overlaid text from im- Modality Encoders: Each feature modality encoder φ is
ages. trained to encode to a common space with dimensions D so
Objects (obj) and bounding boxes (bbox). We use that the various modalities can be pooled together. For these
Faster R-CNN with a ResNet-50 backbone with FPN [14] experiments, we use DAds = 32 for the Ads-Parallelity
trained on the COCO dataset [16] with the 1x training dataset, and DIM Db = 1024 for MM-IMDb. These are
schedule from detectron2 model zoos1 . We extract the chosen empirically based on the amount of available data
COCO categories with classification score threshold of to train as well as complexity of the problem. For each φ
0.65, and extract the WSL embeddings of the associated (e.g., corresponding to each feature modality), we use a sin-
bounding boxes. gle fully-connected layer with an ELU activation [6], using
Sentence embeddings (RoBERTa). To encode raw a dropout of 0.25, each having a final dimensionality of D,
text, we employ RoBERTa (using the BERT-base architec- given the dataset. (Note: for some datasets, there are too
ture) [17] to extract sentence embeddings from text as well many instances of a particular modality, for example, ob-
as any detected OCR text. We use the first output (dimen- jects or faces. In these cases, to keep computational con-
sion 768) of the final layer from the model download from straints in mind, we sub-sample down to 10 bounding box
fairseq [23]. We concatenate all sentences as one for an in- detections when need be.)
dividual sample, clipping the maximum sentence length of Deep Set Pooling: We study various pooling schemes
tokens to 512. for ψ, including max, sum, and min. Though max is the pro-
Index Embeddings (IE). For certain types of features, posed choice for interpretable feature importance, in some
the representation is a simple class index; an integer in cases other pooling schemes yield better results. This rep-
a finite range (e.g., an object class index). To represent resents a common trade-off between model interpretability
and performance. We discuss more about the pooling oper-
1 https://github.com/facebookresearch/detectron2 ators below.
Set Predictors: For each set encoding following ψ, we ing the same message. When looking at the FIM, all but
construct MLPs to represent ρ for the final down-stream one feature combination gets the prediction correct. On
tasks. The architectures for each are as follows: ρAds : further inspection, it seems that faces are quite important
[DAds → 32 → 2], and ρIM Db : [DIM Db → 256 → when included, as well as WSL(bbox) and naturalness and
128 → 32]. In each case, the last output layer represents concreteness index labels. This seems reasonable as faces
the total number of predicted classes for that dataset task. are the major feature of the image, and comparing that to
text would be a natural way to determine if they are par-
4.4. Results and Discussions allel or not. For the next column over (Coors Light ad),
face importance goes down (empty means zero importance),
Ads-Parallelity: Table 2 shows our results using Deep
which makes sense since the actor’s face is present, but
Multi-Modal Sets on the Ads-Parallelity dataset. Here, we
barely visible. Here, WSL(i) as well as naturalness and
show ablation studies using various combinations of fea-
concreteness dominate importance. In the ”White Castle”
tures and pooling schemes to best understand how differ-
ad, WSL(bbox) (e.g., objects) as well as WSL(i) work well
ent modalities interact with one another for this particular
together with naturalness and concreteness, which seems
task and compare each to the current state-of-the-art shown
reasonable given the content of the ad. Finally, the right-
in the top row [34]. The metrics used here are overall ac-
most column shows a mistake: most predictions get the la-
curacy and AUC as well as individually for both the non-
bel wrong. When we inspect this, we see faces were incor-
parallel and parallel tasks. The best metrics are shown in
rectly detected and the Roberta(t) is given zero importance
bold, showing a noticeable increase in performance over the
too often. This shows what happens when bad features oc-
current SOTA.
cur, however the FI gives us insight into why it is hap-
We find that, in general, and perhaps unsurprisingly,
pening.
more features working together do help boost performance
over uni-modal approaches. This is an advantage to the pro- MM-IMDb: In Table 3, we show results on MM-IMDb.
posed method. It allows, in a very flexible manner, to sup- For this, we compare against 4 existing SOTA methods in
port just about any type of feature as well as any number [1, 12, 26, 11]. To evaluate, as in those works, we use the
of instances of each modality, without increasing the model F1-Micro, F1-Macro, and F1-Samples metrics. Here, we
complexity. In this way, the model can leverage more types can observe some slight trade-offs amongst the different in-
of modalities jointly. We stress that, though many features, put modalities when comparing image to text features. We
such as WSL or OCR, typically have one occurrence in a will note that the combination of images and text modalities
given sample, modalities such as object detections can oc- outperforms either on their own, which highlights the need
cur any number of times in an image, and the model need for multi-modal models whenever possible. We are able to
not know this beforehand; each is simply an instance in the show improvement over all 4 prior methods on this dataset,
set and there may be as few or many instances as naturally with the best performing model using WSL(i), RoBERTa(t)
occur, without padding or place-holders. This requires less and WSL(bbox) from objects.
up-front manual feature engineering. In the end, our boost We similarly show example predictions from this dataset
in performance ranges from +11.213% on overall accuracy in Fig. 3, as we did above. The left 3 columns show ex-
to +21.344% on the parallel equivalence task. amples of correct predictions, where several of the feature
We show some sample results from this experiment in combinations get most (or all) of the labels from this multi-
Fig. 2. Here, the top row shows sample images from the label problem correct. Faces are very helpful for the middle-
dataset. Below that, we show the various features either ex- left example, where clearly there are many faces present
tracted (e.g., number of faces, OCR overlaid text) or given in the image. For the middle-right example (Prom Night
(e.g., concreteness and naturalness labels, supplemental raw III), faces are also important as well as WSL(i) along with
text input). In the last row, we show the feature importance the text/plot description encoding. The right-most column
(FI) for each modality for that individual test sample after shows an example where the model performs poorly. Here,
prediction, as it is considered in various feature combina- no combination of features seems able to get all labels cor-
tions. On the x-axis of the FI matrices (FIM), we label the rect, though some cases are able to pick up on one of them.
individual feature modalities that are considered. On the y- The labels ”Family” and ”Biography” do not seem indicated
axis, we show the predicted label for each combination of in any of the visual features, and the signal from the raw text
features (shown per-row). The top of each FIM shows the was not able to pick up for those mistakes enough to get all
ground-truth for that given test sample. 3 labels correct. Again, the analysis provided by this inter-
In the first column, we show an example ”Mad Men” pretable FI allows us to reason about model performance in
advertisement along with the supplemented text ”For your a natural way.
consideration”. The ground truth label is given as non- Pooling: For Ads-Parallelity, it seems sum pooling
parallel, meaning that the image and text are not convey- slightly out-performs the other pooling schemes for overall
Overall Non- Parallel
Method Modalities Features Pool Type Accuracy AUC Parallel Equiv Non-Equiv
[34] img + txt - - 65.500 - 63.300 70.200 65.500
Ours img WSL(i) - 57.726 54.508 47.803 76.103 59.273
Ours img WSL(i), RoBERTa(ocr) ψmax 61.721 59.188 46.553 81.912 71.283
Ours img WSL(i), RoBERTa(ocr), Face(i) ψmax 62.043 57.681 51.667 75.956 68.766
Ours img WSL(i), RoBERTa(ocr), Face(i), CNN-text(objs) ψmax 62.029 58.648 55.985 76.912 58.913
Ours img WSL(i), RoBERTa(ocr), Face(i), CNN-text(objs), WSL(bbox) ψmax 61.707 59.402 40.284 84.412 81.217
WSL(i), RoBERTa(ocr), Face(i),
Ours img IE(objs), WSL(bbox), IE(nat) ψmax 63.909 61.300 58.371 72.426 66.846
Ours txt RoBERTa(t) - 46.777 55.889 81.487 36.718 28.294
Ours txt RoBERTa(t), IE(con) ψmax 72.167 65.720 67.140 87.941 65.776
ψmax 71.552 73.326 64.583 91.544 65.171
Ours img + txt WSL(i), IE(nat), RoBERTa(ocr), RoBERTa(t), IE(con) ψmin 73.058 70.113 65.170 90.294 71.275
ψsum 76.713 77.788 75.152 88.971 67.451
ψmax 70.311 67.464 60.246 86.691 73.840
Ours img + txt WSL(i), IE(nat), WSL(bbox), RoBERTa(ocr), ψmin 71.561 67.982 63.409 91.471 67.590
Face(i), RoBERTa(t), IE(con) ψsum 73.716 75.460 69.545 91.471 63.709
ψmax 71.845 73.334 63.371 90.147 70.229
Ours img + txt WSL(i), IE(nat), objs, bbox, RoBERTa(ocr), ψmin 71.262 73.428 60.928 90.221 72.377
Face(i), transcription (r), IE(con) ψsum 74.923 75.045 72.064 91.544 64.118

Table 2: Ads-Parallelity results. Average accuracy and area under the ROC curve over entire datasets, and accuracies for three
fine-grained classes (Non-Parallel: text and image convey different meanings; Parallel-Equiv: they are completely equivalent;
Parallel-Non-Equiv: they express same ideas but in different levels of details.) are reported. The proposed Deep Multi-Modal
Sets is able to outperform the baseline method by a large margin.

Face nums: 2 Face nums: 0 Face nums: 0 Face nums: 1

OCR: [" End The an YOUR Consider OCR: ["TERRIBLE Coors SEEMS GETS ALWAYS OCR: [ANY RETURN 3-PIECE WITH RING OCR: [friend friend, friend. good Coke with
CONSIDERATIION Era FOR of аMс MEN MADI JOB DONE. LIGH! BUT THE COHT"] CHICKEN RINGS WITH FREE PURCHASE. OF friend, touch happiness new Share maybe
CONSIDERATION FOR DERATION "] THE Manageement White reserved cool open or wiith Kuliie shareacoke.commu in an
Offer ends CHICKEN buffglo ranch visit…’] Get even CocaCola a Luke old a a a Coca…’]

Concrete_ori: [3, 3, 3, 2] Concrete_ori: [3, 3, 2, 0, 3] Concrete_ori: [2, 0, 0, 2, 2] Concrete_ori: [1, 0, 1, 0, 0, 3, 2, 1, 2, 3]

Natural_ori: [2, 3, 0, 2] Natural_ori: [3, 2, 2, 1, 3] Natural_ori: [2, 0, 0, 2, 2] Natural_ori: [1, 2, 0, 0, 0, 1, 2, 2, 2, 2]

Text: For your consideration Text: SEEMS TERRIBLE BUT ALWAYS GETS Text: Return of the rings Text: Share a Coke with a friend
THE JOB DONE.
Ground truth: non parallel Ground truth: non parallel Ground truth: parallel Ground truth: parallel
1
2
3
4
5
6
7
8
9
10
11

Figure 2: Selective examples from Ads-Parallelity. Input of this dataset include image, text, faces, OCR text, etc. Last row
of the figure presents the Feature Importance Matrices (FIM) for each model (referenced in Table 2). Best view digitally.

accuracy and AUC, however in the individual tasks max- tion for all three metrics. Again, the differences between
pooling occasionally edges out the others. The difference is pooling schemes is not large, but this study is still educa-
typically marginal, however this does showcase how differ- tional towards understanding pooling schemes. Because the
ent pooling schemes should be considered when designing encoders are trained end-to-end with a particular pooling
the model architecture. A single pooling method does not scheme in mind, they will attempt to optimize performance
always out-perform the others in all cases. given the pooling operator.
In the case of MM-IMDb, max pooling is yielding the
top results, interestingly with the same feature combina-
Method Modalities Features Pool Type F1-Micro F1-Macro F1-Samples
[1] img + txt - - 0.6300 0.5410 0.6300
[12] img + txt - - 0.6230 - -
[26] img + txt - - - 0.5568 -
[11] img + txt - - 0.6640 0.6110 -
Ours img WSL(i) - 0.5253 0.3791 0.5227
Ours img WSL(i), Face(i), IE(obj) ψmax 0.4945 0.3570 0.4931
Ours img WSL(i), Face(i), IE(obj), WSL(bbox) ψmax 0.4939 0.3566 0.4933
Ours img WSL(i), Face(i), IE(obj), WSL(bbox), IE(ocr), RoBERTa(ocr) ψmax 0.5150 0.3856 0.5173
Ours txt RoBERTa(t) ψmax 0.6699 0.6011 0.6714
ψmax 0.6623 0.5961 0.6637
Ours img + txt WSL(i), RoBERTa(t) ψmin 0.6709 0.5929 0.6710
ψsum 0.6716 0.5885 0.6721
ψmax 0.6773 0.6133 0.6763
Ours img + txt WSL(i), RoBERTa(t), WSL(bbox) ψmin 0.6673 0.5871 0.6690
ψsum 0.6677 0.5965 0.6665
ψmax 0.6474 0.5671 0.6502
Ours img + txt WSL(i), RoBERTa(t), WSL(bbox), Face(i) ψmin 0.6302 0.5532 0.6310
ψsum 0.6664 0.5948 0.6665
ψmax 0.6345 0.5459 0.6328
Ours img + txt WSL(i), RoBERTa(t), WSL(bbox), Face(i), IE(obj) ψmin 0.6544 0.5688 0.6526
ψsum 0.6688 0.5884 0.6666
ψmax 0.6353 0.5621 0.6363
Ours img + txt WSL(i), RoBERTa(t), WSL(bbox), Face(i), IE(obj), RoBERTa(ocr) ψmin 0.6410 0.5530 0.6430
ψsum 0.6750 0.6044 0.6760

Table 3: MM-IMDb results. Micro F1, Macro F1, Samples F1 scores are reported. The proposed Deep Multi-Modal Sets is
able to outperform the other methods.

Face nums: 5 Face nums: 14 Face nums: 2 Face nums: 1

OCR: ["ja Kiveppriiiljan Pölösen elokuva kylä OCR: ["RORERTS EXPRESS" CRISTIE\'S THE OCR: [” thinks Alex THE gone he\'s heaven. OCR: [“Praying CAN FAITH SIMPLE BE
Markku Taytelälnen maalalskomeella "MURDER WAESSA BISSET ON ORIENT HLLER right died onad andd He\'\'s to RRom half W.. DELIGHT! Lior fascinating exttraordinargi"
hummtttstnn... Inhimilllneen HS yörittajän AGATHA IACAURINE CASSEL RACHE MENON EENTERTANMENT PRODUCTION MORITA Invelving "A rather Intimate and Trachhtman
Kivenpyorii maalalikommedia"] BERGYA GELGLD PIERRE INGRD IFAN MARTTE NIGHT ASSOCIATID M NORSTAR T LAST…] OWD-PLEASIING and ilana with t Allim…”]
…]

Plot: ["Pekka returns to his native village to Plot: [Famous detective Hercule Poirot is on Plot: ["Mary Lou, the prom queen burned to Plot: [‘Praying with Lior asks whether
attend a wedding. The village has lost its the Orient Express, but the train is caught in death by her boyfriend back in the fifties, has someone with Down syndrome can be a
younger generations, and the movie tells the the snow. When one of the passengers is escaped from hell and is once again walking spiritual genius. Many believe Lior is close to
story of the last wedding in the village."] discovered murdered, Poirot immediately the hallways of Hamilton High School, looking God -- at least that's what his family and
starts investigating...'] for blood. She chooses as her escort in world community believe -- though he's also a
of the living Alex,...'] burden, a best friend, ….']
Ground truth: Drama (DR), Comedy (CO) Ground truth: Crime (CR), Mystery(MY), Drama(DR) Ground truth: Thriller(TH), Horror(HO), Comedy(CO) Ground truth: Documentary(DO),Biography(BI),Family(FAM)
1 DR, CO TH, DR MU,DR,CO,AC DR, FAM
2 DR, CO RO, DR RO,TH,SC,CR,CO,FA DR, CO, AD
3 DR, CO BI, WA, DR, CO RO, MUL,CO,AC,AD,FA RO,DR, CO
4 DR, CO DR, AD RO,MU,SC,DR,CO,AD,FA DR, CO, AD
5 DR TH, CR, MY, DR TH, HO DO
6 DR TH, CR, MY, DR TH,HO,CR,MY,CO,FA DO
7 DR TH, CR, MY, DR TH,HO DO
8 DR, CO TH, CR, MY, DR TH,HO,CO,FA DR
9 DR, CO CR, DR RO,TH,CR,DR,CO,FA BI, DR
10 DR, CO CR, DR RO,TH,HO,CO DR, CO

Figure 3: Selective examples from MM-IMDb. We list input for each sample including poster, faces, OCR text, and plot.
Plots and OCR are abbreviated for visual effect. Last row of the figure presents the Feature Importance Matrices (FIM) for
each model (referenced in Table 3). Predicted labels at y-axis are abbreviated as: Action (AC), Adventure (AD), Biography
(BI), Comedy (CO), Crime (CR), Documentary (DO), Drama (DR), Family (FAM), Fantasy (FA), Horror (HO), Music (MU),
Musical (MUL), Mystery (MY), Romance (RO), Sci-Fi (SC), Thriller (TH), War (WA). Best view digitally.

5. Conclusions
In this paper we introduced a new model architecture
for reasoning about multiple modalities that is more nat-
ural and less restrictive than previous concatenation-based [13] Y. Kim. Convolutional neural networks for sentence classifi-
fusion approaches. Our method utilizes unordered sets in cation. In EMNLP, 2014. 5
which any number of features are pooled together for down- [14] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S.
stream tasks. When features aren’t present, we do not need Belongie. Feature pyramid networks for object detection. In
Proceedings of the IEEE conference on computer vision and
any unnatural placeholders and when features occur vari-
pattern recognition, pages 2117–2125, 2017. 5
able number of times, we don’t need padding. We demon- [15] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
strate new SOTA performance on challenging datasets and loss for dense object detection. PAMI, 2018. 5
offer an interpretable method of feature importance when [16] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
using max-pooling as the fusion scheme. This allows us to D.Ramanan, P. Dollár, and C.L. Zitnick. Microsoft coco:
reason about both correct and incorrect predictions at infer- Common objects in context. In European conference on
ence time. Future work will include extension to video, as computer vision, pages 740–755. Springer, 2014. 5
this is a common use of multi-modal modeling. [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A ro-
bustly optimized bert pretraining approach. arXiv preprint
References arXiv:1907.11692, 2019. 5
[1] J. Arevalo, T. Solorio, M. Montesy Gomez, and F.A. Gonza- [18] X. Long, C. Gan, G. de Melo, X. Liu, Y. Li, F. Li, and S.
lez. Gated multimodal units for information fusion. In ICLR, Wen. Multimodal keyless attention fusion for video classifi-
2017. 1, 3, 4, 6, 8 cation. In AAAI, 2018. 2
[2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankan- [19] I. Loshchilov and F. Hutter. Decoupled weight decay regu-
halli. Multimodal fusion for multimedia analysis: a survey. larization. In International Conference on Learning Repre-
Multimedia Systems, 16(6):345–379, 2010. 2 sentations, 2019. 5
[20] D.K. Mahajan, R.B. Girshick, V. Ramanathan, K. He, M.
[3] F. Borisyuk, A. Gordo, and V. Sivakumar. Rosetta: Large
Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Explor-
scale system for text detection and recognition in images. In
ing the limits of weakly supervised pretraining. In ECCV,
Proceedings of the 24th ACM SIGKDD International Con-
2018. 5
ference on Knowledge Discovery & Data Mining, pages 71–
79. ACM, 2018. 5 [21] P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsaka-
lidis, U. Park, R. Prasad, and P. Natarajan. Multimodal fea-
[4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.
ture fusion for robust event detection in web videos. In IEEE
Vggface2: A dataset for recognising face across pose and
Conference on Computer Vision and Pattern Recognition,
age. In International Conference on Automatic Face and
2012. 2
Gesture Recognition, 2018. 5
[22] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y Ng.
[5] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Multimodal deep learning. In Proceedings of ICML, 2011. 3
Vggface2: A dataset for recognising faces across pose and
[23] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
age. In International Conference on Automatic Face and
Sam Gross, Nathan Ng, David Grangier, and Michael Auli.
Gesture Recognition, 2018. 5
fairseq: A fast, extensible toolkit for sequence modeling. In
[6] D.A. Clevert and T. Unterthiner andS Hochreiter. Fast and Proceedings of NAACL-HLT 2019: Demonstrations, 2019. 5
accurate deep network learning by exponential linear units [24] D.H. Park, L.A. Hendricks, Z. Akata, A. Rohrbach, B.
(elus). In International Conference on Learning Representa- Schiele, T. Darrell, and M. Rohrbach. Multimodal expla-
tions (ICLR 2016 Conference Track), 2015. 5 nations: Justifying decisions and pointing to the evidence.
[7] Y. Cui, M. Jia, T.Y. Lin, Y. Song, and S. Belongie. Class- In IEEE Conference on Computer Vision and Pattern Recog-
balanced loss based on effective number of samples. In nition, 2018. 3
CVPR, 2019. 5 [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-
[8] A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, and Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-
M. Rohrbach. Multimodal compact bilinear pooling for vi- matic differentiation in PyTorch. In NIPS Autodiff Workshop,
sual question answering and visual grounding. In CoRR 2017. 5
abs/1606.01847, 2016. 3 [26] J.M. Prez-Ra, V. Vielzeuf, S. Pateux, M. Baccouche, and
[9] C. Hori, T. Hori, T.Y. Lee, Z. Zhang, B. Harsham, J. R. Her- F. Jurie. Mfas: Multimodal fusion architecture search. In
shey, T. K. Marks, and K. Sumi. Attention-based multimodal CVPR, 2019. 2, 6, 8
fusion for video description. In International Conference on [27] C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus
Computer Vision, 2017. 2 late fusion in semantic video analysis. In ACMM, 2005. 2
[10] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. [28] N. Srivastava and R.R. Salakhutdinov. Multimodal learning
Adaptive mixtures of local experts. In Neural Computation, with deep boltzmann machines. In Advances in neural infor-
1991. 2 mation processing systems, 2012. 3
[11] D. Kiela, S. Bhooshan, H. Firooz, and D. Testuggine. Super- [29] H. Xu and K. Saenko. Ask, attend and answer: Exploring
vised multimodal bitransformers for classifying images and question-guided spatial attention for visual question answer-
text, 2019. 2, 6, 8 ing. In European Conference on Computer Vision, 2016. 3
[12] D. Kiela, E. Grave, A. Joulin, and T. Mikolov. Efficient large- [30] X. Yang, P. Molchanov, and J. Kautz. Multilayer and mul-
scale multi-modal classification. In Thirty-Second AAAI timodal fusion of deep neural networks for video classifica-
Conference on Artificial Intelligence, 2018. 2, 6, 8 tion. In ACMM, 2016. 2, 3
[31] G. Ye, D. Liu, I.H. Jhuo, and S.F. Chang. Robust late fusion
with rank minimization. In IEEE Conference on Computer
Vision and Pattern Recognition, 2012. 2
[32] M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Poczos, R.
Salakhutdinov, and A. J. Smola. Deep sets. In Conf. on
Neural Information Processing Systems, 2017. 3
[33] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection
and alignment using multitask cascaded convolutional net-
works. IEEE Signal Processing Letters, 23(10):1499–1503,
Oct 2016. 5
[34] M. Zhang, R. Hwa, and A. Kovashka. Equal but not the
same: Understanding the implicit relationship between per-
suasive images and text. arXiv preprint arXiv:1807.08205,
2018. 1, 2, 4, 6, 7
[35] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-
ralba. Learning deep features for discriminative localization.
In IEEE Conference on Computer Vision and Pattern Recog-
nition, 2015. 4

You might also like