Academia.eduAcademia.edu

More Cat than Cute?

Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes

The increasing availability of affect-rich multimedia resources has bolstered interest in understanding sentiment and emotions in and from visual content. Adjective-noun pairs (ANP) are a popular midlevel semantic construct for capturing affect via visually detectable concepts such as "cute dog" or "beautiful landscape". Current stateof-the-art methods approach ANP prediction by considering each of these compound concepts as individual tokens, ignoring the underlying relationships in ANPs. This work aims at disentangling the contributions of the 'adjectives' and 'nouns' in the visual prediction of ANPs. Two specialised classifiers, one trained for detecting adjectives and another for nouns, are fused to predict 553 different ANPs. The resulting ANP prediction model is more interpretable as it allows us to study contributions of the adjective and noun components.

arXiv:1708.06039v1 [cs.CV] 21 Aug 2017 More cat than cute? Interpretable Prediction of Adjective-Noun Pairs Dèlia Fernández∗ Alejandro Woodward Víctor Campos Vilynx Barcelona, Catalonia/Spain [email protected] Universitat Politècnica de Catalunya Barcelona, Catalonia/Spain [email protected]. edu Barcelona Supercomputing Center Barcelona, Catalonia/Spain [email protected] Xavier Giró-i-Nieto Brendan Jou Shih-Fu Chang Universitat Politècnica de Catalunya Barcelona, Catalonia/Spain [email protected] Columbia University New York City, New York [email protected] Columbia University New York City, New York [email protected] ABSTRACT The increasing availability of affect-rich multimedia resources has bolstered interest in understanding sentiment and emotions in and from visual content. Adjective-noun pairs (ANP) are a popular midlevel semantic construct for capturing affect via visually detectable concepts such as łcute dog" or łbeautiful landscape". Current stateof-the-art methods approach ANP prediction by considering each of these compound concepts as individual tokens, ignoring the underlying relationships in ANPs. This work aims at disentangling the contributions of the ‘adjectives’ and ‘nouns’ in the visual prediction of ANPs. Two specialised classifiers, one trained for detecting adjectives and another for nouns, are fused to predict 553 different ANPs. The resulting ANP prediction model is more interpretable as it allows us to study contributions of the adjective and noun components. CCS CONCEPTS · Information systems → Multimedia information systems; · Computing methodologies → Scene understanding; KEYWORDS affective computing, convolutional neural networks, compound concepts, adjective noun pairs, interpretable models ACM Reference format: Dèlia Fernández, Alejandro Woodward, Víctor Campos, Xavier Giró-i-Nieto, Brendan Jou, and Shih-Fu Chang. 2017. More cat than cute? Interpretable Prediction of Adjective-Noun Pairs. In Proceedings of MUSA2’17, Mountain View, CA, USA, October 27, 2017, 9 pages. https://doi.org/10.1145/3132515.3132520 ∗ Work Figure 1: Example of object oriented and scene oriented ANPs. The hypothesis of different contribution of the adjective or the noun depending on the ANP is subtended on the subjective different visual relevance of one or the other varying on the ANP. We distinguish between noun oriented ANPs (top row) and adjective oriented ANPs (bottom row). partially developed while Dèlia Fernández was a visiting scholar at Columbia University. 1 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MUSA2’17, October 27, 2017, Mountain View, CA, USA © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5509-4/17/10. . . $15.00 https://doi.org/10.1145/3132515.3132520 Computers are acquiring increasing ability for understanding high level visual concepts such as objects and actions in images and videos, but often lack an affective comprehension of such content. Technologies have largely obviated emotion from data, while neurology demonstrates how emotions are fundamental to human experience by influencing cognition, perception and everyday tasks such as learning, communication and decision-making [18]. During the last decade, with the growing availability and popularity of opinion-rich resources such as social networks, the interest INTRODUCTION MUSA2’17, October 27, 2017, Mountain View, CA, USA on the computational analysis of sentiment has increased. Every day, Internet users post and share billions of multimedia content in online platforms to express sentiment and opinions about several topics [13], which has motivated research on automated affect understanding for large-scale multimedia [3, 12, 16]. The ability of analyzing and understanding this kind of information opens the door to behavioral sciences and applications such as brand monitoring or advertisement effect [22]. One of the main challenges for automated affect understanding in visual media is overcoming the affective gap between low-level visual features and high-level affective semantics. Such task goes beyond overcoming the semantic gap, i.e. recognizing the objects in an image, and poses a challenging task in computer vision. In [3], adjective-noun pair (ANP) semantics are proposed as a midlevel representation that convey strong affective content while being visually detectable by traditional computer vision methods, e.g. łhappy dog", łmisty morning" or łlittle girl". It has been argued that the noun in an ANP grounds the visual appearance of the concept, while the adjective works as a bias carrying most of the conveyed affect [16]. However, we hypothesize that for some ANPs the adjective may carry most of the visual cues that are key for its detection, as depicted by the examples in Figure 1. While the most salient visual cues for łcute dogž or łdelicious cupcakež are related to łdogž and łcupcakež, we expect that in other cases such as łdark nightž or łbright dayž the visual features related to łdarkž and łbrightž to contribute more in the detection of the ANP. The analysis developed in this paper allows classifying between noun or adjective oriented ANPs by comparing the contribution of the adjective and noun concept in the final prediction. The examples depicted in Figure1 point at a second factor influencing the relative contributions between adjective and nouns. If focused on the nouns only, it can be observed that some of them are related to the objects depicted in the images (eg. dog and cupcake), while other nouns are more related to the whole scene represented by the image (eg. night and day). Our analysis will also discuss the behavior of adjective and noun contributions from this perspective. The prediction of these adjective and noun structured labels has been traditionally addressed with single branch classifiers, ignoring the particular structure of these pairs. In order to verify our hypothesis, we propose fusing specialized adjective and noun detectors and then analyze their contribution by means of state of the art methods. The proposed two-stage training process allows to decompose the decision of the final classifier in terms of the contribution of different understandable concepts, i.e. the adjectives and nouns in the dataset. Given these contribution results, a thorough analysis is performed in order to understand how the classifier leverages the information coming from each of the branches and shed some light into the particularities of the ANP detection task. The contributions of this work include (1) a new ANP classifier architecture that provides comparable performance to the state of the art results while allowing interpretability, (2) a method to evaluate adjective and noun contributions on ANP prediction, and (3) an extended analysis on the contributions from a numerical and semantical point of view. This paper is structured as follows. Section 2 reviews the related models in ANP prediction and the previous works on decomposing this task into adjective and noun classification. We present our D. Fernández et al. interpretable model ANPnet in Section 3, which is trained with the dataset and parameters described in Section 4. Accuracy results are presented in Section 5, while the contributions in terms of adjective and nouns are discussed in 6. Final conclusions and discussions are contained in Section 7. Source code and models are available at https://imatge-upc.github.io/affective-2017-musa2/. 2 RELATED WORK Automated sentiment and emotion detection from visual media has received increasing interest by the multimedia community during the past years. These tasks have been addressed with traditional low-level feature based techniques, such as color features [11], SIFT-based Bag of Words [19] and aesthetic effects [25]. Due to the success of deep learning techniques in reference vision benchmarks [9, 17, 23], Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNNs) have replaced and surpassed handcrafted features for affect-related tasks [4, 26, 27]. Methods for detecting adjective-noun pairs (ANP) have strongly relied on state-of-the-art vision algorithms. Similarly to other computer vision tasks, early approaches based on handcrafted features [2] were soon replaced with CNNs [5, 15, 16]. CNNs have proven their efficiency for large-scale image datasets [9, 10, 17, 23]. DeepSentiBank [5] presented the first application of CNNs for ANP prediction. MVSO detector-banks [15] showed performance improvement by using a more modern architecture, GoogLeNet [23], which also reduced the amount of parameters of the model. The multi-task nature of the ANP detection task has been exploited by using a fan-out architecture, where a first set of layers is shared for all tasks, and then splits in different network heads that specialize on each task [14]. Inter-task influence is increased through the use of cross-residual connections between the different heads. Despite this approach improves on the single-task models, the hierarchical structure of ANPs is not explicitly encoded in the architecture and the influence of the adjective and noun branches on the ANP detection lacks interpretability as compared to the approaches presented in this work. Factorized nets [21] explicitly leverage the hierarchical nature of ANPs in the model architecture. Decomposing the ANP detection task into factorized adjective and noun classification problems allows the model to classify unseen adjective and noun combinations that are not available in the training set. However, the use of an M-dimensional latent space for adjectives and nouns complicates the interpretability of their combination in terms of understandable semantic concepts. The task of sentiment analysis is addressed with Deep Coupled Adjective and Noun neural networks (DCAN) [24] by learning a mid-level visual representation from two convolutional networks jointly trained with VSO. One of this networks is specialized in adjectives and the other one in nouns. The learned representations shows superior performance in the task of sentiment analysis, but does not provide an interpretation about which concepts triggered the predictions. The model proposed in this work follows a fan-in architecture which allows to understand and decompose the final classification in terms of the contribution of specific adjectives and nouns. More cat than cute? Interpretable Prediction of Adjective-Noun Pairs MUSA2’17, October 27, 2017, Mountain View, CA, USA representations between the neighboring layers of smaller sizes. Inputs to the fusion layer are whitened by computing the mean and standard deviation for all the samples in the training set. The number of output neurons in AdjNet and NounNet is determined by the number of adjective and noun classes in our dataset, which were 117 and 167 respectively. Next sections present how ANPNet was trained to simultaneously provide competitive and interpretable results. 4 Figure 2: Interpretable model for ANP detection. 3 EXPERIMENTAL SETUP The ANPnet network presented in Section 3 was trained with a subset of the Visual Sentiment Ontology (VSO) [3] dataset. Firstly, AdjNet and NounNet were trained independently for adjective and noun predictions, respectively, and in a second stage the fusion layers were trained to predict the ANPs in the input image. This section describes in detail the dataset used and the training parameters of the whole architecture. ANPNET This section presents ANPnet, an interpretable architecture for ANP detection constructed by fusing the outputs of two specialized networks for adjective and noun classification. Given an input image, x, we estimate its corresponding Adjective, Noun and ANP labels, yad j , ynoun and yAN P , as ŷad j = fad j (x) (1) ŷnoun = fnoun (x)   ŷAN P = д ŷad j , ŷnoun (2) (3) where fad j , fnoun and д are parametrized by neural networks, namely AdjNet, NounNet and the Fusion network. We aim at studying the contribution of the different adjectives and nouns by analyzing the behavior of д with respect to its inputs. The method for computing such contributions is described in Section 6. The architecture for the specialized networks are based on the well-known ResNet-50 model [9]. Residual Networks are convolutional neural network architectures that introduce residual functions with reference to the layer inputs, achieving better results than their non-residual counterparts. All residual layers use łB optionž shortcut connections as described in [9], where projections are only used when matching dimensions (1 × 1 convolutions with stride 2) and other shortcuts are identity. This architecture represents a good trade-off between classification accuracy and computational cost. Besides, it allows comparison in terms of accuracy with the ResNet-50 network trained for ANP detection in [14]. The last layer in ResNet-50, originally predicting the 1,000 classes in ImageNet, is replaced in AdjNet and NounNet to predict, respectively, the adjectives and nouns considered in our dataset. Each of these two networks is trained separately with the same images and the corresponding adjective or noun labels. The probability ouputs of AdjNet and NounNet are fused by means of a fully connected neural network with a ReLU non-linearity. On top of that, a softmax linear classifier predicts the detection probabilities for each ANP class considered in the model. Figure 2 depicts the deeper layers of ANPnet, showing how AdjNet and NounNet are fused by a fully connected layer of 1,024 neurons. This size is chosen to allow the network to learn sparse 4.1 Dataset The presented work uses a subset of the Visual Sentiment Ontology (VSO) [3] dataset, the same part used in [14] to facilitate the comparison in terms of accuracy. The original VSO dataset contains over 1200 different ANPs and about 500k images retrieved from the social multimedia platform Flickr 1 . Those images were found by querying Flickr search engine with keywords from the Plutchik’s Wheel of Emotions [7], a wellknown emotion model derived from psychological studies. This wheel contains 24 basic emotions, such as joy, fear or anger. The discovery of affective-related ANPs was based on their co-occurrences with the emotion tags. The initial list of ANP candidates was manually filtered in order to ensure semantics correctness, sentiment strength and popular usage on Flickr. Finally, each resulting ANP concepts was used to query again Flickr and build this way a collection of images and metadata associated to the ANP. The full VSO dataset presents certain limitations already pointed out in [14]. First, some adjective-noun pair concepts are singletons and do not share any adjectives or nouns with other concept pairs. Also, some nouns are massively over-represented and there are far less adjectives to compose the adjective-noun pairs. Due to these drawbacks, our experiments are based on a subset of VSO build according to the more restrictive constraints proposed in [14]. The considered subset of ANPs satisfy the following conditions: (1) the adjective is paired with at least two more different nouns, (2) nouns that are not overwhelmingly biasing and abstract, and (3) all ANPs are related with 500 or more images. The final VSO subset contains 167 nouns and 117 adjectives that form 553 adjective-noun pairs over 384,258 Flickr images. A stratified 80-20 split is performed, resulting in 307,185 images for training and 77,073 for test. The partition used in our experiments is the same for which results in [14] are reported2 . 1 https://www.flickr.com 2 Dataset splits were obtained through personal communication with the authors of [14]. D. Fernández et al. Best detected ANPs MUSA2’17, October 27, 2017, Mountain View, CA, USA b) young deer Worst detected ANPs a) dying rose Figure 4: Histogram of top-5 accuracy for ANP prediction with ANPnet. c) nice scene d) peaceful morning Figure 3: Example of two of the best (a, b) and worst detected (c, d) ANPs. The visual variance on the ANP can be noticed with contrasting examples. 4.2 Training All CNNs were trained using stochastic gradient descent with momentum of 0.9, on batches of 128 samples and a learning rate of 0.01. Data augmentation, consisting in random crops and/or horizontal flips on the input images, together with ℓ2 -regularization with a weight decay rate of 10−4 is used in order to reduce overfitting. When possible, layers were initialized using weights from a model pre-trained on ImageNet [6]. Otherwise, weights were initialized following the method proposed by Glorot and Bengio [8] and the initial value for the biases was set to 0. The training is performed in two stages. First, AdjNet and NounNet are trained independently for the tasks of adjective and noun classification, respectively. The learned weights are then frozen in order to train the fusion network on top of the specialized CNNs. Thanks to this two-step training strategy, the inputs to the fusion layer become an intermediate and semantically interpretable representation. All experiments were run using two NVIDIA GeForce GTX Titan X GPUs and implemented with TensorFlow [1]. 5 ANP PREDICTION The performance of our interpretable model in terms of ANP detection is presented in Table 1. This table shows a decrease of performance in terms of accuracy of ANPnet with respect to a ResNet-50 fine-tuned end-to-end for ANP prediction. The table includes the results of two ResNet-50, the ones published in [14] and new ones obtained with the training parameters described in Section 4.2. The similar values obtained by our model with respect to the ones reported in [14] confirm that the training hyperparameters were Model Task Classes top-1 top-5 Adj Adj 117 117 28.45 27.70 57.87 57.00 NounNet [14] NounNet Noun Noun 167 167 41.64 41.50 69.81 69.20 ResNet-50 [14] ResNet-50 Non-Interpretable ANPNet ANP ANP ANP ANP 553 553 553 553 22.68 23.40 21.80 20.67 47.82 48.20 46.00 43.28 AdjNet [14] AdjNet Table 1: Detection accuracy for adjective, nouns and ANPs, in %. appropriate. According to the baseline set by our ResNet-50 for ANP prediction, the loss of accuracy associated building ANPNet is of 3.8% for top-1 accuracy and 4.7% for top-5. The loss is also present when comparing ANPnet with a noninterpretable version of the same architecture for which the output layers of AdjNet and NounNet are also initialized randomly and trained. In this case the drop in accuracy is of 2.2% when considering top-1 and 2.5% for top-5. With this setup, the network is not forced to use adjective and noun probabilities as an intermediate representation and has additional degrees of freedom to optimize for the target task of ANP classification. The decrease in accuracy of the interpretable model with respect to this latter configuration is then expected. These results quantize the price to pay in terms of accuracy for making our model interpretable. A more general overview of the ANPnet performance in terms of top-5 accuracy is presented in Figure 4 as the histogram of values considering the full test dataset. The distribution of accuracies across the different ANPs follows a Gaussian-like distribution centered around the average score of 43.28%. The results in Table 1 show both accuracy in terms of top-1 and top-5 because ANP prediction can be highly affected by synonyms. More cat than cute? Interpretable Prediction of Adjective-Noun Pairs Adjective concepts such as łsmilingž and łhappyž, or noun concepts like łcatž and łanimalž are considered absolutely different in the accuracy metric, so relaxing its detection by considering the top-5 predictions may provide a metric that expresses better the obtained results [15]. This observation motivates the use of top-5 accuracy as the reference metric for evaluating the correctness of the predictions in the remaining sections of this work. The accuracy results in Table 1 also show how adjectives are more difficult to detect than nouns, as accuracy values are lower, even distinguishing among a larger number of classes. There are two reasons for this gap of performance. The first one is that adjectives usually describe more abstract concepts than nouns, with a larger associated visual variance. For example, there may be a wide range of visual features required to describe the concept łhappyž. The second reason is that ResNet-50 was initially trained for object classification on ImageNet, a type of concepts that are associated to nouns. A closer look at the results allows to distinguish which of the 553 considered ANPs can be better detected and which ones present more problems. Table 2 presents the ANPs with best and worst top5 accuracy predictions with ANPNet, comparing their accuracy per ANP with the individual accuracies of their composing adjective and nouns. Qualitative results of two of the best and worst detected ANPs are depicted in Figure 3. These results show how the best detected ANPs correspond to object-oriented nouns, i.e. well defined entities with a low variation from a visual perspective and usually represented by a localized region withing the image. These would the the cases of łriverž, łdeerž or łmushroomsž. On the other hand, the worst predictions are associated to scene-oriented ANPs, which depict more abstract concepts and thus have a larger visual variance, such as łplacesž, łviewž or łscenež itself. The top-5 accuracy tends to be significantly better in the case of object-oriented nouns than in the case of sceneoriented ones. This difference in performance may be related to using a ResNet-50 pre-trained on the ImageNet dataset for AdjNet and NounNet. ImageNet is a dataset build for object classification, so our model is more specialized in these type of classes than those related to scenes [28]. ANPNet allows a finer analysis under the form of a co-detection matrix. The contents of the matrix provide the percentage of images for which the adjective, noun or ANP are correctly detected (columns) among those images for which the adjective, noun or ANP has been correctly predicted (rows). As previously stated, a detection is considered correct when the ground truth label is among the top-5, whether adjective, noun or ANP. The diagonal of the co-detection matrix corresponds to 100%, as it represents the ratio of the whole set of detected adjective, noun or ANP with respect to itself. The rest of values do not necessarily need to meet the 100% as, for example, a correct detection of an ANP among the top-5 predictions does not also imply that the composing adjective or noun was among the top-5 predicted adjectives or nouns. The co-detection matrix allows to study the correlation between ANPs and their composing parts, providing insights about how a correct detection of adjectives and nouns is related to a correct detection of an ANP. The co-detection matrix of ANPNet is presented in Table 3. Its results indicate that the detection of an ANP also implies in many MUSA2’17, October 27, 2017, Mountain View, CA, USA cases a correct detection of the adjective, in 87.47% of the cases, and of the noun, in 95.76% of the cases. On the other hand, a correct detection of the adjective or noun does not necessarily imply a correct detection of the ANP, as the ANP top-5 accuracy for these cases is 66.38% for adjectives and 59.87% for nouns. In addition, the matrix also indicates that a detection of the adjective is also related to a correct prediction of the noun in 84.58% of the samples, but the inverse situation only corresponds to 69.67% of the considered images. 6 ADJECTIVE AND NOUN CONTRIBUTIONS Understanding neural networks has attracted much research interest. For CNNs in particular, most methods rely on extracting the most salient points in the original image space that triggered a particular decision. The particularities of the ANPNet model presented in Section 3 allows to interpret the ANP predictions obtained in Section 5 in terms of adjective and noun contributions. We adopted Deep Taylor Decomposition [20] to compute the contribution of each element in ŷad j and ŷnoun to the final ANP prediction, ŷAN P (Equation 3). This method consists in computing and propagating relevance backwards in the network (i.e. from the output to the input), where layer-wise relevance is approximated by a domain-specific Taylor decomposition. Two different rules are then derived for ReLU layers, namely the z + −rule and z B −rule, which apply to layers with unbounded and bounded inputs, respectively [20]. In our model, z B −rule is used for the relevance model between the fully connected layer in the Fusion network and the bounded Adjective and Noun probabilities, whereas z + −rule is used otherwise. 6.1 Adjective Noun Ratio This first analysis explores the nature of different ANPs depending on how much their prediction is influenced by the adjective and noun classes that compose it. For this purpose, we define the Adjective-to-Noun Ratio (ANR) as the normalized contribution of the adjectives with respect to the nouns during the prediction of an ANP. These normalized contributions are computed by summing the individual contributions of all adjectives and nouns considered by AdjNet and NounNet, and normalizing them by the total amount of adjective and nouns, respectively. Based on this definition, a uniform distribution of activations at the 117 outputs of AdjNet and another uniform distribution of activations at the 167 outputs of NounNet will result in an ANR equal to the unit. We present two types of analysis from the ANR perspective: when considering a correct predictions of the ANP and its composing adjective and nouns, and when considering every prediction in the top-5 for every image. The ANRs of the ANP with highest and lowest ANPs are presented in Table 4, together with their top-5 accuracy values for the prediction, which allows interpreting the ANRs values in the context of the predictability of the adjective, noun and ANP. 6.1.1 Average ANR per correct predictions. A first analysis considers high quality predictions from two perspectives: focusing on the correctness of the ANP only, or requiring a correct prediction the adjective and/or noun as well. As in previous sections, a prediction is considered correct if the ground truth adjective, noun or MUSA2’17, October 27, 2017, Mountain View, CA, USA D. Fernández et al. Top-5 accuracies for best ANPs Adj Noun ANP gentle river tiny bathroom young deer wild deer misty road dying rose icy grass tiny mushrooms golden statue empty train 72.22 70.82 52.49 64.34 79.42 68.12 78.29 70.82 64.56 69.72 55.97 88.35 89.04 89.04 80.16 80.07 69.31 82.52 77.12 65.42 92.45 91.26 89.81 86.49 86.49 85.00 84.16 84.00 83.61 76.60 Top-5 accuracies for worst ANPs Adj Noun ANP abandoned places beautiful landscape beautiful earth charming places bad view peaceful morning peaceful places nice scene serene scene bright sky 84.58 60.49 60.49 4.46 38.83 26.22 26.22 45.20 18.93 52.84 33.93 42.68 5.00 33.93 67.86 53.18 33.93 25.65 25.65 67.93 8.21 7.41 5.71 5.36 5.22 4.96 4.59 4.39 3.60 3.20 Table 2: Top-5 accuracies for the best and worst detected ANPs, together with the top-5 accuracies of their composing adjectives and nouns. Adj Noun ANP Adj 100.00 84.58 66.38 Noun 69.67 100.00 59.87 ANP 87.47 95.76 100.00 Table 3: Adjective, Noun and ANP co-detection matrix. The contents of the matrix provide the percentage of images for which the adjective, noun or ANP are correctly detected (columns) among those images for which the adjective, noun or ANP has been correctly predicted (rows). ANP is among the top-5 prediction. This analysis corresponds to the columns 2 to 5 in Table 4. A first observation is that ANR takes values higher or lower than one, indicating that some ANP predictions are more influenced by the adjective branch of ANPnet, while others are more influenced by the noun branch. This diversity allows us to classify ANPs between adjective oriented or noun oriented ANPs, depending on whether the ANR is higher or lower than one, correspondingly. A second observation indicates that the difficulty on predicting the adjective favors most of the ANPs to be noun oriented. We can observe in table 4 that all the five ANPs with lowest ANR (noun oriented) have an adjective accuracy prediction lower than the adjective predictions from the five ANPs with highest ANR. Finally, filtering the results by forcing a correct prediction of the adjective and/or noun shows almost no impact in the final estimated ANR. This lack of variation can be explained by the co-detection matrix previously presented in Table 3. The third row of the matrix indicates that, when an ANP is correctly detected, the adjective is also well predicted in 87.47% of the cases and the noun in 95.76%. This allows a small variation when samples are filtered to force that adjective and/or noun must also be detected. 6.1.2 Average ANR per all images. The results of Section 6.1.1 indicate how each ANP prediction is affected by the composing adjective and noun in the case of correct predictions. We extend this analysis to a more generic approach were ground truth labels are not considered, so that all top-5 predicted ANP for each image are used in estimating the ANR values. This analysis provides more samples because in this case each image in the dataset allows drawing five ANR values, one for each of the top-5 predicted ANPs. In the case of Section 6.1.1, each image could only contribute in estimating the ANR of the ground truth ANP, and only in the case that the ANP were predicted among the top-5 ones. The results of this analysis corresponds to the sixth column in Table 4. The estimated values show some slight variations with respect to the ANRs predicted over correct predictions only. The conclusions about adjective and noun oriented ANP remain the same for the cases depicted in the table. 6.2 Visually equivalent ANPs The interpretation of ANP predictions in terms of their contributing adjectives and nouns permits a novel description approach for ANPs. We define as visually equivalent those ANPs whose top-5 adjective and noun contributions are identical. Table 5 contains two pairs of visually equivalent ANPs. In the first example, łhappy dogž and łsmiling dogž have an identical noun and very similar adjective from a semantic perspective, while the second case presents the opposite situation, in which the same łgoldenž adjective is used to build the łgolden autumnž and łgolden leavesž ANPs. In both examples their two sets of top-5 contributing adjectives and nouns is identical, but not in the exact order. Also, from a visual perspective, the presented examples depict how the two ANPs are visually described by the same class of images. A more extensive list of visually equivalent ANPs is provided in Table 5, together with the ANR of each ANP. These visually equivalent ANPs also share in all cases the adjective or the noun. Notice how they also present similar ANR values, an expected behavior as the most contributing adjectives and nouns are the same in each member of the pair. These results show how our interpretable model is able to identify equivalent ANPs, which are a common case given the subjectivity and richness of an affective-aware dataset. These observations reinforce the choice of a top-5 accuracy metric instead of a more rigid top-1, as the boundaries between classes are very dim and often overlap. More cat than cute? Interpretable Prediction of Adjective-Noun Pairs ANP sexy model misty trees abandoned places sexy body wild horse innocent eyes incredible view tired eyes laughing baby chubby baby 1.161 1.139 1.121 1.118 1.117 0.787 0.785 0.776 0.769 0.764 MUSA2’17, October 27, 2017, Mountain View, CA, USA Adjective-to-Noun Ratio (ANR) Correct predictions All top-5 predictions ANP + Adj ANP + Noun ANP + Adj + Noun ANP 1.162 1.140 1.121 1.118 1.117 0.788 0.786 0.778 0.769 0.764 1.163 1.138 1.033 1.117 1.116 0.787 0.785 0.776 0.769 0.764 1.163 1.139 1.033 1.117 1.117 0.788 0.786 0.788 0.769 0.764 1.122 1.146 1.018 1.110 1.109 0.788 0.809 0.784 0.773 0.786 Top-5 Accuracy Adj Noun ANP 76.52 79.42 84.58 76.52 54.04 43.23 30.71 56.13 72.57 48.00 62.77 71.74 33.93 57.89 88.50 76.44 67.86 76.44 83.74 83.74 59.63 71.88 8.21 56.44 58.06 16.07 39.02 37.50 69.03 45.60 Table 4: Highest and lowest ANRs computed for all top-5 predictions and only for the correct predictions among the top-5. ANP Happy Dog Smiling Dog top-5 adjectives top-5 nouns top-5 adjectives top-5 nouns happy smiling friendly playful funny dog animals pets grass eyes smiling happy friendly funny playful dog eyes pets blonde animals Golden Autumn Golden Leaves top-5 adjectives top-5 nouns top-5 adjectives top-5 nouns golden sunny colorful falling bright autumn leaves trees sunlight tree golden sunny falling colorful bright leaves autumn trees sunlight tree Table 5: Pairs of ANPs with identical top-5 adjective and noun contributions. 6.3 Related Adjectives and Nouns The most contributing adjectives and nouns detected by ANPnet can also be used as semantic labels themselves. This way, our model can ANR ANP ANR ancient architecture 1.044 ancient building 1.075 dead fly 1.045 dead bug 1.080 traditional architecture 1.027 traditional house 1.004 dry tree 0.962 dying tree 0.832 tiny boat 0.924 little boat 0.909 0.964 weird bug 0.925 ugly bug heavy clouds 0.921 dark clouds 0.948 0.895 beautiful clouds 0.920 beautiful sky angry cat 0.820 evil cat 0.816 Table 6: Pairs of Visually Equivalent ANPs detect in a single pass additional concepts related to the predicted ANP, with applications to image tagging, captioning or retrieval. Table 7 shows the top-5 related adjective and nouns of four ANPs. We verify that the top contributing adjective and nouns have correspondence with the image contents, by randomly picking six images in the dataset for the considered ANPs. In example a) it can be seen how ANPnet learned that the most related concepts for an łelegant weddingž scene are the names łcakež, łrosež, łdressž and łladyž, and the adjectives łoutdoorž, łfreshž, łtastyž and łdeliciousž. Notice the high contribution of food-adjectives as łfreshž, łtastyž and łdeliciousž, which apply to the wedding cake and wedding meal. In the example b) we show the highest contributions for a more scene-oriented ANP, as łcharming placež. We notice how the network has been able to learn that adjectives describing łcharming placesž are often also related to łcomfortablež, łexcellentž, łtraditionalž and łexpensivež; and that elements appearing on these types of scenes are łhotel", łhousež, łhomež and łfoodž. Examples c) and d) show additional cases of adjectives and nouns that match the contents of some images. For example, to describe łdelicious foodž we could use adjectives as łtraditionalž, łexcellentž, łtastyž or łyummyž. And to describe łgolden hairž images, other concepts that are related are: łshinyž, łprettyž, łsexyž, łsmoothž, łladyž, łblondež, łsunlightž and łgirlž. MUSA2’17, October 27, 2017, Mountain View, CA, USA a) Elegant Wedding D. Fernández et al. b) Charming Places top-5 adjectives top-5 nouns top-5 adjectives top-5 nouns elegant outdoor fresh tasty delicious wedding cake rose dress lady charming comfortable excellent traditional expensive hotel places house home food c) Delicious Food d) Golden Hair top-5 adjectives top-5 nouns top-5 adjectives top-5 nouns delicious traditional excellent tasty yummy food cake mushrooms market drink golden shiny pretty sexy smooth hair lady blonde sunlight girl Table 7: Examples of co-occurring concepts on ANPs for the ANPs: a) "elegant wedding", b) "charming places", c) "delicious food" and d) "golden hair". Notice how the most contributing concepts match the images on the dataset for a given ANP. 7 CONCLUSIONS AND FUTURE WORK This work has presented ANPnet as an interpretable model capable of disentangling the adjective and noun contributions for the predictions of Adjective Noun Pairs (ANPs). This tools has allowed us to validate our hypothesis that the contribution of adjectives and nouns varies depending on each ANP and have introduced Adjective-to-Noun Ratio (ANR) as a measure to quantize it. ANPnet is based on the fusion of two specialized networks for adjective and noun detection. Keeping the interpretation of the model when fusing the two specialized networks has also provoked a loss of accuracy of the model, establishing a trade-off between interpretability and performance. It has been observed that better detection accuracies are often associated to object-oriented nouns, while worse ones are related to scene-oriented nouns. As future work, one may explore pre-training AdjNet and NounNet not only with an object-oriented dataset as ImageNet, but also with a sceneoriented one such as Places [28]. The unbalanced contributions of adjective and nouns in ANP predictions also allows a classification between adjective- and nounoriented ANPs. Adjective-oriented ANPs tend to be harder to detect because adjectives themselves are also harder to detect than nouns. As in the case of scene-oriented nouns, adjective-oriented ANPs may be difficult to predict because AdjNet and NounNet were pretrained with the objects in ImageNet, a type of nouns. In addition, qualitative results also indicate how adjective concepts are much more visually diverse than noun ones. Our work has also shown how different ANPs may have the same top adjective and noun contributions, allowing the detection of visually equivalent ANPs. As a final analysis, we have shown how ANPnet can also be used to generate adjective and noun labels to enrich the semantic description of the images. The presented work, while focus on affective-aware ANPs, could be extended to any other problem of adjective and noun detection, and even to more complex cases with more composing concepts. Our interpretable model aims at contributing in the field of better understanding why deep neural networks produce their predictions in terms of intermediate activations with a straightforward semantic interpretation. In the other hand, relationship between concepts can be used in order to modify network loss when training. As future work, the accuracy gap between the interpretable model and the baseline should be reduced. An option to do that is by introducing multi-task and weight sharing between AdjNet and NounNet to reduce the number of parameters. A different architecture were the detection of an adjective or noun could be conditioned by the prior detection of the other (or viceversa) could reduce the visual variance in the detection and help improving its performance. Finally, the introduction of external knowledge bases could be explored to explore prior knowledge on the domain. The source code and models used in this work are publicly available at https://imatge-upc.github.io/affective-2017-musa2/. ACKNOWLEDGMENTS Dèlia Fernández is funded by contract 2017-DI-011 of the Industrial Doctorate Programme of the Government of Catalonia. This work was partially supported by the Spanish Ministry of Economy and Competitivity under contracts TIN2012-34557 by the BSC-CNS Severo Ochoa program (SEV-2011-00067), and contracts TEC201343935-R and TEC2016-75976-R. It has also been supported by grants 2014-SGR-1051 and 2014-SGR-1421 by the Government of Catalonia, and the European Regional Development Fund (ERDF). We gratefully acknowledge the support of NVIDIA Corporation for the donation of GPUs used in this work. REFERENCES [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. (2015). [2] Damian Borth, Tao Chen, Rongrong Ji, and Shih-Fu Chang. 2013. Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In ACM MM. [3] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. [n. d.]. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM. [4] Victor Campos, Brendan Jou, and Xavier Giro-i Nieto. 2017. From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction. Image and Vision Computing (2017). More cat than cute? Interpretable Prediction of Adjective-Noun Pairs [5] Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586 (2014). [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. [7] Suci d’Osgood. 1957. Tannenbaum, The measurement of meaning. Urbano, University of Illinois Press (1957). [8] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks.. In AISTATS. [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770ś778. [10] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR. [11] Jia Jia, Sen Wu, Xiaohui Wang, Peiyun Hu, Lianhong Cai, and Jie Tang. 2012. Can we understand van gogh’s mood?: learning to infer affects from images in social networks. In ACM MM. [12] Yu-Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting Emotions in User-Generated Videos.. In AAAI. [13] Brendan Jou. 2016. Large-scale affective computing for visual multimedia. Ph.D. Dissertation. Columbia University. [14] Brendan Jou and Shih-Fu Chang. 2016. Deep Cross Residual Learning for Multitask Visual Recognition. In ACM MM. [15] Brendan Jou and Shih-Fu Chang. 2016. Going Deeper for Multilingual Visual Sentiment Detection. arXiv preprint arXiv:1605.09211 (2016). [16] Brendan Jou, Tao Chen, Nikolaos Pappas, Miriam Redi, Mercan Topkara, and Shih-Fu Chang. 2015. Visual affect around the world: A large-scale multilingual visual sentiment ontology. In ACM MM. [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. [18] Richard D Lane and Lynn Nadel. 2002. Cognitive neuroscience of emotion. Oxford University Press, USA. [19] Bing Li, Songhe Feng, Weihua Xiong, and Weiming Hu. 2012. Scaring or pleasing: exploit emotional impact of an image. In ACM MM. [20] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. 2017. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition (2017). [21] Takuya Narihira, Damian Borth, Stella X Yu, Karl Ni, and Trevor Darrell. 2015. Mapping Images to Sentiment Adjective Noun Pairs with Factorized Neural Nets. arXiv preprint arXiv:1511.06838 (2015). [22] Rosalind W. Picard. 1997. Affective Computing. MIT Press Cambridge. [23] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR. [24] Jingwen Wang, Jianlong Fu, Yong Xu, and Tao Mei. 2016. Beyond Object Recognition: Visual Sentiment Analysis with Deep Coupled Adjective and Noun Neural Networks. IJCAI. [25] Xiaohui Wang, Jia Jia, Peiyun Hu, Sen Wu, Jie Tang, and Lianhong Cai. 2012. Understanding the emotional impact of images. In ACM MM. [26] Quanzeng You, Liangliang Cao, Hailin Jin, and Jiebo Luo. 2016. Robust VisualTextual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks. In ACM MM. [27] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks. In AAAI. [28] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In NIPS. MUSA2’17, October 27, 2017, Mountain View, CA, USA