Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2018
…
3 pages
1 file
We participated in the video to text description: matching and ranking task in TRECVID 2018. The goal of this task is to return a ranked list of the most likely text descriptions that correspond to each video in the test set. We trained joint visual-semantic embedding models using image-text pairs from an image-captioning dataset and applied to the videotext retrieval task utilizing key frames of videos extracted by a sparse subset selection approach. Our retrieval system performed reasonably across all the testing sets. Our best system, which uses a late-fusion of similarity scores obtained from the key frames of a video, achieved mean inverted ranking score of 0.225 on the testing set C, and we ranked the 4th overall on this task.
2018
This paper describes our participation in the ad-hoc video search and video to text tasks of TRECVID 2018. In ad-hoc video search, we adapted an image-based visual semantic embedding approach and trained our model on combined MS COCO and Flicker30k datasets. We extracted multiple keyframes from each shot and performed similarity search using the computed embeddings. In video to text, description generation task, we trained a video captioning model with multiple features using a reinforcement learning method on the combination of MSR-VTT and MSVD video captioning datasets. For the matching and ranking subtask, we trained two types of image-based ranking models on the MS COCO dataset. 1 Ad-hoc Video Search (AVS) In the ad-hoc video search task, we are given 30 free text queries and required to return the top 1000 shots from the test set videos [1, 2]. The queries are given in Appendix A. The test set contains 4593 Internet Archive videos of 600 hours with 450K shots (publicly availabl...
2019
In this paper we present an overview of our participation in TRECVID 2019 [1]. We participated in the task Ad-hoc Video Search (AVS) and the subtasks Description Generation and Matching and Ranking of Video to Text (VTT) task. First, for the AVS Task, we develop a system architecture that we call “Word2AudioVisualVec++” (W2AVV++) based on Word2VisualVec++ (W2VV++) [11] that in addition to using deep visual features of videos, also uses deep audio features obtained from pre-trained networks. Second, for the VTT Matching and Ranking Task, we develop another deep learning model based on Word2VisualVec++, extracting temporal information of the video by using Dense Trajectories [16] and a clustering approach to encode them into a single vector representation. Third, for the VTT Description Generation Task, we develop an Encoder-Decoder model incorporating semantic states into the Encoder phase. 1 Ad-hoc Video Search
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query's relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the video clip depends on the sentence compared to it, the representation needs to be recomputed for each potential match. To this end, we propose an efficient shallow neural network. Its training employs a hierarchical triplet loss that is extendable to paragraph/video matching. The method is simple, provides explainability, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: Activi-tyNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC. We also show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX. Source code is available at . fry onions until golden then add carrots and fry for 5 mins Joint conditioned-embedding space fry onions until golden then add carrots and fry for 5 mins Joint embedding space ℝ ! ℝ !
Symmetry
Traditionally, searching for videos on popular streaming sites like YouTube is performed by taking the keywords, titles, and descriptions that are already tagged along with the video into consideration. However, the video content is not utilized for searching of the user’s query because of the difficulty in encoding the events in a video and comparing them to the search query. One solution to tackle this problem is to encode the events in a video and then compare them to the query in the same space. A method of encoding meaning to a video could be video captioning. The captioned events in the video can be compared to the query of the user, and we can get the optimal search space for the videos. There have been many developments over the course of the past few years in modeling video-caption generators and sentence embeddings. In this paper, we exploit an end-to-end video captioning model and various sentence embedding techniques that collectively help in building the proposed video-...
ArXiv, 2020
This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchm...
ArXiv, 2016
We present the approaches for the four video-tolanguage tasks of LSMDC 2016, including movie description, fill-in-the-blank, multiple-choice test, and movie retrieval. Our key idea is to adopt the semantic attention mechanism; we first build a set of attribute words that are consistently discovered on video frames, and then selectively fuse them with input words for more semantic representation and with output words for more accurate prediction. We show that our implementation of semantic attention indeed improves the performance of multiple video-tolanguage tasks. Specifically, the presented approaches participated in all the four tasks of the LSMDC 2016, and have won three of them, including fill-in-the-blank, multiplechoice test, and movie retrieval.
2020
In this paper we present an overview of our participation in TRECVID 2020 Video to Text Description Challenge [1]. Specifically, we participated in the Description Generation subtask by extending of our recent paper [21]. We address the limitation of previous video captioning methods that have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. We consider syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which gui...
2020
In this paper we summarize our TRECVID 2020 video retrieval. We participated in Ad-hoc Video Search (AVS) task. For the AVS task, we developed our solutions based on W2VV++, a super version of Word2VisualVec (W2VV) by attempting optimization of hyperparameters and further augmenting it with attention based caption generation based text to text matching. 1. Approach An attempt is done to augment the state of the art W2vv++ implementation. The w2vvpp model which won the 2018 Trecvid and set the change towards concept-less video. Firstly, experimental optimization of hyper parameters and different optimisers were tried and secondly, Query to captions similarity was explored to re-rank the outcome of the w2vv++. 1.1 Model optimization Attempt is done to improve training performance of the W2vv++ model. Multiple optimisers were experimented and learning rate values and strategies used. a. For various Optimizers The W2vvpp model is trained using different optimizers. Following optimizer t...
ArXiv, 2021
Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.
Procedia-Social and Behavioral Sciences, 2009
INTRODUCTION
Joint embedding has a wide use case in multimedia data analysis and retrieval as it can bridge the gap between different modalities [1,2,3,4,5,6]. Joint embeddings are learned by projecting semantically associated inputs from two or more domains into a common space (e.g., images and text) so that the embedding tends to represent the underlying correspondence of multiple domains. In this work, we focus on solving cross-modal video-text retrieval task utilizing joint image-text embeddings. In this work, we capitalized on the weighted pair-wise ranking loss mentioned in [5] for training joint image-text embeddings. The performance of the approach is evaluated using mean inverted rank (MIR) at which the annotated item is found or equivalent.
Existing video-text datasets are very small considering the diversity visual world have, and the enormous amount of rich description human can compose. Our retrieval approach is based on joint embeddings trained on image-captioning datasets, which has a significantly larger size and variety compared to video-captioning datasets. We believe that models trained on image captioning sets are more likely to show higher cross-dataset generalization performance on the TRECVID 2018 test set, compared to training with video captioning datasets consisting of a smaller number of examples. Moreover, the TRECVID test set contains short Vine videos and a few key frames are often enough to summarize most of the videos. In this work, we utilize a fixed number of key frames extracted from each of the videos and employed joint image-text embedding model for the retrieval task using the frames.
SYSTEM OVERVIEW
We consider the problem as matching key frames from video and text descriptions in a joint image-text embedding space following [7]. We adopt the approach proposed in [5] to learn the joint embedding using image captioning dataset MSCOCO [8]. At the time of retrieval, given key frames from a query video, we calculate similarity score for each of the frames with the all the sentences in the dataset using the joint embedding model and use a fusion of the similarity scores for the final ranking. The key frames from the videos are extracted following dissimilarity based subset selection approach [9].
Training Joint Embedding
Joint visual-semantic embedding models are trained to project visual and textual features into a common space [3,10,5,11]. The embedding is learned such that the similarity in the joint space is reflective of semantic closeness between images and their corresponding text. In this work, we followed a pair-wise ranking loss based approach for training joint space following [5]. The network is trained by minimizing a weighted ranking loss that emphasizes on hard negatives and tries to maximize the similarity between an image embedding x (v) and its corresponding text embedding x (t) , and minimize similarity to the non-matching one with the highest similarity score. The optimization problem can be written as follows,
Here, in the Eqn.1, [f ] + = max(0, f ). L(.) is a weighting function. For an image embedding x (v) , r v is the rank of matching sentence x (t) among all compared sentences. Similarly, for a text embedding x (t) , r t is the rank of matching image embedding x (v) among all compared images in the batch. The weighting function is defined as L(r) = (1 + β/(N − r + 1)), where N is the number of compared images and β is the weighting factor. Here, for a positive pair (x (v) , x (t) ), the hardest negative text sample x (t) n can be identified as the negative text having the highest similarity score with image embedding x (v) in the batch. Similarly, the hardest negative image sample x
n can be identified as the negative image sample having the highest similarity score with x (t) in the batch. α is the margin value for the loss function.
is defined as the similarity function to measure the similarity between the images and text in the embedding.
The embedding model is trained using pairs from MS-COCO dataset [12] using a two-branch network. One of the branches of this network takes in visual features and the other one takes in text features. In this work, Resnet152 is used for visual feature encoding [13] and a GRU-based text encoder for caption encoding [14]. To calculate the similarity between the embedded vectors, cosine similarity is used.
Key frame Extraction
Key frame extraction is another major step in our retrieval pipeline. The goal of this step is to find a small subset of representative frames from a video. The selected frames should represent the entire video and have enough variety between each other. Recently, sparse coding based techniques have been shown to be highly successful in finding an informative subset of a large number of data points [15,9]. In this work, we adopt the approach proposed in [9], which uses a sparse coding based approach to find a representative subset of the source set to describe the target set, given pairwise relationships between the sets. Here, we consider a special case where the source and target sets are same and consider the problem of finding representatives of a set X, given pairwise dissimilarity D between the elements of X.
The problem of subset selection is formulated as a rowsparsity regularized trace minimization problem following [9], where the regularization parameter puts a trade-off between the number of representatives and the encoding cost of the original set via representatives. The algorithm ultimately finds a small set of representative frames. It also returns the confidence score, which indicates how all the frames in the original video are associated with the representative set. For the frames in a video, we extracted features using pre-trained Alexnet CNN [16]. To calculate dissimilarity score, we use Euclidean distance based measure. As we are dealing with small videos, in this work, we choose to limit the number of representatives to four.
Dataset
The TRECVID test dataset [17] contains randomly selected 1921 Vine videos. The videos are short and the duration is less than 10 seconds in most cases. Each video is annotated with sentences by 5 different annotators. We did not use the vine videos provided by NIST for training our joint embedding model. We utilize MS-COCO dataset to train our joint image-text embedding model [12]. The MSCOCO training set contains about 82K images and each image in MSCOCO comes with 5 captions.
Video-Text Retrieval Performance
We submitted four runs for each matching task. Our four submitted runs were based on results obtained using the key frames extracted from the videos. Method-1 uses scores obtained by averaging similarity scores from the key frames of a video for ranking. Method-2 uses the maximum similarity score obtained by the key frames for ranking a video. Method-3 reports MIR obtained by key frame 2 and Method-4 reports MIR obtained by key frame 4. Table 1 reports the performance of our approach on the TRECVID video to text (VTT) dataset on annotation set A, B, C, D, and E. We observe that our method performs consistently across test sets. We also observe that Method-1 performs best across training sets, where we use the average of scores obtained by all key frames of a video for the final ranking.
Table 1
Model Performance on TRECVID VTT Test Sets
CONCLUSION
This work focused on utilizing visual-semantic embedding models for TRECVID video to text matching and ranking task. We propose an approach that employs a joint imagetext embedding model for the task utilizing a few key frames extracted from the videos. Experiments on TRECVID 2018 test sets demonstrate that our simple yet efficient approach is promising as it consistently achieves performance comparable to the state-of-the-art methods.
Proceedings of the IADIS …, 2007
Infection, 1996
Future Expert Solutions, 2018
Journal of Psychiatric Research, 1987
Gondwana Landscapes in southern South America, 2014
Physics 3rd Edition Giambattisata Test Bank, 2019
ACS Nano, 2011
Optical Materials Express, 2012
P. Mąkosa, Emigration of Poles to the United Kingdom: history, present state and future prospects. "International Migration" (56) 5 2018 p. 137-150., 2018
ICD 10 CM 2018 The Complete Official Codebook Icd 10 Cm The Complete Official Codebook by America
Boletin: California Mission Studies Association, 2013
Nature and Science of Sleep, 2020
Journal of Geosciences and Geomatics, 2025, Vol. 13, No. 1, 1-22, 2025
Clinical Cancer Research, 2009
Hernia : the journal of hernias and abdominal wall surgery, 2010
Journal of the American College of Surgeons, 2012