Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2017
…
3 pages
1 file
We participated in the matching and ranking subtask in TRECVid challenge 2017. The task here was to return a ranked list of the most likely text descriptions that correspond to each video. We adopted a joint visual semantic embedding approach for image-text retrieval and applied to the video-text retrieval task utilizing key-frames extracted by dissimilaritybased sparse subset selection approach. We trained our system on the MS-COCO dataset and tested on the TRECVid dataset. Our approach got an average mean inverted ranking score of 0.255 across 4 sets of testing data, and we ranked the 3rd overall in the challenge on this task.
2018
This paper describes our participation in the ad-hoc video search and video to text tasks of TRECVID 2018. In ad-hoc video search, we adapted an image-based visual semantic embedding approach and trained our model on combined MS COCO and Flicker30k datasets. We extracted multiple keyframes from each shot and performed similarity search using the computed embeddings. In video to text, description generation task, we trained a video captioning model with multiple features using a reinforcement learning method on the combination of MSR-VTT and MSVD video captioning datasets. For the matching and ranking subtask, we trained two types of image-based ranking models on the MS COCO dataset. 1 Ad-hoc Video Search (AVS) In the ad-hoc video search task, we are given 30 free text queries and required to return the top 1000 shots from the test set videos [1, 2]. The queries are given in Appendix A. The test set contains 4593 Internet Archive videos of 600 hours with 450K shots (publicly availabl...
2019
In this paper we present an overview of our participation in TRECVID 2019 [1]. We participated in the task Ad-hoc Video Search (AVS) and the subtasks Description Generation and Matching and Ranking of Video to Text (VTT) task. First, for the AVS Task, we develop a system architecture that we call “Word2AudioVisualVec++” (W2AVV++) based on Word2VisualVec++ (W2VV++) [11] that in addition to using deep visual features of videos, also uses deep audio features obtained from pre-trained networks. Second, for the VTT Matching and Ranking Task, we develop another deep learning model based on Word2VisualVec++, extracting temporal information of the video by using Dense Trajectories [16] and a clustering approach to encode them into a single vector representation. Third, for the VTT Description Generation Task, we develop an Encoder-Decoder model incorporating semantic states into the Encoder phase. 1 Ad-hoc Video Search
2021 TREC Video Retrieval Evaluation, 2021
ArXiv, 2020
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, metrics-based evaluation. Over the last twenty years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID has been funded by NIST (National Institute of Standards and Technology) and other US government agencies. In addition, many organizations and individuals worldwide contribute significant time and effort. TRECVID 2020 represented a continuation of four tasks and the addition of two new tasks. In total, 29 teams from various research organizations worldwide completed one or more of the following six tasks: 1. Ad-hoc Video Search (AVS), 2. Instance Search (INS), 3. Disaster Scene Description and Indexing (DSDI), 4. Vide...
2015
This year EURECOM participated in the TRECVID 2015 Semantic INdexing (SIN) Task [24] for the submission of four different runs for 60 concepts, and Video Hyperlinking (LNK) Task [24] with the four submissions. Our submission to the SIN Task builds on the runs submitted in the previous years for the 2013 and 2014 SIN tasks, the details of which can be found in [20] and [19], while the LNK submissions are based on our previous experiments as in [28] and [9]. The major changes for 2015 are the use of new Deep Network models to produce extra descriptors for the video shots, and the introduction of various fusion schemes at all levels of the processing, to reduce the problem of overfitting. This year, we did not use our uploader model, partly because of lack of time, and partly because initial experiments showed only marginal improvement after the new features were added. For the LNK Task our approach targeted to connect the textual stream of the videos within the collection and its vocabulary context, as defined by word2vec algorithm, with the output of visual concepts detection tools for the corresponding hyperlinks candidates within one framework. We combined visual concepts detection confidence scores with the information about corresponding word vectors distances in order to rerank the baseline text based search. The reranked runs did not outperform the baseline, however they exposed potential of our method for further improvement. Beside this participation, EURECOM took part in the collaborative IRIM submission, the details of this contribution is included in the corresponding publication from the IRIM group. The remainder of this paper briefly describes the descriptors that we have been using, the training and the various fusion schemes, and the content of the submitted runs; and the framework of the confidence scores combinations used for reranking in the LNK task.
2019
This paper presents our system developed for Adhoc Video Search (AVS) task in TRECVID 2019. Our system is based on embedding that maps visual and textual information into a common space to measure the relevance of each shot to a topic. We devise three embedding models built on two sources of training data, MS-COCO [1] and Flickr 30k [2]. Image feature extractors and region detector internally used in these models are pre-trained on ImageNet [3] and Visual Genome [4], respectively. The following five variants of our system were submitted: 1) F M C D kindai kobe.19 1: This run is an ensemble of three embedding models. The first and second models are respectively trained on MS-COCO and Flickr 30k to perform different coarse-grained embeddings between frames and a topic. The last model forms fine-grained embedding between regions in frames and words in a topic. 2) F M C D kindai kobe.19 2: This run is the same to F M C D kindai kobe.19 1 except that the fine-grained embedding model norm...
2018
This paper provides an overview of the runs submitted to TRECVID 2017 by ITI-CERTH. ITI-CERTH participated in the Ad-hoc Video Search (AVS), Multimedia Event Detection (MED), Instance Search (INS) and Surveillance Event Detection (SED) tasks. Our AVS task participation is based on a method that combines the linguistic analysis of the query with concept-based and semantic-embedding representations of video fragments. Regarding the MED task, this year we participate on Pre-Specied and Ah-Hoc sub-tasks exploiting both motion-based as well as DCNN-based features. The INS task is performed by employing VERGE, which is an interactive retrieval application that integrates retrieval functionalities that consider mainly visual information. For the SED task, we deploy a novel activity detection algorithm that is based on human detection in video frames, goal descriptors, dense trajectories, Fisher vectors and a discriminative action segmentation scheme.
ArXiv, 2020
This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchm...
2019
In this paper, we describe the systems developed for Ad-hoc Video Search (AVS) task at TRECVID 2019[1] and the achieved results. Ad-Hoc Video Search (AVS): We merge three video search systems for AVS, including: two conceptbased video search systems which analyse the query using linguistic approaches then select and fuse the concepts, and a video retrieval model which learns the joint embedding space of the textual queries and the videos for matching. With this setting, we plan to analyze the advantages and shortcomings of these video search approaches. We submit totally seven runs consisting four automatic runs, two manual runs, and one novelty run. We brief our runs as follows: • F_M_C_D_VIREO.19_1 : This automatic run has mean xinfAP=0.034 using a concept-based video search system including ∼16.6k concepts covering objects, persons, activities, and places. We parse the queries with Stanford NLP parsing tool [2], keep the keywords, and categorize the keywords into three groups: object/person, action, and place. Correspondingly, the concepts from different groups in the concept bank are selected and fused.
2011
The Text Retrieval Conference’s (TREC’s) Video Retrieval
Плунгян В. А. I. Коммуникативная информация и порядок слов II. Пресуппозиции в словообразовании прилагательных / В.А. Плунгян ; Отв. ред. В.Ю. Розенцвейг. – М., 1983. – 52 с. – (Предварительные публикации / Ин-тут рус. яз. АН СССР ; Проблемная группа по эксперимент. и прикл. лингвистике. Вып. 149).
2018
Atenea (Concepción), 2009
IOP Conference Series: Materials Science and Engineering, 2017
Africa Insight, 2011
Nueva Sociedad , 2020
Energies, 2018
Korean Circulation Journal, 2005
Acta Crystallographica Section A Foundations of Crystallography, 2008
Plasma Chemistry and Plasma Processing, 2021
Pelita Masyarakat, 2019
Journal of Ayub Medical College, Abbottabad : JAMC
Research, Society and Development, 2021