Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2018
…
3 pages
1 file
We participated in the video to text description: matching and ranking task in TRECVID 2018. The goal of this task is to return a ranked list of the most likely text descriptions that correspond to each video in the test set. We trained joint visual-semantic embedding models using image-text pairs from an image-captioning dataset and applied to the videotext retrieval task utilizing key frames of videos extracted by a sparse subset selection approach. Our retrieval system performed reasonably across all the testing sets. Our best system, which uses a late-fusion of similarity scores obtained from the key frames of a video, achieved mean inverted ranking score of 0.225 on the testing set C, and we ranked the 4th overall on this task.
2018
This paper describes our participation in the ad-hoc video search and video to text tasks of TRECVID 2018. In ad-hoc video search, we adapted an image-based visual semantic embedding approach and trained our model on combined MS COCO and Flicker30k datasets. We extracted multiple keyframes from each shot and performed similarity search using the computed embeddings. In video to text, description generation task, we trained a video captioning model with multiple features using a reinforcement learning method on the combination of MSR-VTT and MSVD video captioning datasets. For the matching and ranking subtask, we trained two types of image-based ranking models on the MS COCO dataset. 1 Ad-hoc Video Search (AVS) In the ad-hoc video search task, we are given 30 free text queries and required to return the top 1000 shots from the test set videos [1, 2]. The queries are given in Appendix A. The test set contains 4593 Internet Archive videos of 600 hours with 450K shots (publicly availabl...
2019
In this paper we present an overview of our participation in TRECVID 2019 [1]. We participated in the task Ad-hoc Video Search (AVS) and the subtasks Description Generation and Matching and Ranking of Video to Text (VTT) task. First, for the AVS Task, we develop a system architecture that we call “Word2AudioVisualVec++” (W2AVV++) based on Word2VisualVec++ (W2VV++) [11] that in addition to using deep visual features of videos, also uses deep audio features obtained from pre-trained networks. Second, for the VTT Matching and Ranking Task, we develop another deep learning model based on Word2VisualVec++, extracting temporal information of the video by using Dense Trajectories [16] and a clustering approach to encode them into a single vector representation. Third, for the VTT Description Generation Task, we develop an Encoder-Decoder model incorporating semantic states into the Encoder phase. 1 Ad-hoc Video Search
Symmetry
Traditionally, searching for videos on popular streaming sites like YouTube is performed by taking the keywords, titles, and descriptions that are already tagged along with the video into consideration. However, the video content is not utilized for searching of the user’s query because of the difficulty in encoding the events in a video and comparing them to the search query. One solution to tackle this problem is to encode the events in a video and then compare them to the query in the same space. A method of encoding meaning to a video could be video captioning. The captioned events in the video can be compared to the query of the user, and we can get the optimal search space for the videos. There have been many developments over the course of the past few years in modeling video-caption generators and sentence embeddings. In this paper, we exploit an end-to-end video captioning model and various sentence embedding techniques that collectively help in building the proposed video-...
ArXiv, 2020
This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchm...
ArXiv, 2016
We present the approaches for the four video-tolanguage tasks of LSMDC 2016, including movie description, fill-in-the-blank, multiple-choice test, and movie retrieval. Our key idea is to adopt the semantic attention mechanism; we first build a set of attribute words that are consistently discovered on video frames, and then selectively fuse them with input words for more semantic representation and with output words for more accurate prediction. We show that our implementation of semantic attention indeed improves the performance of multiple video-tolanguage tasks. Specifically, the presented approaches participated in all the four tasks of the LSMDC 2016, and have won three of them, including fill-in-the-blank, multiplechoice test, and movie retrieval.
2020
In this paper we present an overview of our participation in TRECVID 2020 Video to Text Description Challenge [1]. Specifically, we participated in the Description Generation subtask by extending of our recent paper [21]. We address the limitation of previous video captioning methods that have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. We consider syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which gui...
2020
In this paper we summarize our TRECVID 2020 video retrieval. We participated in Ad-hoc Video Search (AVS) task. For the AVS task, we developed our solutions based on W2VV++, a super version of Word2VisualVec (W2VV) by attempting optimization of hyperparameters and further augmenting it with attention based caption generation based text to text matching. 1. Approach An attempt is done to augment the state of the art W2vv++ implementation. The w2vvpp model which won the 2018 Trecvid and set the change towards concept-less video. Firstly, experimental optimization of hyper parameters and different optimisers were tried and secondly, Query to captions similarity was explored to re-rank the outcome of the w2vv++. 1.1 Model optimization Attempt is done to improve training performance of the W2vv++ model. Multiple optimisers were experimented and learning rate values and strategies used. a. For various Optimizers The W2vvpp model is trained using different optimizers. Following optimizer t...
ArXiv, 2021
Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.
Comprueba la efectividad de este libro respondiendo a este cuestionario. Si no has leído el libro, te servirá para conocer tu punto de partida como negociador; si ya lo has leído, comprobarás tus avances. Por favor, contesta verdadero o falso a estas 20 preguntas.
Heliyon, 2019
This study describes the social and demographic profile of the first generation of users of marketed virtual reality (VR) viewers in Spain and, subsequently, it assesses the interest in its use as a learning tool. For that purpose, an online questionnaire created ad hoc was administered to a sample of 117 participants. The relationship between twelve variables was analysed comparing means through the Snedecor's F distribution and the contingency tables through the Chi-squared test and Somers' D. Among other issues, it was concluded that the virtual reality user profile at present corresponds to a person older than 36, mainly men, with higher education and having acquired their viewer no longer than one year ago. Concerning the interests of virtual reality users as a learning tool, only a few of them currently use virtual reality for this aim, but they mainly show an interest in using the virtual reality as a learning method and they feel optimism regarding the future use of this technology as a learning tool. However, this is not the case among users of video game consoles (PSVR), who are mainly men not interested in their use as a learning tool at (http://creativecommons.org/licenses/by/4.0/). present. Finally, it can be stated that current use as a learning tool among teachers and students is occasional and preferably via smartphones.
Trabalho & Educação, 2024
Huerta, R. (2022). Funció plàstica de les lletres en les publicacions periòdiques il·lustrades espanyoles dels anys '50. Tesi doctoral. Universitat Politècnica de València. ISBN 84-370-0976-6
The Journal of Emergency Medicine, 1984
Regional Environmental Change, 2019
La traducción y la interpretación en tiempos de pandemia, 2024
INTERNATIONAL JOURNAL OF EDUCATION, PSYCHOLOGY AND COUNSELLING (IJEPC), 2024
Lucentum, 2022
"Rara volumina", Rivista di studi sull'editoria e il libro illustrato, 2021
arXiv (Cornell University), 2013
Food Control, 2018
Biocyt biología, ciencia y tecnología, 2023
Arthroscopy: The Journal of Arthroscopic & Related Surgery, 2013
medRxiv (Cold Spring Harbor Laboratory), 2023