Papers by Manish Shrivastava
Cornell University - arXiv, Aug 2, 2018
In order to expand their reach and increase website ad revenue, media outlets have started using ... more In order to expand their reach and increase website ad revenue, media outlets have started using clickbait techniques to lure readers to click on articles on their digital platform. Having successfully enticed the user to open the article, the article fails to satiate his curiosity serving only to boost click-through rates. Initial methods for this task were dependent on feature engineering, which varies with each dataset. Industry systems have relied on an exhaustive set of rules to get the job done. Neural networks have barely been explored to perform this task. We propose a novel approach considering different textual embeddings of a news headline and the related article. We generate sub-word level embeddings of the title using Convolutional Neural Networks and use them to train a bidirectional LSTM architecture. An attention layer allows for calculation of significance of each term towards the nature of the post. We also generate Doc2Vec embeddings of the title and article text and model how they interact, following which it is concatenated with the output of the previous component. Finally, this representation is passed through a neural network to obtain a score for the headline. We test our model over 2538 posts (having trained it on 17000 records) and achieve an accuracy of 83.49% outscoring previous state-of-the-art approaches.
Cornell University - arXiv, Jun 14, 2018
The tremendous amount of user generated data through social networking sites led to the gaining p... more The tremendous amount of user generated data through social networking sites led to the gaining popularity of automatic text classification in the field of computational linguistics over the past decade. Within this domain, one problem that has drawn the attention of many researchers is automatic humor detection in texts. In depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate. With increase in the number of social media users, many multilingual speakers often interchange between languages while posting on social media which is called code-mixing. It introduces some challenges in the field of linguistic analysis of social media content (Barman et al., 2014), like spelling variations and non-grammatical structures in a sentence. Past researches include detecting puns in texts (Kao et al., 2016) and humor in one-lines (Mihalcea et al., 2010) in a single language, but with the tremendous amount of code-mixed data available online, there is a need to develop techniques which detects humor in code-mixed tweets. In this paper, we analyze the task of humor detection in texts and describe a freely available corpus containing English-Hindi code-mixed tweets annotated with humorous(H) or non-humorous(N) tags. We also tagged the words in the tweets with Language tags (English/Hindi/Others). Moreover, we describe the experiments carried out on the corpus and provide a baseline classification system which distinguishes between humorous and non-humorous texts.
Text, Speech, and Dialogue, 2020
Conversational and task-oriented dialogue systems aim to interact with the user using natural res... more Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input (confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.
2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), 2018
In this paper, we model the similar question retrieval task as a binary classification problem. W... more In this paper, we model the similar question retrieval task as a binary classification problem. We propose a novel approach of “ID-Siamese LSTM for cQA (1D-SLcQA)” to find the semantic similarity between a new question and existing question(s). In 1D-SLcQA, we use a combination of twin LSTM networks and a contrastive loss function to effectively memorize the long term dependencies i.e., capture semantic similarity even when the length of the answers/questions is very large (200 words). The similarity of the questions is modeled using a single network with (1D) (feature) convolution between feature vectors learned from twin LSTM layers. Experiments on large scale real world Yahoo Answers dataset show that 1D-SLcQA outperform the state of the art approach of Siamese cQA approach(SCQA).
ArXiv, 2018
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societie... more Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.
The advent of social media has immensely proliferated the amount of opinions and arguments voiced... more The advent of social media has immensely proliferated the amount of opinions and arguments voiced on the internet. These virtual debates often present cases of aggression. While research has been focused largely on analyzing aggression and stance in isolation from each other, this work is the first attempt to gain an extensive and fine-grained understanding of patterns of aggression and figurative language use when voicing opinion. We present a Hindi-English code-mixed dataset of opinion on the politico-social issue of ‘2016 India banknote demonetisation‘ and annotate it across multiple dimensions such as aggression, hate speech, emotion arousal and figurative language usage (such as sarcasm/irony, metaphors/similes, puns/word-play).
Code-mixing, use of two or more languages in a single sentence, is generated by multi-lingual spe... more Code-mixing, use of two or more languages in a single sentence, is generated by multi-lingual speakers across the world. The phenomenon presents itself prominently in social media discourse. Consequently, there is a growing need for translating code-mixed hybrid language into standard languages. However, due to the lack of gold parallel data, existing machine translation systems fail to properly translate code-mixed text. In an effort to initiate the task of machine translation of code-mixed content, we present a newly created parallel corpus of code-mixed English-Hindi and English. We selected previously available English-Hindi code-mixed data as a starting point for our parallel corpus, and 4 human translators, fluent in both English and Hindi, translated the 6,096 code-mixed English-Hindi sentences into English. With the help of the created parallel corpus, we analyzed the structure of EnglishHindi code-mixed data and present a technique to augment run-of-the-mill machine transla...
ABSTRACT:- Studies among the field communication system existing technique and proposes and by ex... more ABSTRACT:- Studies among the field communication system existing technique and proposes and by experimentation demonstrate a multiuser wavelengthdivision-multiplexing passive optical network (WDM-PON) system combining with orthogonal frequency division multiple (OFDM) technique. A tunable multiwavelength optical comb is intended to provide flat optical lines for helping the configuration of the multiple source-free optical network units WDM-OFDM-PON system supported normal single-mode fiber (SSMF). In WDM based on fiber, optical network communications using wavelength with multiplex or demultiplex may be a technology that multiplexes a variety of optical carrier signals onto one fiber by victimization completely different wavelengths of optical device lightweight. this system allows bidirectional communications over one strand of fiber, also as multiplication of capability and calculate BER (Bit Error Rate) and OSNR (optical signal noise ratio) finally; a comparison of by experiment...
International Journal For Science Technology And Engineering, 2016
In this new Era, the growing Vehicle population in all developing and developed country calls for... more In this new Era, the growing Vehicle population in all developing and developed country calls for a major improvement and innovation in the existing Traffic Signalling systems. The most widely used Traffic Control uses a simple time based system which works on a fixed time interval basis which is now inefficient for random and non-uniform Traffic .This Proposed system uses the congestion based dynamic timing system .This system does not require any system in vehicles so can be implemented in any Traffic system quite easily with less time and is less expensive also. This system uses Wireless sensor networks Technology to sense vehicles and a Microcontroller IC-based routing algorithm programmed for excellent Traffic management.
ArXiv, 2016
Community Question Answering (cQA) forums have become a popular medium for soliciting direct answ... more Community Question Answering (cQA) forums have become a popular medium for soliciting direct answers to specific questions of users from experts or other experienced users on a given topic. However, for a given question, users sometimes have to sift through a large number of low-quality or irrelevant answers to find out the answer which satisfies their information need. To alleviate this, the problem of Answer Quality Prediction (AQP) aims to predict the quality of an answer posted in response to a forum question. Current AQP systems either learn models using - a) various hand-crafted features (HCF) or b) use deep learning (DL) techniques which automatically learn the required feature representations. In this paper, we propose a novel approach for AQP known as - "Deep Feature Fusion Network (DFFN)" which leverages the advantages of both hand-crafted features and deep learning based systems. Given a question-answer pair along with its metadata, DFFN independently - a) learn...
Eye blinking is a physiological necessity for humans. This method automatically locates the user&... more Eye blinking is a physiological necessity for humans. This method automatically locates the user's eye by detecting eye blinks. A system is the improvement of driver carefulness and accident reduction. The driver's face is tracked while he is driving and he is warned if there seems to be an alerting fact that can result in an accident such as sleepy eyes, or looking out of the road. Furthermore, with a facial feature tracker, it becomes possible to play a synthesized avatar so that it imitates the expressions of the performer. For a user who is incapable of using her hands, a facial expression controller may be a solution to send limited commands to a computer. Eye blinking is one of the prominent areas to solve many real world problems. The process of blink detection consists of two phases. These are eye tracking followed by detection of blink. The work that has been carried out for eye tracking only is not suitable for eye blink detection. Therefore some approaches had bee...
This paper describes an end to end dialog system created using sequence to sequence learning and ... more This paper describes an end to end dialog system created using sequence to sequence learning and memory networks for Telugu, a low-resource language. We automatically generate dialog data for Telugu in the tourist domain, using a knowledge base that provides tourist place, type, tour time, etc. Using this data, we train a sequence to sequence model to learn system responses in the dialog. In order to add the query prediction for information retrieval (through API calls), we train a memory network. We also handle cases requiring updation of API calls and querying for additional information. Using the combination of sequence to sequence learning and memory network, we successfully create an end to end dialog system for Telugu.
ArXiv, 2017
We present a language independent, unsupervised method for building word embeddings using morphol... more We present a language independent, unsupervised method for building word embeddings using morphological expansion of text. Our model handles the problem of data sparsity and yields improved word embeddings by relying on training word embeddings on artificially generated sentences. We evaluate our method using small sized training sets on eleven test sets for the word similarity task across seven languages. Further, for English, we evaluated the impacts of our approach using a large training set on three standard test sets. Our method improved results across all languages.
Community Question Answering (cQA) forums have become a popular medium for soliciting direct answ... more Community Question Answering (cQA) forums have become a popular medium for soliciting direct answers to specific questions of users from experts or other experienced users on a given topic. However, for a given question, users sometimes have to sift through a large number of low-quality or irrelevant answers to find out the answer which satisfies their information need. To alleviate this, the problem of Answer Quality Prediction (AQP) aims to predict the quality of an answer posted in response to a forum question. Current AQP systems either learn models using - a) various hand-crafted features (HCF) or b) Deep Learning (DL) techniques which automatically learn the required feature representations. In this paper, we propose a novel approach for AQP known as - “Deep Feature Fusion Network (DFFN)” which combines the advantages of both hand-crafted features and deep learning based systems. Given a question-answer pair along with its metadata, the DFFN architecture independently - a) lea...
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, 2021
Code-mixing is a common phenomenon in multilingual societies around the world and is especially c... more Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark-Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further finetune these models on the translated codemixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequenceto-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2019
Code-mixing is the phenomenon of mixing the vocabulary and syntax of multiple languages in the sa... more Code-mixing is the phenomenon of mixing the vocabulary and syntax of multiple languages in the same sentence. It is an increasingly common occurrence in today's multilingual society and poses a big challenge when encountered in different downstream tasks. In this paper, we present a hybrid architecture for the task of Sentiment Analysis of English-Hindi code-mixed data. Our method consists of three components, each seeking to alleviate different issues. We first generate subword level representations for the sentences using a CNN architecture. The generated representations are used as inputs to a Dual Encoder Network which consists of two different BiLSTMs-the Collective and Specific Encoder. The Collective Encoder captures the overall sentiment of the sentence, while the Specific Encoder utilizes an attention mechanism in order to focus on individual sentiment-bearing sub-words. This, combined with a Feature Network consisting of orthographic features and specially trained word embeddings, achieves state-of-the-art results-83.54% accuracy and 0.827 F1 score-on a benchmark dataset.
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 2018
In the past few years, bully and aggressive posts on social media have grown significantly, causi... more In the past few years, bully and aggressive posts on social media have grown significantly, causing serious consequences for victims/users of all demographics. Majority of the work in this field has been done for English only. In this paper, we introduce a deep learning based classification system for Facebook posts and comments of Hindi-English Code-Mixed text to detect the aggressive behaviour of/towards users. Our work focuses on text from users majorly in the Indian Subcontinent. The dataset that we used for our models is provided by TRAC-1 1 in their shared task. Our classification model assigns each Facebook post/comment to one of the three predefined categories: "Overtly Aggressive", "Covertly Aggressive" and "Non-Aggressive". We experimented with 6 classification models and our CNN model on a 10 K-fold crossvalidation gave the best result with the prediction accuracy of 73.2%.
Lecture Notes in Computer Science, 2016
Internet users today prefer getting precise answers to their questions rather than sifting throug... more Internet users today prefer getting precise answers to their questions rather than sifting through a bunch of relevant documents provided by search engines. This has led to the huge popularity of Community Question Answering (cQA) services like Yahoo! Answers, Baidu Zhidao, Quora, StackOverflow etc., where forum users respond to questions with precise answers. Over time, such cQA archives become rich repositories of knowledge encoded in the form of questions and user generated answers. In cQA archives, retrieval of similar questions, which have already been answered in some form, is important for improving the effectiveness of such forums. The main challenge while retrieving similar questions is the "lexico-syntactic" gap between the user query and the questions already present in the forum. In this paper, we propose a novel approach called "Deep Structured Topic Model (DSTM)" to bridge the lexico-syntactic gap between the question posed by the user and forum questions. DSTM employs a two-step process consisting of initially retrieving similar questions that lie in the vicinity of the query and latent topic vector space and then re-ranking them using a deep layered semantic model. Experiments on large scale real-life cQA dataset show that our approach outperforms the state-of-the-art translation and topic based baseline approaches.
Proceedings of the 24th International Conference on World Wide Web, 2015
Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morp... more Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language. CM is a natural phenomenon observed in many multilingual societies. It helps in speeding-up communication and allows wider variety of expression due to which it has become a popular mode of communication in social media forums like Facebook and Twitter. However, current Question Answering (QA) research and systems only support expressing a question in a single language which is an unrealistic and hard proposition especially for certain domains like health and technology. In this paper, we take the first step towards the development of a full-fledged QA system in CM language which is building a Question Classification (QC) system. The QC system analyzes the user question and infers the expected Answer Type (AType). The AType helps in locating and verifying the answer as it imposes certain type-specific constraints. We learn a basic Support Vector Machine (SVM) based QC system for English-Hindi CM questions. Due to the inherent complexities involved in processing CM language and also the unavailability of language processing resources such POS taggers, Chunkers, Parsers, we design our current system using only word-level resources such as language identification, transliteration and lexical translation. To reduce data sparsity and leverage resources available in a resourcerich language, in stead of extracting features directly from the original CM words, we translate them commonly into English and then perform featurization. We created an evaluation dataset for this task and our system achieves an accuracy of 63% and 45% in coarse-grained and fine-grained categories of the question taxonomy. The idea of translating features into English indeed helps in improving accuracy over the uni-gram baseline.
The Domain Name System (DNS) is a hierarchical distributed database that facilitates the conversi... more The Domain Name System (DNS) is a hierarchical distributed database that facilitates the conversion of human readable host names into IP address and vice versa. The security services are provided in DNS using DNSSEC system which provides security services through cryptography. Symmetric key cryptography can be applied to provide security services in DNSSEC. We have introduced several modifications in the process so as to extend the security services in cryptography.
Uploads
Papers by Manish Shrivastava