The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known a... more The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity induced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large context, we propose an Information Condensationbased approach, which prunes down the unnecessary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait. The resulting condensed article is then fed to the two downstream tasks of spoiler type classification and spoiler generation. We demonstrate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competitive results on spoiler generation.
Depression is a common and serious medical illness that negatively affects how you feel, the way ... more Depression is a common and serious medical illness that negatively affects how you feel, the way you think, and how you act. Detecting depression is essential as it must be treated early to avoid painful consequences. Nowadays, people are broadcasting how they feel via posts and comments. Using social media, we can extract many comments related to depression and use NLP techniques to train and detect depression. This work presents the submission of the DepressionOne team at LT-EDI-2022 for the shared task, detecting signs of depression from social media text. The depression data is small and unbalanced. Thus, we have used oversampling and undersampling methods such as SMOTE and RandomUnderSampler to represent the data. Later, we used machine learning methods to train and detect the signs of depression.
Identifying named entities is, in general, a practical and challenging task in the field of Natur... more Identifying named entities is, in general, a practical and challenging task in the field of Natural Language Processing. Named Entity Recognition on the code-mixed text is further challenging due to the linguistic complexity resulting from the nature of the mixing. This paper addresses the submission of team CM-NEROne to the SEMEVAL 2022 shared task 11 MultiCoNER. The Code-mixed NER task aimed to identify named entities on the codemixed dataset. Our work consists of Named Entity Recognition (NER) on the code-mixed dataset by leveraging the multilingual data. We achieved a weighted average F1 score of 0.7044, i.e., 6% greater than the baseline.
Named Entity Recognition (NER) is a successful and well-researched problem in English due to the ... more Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER in recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form a Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT (Devlin et al., 2018), XLM-R (Conneau et al., 2020), and IndicBERT (Kakwani et al., 2020). We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various Telugu NER datasets. We open-source our dataset, pretrained models, and code 1 .
Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance... more Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in codemixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.
Sentiment Analysis is an important task for analysing online content across languages for tasks s... more Sentiment Analysis is an important task for analysing online content across languages for tasks such as content moderation and opinion mining. Though a significant amount of resources are available for Sentiment Analysis in several Indian languages, there do not exist any large-scale, open-access corpora for Gujarati. Our paper presents and describes the Gujarati Sentiment Analysis Corpus (GSAC), which has been sourced from Twitter and manually annotated by native speakers of the language. We describe in detail our collection and annotation processes and conduct extensive experiments on our corpus to provide reliable baselines for future work using our dataset.
The presented work aims at generating a systematically annotated corpus that can support the enha... more The presented work aims at generating a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu using wordlevel sentiment annotations. From On-toSenseNet, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by language experts. We discuss the methodology followed for the polarity annotations and validate the developed resource. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where lexeme annotations are applied for sentiment predictions. The fundamental aim of this paper is to validate and study the possibility of utilizing machine learning algorithms, word-level sentiment annotations in the task of automated sentiment identification. Furthermore, accuracy is improved by annotating the bi-grams extracted from the target corpus. * * This work was presented at Student Research Workshop in 56 th Annual Meeting of the Association for Computational Linguistics, ACL.
International Joint Conference on Artificial Intelligence, 2016
In this paper, an approach to detect the sentiment of a song based on its multi-modality natures ... more In this paper, an approach to detect the sentiment of a song based on its multi-modality natures (text and audio) is presented. The textual lyric features are extracted from the bag of words. By using these features, Doc2Vec will generate a single vector for each song. Support Vector Machine (SVM), Naive Bayes (NB) and a combination of both these classifiers are developed to classify the sentiment using the textual lyric features. Audio features are used as an add-on to the lyrical ones which include prosody features, temporal features, spectral features, tempo and chroma features. Gaussian Mixture Models (GMM), SVM and a combination of both these classifiers are developed to classify the sentiment using audio features. GMM are known for capturing the distribution in the features and SVM are known for discriminating the features. Hence these models are combined to improve the performance of sentiment analysis. Performance is further improved by combining the text and audio feature domains. These text and audio features are extracted at the beginning, ending and for the whole song. From our experimental results, it is observed that the first 30 seconds(s) of a song gives better performance for detecting the sentiment of the song rather than the last 30s or from the whole song.
Recent technological advancements in the Internet and Social media usage have resulted in the evo... more Recent technological advancements in the Internet and Social media usage have resulted in the evolution of faster and efficient platforms of communication. These platforms include visual, textual and speech mediums and have brought a unique social phenomenon called Internet memes. Internet memes are in the form of images with witty, catchy, or sarcastic text descriptions. In this paper, we present a multi-modal sentiment analysis system using deep neural networks combining Computer Vision and Natural Language Processing. Our aim is different than the normal sentiment analysis goal of predicting whether a text expresses positive or negative sentiment; instead, we aim to classify the Internet meme as a positive, negative, or neutral, identify the type of humor expressed and quantify the extent to which a particular effect is being expressed. Our system has been developed using CNN and LSTM and outperformed the baseline score.
The exponential rise of social media networks has allowed the production, distribution, and consu... more The exponential rise of social media networks has allowed the production, distribution, and consumption of data at a phenomenal rate. Moreover, the social media revolution has brought a unique phenomenon to social media platforms called Internet memes. Internet memes are one of the most popular contents used on social media, and they can be in the form of images with a witty, catchy, or satirical text description. In this paper, we are dealing with propaganda that is often seen in Internet memes in recent times. Propaganda is communication, which frequently includes psychological and rhetorical techniques to manipulate or influence an audience to act or respond as the propagandist wants. To detect propaganda in Internet memes, we propose a multimodal deep learning fusion system that fuses the text and image feature representations and outperforms individual models based solely on either text or image modalities.
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state... more Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word's importance for the task, called the word's task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions that assign the probability of masking based on the word's task score. The BERT model is further trained on MLM objective, where masking is done using the above strategy. Following this standard supervised fine-tuning is done for different downstream tasks. Results on these tasks show that the selective masking strategy outperforms random masking, indicating its effectiveness.
2022 International Joint Conference on Neural Networks (IJCNN), Jul 18, 2022
Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classificati... more Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classification tasks like sentiment analysis, emotion detection, etc. However, the performance is achieved by testing and reporting on resourcerich languages like English. Applying GCN for multi-task text classification is an unexplored area. Moreover, training a GCN or adopting an English GCN for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we study the use of GCN for the Telugu language in single and multi-task settings for four natural language processing (NLP) tasks, viz. sentiment analysis (SA), emotion identification (EI), hate-speech (HS), and sarcasm detection (SAR). In order to evaluate the performance of GCN with one of the Indian languages, Telugu, we analyze the GCN based models with extensive experiments on four downstream tasks. In addition, we created an annotated Telugu dataset, TEL-NLP, for the four NLP tasks. Further, we propose a supervised graph reconstruction method, Multi-Task Text GCN (MT-Text GCN) on the Telugu that leverages to simultaneously (i) learn the low-dimensional word and sentence graph embeddings from word-sentence graph reconstruction using graph autoencoder (GAE) and (ii) perform multi-task text classification using these latent sentence graph embeddings. We argue that our proposed MT-Text GCN achieves significant improvements on TEL-NLP over existing Telugu pretrained word embeddings [1], and multilingual pretrained Transformer models: mBERT [2], and XLM-R [3]. On TEL-NLP, we achieve a high F1-score for four NLP tasks: SA (0.84), EI (0.55), HS (0.83) and SAR (0.66). Finally, we show our model's quantitative and qualitative analysis on the four NLP tasks in Telugu. We open-source our TEL-NLP dataset, pretrained models, and code 1 .
In this paper we present our work in building Natural Language Interface to Database (NLIDB) syst... more In this paper we present our work in building Natural Language Interface to Database (NLIDB) system using Intermediate query approach. This approach is demonstrated using Movie domain chatbot and can also be extended to different domains. The need of NLIDB System has increased in this fast paced world where more number of users are accessing databases through their Smart phones and web browsers.NLIDB System maps user’s Natural Language query to database query allowing user to extract information without any prior experience with databases. Results obtained are very promising and can tackle most of the user queries regarding target database.
Document summarization aims to create a precise and coherent summary of a text document. Many dee... more Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISUMM, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISUMM uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISUMM. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISUMM on other Indian languages. Our experiments of GAE-ISUMM in seven languages make the following observations: (i) it is competitive or better than state-ofthe-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries. We open-source our dataset and code 1 .
Document summarization aims to create a precise and coherent summary of a text document. Many dee... more Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISUMM, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISUMM uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISUMM. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISUMM on other Indian languages. Our experiments of GAE-ISUMM in seven languages make the following observations: (i) it is competitive or better than state-ofthe-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries. We open-source our dataset and code 1 .
Text classification is a fundamental problem in the field of natural language processing. Text cl... more Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.
Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses ... more Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses multiple languages in an utterance or sentence. Code-mixed texts are abundant, especially in social media, and pose a problem for NLP tools as they are typically trained on monolingual corpora. Recently, finding the sentiment from code-mixed text has been attempted by some researchers in SentiMix SemEval 2020 and Dravidian-CodeMix FIRE 2020 shared tasks. Mostly, the attempts include traditional methods, long short term memory, convolutional neural networks, and transformer models for code-mixed sentiment analysis (CMSA). However, no study has explored graph convolutional neural networks on CMSA. In this paper, we propose the graph convolutional networks (GCN) for sentiment analysis on code-mixed text. We have used the datasets from the Dravidian-CodeMix FIRE 2020. Our experimental results on multiple CMSA datasets demonstrate that the GCN with multi-headed attention model has shown an improvement in classification metrics.
The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known a... more The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity induced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large context, we propose an Information Condensationbased approach, which prunes down the unnecessary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait. The resulting condensed article is then fed to the two downstream tasks of spoiler type classification and spoiler generation. We demonstrate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competitive results on spoiler generation.
Depression is a common and serious medical illness that negatively affects how you feel, the way ... more Depression is a common and serious medical illness that negatively affects how you feel, the way you think, and how you act. Detecting depression is essential as it must be treated early to avoid painful consequences. Nowadays, people are broadcasting how they feel via posts and comments. Using social media, we can extract many comments related to depression and use NLP techniques to train and detect depression. This work presents the submission of the DepressionOne team at LT-EDI-2022 for the shared task, detecting signs of depression from social media text. The depression data is small and unbalanced. Thus, we have used oversampling and undersampling methods such as SMOTE and RandomUnderSampler to represent the data. Later, we used machine learning methods to train and detect the signs of depression.
Identifying named entities is, in general, a practical and challenging task in the field of Natur... more Identifying named entities is, in general, a practical and challenging task in the field of Natural Language Processing. Named Entity Recognition on the code-mixed text is further challenging due to the linguistic complexity resulting from the nature of the mixing. This paper addresses the submission of team CM-NEROne to the SEMEVAL 2022 shared task 11 MultiCoNER. The Code-mixed NER task aimed to identify named entities on the codemixed dataset. Our work consists of Named Entity Recognition (NER) on the code-mixed dataset by leveraging the multilingual data. We achieved a weighted average F1 score of 0.7044, i.e., 6% greater than the baseline.
Named Entity Recognition (NER) is a successful and well-researched problem in English due to the ... more Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER in recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form a Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT (Devlin et al., 2018), XLM-R (Conneau et al., 2020), and IndicBERT (Kakwani et al., 2020). We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various Telugu NER datasets. We open-source our dataset, pretrained models, and code 1 .
Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance... more Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in codemixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.
Sentiment Analysis is an important task for analysing online content across languages for tasks s... more Sentiment Analysis is an important task for analysing online content across languages for tasks such as content moderation and opinion mining. Though a significant amount of resources are available for Sentiment Analysis in several Indian languages, there do not exist any large-scale, open-access corpora for Gujarati. Our paper presents and describes the Gujarati Sentiment Analysis Corpus (GSAC), which has been sourced from Twitter and manually annotated by native speakers of the language. We describe in detail our collection and annotation processes and conduct extensive experiments on our corpus to provide reliable baselines for future work using our dataset.
The presented work aims at generating a systematically annotated corpus that can support the enha... more The presented work aims at generating a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu using wordlevel sentiment annotations. From On-toSenseNet, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by language experts. We discuss the methodology followed for the polarity annotations and validate the developed resource. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where lexeme annotations are applied for sentiment predictions. The fundamental aim of this paper is to validate and study the possibility of utilizing machine learning algorithms, word-level sentiment annotations in the task of automated sentiment identification. Furthermore, accuracy is improved by annotating the bi-grams extracted from the target corpus. * * This work was presented at Student Research Workshop in 56 th Annual Meeting of the Association for Computational Linguistics, ACL.
International Joint Conference on Artificial Intelligence, 2016
In this paper, an approach to detect the sentiment of a song based on its multi-modality natures ... more In this paper, an approach to detect the sentiment of a song based on its multi-modality natures (text and audio) is presented. The textual lyric features are extracted from the bag of words. By using these features, Doc2Vec will generate a single vector for each song. Support Vector Machine (SVM), Naive Bayes (NB) and a combination of both these classifiers are developed to classify the sentiment using the textual lyric features. Audio features are used as an add-on to the lyrical ones which include prosody features, temporal features, spectral features, tempo and chroma features. Gaussian Mixture Models (GMM), SVM and a combination of both these classifiers are developed to classify the sentiment using audio features. GMM are known for capturing the distribution in the features and SVM are known for discriminating the features. Hence these models are combined to improve the performance of sentiment analysis. Performance is further improved by combining the text and audio feature domains. These text and audio features are extracted at the beginning, ending and for the whole song. From our experimental results, it is observed that the first 30 seconds(s) of a song gives better performance for detecting the sentiment of the song rather than the last 30s or from the whole song.
Recent technological advancements in the Internet and Social media usage have resulted in the evo... more Recent technological advancements in the Internet and Social media usage have resulted in the evolution of faster and efficient platforms of communication. These platforms include visual, textual and speech mediums and have brought a unique social phenomenon called Internet memes. Internet memes are in the form of images with witty, catchy, or sarcastic text descriptions. In this paper, we present a multi-modal sentiment analysis system using deep neural networks combining Computer Vision and Natural Language Processing. Our aim is different than the normal sentiment analysis goal of predicting whether a text expresses positive or negative sentiment; instead, we aim to classify the Internet meme as a positive, negative, or neutral, identify the type of humor expressed and quantify the extent to which a particular effect is being expressed. Our system has been developed using CNN and LSTM and outperformed the baseline score.
The exponential rise of social media networks has allowed the production, distribution, and consu... more The exponential rise of social media networks has allowed the production, distribution, and consumption of data at a phenomenal rate. Moreover, the social media revolution has brought a unique phenomenon to social media platforms called Internet memes. Internet memes are one of the most popular contents used on social media, and they can be in the form of images with a witty, catchy, or satirical text description. In this paper, we are dealing with propaganda that is often seen in Internet memes in recent times. Propaganda is communication, which frequently includes psychological and rhetorical techniques to manipulate or influence an audience to act or respond as the propagandist wants. To detect propaganda in Internet memes, we propose a multimodal deep learning fusion system that fuses the text and image feature representations and outperforms individual models based solely on either text or image modalities.
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state... more Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word's importance for the task, called the word's task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions that assign the probability of masking based on the word's task score. The BERT model is further trained on MLM objective, where masking is done using the above strategy. Following this standard supervised fine-tuning is done for different downstream tasks. Results on these tasks show that the selective masking strategy outperforms random masking, indicating its effectiveness.
2022 International Joint Conference on Neural Networks (IJCNN), Jul 18, 2022
Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classificati... more Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classification tasks like sentiment analysis, emotion detection, etc. However, the performance is achieved by testing and reporting on resourcerich languages like English. Applying GCN for multi-task text classification is an unexplored area. Moreover, training a GCN or adopting an English GCN for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we study the use of GCN for the Telugu language in single and multi-task settings for four natural language processing (NLP) tasks, viz. sentiment analysis (SA), emotion identification (EI), hate-speech (HS), and sarcasm detection (SAR). In order to evaluate the performance of GCN with one of the Indian languages, Telugu, we analyze the GCN based models with extensive experiments on four downstream tasks. In addition, we created an annotated Telugu dataset, TEL-NLP, for the four NLP tasks. Further, we propose a supervised graph reconstruction method, Multi-Task Text GCN (MT-Text GCN) on the Telugu that leverages to simultaneously (i) learn the low-dimensional word and sentence graph embeddings from word-sentence graph reconstruction using graph autoencoder (GAE) and (ii) perform multi-task text classification using these latent sentence graph embeddings. We argue that our proposed MT-Text GCN achieves significant improvements on TEL-NLP over existing Telugu pretrained word embeddings [1], and multilingual pretrained Transformer models: mBERT [2], and XLM-R [3]. On TEL-NLP, we achieve a high F1-score for four NLP tasks: SA (0.84), EI (0.55), HS (0.83) and SAR (0.66). Finally, we show our model's quantitative and qualitative analysis on the four NLP tasks in Telugu. We open-source our TEL-NLP dataset, pretrained models, and code 1 .
In this paper we present our work in building Natural Language Interface to Database (NLIDB) syst... more In this paper we present our work in building Natural Language Interface to Database (NLIDB) system using Intermediate query approach. This approach is demonstrated using Movie domain chatbot and can also be extended to different domains. The need of NLIDB System has increased in this fast paced world where more number of users are accessing databases through their Smart phones and web browsers.NLIDB System maps user’s Natural Language query to database query allowing user to extract information without any prior experience with databases. Results obtained are very promising and can tackle most of the user queries regarding target database.
Document summarization aims to create a precise and coherent summary of a text document. Many dee... more Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISUMM, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISUMM uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISUMM. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISUMM on other Indian languages. Our experiments of GAE-ISUMM in seven languages make the following observations: (i) it is competitive or better than state-ofthe-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries. We open-source our dataset and code 1 .
Document summarization aims to create a precise and coherent summary of a text document. Many dee... more Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISUMM, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISUMM uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISUMM. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISUMM on other Indian languages. Our experiments of GAE-ISUMM in seven languages make the following observations: (i) it is competitive or better than state-ofthe-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries. We open-source our dataset and code 1 .
Text classification is a fundamental problem in the field of natural language processing. Text cl... more Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.
Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses ... more Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses multiple languages in an utterance or sentence. Code-mixed texts are abundant, especially in social media, and pose a problem for NLP tools as they are typically trained on monolingual corpora. Recently, finding the sentiment from code-mixed text has been attempted by some researchers in SentiMix SemEval 2020 and Dravidian-CodeMix FIRE 2020 shared tasks. Mostly, the attempts include traditional methods, long short term memory, convolutional neural networks, and transformer models for code-mixed sentiment analysis (CMSA). However, no study has explored graph convolutional neural networks on CMSA. In this paper, we propose the graph convolutional networks (GCN) for sentiment analysis on code-mixed text. We have used the datasets from the Dravidian-CodeMix FIRE 2020. Our experimental results on multiple CMSA datasets demonstrate that the GCN with multi-headed attention model has shown an improvement in classification metrics.
Uploads
Papers by Radhika Mamidi