Papers by Md Saiful Islam
21st International Conference of Computer and Information Technology (ICCIT), 2018
Automatic text categorization is a primary step in information retrieval where it is necessary to... more Automatic text categorization is a primary step in information retrieval where it is necessary to find the most relevant documents in an enormous volume. It is also useful in a wide range of web domains, such as from portal sites to news indexing, or from spam filtering to genre tagging. A significant amount of research works has been carried out in this field, and they are mostly dominated by Support Vector Machines (SVMs) models. Although these models have been very successful, but they require careful feature engineering to achieve optimum results. In this paper, we propose a model for Bengali text categorization that doesn't require feature engineering and is able to capture nonlinearity in data. We had first found a lower dimensional representation for the tf-idf vectors of each document using denoising autoencoders, and then we fed this transformed domain data vector into a deep feedforward network to find its most plausible category. We also show empirically that our model achieves 94.05% accuracy for 12 categories that surmounts the best existing models on Bengali text categorization.
Applied Soft Computing Journal, 2019
Recommender systems play an important role in quickly identifying and recommending most acceptabl... more Recommender systems play an important role in quickly identifying and recommending most acceptable products to the users. The latent user factors and item characteristics determine the degree of user satisfaction on an item. While many of the methods in the literature have assumed that these factors are linear, there are some other methods that treat these factors as nonlinear; but they do it in a more implicit way. In this paper, we have investigated the effect of true nature (i.e., nonlinearity) of the user factors and item characteristics, and their complex layered relationship on rating prediction. We propose a new deep feedforward network that learns both the factors and their complex relationship concurrently. The aim of our study was to automate the construction of user profiles and item characteristics without using any demographic information and then use these constructed features to predict the degree of acceptability of an item to a user. We constructed the user and item factors by using separate learner weights at the lower layers, and modeled their complex relationship in the upper layers. The construction of the user profiles and the item characteristics, solely based on rating triples (i.e., user id, item id, rating), overcomes the requirement of explicit demographic information be given to the system. We have tested our model on three real world datasets: Jester, Movielens, and Yahoo music. Our model produces better rating predictions than some of the state-of-the-art methods which use demographic information. The root mean squared error incurred by our model on these datasets are 4.0873, 0.8110, and 0.9408 respectively. The errors are smaller than current best existing models’ errors in these datasets. The results show that our system can be integrated to any web store where development of hand engineered features for recommending products is less feasible due to huge traffics and also that there is a lack of demographic information about the users and the items.
A Neural Network Approach for Bangla POS Tagger, 2018
In the field of sentiment classification, opinions or sentiments of the people are analyzed. Sent... more In the field of sentiment classification, opinions or sentiments of the people are analyzed. Sentiment analysis systems are being applied in social platforms and in almost every business because the opinions or sentiments are the reflection of the beliefs, choices and activities of the people. With these systems it is possible to make decisions for businesses to political agendas. In recent times a huge number of people share their opinions across the Internet using Bengali. In this paper a new way of sentiment classification of Bengali text using Recurrent Neural Network(RNN) is presented. Using deep recurrent neural network with BiLSTM, the accuracy 85.67% is achieved.
A Neural Network Approach for Bangla POS Tagger, 2018
Though there are many research has been done on bangla language but result of bangla word pos-tag... more Though there are many research has been done on bangla language but result of bangla word pos-tagging doesn't improve as much as excepted. There are some rule based pos tagger already exist in bangla language. In here a neural network method has been proposed for pos tagging. This method also use dynamic programming technique for reducing its overall time complexity.
International Conference on Bangla Speech and Language Processing(ICBSLP), 2018
Student learning pedagogy detection requires a huge amount of data from students. Efficient proce... more Student learning pedagogy detection requires a huge amount of data from students. Efficient process to collect the data is a major fact here. This paper proposes an approach based on Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) and mainly shows a method to use this OCR system to extract information of a student filled in a specialized form. This form contains 170 cells. Some of these cells are to be filled with capital English alphabets and others are to be filled with English numerals. This paper discusses a method for feature extraction and use of CNN to identify each cell. Using this method we could predict 96.87% of numeric data and 94.36% of alphabetic data accurately.
21st International Conference of Computer and Information Technology (ICCIT), 2018
Word Embeddings can be used by deep layers of neural networks to extract features from them to le... more Word Embeddings can be used by deep layers of neural networks to extract features from them to learn stylo-metric patterns of authors based on context and co-occurrence of the words in the field of Authorship Attribution. In this paper, we investigate the effects of different types of word embeddings in Authorship Attribution of Bengali Literature, specifically the skip-gram and continuous-bag-of-words(CBOW) models generated by Word2Vec and fastText along with the word vectors generated by Glove. We experiment with dense neural network models, such as the convolutional and recurrent neural networks and analyse how different word embedding models effect the performance of the classifiers and discuss their properties in this classification task of Authorship Attribution of Bengali Literature. The experiments are performed on a data set we prepared, consisting of 2400 on-line blog articles from 6 authors of recent times.
IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR), 2017
— This paper presents the development process of the SUST-Bangla Handwritten Numeral Database (SU... more — This paper presents the development process of the SUST-Bangla Handwritten Numeral Database (SUST-BHND). We extracted handwritten Bengali digits from twenty-one hundred pre-designed form filled by different people. After data retrieval, cleaning, processing and error analysis we have created a database consisting of 101065 sample images. It provides a basic database for Bangla OCR and script identification research field. Finally, a deep convolutional neural network was trained by the database which led to an accuracy of around 99.4%.
Speech recognition has received a less attention in Bengali literature due to the lack of a compr... more Speech recognition has received a less attention in Bengali literature due to the lack of a comprehensive dataset. In this paper, we describe the development process of the first comprehensive Bengali speech dataset on real numbers. It comprehends all the possible words that may arise in uttering any Bengali real number. The corpus has ten speakers from the different regions of Bengali native people. It comprises of more than two thousands of speech samples in a total duration of closed to four hours. We also provide a deep analysis of our corpus, highlight some of the notable features of it, and finally evaluate the performances of two of the notable Bengali speech recognizers on it.
— The Rohingya Movement and Crisis caused a huge uproar in the political and economic state of Ba... more — The Rohingya Movement and Crisis caused a huge uproar in the political and economic state of Bangladesh. Refugee movement is a recurring event and a large amount of data in the form of opinions remains on social media such as Facebook, with very little analysis done on them.To analyse the comments based on all Rohingya related posts, we had to create and modify a classifier based on the Support Vector Machine algorithm. The code is implemented in python and uses scikit-learn library. A dataset on Rohingya analysis is not currently available so we had to use our own data set of 2500 positive and 2500 negative comments. We specifically used a support vector machine with linear kernel. A previous experiment was performed by us on the same dataset using the naïve bayes algorithm, but that did not yield impressive results.
— This paper explores the authorship attribution problem in modern Bengali literature. By scrutin... more — This paper explores the authorship attribution problem in modern Bengali literature. By scrutinizing the writings of six Bangladeshi columnists of current time using some established and modified stylometric features the writing patterns of these writers have been observed. With statistical analysis on a corpus containing around seven hundred articles the most effective style markers that can create a significant difference among authors were identified. Based on these features a classification model and a voting system were developed to identify the original author of an unknown document. The developed voting system achieved 90.67% accuracy rate on a test corpus of three hundred articles.
—This paper presents an approach to categorizing Bangla language question into some predefined co... more —This paper presents an approach to categorizing Bangla language question into some predefined coarse-grained category that represents expected answer type of that particular question. Support vector machine was used with different kernel function to increase the accuracy of existing Bangla question classification system. Both predefined feature set and the stream of unigram based on the frequency of data set was considered to build feature matrix. For five cross validation average 89.14% accuracy was achieved using 380 top frequent words as the feature which outperformed existing single model based Bangla question classification system. For same cross validation, 88.62% accuracy was achieved with a combination of wh-word, wh-word position and question length as feature set.
—Computers are now too smart to interact with the human in different approaches. This interaction... more —Computers are now too smart to interact with the human in different approaches. This interaction will be more acceptable for both human and computer if it is based on recognition process. In this article, author's concern is to integrate and develop a student recognition system using existing algorithms. Among various face recognition methods, here author use deep learning based face recognition method. This method uses Convolutional Neural Networks (CNN) to generate a low dimensional representation called embeddings. Then those embeddings are used to classify the person's facial image.By this system different types of applications like student attendance-system, building security etc. can be developed.
—Exponential growth in information has made it totally unimaginable to manually find a relevant p... more —Exponential growth in information has made it totally unimaginable to manually find a relevant product in a quick time, entailing the need for a mechanical recommendation system which would remember the users and recommend most suitable items. Most of the approaches for such machinery have been to first find similarity in users or in items, and then exploit these similarities to recommend the products. These methods produce better results when demographic information about users and items are given to them. In this paper, we propose a deep neural network model which does not require any information be given to it other than the rating triples. We created spurious user profiles and item characteristics by using separate learner weights at the bottommost layer. The weights in the upper layers took these information, created by the weights at bottommost layer, to produce a real valued rating. Our model produced an RMSE 4.1824 on Jester 4-million dataset, and this shows our deep network is comparable to the state of the art models.
—Handwritten character recognition is a nontrivial task as it seeks to recognize the correct clas... more —Handwritten character recognition is a nontrivial task as it seeks to recognize the correct class for user independent handwritten characters. This problem becomes even more challenging for a highly stylized, morphologically complex, and potentially juxtapositional characters comprising language like Bengali. As a result, the improvements over the years in Bengali character recognition are significantly less as compared to the other languages. In this paper, we propose a convolutional deep model to recognize Bengali handwritten characters. We first learnt a useful set of features by using kernels and local receptive fields, and then we have employed densely connected layers for the discrimination task. Our system has been tested on BanglaLekha-Isolated dataset. It achieves 98.66% accuracy on numerals (10 character classes), 94.99% accuracy on vowels (11 character classes), 91.60% accuracy on compound letters (20 character classes), 91.23% accuracy on alphabets (50 character classes), and 89.93% accuracy on almost all Bengali characters (80 character classes). Most of the errors incurred by our model in recognition task are due to extreme proximity in shapes among characters. A significant number of errors was caused by the mislabeled, irrecoverably distorted, and illegal data examples.
Wisdom of Crowds is often considered a very powerful tool for predicting anything. In this paper ... more Wisdom of Crowds is often considered a very powerful tool for predicting anything. In this paper we explore the power of public sentiments on predicting the success of movies. In short, we differentiated between positive and negative comments using Support Vector Machine and then use Statistical Reasoning to predict movie success. We used non linear RBF kernel for our sentiment classifier which achieved better accuracy than the classifiers that use linear kernels in the famous IMDB Movie Review Dataset (89.51% accuracy) and also in the Pang and Lee Movie Review Dataset (86.86% accuracy). Using our system we can predict whether a movie will be successful or not with an accuracy of 90.3%. We also compared our approach with other authors in the literature.
—Speech recognition may be an intuitive process for humans, but it turns out to be intimidating t... more —Speech recognition may be an intuitive process for humans, but it turns out to be intimidating to make computer automatically recognize speeches. Although recent progresses in speech recognition have been very promising in other languages, Bengali lacks such progress. There are very little research works published for Bengali speech recognizer. In this paper, we have investigated long short term memory (LSTM), a recurrent neural network, approach to recognize individual Bengali words. We divided each word into a number of frames each containing 13 mel-frequency cepstral coefficients (MFCC), providing us with a useful set of distinctive features. We trained a deep LSTM model with the frames to recognize the most plausible phonemes. The final layer of our deep model is a softmax layer having equal number of units to the number of phonemes. We picked the most probable phonemes for each time frame. Finally, we passed these phonemes through a filter where we got individual words as the output. Our system achieves word detection error rate 13.2% and phoneme detection error rate 28.7% on Bangla-Real-Number audio dataset.
— Speech recognition is widely researched topic around the world. It is a process of conversion o... more — Speech recognition is widely researched topic around the world. It is a process of conversion of speech to text. Many scientists and researchers are busy with doing works to increase the performance of speech recognition systems. Most of the languages in the world have speech recognizer of its own. But in our mother tongue Bangla there is no working speech recognizer. This work is little try to build a Bengali speech recognizer to enrich our language. In this paper we have proposed a noble approach to develop an automatic Bangla Real Number recognizer and analyze the performance of this recognition system using the most popular speech recognizer API CMU Sphinx 4 and a popular Bangla Unicode based writing software called Avro.
— The vector representation of Bengali words using word2vec model (Mikolov et al. (2013)) plays a... more — The vector representation of Bengali words using word2vec model (Mikolov et al. (2013)) plays an important role in Bengali sentiment classification. It is observed that the words that are from same context stay closer in the vector space of word2vec model and they are more similar than other words. In this article, a new approach of sentiment classification of Bengali comments with word2vec and Sentiment extraction of words are presented. Combining the results of word2vec word co-occurrence score with the sentiment polarity score of the words, the accuracy obtained is 75.5%.
Document categorization is a technique through which the category of a document is determined. Th... more Document categorization is a technique through which the category of a document is determined. This paper deals with the automatic classification of Bangla documents. In this proposed categorization system, a support vector machine is used for classifying a document in predefine twelve categories. In this classification model TFIDF (term frequency-inverse document frequency) weighting with length normalization is used for feature selection after the preprocessing of data set is complete. It is shown that the results achieved by applying SVM to classify the category of a Bangla document are very promising as compared to conventional methods where features are chosen on the basis of bag-of-words. The accuracy of this proposed methodology is 92.57% for twelve categories.
— Sentiment Analysis is one of the most important and challenging research topic in the field of ... more — Sentiment Analysis is one of the most important and challenging research topic in the field of natural language processing and opinion mining. In this article, six different approaches are discussed to determine the actual sentiment of the sentence and analyzed their performances. In parts of speech ratio method, the Parts of Speech (POS) of the queries are tagged and the POS ratio and the hamming distance between positive classifier and query and negative classifier and query are computed. To detect the sentiment more accurately, cosine similarity using TF-IDF is applied which is calculated by computing TF, DF and IDF and calculate positive vector, negative vector and query vector. In Cosine similarity using custom TF-IDF, custom POS tagger is used and TF, DF and IDF are computed. Another method with Naïve Bayes model using Uni-gram & stammer also gives good performance. In this approach, prior probability and conditional probability are calculated and the root words of the words are extracted. Naïve Bayes model using Bi-gram, stammer and normalizer is better than the other models. The last method discussed is Word Embedding with Hellinger PCA which presents the idea of word co-occurrence matrix and Skip-Gram to determine the actual contexts of the words, Hellinger PCA to determine most similar words and generate a sliding window of most probable context words around each word.
Uploads
Papers by Md Saiful Islam