Background: Customer churn prediction (CCP) refers to detecting which customers are likely to can... more Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier's performance and algorithmic fairness are evaluated with notable metrics. Results: The results indicated that Random Forest classifier outperforms other classifiers in all three datasets and, using SMOTE and ADASYN techniques cause more discrimination in the female group. The rate of unintentional discrimination seems to be higher in the original data of extremely unbalanced datasets under the following classifiers: Logistics Regression, LightGBM, and XGBoost. Conclusions: Algorithmic fairness has become a broadly studied area in recent years, yet there is a very little systematic study on the effect of using DSTs on algorithmic fairness. This study presents important findings to further the use of algorithmic fairness in CCP research.
Malaysian Journal of Computer Science, Apr 24, 2020
Due to the mass availability of textual data on Web, text classification (TC), classifying texts ... more Due to the mass availability of textual data on Web, text classification (TC), classifying texts into predetermined sets becomes a spotlight for researchers. A number of TC applications have been proposed yet very few studies reported an overview of TC research area in a proper and systematic manner. This paper aims to provide an overview of TC research trends and gaps by structuring and analyzing research patterns, encountered problems and problem-solving methods in TC. In other words, this study highlights problem types, data sources, choice of language of text and types of applied techniques in TC. An intensive systematic study is conducted by applying guidelines proposed by Petersen and colleagues in 2007. In this paper, ninety-six literatures from five electronic databases from 2006 to 2017 were systematically reviewed and followed each and every step properly in accordance with systematic mapping study. Nine main problems in TC research area were identified and significant findings which highlighted the evolution of TC research within the past 12 years were investigated. Different from other review articles, this paper highlighted issues and technical gaps of TC area in a useful and effective manner.
Background: Customer churn prediction (CCP) refers to detecting which customers are likely to can... more Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier’s performance and algorithmi...
This paper presents a new set of features for land-line customer churn prediction, including 2 si... more This paper presents a new set of features for land-line customer churn prediction, including 2 six-month Henley segmentation, precise 4-month call details, line information, bill and payment information, account information, demographic profiles, service orders, complain information, etc. Then the seven prediction techniques (Logistic Regressions, Linear Classifications, Naive Bayes, Decision Trees, Multilayer Perceptron Neural Networks, Support Vector Machines and the Evolutionary Data Mining Algorithm) are applied in customer churn as predictors, based on the new features. Finally, the comparative experiments were carried out to evaluate the new feature set and the seven modelling techniques for customer churn prediction. The experimental results show that the new features with the six modelling techniques are more effective than the existing ones for customer churn prediction in the telecommunication service field.
Due to the mass availability of textual data on Web, text classification (TC), classifying texts ... more Due to the mass availability of textual data on Web, text classification (TC), classifying texts into predetermined sets becomes a spotlight for researchers. A number of TC applications have been proposed yet very few studies reported an overview of TC research area in a proper and systematic manner. This paper aims to provide an overview of TC research trends and gaps by structuring and analyzing research patterns, encountered problems and problem-solving methods in TC. In other words, this study highlights problem types, data sources, choice of language of text and types of applied techniques in TC. An intensive systematic study is conducted by applying guidelines proposed by Petersen and colleagues in 2007. In this paper, ninety-six literatures from five electronic databases from 2006 to 2017 were systematically reviewed and followed each and every step properly in accordance with systematic mapping study. Nine main problems in TC research area were identified and significant fin...
Background: Detecting hateful contents on social media becomes a broad and important research are... more Background: Detecting hateful contents on social media becomes a broad and important research area along with the popularity of social media. Objective: This paper aims primarily to understand the different techniques applied within the scope of detecting the use of hateful language on social media, their strengths and challenges to provide a solid and concrete reference to future researchers and practitioners. Methodology: In this paper, we investigated previous researches done in the domain of hateful contents detection on social media. We selected relevant published journal articles and conference proceedings from 2010 to 2015. Results: We observed that Support Vector Machine (SVM) algorithm is the most frequently applied for text classification. Data ambiguity problem, classification of sarcastic sentences and lack of necessary resources are identified as the difficulties for researchers in detecting the use of hateful contents. Conclusion: Future researchers must pay more atten...
Background: Customer churn prediction (CCP) refers to detecting which customers are likely to can... more Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier's performance and algorithmic fairness are evaluated with notable metrics. Results: The results indicated that Random Forest classifier outperforms other classifiers in all three datasets and, using SMOTE and ADASYN techniques cause more discrimination in the female group. The rate of unintentional discrimination seems to be higher in the original data of extremely unbalanced datasets under the following classifiers: Logistics Regression, LightGBM, and XGBoost. Conclusions: Algorithmic fairness has become a broadly studied area in recent years, yet there is a very little systematic study on the effect of using DSTs on algorithmic fairness. This study presents important findings to further the use of algorithmic fairness in CCP research.
Malaysian Journal of Computer Science, Apr 24, 2020
Due to the mass availability of textual data on Web, text classification (TC), classifying texts ... more Due to the mass availability of textual data on Web, text classification (TC), classifying texts into predetermined sets becomes a spotlight for researchers. A number of TC applications have been proposed yet very few studies reported an overview of TC research area in a proper and systematic manner. This paper aims to provide an overview of TC research trends and gaps by structuring and analyzing research patterns, encountered problems and problem-solving methods in TC. In other words, this study highlights problem types, data sources, choice of language of text and types of applied techniques in TC. An intensive systematic study is conducted by applying guidelines proposed by Petersen and colleagues in 2007. In this paper, ninety-six literatures from five electronic databases from 2006 to 2017 were systematically reviewed and followed each and every step properly in accordance with systematic mapping study. Nine main problems in TC research area were identified and significant findings which highlighted the evolution of TC research within the past 12 years were investigated. Different from other review articles, this paper highlighted issues and technical gaps of TC area in a useful and effective manner.
Background: Customer churn prediction (CCP) refers to detecting which customers are likely to can... more Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier’s performance and algorithmi...
This paper presents a new set of features for land-line customer churn prediction, including 2 si... more This paper presents a new set of features for land-line customer churn prediction, including 2 six-month Henley segmentation, precise 4-month call details, line information, bill and payment information, account information, demographic profiles, service orders, complain information, etc. Then the seven prediction techniques (Logistic Regressions, Linear Classifications, Naive Bayes, Decision Trees, Multilayer Perceptron Neural Networks, Support Vector Machines and the Evolutionary Data Mining Algorithm) are applied in customer churn as predictors, based on the new features. Finally, the comparative experiments were carried out to evaluate the new feature set and the seven modelling techniques for customer churn prediction. The experimental results show that the new features with the six modelling techniques are more effective than the existing ones for customer churn prediction in the telecommunication service field.
Due to the mass availability of textual data on Web, text classification (TC), classifying texts ... more Due to the mass availability of textual data on Web, text classification (TC), classifying texts into predetermined sets becomes a spotlight for researchers. A number of TC applications have been proposed yet very few studies reported an overview of TC research area in a proper and systematic manner. This paper aims to provide an overview of TC research trends and gaps by structuring and analyzing research patterns, encountered problems and problem-solving methods in TC. In other words, this study highlights problem types, data sources, choice of language of text and types of applied techniques in TC. An intensive systematic study is conducted by applying guidelines proposed by Petersen and colleagues in 2007. In this paper, ninety-six literatures from five electronic databases from 2006 to 2017 were systematically reviewed and followed each and every step properly in accordance with systematic mapping study. Nine main problems in TC research area were identified and significant fin...
Background: Detecting hateful contents on social media becomes a broad and important research are... more Background: Detecting hateful contents on social media becomes a broad and important research area along with the popularity of social media. Objective: This paper aims primarily to understand the different techniques applied within the scope of detecting the use of hateful language on social media, their strengths and challenges to provide a solid and concrete reference to future researchers and practitioners. Methodology: In this paper, we investigated previous researches done in the domain of hateful contents detection on social media. We selected relevant published journal articles and conference proceedings from 2010 to 2015. Results: We observed that Support Vector Machine (SVM) algorithm is the most frequently applied for text classification. Data ambiguity problem, classification of sarcastic sentences and lack of necessary resources are identified as the difficulties for researchers in detecting the use of hateful contents. Conclusion: Future researchers must pay more atten...
Uploads
Papers by Maw Maw