Iraqi dataset is collected through applying (or submitting) questionnaire in three Iraqi secondar... more Iraqi dataset is collected through applying (or submitting) questionnaire in three Iraqi secondary schools for both applicable and biology branches of the final stage during the second semester of the 2018 year. Initially, the questionnaire contains 56 questions in three A4 sheets and it is answered by 250 students (samples). Latter, 130 samples are discarded due to lack of information since pre-processing is applied to obtain the most complete information of students. After removing inconsistencies and incompleteness in the dataset, this study considers 120 samples instances with 55 features for experiment purposes. The features are distributed into five main categories: Demographic, Economic, Educational, Time, and Marks. Table (1) shows the dataset's attributes/features and their description. As illustrated in this table, new features are introduced, such as holiday and worrying effects. The relationships between parents with schools and use of books and references by the stu...
Indonesian Journal of Electrical Engineering and Computer Science, 2020
Recently, the decision trees have been adopted among the preeminent utilized classification model... more Recently, the decision trees have been adopted among the preeminent utilized classification models. They acquire their fame from their efficiency in predictive analytics, easy to interpret and implicitly perform feature selection. This latter perspective is one of essential significance in Educational Data Mining (EDM), in which selecting the most relevant features has a major impact on classification accuracy enhancement. The main contribution is to build a new multi-objective decision tree, which can be used for feature selection and classification. The proposed Decisive Decision Tree (DDT) is introduced and constructed based on a decisive feature value as a feature weight related to the target class label. The traditional Iterative Dichotomizer 3 (ID3) algorithm and the proposed DDT are compared using three datasets in terms of some ID3 issues, including logarithmic calculation complexity and multi-values featuresselection. The results indicated that the proposed DDT outperforms ...
Journal of King Saud University - Computer and Information Sciences, 2021
Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital da... more Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunksize that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving.
The Electronic (E)-Learning attracts the attention of researchers in the recent years. This for d... more The Electronic (E)-Learning attracts the attention of researchers in the recent years. This for different reasons, such as easing the board studying and guarantee the education for busy people. Different methods and algorithm have been adopted in e-learning systems to offer more flexible services for students. In addition, the recent smart systems consider the prediction strategies for expecting the logical results of different categories in e-learning. The researcher goes further with decision making for students, presented as a recommendation for each type of classification. Moreover, the e-learning systems use the classification and clustering methods for classifying the investigated dataset. In this paper, a comprehensive study of the recent e-learning decision making and prediction is presented. It offers a wide information regarding the subject of decision making and prediction in e-learning that can improve them efficiently. Discussion and recommendations have been included in this paper.
International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 2019
Educational organizations are maintaining the
history of data for future analysis to predict and ... more Educational organizations are maintaining the history of data for future analysis to predict and improve the benefits, profits, and development of the organization. Decision making on student performance can be made after mining the historical student information that will result in useful insight. This paper proposes developing and implementing the Student Performance Prediction System using three Data Mining Methods Neural Network, Decision Tree, and K- Nearest Neighbor. Furthermore, a comparison of their results is provided based on three educational data sets. The experimental results indicated that Neural Network exceeds the K- Nearest Neighbor and Decision Tree for all data sets based on holdout validation obtaining 97 percent, 92 percent, 83 percent of accuracy for Iraq, Math, Por datasets, respectively.
Journal of Theoretical and Applied Information Technology, 2019
Educational Data Mining (EMD) is in charge of discovering useful information from educational dat... more Educational Data Mining (EMD) is in charge of discovering useful information from educational datasets. In recent years, the data is mounting rapidly due to the ease access to the websites of e-learning intakes extraordinary enthusiasm from different colleges and instructive foundation. High dimensionality, irrelevant, redundant and noisy dataset can affect the knowledge discovery during the training phase in a bad way as well as degrading machine learning performance accuracy. All these factors often rise demand for dataset preparation, analysis, and feature selection. The fundamental aim of research is to enhance the precision of classification by information preprocessing and expel the unessential information without discarding any vital data by means of feature selection.This paper proposes EDM dataset preprocessing, and hybrid feature selection method by combining filter and wrappers techniques. In the filter-based feature selection, the statistical analysis is based on the Pearson correlation and information gain. In the wrapper method, the accuracy of the feature subset is tested using a neural network as a baseline algorithm. The obtained results show an enhancement in performance accuracy toward selecting minimum feature subset with high predictive power over using all features.
Journal of King Saud University – Computer and Information Sciences, 2021
Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital da... more Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunksize that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving.
Indonesian Journal of Electrical Engineering and Computer Science, 2020
Recently, the decision trees have been adopted among the preeminent utilized classification model... more Recently, the decision trees have been adopted among the preeminent utilized classification models. They acquire their fame from their efficiency in predictive analytics, easy to interpret and implicitly perform feature selection. This latter perspective is one of essential significance in Educational Data Mining (EDM), in which selecting the most relevant features has a major impact on classification accuracy enhancement. The main contribution is to build a new multi-objective decision tree, which can be used for feature selection and classification. The proposed Decisive Decision Tree (DDT) is introduced and constructed based on a decisive feature value as a feature weight related to the target class label. The traditional Iterative Dichotomizer 3 (ID3) algorithm and the proposed DDT are compared using three datasets in terms of some ID3 issues, including logarithmic calculation complexity and multi-values features selection. The results indicated that the proposed DDT outperforms the ID3 in the developing time. The accuracy of the classification is improved on the basis of 10-fold cross-validation for all datasets with the highest accuracy achieved by the proposed method is 92% for the student.por dataset and holdout validation for two datasets, i.e. Iraqi and Student-Math. The experiment also shows that the proposed DDT tends to select attributes that are important rather than multi-value.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2020
The traditional K-nearest neighbor (KNN) algorithm uses an exhaustive search for a complete train... more The traditional K-nearest neighbor (KNN) algorithm uses an exhaustive search for a complete training set to predict a single test sample. This procedure can slow down the system to consume more time for huge datasets. The selection of classes for a new sample depends on a simple majority voting system that does not reflect the various significance of different samples (i.e. ignoring the similarities among samples). It also leads to a misclassification problem due to the occurrence of a double majority class. In reference to the above-mentioned issues, this work adopts a combination of moment descriptor and KNN to optimize the sample selection. This is done based on the fact that classifying the training samples before the searching actually takes place can speed up and improve the predictive performance of the nearest neighbor. The proposed method can be called as fast KNN (FKNN). The experimental results show that the proposed FKNN method decreases original KNN consuming time withi...
Iraqi dataset is collected through applying (or submitting) questionnaire in three Iraqi secondar... more Iraqi dataset is collected through applying (or submitting) questionnaire in three Iraqi secondary schools for both applicable and biology branches of the final stage during the second semester of the 2018 year. Initially, the questionnaire contains 56 questions in three A4 sheets and it is answered by 250 students (samples). Latter, 130 samples are discarded due to lack of information since pre-processing is applied to obtain the most complete information of students. After removing inconsistencies and incompleteness in the dataset, this study considers 120 samples instances with 55 features for experiment purposes. The features are distributed into five main categories: Demographic, Economic, Educational, Time, and Marks. Table (1) shows the dataset's attributes/features and their description. As illustrated in this table, new features are introduced, such as holiday and worrying effects. The relationships between parents with schools and use of books and references by the stu...
Indonesian Journal of Electrical Engineering and Computer Science, 2020
Recently, the decision trees have been adopted among the preeminent utilized classification model... more Recently, the decision trees have been adopted among the preeminent utilized classification models. They acquire their fame from their efficiency in predictive analytics, easy to interpret and implicitly perform feature selection. This latter perspective is one of essential significance in Educational Data Mining (EDM), in which selecting the most relevant features has a major impact on classification accuracy enhancement. The main contribution is to build a new multi-objective decision tree, which can be used for feature selection and classification. The proposed Decisive Decision Tree (DDT) is introduced and constructed based on a decisive feature value as a feature weight related to the target class label. The traditional Iterative Dichotomizer 3 (ID3) algorithm and the proposed DDT are compared using three datasets in terms of some ID3 issues, including logarithmic calculation complexity and multi-values featuresselection. The results indicated that the proposed DDT outperforms ...
Journal of King Saud University - Computer and Information Sciences, 2021
Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital da... more Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunksize that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving.
The Electronic (E)-Learning attracts the attention of researchers in the recent years. This for d... more The Electronic (E)-Learning attracts the attention of researchers in the recent years. This for different reasons, such as easing the board studying and guarantee the education for busy people. Different methods and algorithm have been adopted in e-learning systems to offer more flexible services for students. In addition, the recent smart systems consider the prediction strategies for expecting the logical results of different categories in e-learning. The researcher goes further with decision making for students, presented as a recommendation for each type of classification. Moreover, the e-learning systems use the classification and clustering methods for classifying the investigated dataset. In this paper, a comprehensive study of the recent e-learning decision making and prediction is presented. It offers a wide information regarding the subject of decision making and prediction in e-learning that can improve them efficiently. Discussion and recommendations have been included in this paper.
International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 2019
Educational organizations are maintaining the
history of data for future analysis to predict and ... more Educational organizations are maintaining the history of data for future analysis to predict and improve the benefits, profits, and development of the organization. Decision making on student performance can be made after mining the historical student information that will result in useful insight. This paper proposes developing and implementing the Student Performance Prediction System using three Data Mining Methods Neural Network, Decision Tree, and K- Nearest Neighbor. Furthermore, a comparison of their results is provided based on three educational data sets. The experimental results indicated that Neural Network exceeds the K- Nearest Neighbor and Decision Tree for all data sets based on holdout validation obtaining 97 percent, 92 percent, 83 percent of accuracy for Iraq, Math, Por datasets, respectively.
Journal of Theoretical and Applied Information Technology, 2019
Educational Data Mining (EMD) is in charge of discovering useful information from educational dat... more Educational Data Mining (EMD) is in charge of discovering useful information from educational datasets. In recent years, the data is mounting rapidly due to the ease access to the websites of e-learning intakes extraordinary enthusiasm from different colleges and instructive foundation. High dimensionality, irrelevant, redundant and noisy dataset can affect the knowledge discovery during the training phase in a bad way as well as degrading machine learning performance accuracy. All these factors often rise demand for dataset preparation, analysis, and feature selection. The fundamental aim of research is to enhance the precision of classification by information preprocessing and expel the unessential information without discarding any vital data by means of feature selection.This paper proposes EDM dataset preprocessing, and hybrid feature selection method by combining filter and wrappers techniques. In the filter-based feature selection, the statistical analysis is based on the Pearson correlation and information gain. In the wrapper method, the accuracy of the feature subset is tested using a neural network as a baseline algorithm. The obtained results show an enhancement in performance accuracy toward selecting minimum feature subset with high predictive power over using all features.
Journal of King Saud University – Computer and Information Sciences, 2021
Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital da... more Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunksize that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving.
Indonesian Journal of Electrical Engineering and Computer Science, 2020
Recently, the decision trees have been adopted among the preeminent utilized classification model... more Recently, the decision trees have been adopted among the preeminent utilized classification models. They acquire their fame from their efficiency in predictive analytics, easy to interpret and implicitly perform feature selection. This latter perspective is one of essential significance in Educational Data Mining (EDM), in which selecting the most relevant features has a major impact on classification accuracy enhancement. The main contribution is to build a new multi-objective decision tree, which can be used for feature selection and classification. The proposed Decisive Decision Tree (DDT) is introduced and constructed based on a decisive feature value as a feature weight related to the target class label. The traditional Iterative Dichotomizer 3 (ID3) algorithm and the proposed DDT are compared using three datasets in terms of some ID3 issues, including logarithmic calculation complexity and multi-values features selection. The results indicated that the proposed DDT outperforms the ID3 in the developing time. The accuracy of the classification is improved on the basis of 10-fold cross-validation for all datasets with the highest accuracy achieved by the proposed method is 92% for the student.por dataset and holdout validation for two datasets, i.e. Iraqi and Student-Math. The experiment also shows that the proposed DDT tends to select attributes that are important rather than multi-value.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2020
The traditional K-nearest neighbor (KNN) algorithm uses an exhaustive search for a complete train... more The traditional K-nearest neighbor (KNN) algorithm uses an exhaustive search for a complete training set to predict a single test sample. This procedure can slow down the system to consume more time for huge datasets. The selection of classes for a new sample depends on a simple majority voting system that does not reflect the various significance of different samples (i.e. ignoring the similarities among samples). It also leads to a misclassification problem due to the occurrence of a double majority class. In reference to the above-mentioned issues, this work adopts a combination of moment descriptor and KNN to optimize the sample selection. This is done based on the fact that classifying the training samples before the searching actually takes place can speed up and improve the predictive performance of the nearest neighbor. The proposed method can be called as fast KNN (FKNN). The experimental results show that the proposed FKNN method decreases original KNN consuming time withi...
Uploads
Papers by saja taha
history of data for future analysis to predict and improve the
benefits, profits, and development of the organization. Decision
making on student performance can be made after mining the
historical student information that will result in useful insight.
This paper proposes developing and implementing the Student
Performance Prediction System using three Data Mining
Methods Neural Network, Decision Tree, and K- Nearest
Neighbor. Furthermore, a comparison of their results is
provided based on three educational data sets. The
experimental results indicated that Neural Network exceeds the
K- Nearest Neighbor and Decision Tree for all data sets based
on holdout validation obtaining 97 percent, 92 percent, 83
percent of accuracy for Iraq, Math, Por datasets, respectively.
history of data for future analysis to predict and improve the
benefits, profits, and development of the organization. Decision
making on student performance can be made after mining the
historical student information that will result in useful insight.
This paper proposes developing and implementing the Student
Performance Prediction System using three Data Mining
Methods Neural Network, Decision Tree, and K- Nearest
Neighbor. Furthermore, a comparison of their results is
provided based on three educational data sets. The
experimental results indicated that Neural Network exceeds the
K- Nearest Neighbor and Decision Tree for all data sets based
on holdout validation obtaining 97 percent, 92 percent, 83
percent of accuracy for Iraq, Math, Por datasets, respectively.