Kamal 2018
Kamal 2018
Kamal 2018
Abstract Educational data mining is used to convert the randomly available data in
educational settings into some beneficial information. It helps in building insights
for different research questions that arise in educational settings like performance
prediction of students in academics, designing of new courses, instructors’ feedback,
method or mode of teaching, etc. This paper aims to answer questions that has
been a major challenge for researchers, i.e. the huge list of drop out rate and lower
percentage of first-year students. It highlights factors that affect the performance of
students. There are a lot of studies that has been conducted in the field education
like psychology and statistics. This case study targeted students enrolled in Bachelor
of Computer Applications (BCA). The aim of our research work was to show the
impact of variables on academic performance of students. The sample size of the
study is 480 students of BCA. The questionnaire is based on factors categorized as
Demographic, Academic, Social and Behavioural. The results of the study revealed
that family income, parents qualification and interaction with teachers were among
the influential factors along with previous year percentage, current year attendance
and class behaviour.
1 Introduction
The escalating competition amongst universities gradually have need of best students
who will pass through with better results and bring laurels to their university name.
The management of university have to take important and quick decisions about
introducing new courses [1, 2]. This can be achieved collecting students data from
different institutes, and this data needs to be valid and reliable. However, the data
available in educational settings is used for producing simple queries and traditional
reports that is rarely accessible to the people at the right time. There is need to
introduce advanced information technologies to effectively transform available data
into information and to make use of this knowledge to support decision making [3].
To evaluate the performance of a student and quantifying it to the next level so that
the performance in academic of the student can be predicted. Performance measures
[4] impart us with information about student performance [5] and also the ability of
the student to reach the predefined goals. In order to analyze the performance of a
student, the factors that affects the performance of the students need to be considered
such as: marks from previous class, attendance, socioeconomic factors [6], personal
habit of studying, the time devoted for the studies need to be collected.
2 Related Work
Romero et al. [7] investigated the different data mining approaches to improve the
performance of first-year students on the basis of their participation in different
online forum to discuss the doubts with their peer groups. Huang et al. [8] compared
and developed different type of mathematical models for the predicting the academic
performance of students in engineering. Nasiri et al. [9] illustrated EDM applications,
in this case study they used e-learning and online learning, the data of first-year
students is collected from LMS used by online education centre of Iran University
of Science and Technology. Kumar et al. [10] used the previous student database
to predict the division of student and made use of the classification technique. Shih
B. et al. conducted the study to create the model that distinguishes between good
student and bad student who made use of Bottom-Out-Hint [11]. Ramaswami M. et al.
discussed the type of classroom the student is attending, i.e. traditional classroom,
online education discussed [12]. Vandamme J. P et al. categorized students into
three categories: the ‘low-risk’ students, have a high probability of succeeding; the
‘medium-risk’ students, who may succeed; and the ‘high-risk’ students, who have
a high probability of failing (or dropping out) [13]. Superby et al. [14] classified
the students into different categories based on their personal history, behaviour and
different perceptions of the students. Tair, Mohammad M. Abu et al. worked on a case
study of the undergraduate students to improve their performance in academics [15].
Hlosta et al. [16] focused on the students who are at risk of failure. Two methods
GUHA and Markov chain-based graphical models were explored. Both methods
provided useful insights into the students’ behaviour during their studies. Wolff et al.
carried the latest work at the Open University where data from VLE, clubbed with
demographic data for the prediction of student failure or dropout. They derived that
first assessment considered to be a good predictor for final performance of students.
They pointed out the right time of intervention in study of student, so that one can find
out at-risk student and offering the required assistance can help student’s to improve
their chances of success [17]. Kuzilek et al. [18] analyzed the objective of Open
Academic Performance Prediction Using Data Mining … 837
University project that was to predict ‘at-risk’ students in the early years of their
graduation. The approaches they used for prediction include: Bayesian classifier,
Classification and regression tree (CART), K-Nearest Neighbours (k-NN). Huang
et al. [8] compared and developed different type of mathematical models for the
better performance of student’s enrolled in engineering courses. The models are on
the basis of: high-enrolment, high-impact, and core course that many engineering
undergraduates are required to take. Kabakchieva [5] categorized students into three
classes: Class I-: the students with highest percentage, Class II-: the students with
average percentage but with the help of extra efforts like providing assignments
and class test. Class III: the students with less percentage and probability of failure.
The implementation of predictive models done to maximize student selection and
retention. This exhaustive literature review pointed out few important factors that
highlighted and gave us vision to lead towards our goal.
3 Methodology
Considering literature survey with the aim of extracting influencing factors that have
some impact on performance of student. Our first step was to find set of influencing
factors. The first set of factors was found to be constant, i.e. demographic variables
that consists of (age, gender, financial background of family, educational qualifica-
tion of parents). The second set infer performance during academic session; grades
scored in semester, assignment submission. The third set shed light on the social
issues affecting performance; positive negative impact of friend circle and location
of residence. Last set of factors found to be most influential; behavioural factors that
play a vital role in students’ performance, student teacher interaction [19], problem
solving ability, help seeking behaviour, class attendance, etc.
On the basis of identified factors, a questionnaire was designed. The contents
of the questionnaire were validated by the five-member committee comprising of
experts from management, statistics and data mining field. The questionnaire has 80
questions covering more than 306 variables. The respondents were the students of
Goswami Ganesh Dutta Snatan Dharma College Sec-32 Chandigarh. Initial sample
size was 480 students of BCA stream. Next phase was data cleaning and consolida-
tion. After cleaning data was analyzed to extract the useful information.
On the data collected from the questionnaire we applied regression and deci-
sion tree to identify the most influential factors that affect the students academic
performance [8]. Regression was implemented using SAS and Decision tree was
implemented using RapidMiner.
838 P. Kamal and S. Ahuja
4 Data
There were 80 questions in our questionnaire and no open ended questions were
designed. All the questions were required to be answered from the given set of
answers. Most of the questions were based on five response category and few were of
yes/no form. Complete questionnaire covered 306 different variables in 80 questions.
The questionnaire was first analyzed by applying regression on the category wise
data. Then decision tree(ID3) was applied on the complete set of questionnaire. Both
techniques identified the influencing variables affecting academic performance of
the students in the course.
The data collected using the questionnaire is divided into four different categories on
the basis of the variables viz. Demographic, Academic, Behavioural and Social. The
following subsections include the regression analysis of the variables using SAS.
Demographic variables play a vital role in the life of a student. Information about fam-
ily financial conditions, educational background of parents and location of residence,
i.e. rural or urban. Income of Father found to be highly correlated (r 0.339), loca-
tion of residence(r 317) These demographic variables have the highest impact on
students’ academic perfromance. Stream of secondary school education (r 0.311)
Percentage of marks in secondary school (r 0.303) Percentage of marks scored in
mathematics (r 0.301), however variables like sex of the student, qualification of
parents, type of family, siblings insignificant.
Easy availability of study material can be beneficial for the students (r 0.135);
it saves their time of preparation for examination, making notes, exploring new ideas
to score well. Interaction with teacher (r 0.195); discussion of eminent and thorny
area found correlated with student academic success.
Dedicating number of hours for self study (r 0.215) directly associated with stu-
dents’ academic success. Time spent for self study helps students’ to gain through
knowledge about their subject and linked to their success. Attendance also found
to be highly correlated missing less classes (r 0.197) bright chances of academic
success. Its awful missing those classes that are regularly being attended (r 0.176)
by fellow students. Self-opinion of the student and perception of academics is sub-
jective; the student who are self-motivated (r 0.275) and well aware of the pros
and cons of the academic success and failure can better perform in their studies.
The variables like student’s co-curricular activities (r 0.137), time spent with
friends (r 0.178), distance between their college and home (r 0.195) found
to effective on their academic success. Time spent on internet (r = 0.196). Impact of
these social variables also depends on their perception about studies. How seriously
they devote time to their studies (r 0.215). These all factors results in good/bad
performance of students (Table 1).
6 Decision Trees
Decision trees are commonly used popular tools for classification, prediction and
decision analysis. Structure of decision tree is like a flowchart of nodes and branches.
The internal node is usually referred as the root node. To determine the root node,
we calculate which attribute will most exactly classify the objects (here students)
according to the values of the decision variable [15]. This is done on the basis of
various methods such ad Gini Index, Accuracy, etc. The tree keeps on developing
until the point that it is never again able to split the data. Nodes end up terminal
and cannot be split further when all members of the sample belong to one class.
For example, Classification and Regression Trees, which is one of the most popular
algorithms, uses an index of diversity (the Gini index). In any case, ID3 uses entropy
as an approach to assess a potential splitter [20, 21].
We used the SAS/Enterprise Miner software to build a decision tree using ID3. The
classification of students’ uses variables from all the categories viz. demographic,
social, and academic and behaviour. Most influential to less influential these factors
are Fathers income, attendance in the class, previous year’s percentage and teacher
student interaction in the class. Also, some other factors such as use of internet and
outing with friends had both negative and positive impact depending upon the time
spent in these activities. As shown in Table 2, the proportion of correct predictions
in the model validation phase is very good: 78.65% of the students in the high-risk
category were correctly classified by the elaborated tree, and for the medium-risk
students this figure was 48.46%. However, only 60.34% of the students at low risk of
failure were correctly classified. For the extreme classes, the decision tree managed
reasonably well, but the predictions for students at ‘medium risk’ were poor (Fig. 1).
• If Father’s Income > 200,000 PA and Student Teacher Interaction is poor then
performance of the student is good NO
• If Father’s Income > 200,000 PA and Student Teacher Interaction is good then
performance of the student is good YES
education and student–teacher interaction are the most influential factors which oth-
erwise could not have been included if regression was applied alone. Our future
work will be based on these factors; our focus will be to categorize the students into
different categories. The students will be categorized based on their performance and
influential factors. We will apply more data mining methods to extract more useful
information.
References
1. Ahmed, A.B.E.D., Elaraby, I.S.: Data mining: a prediction for student’s performance using
classification method. World J. Compu. Appl. Technol. 2(2), 43–47 (2014)
2. Asif, R., Haider, N.G., Ali, S.A.: Prediction of undergraduate student’s performance using data
mining methods. Int. J. Comput. Sci. Inf. Security 14(5), 374 (2016)
3. Bharadwaj, B., Pal, S.: Mining educational data to analyze student’s performance. Int. J. Adv.
Comput. Sci. Appl. 2(6), 63–69 (2011)
4. Mishra, T., Kumar, D., Gupta, S.: Students’ performance and employability prediction through
data mining: a survey. Indian J. Sci. Technol. 10(24), (2017)
5. Kabakchieva, D.: Student performance prediction by using data mining classification algo-
rithms. Int. Comput. Sci. Manage. Res. 1(4), 686–690 (2012)
6. Atinaf, W., Petros, P.: Socio economic factors affecting female students academic performance
at higher education. Health Care: Curr. Rev. 1–3 (2016)
7. Romero, C., López, M.I., Luna, J.M., Ventura, S.: Predicting students’ final performance from
participation in on-line discussion forums. Comput. Educ. 68, 458–472 (2013)
8. Huang, S., Fang, N.: Predicting student academic performance in an engineering dynamics
course: a comparison of four types of predictive mathematical models. Comput. Educ. 61,
133–145 (2013)
9. Nasiri, M., Minaei, B.: Predicting GPA and academic dismissal in LMS using educational data
mining: A case mining. In: 6th National and 3rd International conference of e-Learning and
e-Teaching. pp. 53–58. IEEE. (2012)
10. Kumar, B., Baradwaj, S.P.: Mining educational data to analyze students performance. Int. J.
Adv. Comput. Sci. Appl. 2(6), 63–69 (2011)
11. Superby J.F., Vandamme J.P., Meskens, N.: Determination of factors influencing the achieve-
ment of the first-year university students using data mining methods. Workshop on Educational
Data Mining, 37–44. (2006)
12. Rakotomalala, R: Graphes d’induction. Ph.D. thesis, Université Claude Bernard, Lyon (1997)
13. Wolff, A., Zdrahal, Z., Herrmannova, D., Kuzilek, J., Hlosta, M.: Developing predictive models
for early detection of at-risk students on distance learning modules. In: Machine Learning and
Learning Analytics Workshop at The 4th International Conference on Learning Analytics and
Knowledge (LAK14) (2014)
14. Tair, M.M. Abu, El-Halees, AM.: Mining educational data to improve students performance:
a case study. Int. J. Inf. Commun. Technol. Res. 2(2), (2012)
15. Vandamme, J.P., Meskens, N., Superby, J.F.: Predicting academic performance by data mining
methods. Educ. Econ. 15(4), 405–419 (2007)
16. Hlosta, M., Herrmannova, D., Vachova, L., Kuzilek, J., Zdrahal, Z., Wolff, A.: Modelling
student online behaviour in a virtual learning environment. In: Machine Learning and Learning
Analytics workshop at The 4th International Conference on Learning Analytics and Knowledge
(LAK14), (2014)
17. Xu J., Han Y., Marcu D., van der Schaar M.: Progressive prediction of student performance in
college programs. In: AAAI pp. 1604–1610 2017
Academic Performance Prediction Using Data Mining … 843
18. Kuzilek, J., Hlosta, M., Herrmannova, D., Zdrahal, Z., Wolff, A.: OU Analyse: analysing at-risk
students at The Open University. learning analytics review, pp. 1–16, (2015)
19. Feng, M., Heffernan, N.T.: Informing teachers live about student learning: reporting in the
assistment system. Technol. Instruction Cognition Learning 3(1/2), 63 (2006)
20. Breiman, L., et al.: Classification and Regression Trees. Wadsworth International Group),
Belmont (1984)
21. Quinlan, J.R.: Discovering rules by induction from large collections of examples. In: Expert
Systems in the Micro Electronic Age (Ed.). Edinburgh University Press, Edinburgh (1979)
22. Marquez-Vera, C., Morales, C.R., Soto, S.V.: Predicting school failure and dropout by using
data mining techniques’. IEEE Rev. Iberoamericana de Tecnologias del Aprendizaje 8(1), 7–14
(2013)
23. Nghe, N.T., Janecek, P., Haddawy, P.: A comparative analysis of techniques for predicting
academic performance. In: Frontiers in Education Conference-Global Engineering: Knowledge
Without Borders, Opportunities Without Passports”, FIE’07, 37th Annual (pp. T2G-7). IEEE
(2007)
24. Patil, P.A., Mane, R.V.: Prediction of students performance using frequent pattern tree.in com-
putational intelligence and communication networks (CICN). In: 2014 International Confer-
ence on (pp. 1078–1082). IEEE (2014)
25. Romero, C., Ventura, S.: Educational data mining: a review of the state of the art. IEEE Trans.
Sys. Man, and Cyber. Part C (Appl. Rev.) 40(6), 601–618 (2010)
26. Ahuja, Sachin: Identification of factors influencing GER of female candidates in higher tech-
nical education using decision trees. Res. Cell: An Int. J. Eng. Sci. 10, 49–55 (2014)
27. Shih, B., Koedinger, K.R., Scheines, R.: A response time model for bottom-out hints as worked
examples. Handbook of Educational Data Mining, 201–212 (2011)
28. Wolff, A., Zdrahal, Z., Nikolov, A., Pantucek, M.: Improving retention: predicting at-risk
students by analysing clicking behaviour in a virtual learning environment. In: Proceedings of
the third international conference on learning analytics and knowledge, pp. 145–149. ACM,
(2013)