African Centre of Excellence in Data Science
College of Business and Economics-University of Rwanda
Predictive Model for Early Detection of Higher Education
Dropout Using Machine Learning Techniques
By NGENZI Mali Dioscord
Registration Number: 222022311
A dissertation submitted in partial fulfilment of the requirements
for the degree of Master of Data Science in Data Mining
requirements for the University
degree of master’s
degreeCollege
of data science
in Biostatistics
in Africa Centre of
of Rwanda,
of Business
and
Economics
Supervisor: Prof. Joseph Nzabanita
September, 2024
DECLARATION
I declare that this dissertation entitled “Predictive Model for Early Detection of Higher
Education Dropout Using Machine Learning Techniques” is the result of my work and has not
been submitted for any other degree at the University of Rwanda or any other institution.
Names: NGENZI Mali Dioscord
Signature:
i
APPROVAL SHEET
This dissertation entitled “Predictive Model for Early Detection of Higher Education Dropout
Using Machine Learning Techniques” written and submitted by NGENZI Mali Dioscord in
partial fulfilment of the requirements for the degree of Master of Science in Data Science majoring
in Data Mining is hereby accepted and approved. The rate of plagiarism tested using Turnitin is
15% which is less than 20% accepted by the African Centre of Excellence in Data Science (ACEDS).
Prof. Joseph Nzabanita
Supervisor
Dr. Kabano H. Ignace
Head of Training
ii
DEDICATION
I dedicate this work to my family especially my mother and sisters for their support, advice, and
encouragement which greatly contributed to the completion of my studies and research process.
iii
ACKNOWLEDGMENTS
I sincerely express my deepest gratitude to the Holy and Almighty God for His constant presence
and abundant blessings in my life. His guidance, love, and grace have provided me with the
strength and ability to complete this journey successfully.
I would like to dedicate my heartfelt appreciation to my supervisor, Prof. Joseph Nzabanita, for
his invaluable guidance and unwavering support throughout this research. His expertise, kindness,
and encouragement have been instrumental in my growth and success. The knowledge and insights
gained under his supervision have greatly enriched my academic and professional journey.
I would like to offer my most sincere appreciation to the University of Rwanda, specifically to the
African Centre of Excellence in Data Science, for providing an exceptional learning environment
and invaluable support throughout my academic journey. The experiences and opportunities
provided have significantly contributed to my growth and development.
I also want to acknowledge the memory of my father, whose inspiring presence continues to guide
me, even though he passed away on September 15, 2011. His legacy of love and inspiration has
had a profound impact on my life and achievements.
Finally, I extend my heartfelt thanks to my family, including my mother, sisters, and extended
family, for their unwavering support and encouragement.
God bless you all!
iv
ABSTRACT
The prediction of student dropout is vital for the retention policy improvement in any higher
education institution. This study utilized different supervised machine learning techniques for
higher education dropout prediction and identified the key influential factors using secondary data
derived from the UCI Machine Learning Repository. The dataset comprises 4,424 records with 35
attributes representing various demographic, academic, and socio-economic factors related to
student retention. Moreover, the study identified several key determinants that influence dropout
namely academic performance, tuition fee status, age, unemployment rate, and economic
indicators such as GDP and inflation rate. The study also applied and evaluated several machine
learning algorithms, including decision tree, random forest, logistic regression, gradient boosting
machine, XGBoost, and support vector machine with key performance metrics like AUC-ROC,
precision, accuracy, recall, and F1-score. The comparative analysis showed that XGBoost and
Random Forest outperformed other models with an accuracy of 90% and an AUC-ROC of 95.30%
and 95.21 respectively. They also achieved a very high precision of 92%, recall of 86%, and F1Score of 89%, justifying the fact that these models are exceptionally good at correctly identifying
students who will drop out. The following best-performing models become the Gradient Boosting
Machine and Support Vector Machine both with an accuracy of 88% and AUC-ROC of 95% and
94% respectively. Logistic Regression and Decision Tree models also performed well with an
accuracy of 87% and 86% respectively and AUC-ROC of 94% and 92%, but were less effective
compared to the ensemble methods like Random Forest and XGBoost. The study highlights the
value of data-driven approaches in understanding dropout dynamics and underscores the
importance of targeted interventions based on identified key determinants. The research
recommends the implementation of the Random Forest and XGBoost models by educational
stakeholders for proactive intervention and informed decision-making. Future research can expand
this work by incorporating larger, more diverse datasets, and exploring more advanced ensemble
techniques to enhance predictive accuracy and robustness further.
Keywords: Student dropout, Higher Education, Predictive model, Machine learning
v
TABLE OF CONTENTS
DECLARATION ....................................................................................................................................... i
APPROVAL SHEET ................................................................................................................................ ii
DEDICATION ......................................................................................................................................... iii
ACKNOWLEDGMENTS ....................................................................................................................... iv
ABSTRACT .............................................................................................................................................. v
LIST OF ABBREVIATIONS AND ACRONYMS .............................................................................. viii
LISTS OF FIGURES ................................................................................................................................ x
CHAPTER ONE: INTRODUCTION ..................................................................................................... 1
1.1
The background of the study.................................................................................................... 1
1.2 Research problem ........................................................................................................................... 3
1.3 Research objectives ......................................................................................................................... 4
1.3.1 General objective...................................................................................................................... 4
1.3.2 Specific objectives .................................................................................................................... 4
1.4 Research questions .......................................................................................................................... 4
1.5 Scope of the study............................................................................................................................ 4
1.6 Significance of the study ................................................................................................................. 5
1.7 Research Structure ......................................................................................................................... 6
CHAPTER 2: LITERATURE REVIEW ................................................................................................ 7
2.1 Introduction..................................................................................................................................... 7
2.2 Definition of key terms ................................................................................................................... 7
2.2.1 Student dropout........................................................................................................................ 7
2.2.2 Predictive Model ...................................................................................................................... 7
2.2.3 Machine learning...................................................................................................................... 8
2.3 Factors Influencing Student Dropout ............................................................................................ 8
2.3.1 Academic factors ...................................................................................................................... 8
2.3.2 Socioeconomic factors .............................................................................................................. 9
2.3.3 Demographic factors .............................................................................................................. 10
2.4 Application of Machine Learning in Education.......................................................................... 10
2.5 Application of Machine learning for dropout prediction ........................................................... 11
2.6 Evaluating Predictive Models of Student Dropout ..................................................................... 13
CHAPTER THREE: METHODOLOGY OF RESEARCH ................................................................ 15
3.1 Introduction................................................................................................................................... 15
vi
3.2 Study design .................................................................................................................................. 15
3.3 Source of data ................................................................................................................................ 15
3.4 Data Description............................................................................................................................ 16
3.5 Study variables .............................................................................................................................. 16
3.6 Data preprocessing ........................................................................................................................ 17
3.7 Handling Data Imbalance............................................................................................................. 18
3.8 Model Development ...................................................................................................................... 18
3.8.1 Logistic regression .................................................................................................................. 19
3.8.2 Decision Tree .......................................................................................................................... 19
3.8.3 Random Forest ....................................................................................................................... 20
3.8.4 Support Vector Machines ...................................................................................................... 21
3.8.5 XGBoost .................................................................................................................................. 22
3.8.6 Gradient Boosting Machine................................................................................................... 22
3.9 Model Evaluation .......................................................................................................................... 22
3.9.1 Confusion Matrix ................................................................................................................... 23
3.9.2 Accuracy ................................................................................................................................. 24
3.9.3 Precision .................................................................................................................................. 24
3.9.4 Recall ....................................................................................................................................... 24
3.9.5 F1 Score .................................................................................................................................. 24
3.9.6
3.10
Area Under Curve (AUC) ............................................................................................... 24
Feature Importance Analysis ................................................................................................. 25
CHAPTER 4: DATA ANALYSIS AND PRESENTATION OF RESULTS ....................................... 27
4.1 Introduction................................................................................................................................... 27
4.2 Data quality check......................................................................................................................... 27
4.3 Descriptive View of Dropout Rate for Demographic, Socioeconomic, and Academic Variables
.............................................................................................................................................................. 28
4.4 Identification of important predictors of Dropout risk in Higher Education ........................... 30
4.5
4.5.1
Machine learning model results ............................................................................................. 32
Logistic regression ............................................................................................................... 32
4.5.2 Support Vector Machine (SVM) ........................................................................................... 33
4.5.3 Decision Tree .......................................................................................................................... 33
4.5.4 Random Forest ....................................................................................................................... 33
4.5.5 Gradient Boosting Machine................................................................................................... 33
vii
4.5.6 XGBoost .................................................................................................................................. 33
4.5.7 Comparative analysis of Model performances ..................................................................... 34
CHAPTER 5. DISCUSSION OF THE RESULTS ............................................................................... 37
5.1 Introduction................................................................................................................................... 37
5.2 Key Findings Discussion ............................................................................................................... 37
CHAPTER 6. CONCLUSION AND RECOMMENDATIONS .......................................................... 40
6.1 Introduction................................................................................................................................... 40
6.2 Conclusion ..................................................................................................................................... 40
6.3 Recommendations ......................................................................................................................... 40
REFERENCES: ...................................................................................................................................... 42
viii
LIST OF ABBREVIATIONS AND ACRONYMS
ACE-DS: African Center of Excellence-Data Science
ML: Machine Learning
SMOTE: Synthetic Minority Oversampling Technique
ROC: Receiver Operating Characteristic
AUC: Area Under Curve
CGPA: Cumulative Grade Point Average
XGBoost: Extreme Gradient Boosting
RF: Random Forest
UCI: University of California, Irvine
GDP: Gross Domestic Product
DT: Decision Tree
SVM: Support Vector Machine
TPR: True Positive Rate
FPR: False Positive Rate
TP: True Positive
TN: True Negative
SHAP: SHapley Additive exPlanations
TOT: Total
STD: Standard Deviation
LR: Logistic regression
ix
LISTS OF FIGURES
Figure 1. Decision Tree Illustration. ......................................................................................................... 20
Figure 2. Random Forest Algorithm (GeksforGeeks, 2024) ..................................................................... 21
Figure 3. Confusion Matrix illustration ..................................................................................................... 23
Figure 4. Bar plot showing the imbalanced data and the data balance ...................................................... 27
Figure 5. Top 10 Feature Importance ........................................................................................................ 32
Figure 6. ROC Curves for six ML models ................................................................................................ 35
Figure 7. Confusion matrices for six employed ML classifiers ................................................................. 36
x
LIST OF TABLES
Table 1. Study variables ............................................................................................................................ 16
Table 2. Distribution of categorical variables ........................................................................................... 29
Table 3. Distribution of Numerical variables ............................................................................................ 29
Table 4. Feature importance results for all variables ................................................................................. 31
Table 5. Model evaluation metrics for six classifiers. .............................................................................. 34
xi
CHAPTER ONE: INTRODUCTION
1.1 The background of the study
Student dropout is among the major challenges facing higher education institutions in most
countries (Nurmalitasari et al., 2023). This issue has attracted considerable interest from the
scholarly community, the government, and social stakeholders due to the diverse impacts they
have on students, their families, the educational institution itself, and the state (Guzmán et al.,
2021; Nurmalitasari et al., 2023). In addition, this impacts students and their families with financial
burdens and causes reduced workforce productivity, further social inequality, and damages
institutional reputation (Villegas-Ch et al., 2023). Therefore, developing and subsequently
implementing dropout prevention strategies becomes very important for the success and
sustainability of higher education systems worldwide (Kim & Kim, 2018).
Dropout rates in higher education vary significantly across different regions and countries. The
college dropout rate in the United States indicates that 33% of undergraduate students fail to finish
their degree program (Andrea, 2024). The OECD, (2009) data reports that countries with strong
support structures and efficient policies, such as Belgium, Denmark, France, Germany, and Japan,
have generally kept their dropout rates low, typically less than 24%, compared to other countries.
While exact percentages may vary depending on the year and availability of more current statistics,
this problem is still a critical issue globally and continues to demand serious attention and remedy.
Furthermore, Trends in higher education in South Africa showed that 50% of students who enroll
in institutions of higher learning dropout within the first three years, while approximately 30%
drop out in their first year (Letseka & Breier, 2008). Consequently, High dropout rates from higher
education mean a great loss of resources, including that of the taxpayer, resulting in fewer
graduates and thereby impacting highly skilled labor availability. This is a critical situation in
many countries, with implications not only for financial efficiency but also for the quality and
evaluation of higher education institutions (Paura & Arhipova, 2014).
Several studies highlighted that Early identification of at-risk students is very important for
improving academic outcomes, student retention, and dropout reduction (Alyahyan & Düştegör,
2020; Nimy et al., 2023; Osborne & Lang, 2023). Recent developments in the field of data science
1
and ML provide several possible ways to address this challenge using predictive modeling
techniques (Baker & Siemens, 2014).
Developed countries have carried out numerous studies to develop a data-driven predictive method
utilizing machine learning techniques and demonstrating accuracy in forecasting school dropouts.
Therefore, unlike traditional techniques, machine learning offers a significant promise. In addition,
the algorithms can analyze large datasets, uncovering complex relationships between student
characteristics, academic performance, and socioeconomic factors that might contribute to dropout
(Romero et al., 2010). Researchers have also explored several ML techniques to identify students
likely to drop out of school. Examples of such methods include decision tree algorithms, neural
networks, Naive Bayes classifiers, instance-based learning algorithms, Random Forest, Logistic
regression, and support vector machines (Kotsiantis et al., 2003).
The study conducted in Peru reinforces the significance of financial and contextual variables in
predicting university dropout rates in developing countries. Key factors identified include age,
term, and the student's financing method, which align with findings from other regions that
emphasize the role of socioeconomic status and financial support in educational persistence
(Jiménez et al., 2023). The Impact of machine learning approaches in the prediction of student
dropout from any given higher education setting has been the subject of numerous studies. Three
elements identified as main drivers of the success achieved in such tasks are: identifying relevant
features that may influence dropout; choosing a proper algorithm in developing a prediction model;
and the evaluation metrics to be used in estimating a model's performance. Addressing these
elements will be critical to improving dropout prediction accuracy and, consequently, developing
effective early intervention strategies (Oqaidi et al., 2022).
While most studies on dropout prevention have focused on developed countries, the findings and
methodologies, particularly those using machine learning techniques, can provide valuable
insights for developing countries. In fact, leveraging predictive models in developing nations can
offer early identification of at-risk students, allowing for more targeted and cost-effective
interventions. By adapting the variables and models used in developed countries, educational
institutions in developing contexts could mitigate dropout rates, even in the face of resource
limitations.
2
Moreover, the key determinants of dropout such as socioeconomic status, academic performance,
and student engagement are universal, although their impact may be more pronounced in
developing countries where inequalities are more prominent. Thus, the lessons learned from
studies in developed countries can be customized to the specific socio-economic conditions of
developing countries, making the implementation of data-driven dropout prevention strategies
more effective. This study contributes to the growing body of research by showing how machine
learning models, applied to datasets from developed countries, can be adjusted to improve dropout
predictions and interventions in developing countries.
1.2 Research problem
One of the biggest problems facing higher education institutions is student dropout. This impacts
the quality of education provided and the long term sustainability of institutions (Tinto, 2012).
Accurately being able to predict students who are more likely to leave school would be very helpful
in the development of effective early interventions to ultimately improve overall educational
outcomes (Dake & Buabeng-Andoh, 2022). While some previous works have studied other
predictors of student dropout and adopted different statistical methods, there is a serious lack of
comparative studies on the application of machine learning algorithms for early dropout prediction,
particularly during the first year of study (Niyogisubizo et al., 2022).
To address the problem of dropout in the context of existing machine learning solutions, this study
focuses on the application of advanced algorithms such as Random Forest, Logistic Regression,
Gradient Boosting Machine, Decision Tree, XGBoost, and Support Vector Machines. These
methods have demonstrated effectiveness in analyzing large, diverse datasets to predict dropout
more accurately compared to traditional statistical approaches (Andrade-Girón et al., 2023). By
leveraging these machine learning techniques, this research seeks to provide more reliable early
identification of at-risk students, enabling institutions to implement timely and targeted
interventions.
Traditional methods of identifying at-risk students rely on manual work and limited analysis of the
available data, resulting in delayed and less effective interventions (Thayer, 2000). Such methods
typically base their identification on past data, like academic performance and students’
3
attendance, but fail to unearth other risk factors (Marbouti et al., 2016). In addition, By utilizing
machine learning techniques to analyze extensive datasets, we can uncover patterns that
conventional methods may have missed, leading to very promising solutions (Jordan & Mitchell,
2015; Lykourentzou et al., 2009). This study aims to enhance the early detection of at-risk students
in higher education by developing and comparing various machine learning models. By focusing
on the first year of study, the research seeks to facilitate timely and personalized interventions to
reduce dropout rates, ultimately improving educational outcomes and institutional sustainability.
1.3 Research objectives
1.3.1 General objective
The main objective of this study is to determine the most effective machine learning predictive
model for early dropout prediction in higher education.
1.3.2 Specific objectives
i.
To identify the key determinants influencing higher education dropout.
ii.
To determine the best machine learning model that accurately predicts higher education
dropout using supervised machine learning techniques.
iii.
To assess the performance of the selected machine learning model in predicting higher
education dropout using key evaluation metrics, ensuring its effectiveness and reliability
in practical applications.
1.4 Research questions
i.
What are the key factors that significantly influence the likelihood of higher education
dropout as identified by machine learning techniques?
ii.
Which supervised machine learning model demonstrates the highest accuracy in predicting
higher education dropout?
iii.
How effective is the selected machine learning model in predicting higher education
dropout based on key evaluation metrics?
1.5 Scope of the study
This study focuses on applying ML techniques to predict dropout of students in higher education
as well as providing insights and strategies that learning institutions can use to help students at risk
and implement timely interventions. The used dataset contains socioeconomic and academic
4
factors relevant to the prediction of dropout. The study scope concerns only higher education
institutions. Furthermore, the study limits the data on specific academic and socio-economic
variables to a specific time period. After that, the scope narrows to only predict dropout, not other
educational outcomes such as enrollment or graduation.
Furthermore, the study acknowledges potential limitations, namely the generalization of findings
across different educational contexts, and the representativeness of the synthetic samples generated
by SMOTE. All these aspects can affect the effectiveness of the predictive model. The study
addresses these constraints by being extremely careful about model performance and ensuring that
the synthetic data is representative of real-world scenarios. It will also discuss how such limitations
affect the generalizability of findings to different higher education contexts.
1.6 Significance of the study
Applying ML techniques for student dropout prediction gives higher education a better way to
strengthen retention strategies in higher education. The research will help institutions identify atrisk students and offer them assistance on time, hence improving the student's retention rate.
Furthermore, it will assist the institution in effectively utilizing resources and designing focused
intervention programs that address specific needs. As a result, such predictions will help educators
understand some of the causes of student dropout. This will assist them in modifying their teaching
strategies and devising support systems that are most suitable for their students. Furthermore,
educators can use this information to design their engagement methods, spot students in need of
extra help, or adjust instructional strategies to best foster overall student performance and
satisfaction.
Moreover, Economic and social advancements are among the many benefits of lowering the
national dropout rate. With more graduates, the nation will improve its workforce's skill level and
productivity. Additionally, this will reduce social inequality since more children from various
backgrounds can complete their education and seek better career prospects. With evidence-based
strategies, the findings can help support national educational goals by improving student outcomes
and institutional effectiveness.
5
1.7 Research Structure
This study consists six chapters, the first chapter is the introduction of the study. This chapter
discusses background information about the study, the problem statement, outlines the research
objectives, introduces research questions, and justifies the importance of the study. Moving on to
the second chapter encompasses a literature review that focuses on previous research conducted
on predicting students at risk of dropping out. It also includes a theoretical review, highlights a
critical research gap, and presents a conceptual framework. The third chapter describes different
methods used in the study, details of sample selection, study design, the source of data, Data
collection procedures, model specification, the variables of interest selected for analysis to achieve
the objective of the research, predictive model, as well as evaluation of performance for predictive
models. The fourth chapter presents and analyses the results from machine learning built models,
the fifth chapter discusses the study findings and finally, the last chapter provides the conclusion
of the study and some recommendations.
6
CHAPTER 2: LITERATURE REVIEW
2.1 Introduction
The early detection of Higher Education dropout is critical for implementing timely interventions.
This literature review explores the definition of key terms, key factors influencing dropout rates,
Application of machine learning techniques for dropout prediction, and the evaluation of predictive
models for accuracy and reliability.
2.2 Definition of key terms
2.2.1 Student dropout
Student dropout refers to a student who terminates their education before the completion of the
academic program in which they are enrolled. In the context of University, Dropout refers to the
act of a student discontinuing their education while still officially enrolled in a higher education
institution (SYDLE, 2024). Dropout rates are a crucial measure of how well an educational system
is doing and have important consequences for students' future socioeconomic status, including
their job prospects, prospective income, and overall well-being. The causes of dropout are complex
and involve various aspects such as socioeconomic status, academic performance, and school
engagement (R. Rumberger, 2011). It also has financial and societal impacts on both the dropouts
themselves and the country in general (R. Rumberger, 2020).
2.2.2 Predictive Model
Traditionally, humans can analyze data. However, human capacity is limited, making data analysis
difficult. As a result, they have developed automated systems that gain knowledge from data and
its fluctuations to adjust to the ever-changing data environment. Machine learning employs
statistical algorithms to acquire knowledge from data samples and involves the development of
statistical programs referred to as models (Lamba & Madhusudhan, 2022). Predictive modeling,
as defined by Gartner, is a widely employed statistical technique for predicting future behavior. Its
solutions employ data-mining technologies to examine both current and past data, empowering the
creation of a good model for future outcomes prediction (Mishra et al., 2023).
7
2.2.3 Machine learning
Machine Learning is defined as the part of artificial intelligence dealing with the creation of models
that can conduct a predefined task on data, making predictions or decisions by learning from data
with no explicit programming. In traditional programming, humans write rules and instructions,
but in ML, systems automatically identify and analyze patterns in data and make predictions. These
systems change their behavior after making predictions, depending on how accurately they turn
out. With the ability to understand more data, they improve the results over time unaided by human
involvement (Saeed et al., 2024). Machine Learning uses various algorithms to develop, describe,
and predict outcomes iteratively using data. As algorithms absorb training data, more accurate
models can be produced based on that data (Kirsch & Hurwitz, 2018).
2.3 Factors Influencing Student Dropout
Factors that affect higher education dropout need to be understood so that appropriate interventions
and policy formulations can be developed toward improving student retention. Dropout rates not
only reflect the problems of individual students; they also impact the institutional reputation and
national educational outcomes. This review aims to identify and discuss various factors
contributing to student dropout in higher education by going through existing literature and
research findings.
2.3.1 Academic factors
Student retention is highly correlated with academic performance. Students with poor academic
performance or grades are much more likely to drop out of school. Increased academic difficulties,
such as rigorous coursework and a high course load, can lead to higher dropout rates because they
push students beyond their limits (Stinebrickner & Stinebrickner, 2014). A study by Nurmalitasari
et al., (2023) identifies factors such as academic performance including CGPA, and interest in the
study program as crucial determinants of student success. In this context, academic ability as
measured through the CGPA is very important. Low academic ability typically leads to students
who frequently fail to follow lessons or are unable to complete their thesis. As result, a low level
of academic performance and success is connected with a big probability of abandoning (Araque
et al., 2009).
8
2.3.2 Socioeconomic factors
Socioeconomic status is an important factor influencing students to drop out of higher education.
Inadequate finances to cover tuition fees and other related expenses are the primary reasons why
students drop out of school prematurely (Powdthavee & Vignoles, 2008). Furthermore, students
who work long hours to support themselves might find it difficult to balance work and study, which
would increase the dropout rate (Magolda & Astin, 1993). However, Callender, (1999) indicates
that part-time work to fund education mostly negatively impacts academic performance, since
students are engaged in many working hours and have little time available for studying and other
academic aspects. Other socio-economic disadvantages further lead to an increase in stress and
less available time for academic activities; therefore, the risk of dropout increases.
The financial status of a family significantly influences school dropout rates. Students from lowincome families often face challenges that impede their ability to continue their education, such as
the need to work to support their families or the inability to afford school-related expenses.
Financial problems increase the likelihood of students from low-income families dropping out (R.
W. Rumberger & Lim, 2008). Socioeconomic status, including income, education, and financial
security, significantly impacts life outcomes and academic achievement (Morgan et al., 2009).
Aina et al., (2022) show that students whose families have low financial status increase likelihood
of leaving school prematurely compared to those with the highest family income because they
spend much time contributing to their family income.
Family background and parental education have a significant impact on dropout rates. In most
cases, families with all parents alive have lower dropout rates and more likely to finish their
studies, in contrast to those without both parents. However, other family issues, such as illness,
deaths, adults entering and leaving the household, and marital disruptions, contribute more to the
dropout rate (R. W. Rumberger & Lim, 2008).
The educational level of the parents has also a significant impact on the dropout rate prediction.
Research shows that parents who have a higher educational level are more involved in their
children's educational achievements by spending time with them and supporting them. As a result,
their children are more likely to stay at school until they complete their educational level.
9
Additionally, parents significantly impact students' performance by instilling values, aspirations,
and the necessary motivation for success and continuous school attendance (Smelser & Baltes,
2001).
2.3.3 Demographic factors
Demographic variables, specifically age and gender, are significant contributors to dropouts in
higher education. Older students commonly have more problems with academic workload and
personal life balancing compared with their younger peers, since it is more likely that they work
full-time at other jobs and families competing for their time and energy (Casanova et al., 2023).
Moreover, older students might find it harder to get into academic life again after being away from
formal education for a period of their lives, which in turn leads to them dropping out (Bean J. P.
& Metzner B. S. , 1985).
A study conducted at the Polytechnic Institute of Portalegre in Portugal 2018-2019, shows that
male students leave schools prematurely compared to female students. On the other hand, female
students, especially in STEM, Design, and Multimedia courses, exhibit lower dropout rates. In
addition, factors such as marital status and debt also correlate with increased dropout rates,
highlighting the complex interplay of socioeconomic and demographic factors in student retention
(Lugyi, 2024).
2.4 Application of Machine Learning in Education
The application of ML in education transforms traditional teaching methodologies into facilitating
personalized learning experiences, from real-time feedback on every student's behavior to
individual factors. This technology also changes assessment for the better; not only does it
eliminate biases, but it also uncovers hidden insights for better learning outcomes and more
effective teaching strategies (Jagwani & Aloysius, 2019). Machine learning also improves
educational institutions by transforming the learning process and providing a tool to monitor
students' achievement and engagement. Hence education becomes more equitable. For several
years, Machine learning was a very specific area of artificial intelligence, really explored within
academic and research circles. However, it has developed a powerful tool with wide-ranging
applications across diverse fields, including education (Jordan & Mitchell, 2015).
10
2.5 Application of Machine learning for dropout prediction
In education, machine learning was used in dropout prediction and prevention for several decades.
This is driven by the potential of ML to improve education outcomes and bring effective support
to at-risk students (Larusson & White, 2014). Machine learning effectively predicts student
learning outcomes, enabling early intervention to improve dropout rates. The accuracy, sensitivity,
and specificity of these systems use different algorithms to predict student performance early in
their academic journey. Therefore, we can categorize these tasks as the most important ones, which
include automatic exam score collection and data analysis to identify unobservable variables like
pre-knowledge, talent, and diligence. It uses these data to build predictive models of learning
outcomes, making it easy to spot any student who might be falling behind or otherwise at risk of
falling out, thus targeting support specifically in this group (Asthana & Hazela, 2020).
Machine learning personalizes the learning experience by scaling educational content to meet the
needs of each student. Such personalization makes sure that every student faces the right amount
of challenge and support, therefore staying on the right path and reducing the risk of dropping out
(Shaun et al., 2014). Additionally, machine learning-based early warning systems can detect
students who are at risk of dropping out of school based on variables such as attendance, grades,
and behavior and take appropriate measures (Bowers et al., 2013).
Large amounts of unprocessed raw data could be analyzed using advanced methods to derive
insightful information helpful in predicting student performance. The difficulties and learning
trends of a student could be identified with the aid of machine learning models, highlighting the
areas in need of improvement. This enables the elaboration of personalized strategies to improve
student results. Moreover, educators can use such models to understand the level of comprehension
achieved by students and then adjust their teaching to meet the needs of the different learners
(Kharb & Singh, 2021).
Several studies in educational data mining have utilized varied machine learning approaches in
predicting student dropout status. Among these include Support Vector Machine, Naive Bayes,
association rule mining, logistic regression, artificial neural networks, and decision tree (Kumar et
al., 2017). These algorithms have been successfully applied in different learning contexts to
optimize the process and outputs of learning.
11
Baker & Siemens, (2014) describe these fields of study by using algorithms like decision trees,
neural networks, and clustering methods to investigate large amounts of data created in educational
settings. The results showed that the decision trees make the process of identifying student
performance and behavioral patterns easier by pointing out at-risk students. Neural networks
model complex patterns to predict student success and customize learning experiences. The
grouping that clustering methods provide supports the development of targeted interventions.
Generally, these techniques aim to deepen the student's understanding of the learning process,
thereby enhancing its efficacy and ultimately supporting improved educational outcomes (Baker
& Inventado, 2014).
Romero et al., (2010) further propose the influence of data mining and machine learning in the
educational context. These researchers have successfully used different methodologies to address
the problem of student dropout. These include matrix factorization and deep neural networks for
predicting student data that will result in dropouts. Advanced statistical frameworks, probabilistic
graphical models, and survival analysis allow dealing with complexity and variability in
educational data. Consequently, one will gain a better understanding of the variables that predict
attrition as a result, develop proactive strategies to support at-risk students. This ability to predict
is beneficial not only for timely interventions but also in personalizing learning according to the
needs of a single learner so that it could provide a more supportive and effective learning
environment (Romero et al., 2010).
Various machine learning methodologies, including Cox Regression, Logistic Regression, and
Random Forest, were employed to detect students in the United States who are likely of not
completing their education within the expected timeframe. The model was optimized with the data
from a school district and performance was measured in terms of information gain, gini impurity,
stepwise regression and single feature performance. Results show that the implemented Random
Forest model surpasses other ML approach (Aguiar et al., 2015). A study by Sara et al., (2015)
demonstrated the evidence of the problem of student withdrawal in Danish high schools using high
school datasets and implemented model building with the help of Support Vector Machines,
Random Forests and Naive Bayes. Their measure of performance was either the Accuracy or the
Area Under the ROC Curve (AUC). The Random Forest model performed well with the highest
12
accuracy. However, they did not deal with data imbalance which is a very crucial aspect of
improving model performance and predictability.
Moreover, research conducted by Kotsiantis et al., (2003) reveals that the prediction of school
dropout was approached using machine learning models. Six distinct ML classifiers, namely
Support Vector Machine, Logistic Regression, Artificial Neural Networks, Naive Bayes, Decision
Tree, and K-Nearest Neighbor, have been applied in the study, and their performances have been
evaluated based on the accuracy and F1-score metrics. The best performing model was Random
Forest. Kabathova & Drlik, (2021) attempted to use machine learning methods in the prediction of
student dropout. Based on the data collected from four academic years, a number of machine
learning algorithms namely Logistic Regression, Random Forest, Support Vector Machines, and
Naive Bayes were used and compared. These algorithms search for patterns in student data
describing activities and accomplishments that distinguish course completers from noncompleters. This manner enabled the model to predict if a student is at risk. The study found
Random Forest as the best model compared to several other models used. Even though there are
different ML approaches for model development, there is no single best model for predicting
dropouts, as the variety of models ranges from logistic regression, decision tree, and naïve Bayes
to support vector machines, random forests, and neural networks. Whether one algorithm shows
better performance than another greatly depends on the quality and features of the data as well as
the context in which it is found. (Romero et al., 2010; Shaun et al., 2014). Furthermore, knowledge
of the data is required to choose the proper algorithm because a few techniques work well with
small samples and some require a huge amount of data to provide good results.
2.6 Evaluating Predictive Models of Student Dropout
One of the major aspects of evaluating a predictive model, is choosing relevant performance
metrics. Metrics including accuracy, precision, recall, F1 score, and AUC-ROC are frequently
utilized (Sokolova & Lapalme, 2009). Accuracy refers to the proportion of correctly classified
instances, whereas precision and recall refer to model performance considering true positives and
false negatives, respectively. The F1 score presents a weighted average between precision and
recall that may be important in very imbalanced datasets where the case of dropout can be less
frequent (Manning et al., 2008). In another related study, the authors also used accuracy, precision,
recall, F1 score, and AU-ROC curve to evaluate predictive models. They chose the F1 score and
13
AUC specifically because of the imbalanced class ratio in their experimental data(Lee et al., 2021).
These measures are valuable, but sometimes they can be misleading, mostly for datasets with
imbalanced classes. Therefore, a set of metrics should be used to evaluate model performance
without potential biases and present a realistic picture of predictive capabilities.
Furthermore, the selection of appropriate evaluation metrics should align with the specific goals
and context of the predictive model. For instance, in educational settings, the cost of
misclassification must be carefully considered. False negatives, where at-risk students are not
identified, can have more severe consequences compared to false positives, where students are
incorrectly flagged as at-risk (Kotsiantis et al., 2003). Therefore, precision and recall become
crucial metrics, particularly in imbalanced datasets where the dropout rate is low. Additionally,
metrics such as Cohen's Kappa and Matthews Correlation Coefficient (MCC) offer insights into
the model's performance beyond simple accuracy, accounting for the possibility of random chance
in predictions (Chicco & Jurman, 2020). These metrics provide a more nuanced evaluation,
helping researchers and practitioners to achieve a better understanding of the strengths and
limitations of their predictive models in the context of early dropout detection.
14
CHAPTER THREE: METHODOLOGY OF RESEARCH
3.1 Introduction
The methodology chapter provides a clear description of the steps taken in conducting the study.
This chapter outlines the procedures involved in sample selection, study design, data sources, data
collection procedures, model specification, and variable selection for analysis. This section
provides a comprehensive summary of all the methods employed to effectively address both
general and specific objectives using well-structured and scientifically sound procedures.
3.2 Study design
This research utilizes a quantitative approach and machine learning methods to create a model that
predicts student dropout. The main goal is to detect students who are at risk of leaving their studies
early, which enables the institution to take proactive measures and offer timely interventions.
Quantitative research design is appropriate for this study as it allows for the statistical analysis of
numerical data and the development of predictive models based on historical data. Machine
learning techniques, such as random forest, decision tree, logistic regression, support vector
machine, XGBoost, and Gradient Boosting machines are utilized to analyze the data and build
predictive models. These models help in understanding the patterns and factors contributing to
student dropout and academic success (Creswell & Creswell, 2017).
3.3 Source of data
The present work utilizes a dataset from a higher education institution in Portugal provided by
(Martins et al., 2021). This dataset consists 4424 tuples and 35 variables that were created within
the SATDAP program, financed through the grant POCI-05-5762-FSE-000191, whose main
objective was to address and reduce academic failure and dropout rates in higher education. The
dataset is characterized by a broad diversity of variables, including academic path, demographics,
and socioeconomic background available during the student enrollment period. It also includes
students' academic performances at the end of the first and second semesters. Additionally, the
dataset aligns with the standards and rigor expected from datasets housed in the UCI Machine
Learning Repository, a renowned collection used globally by students, educators, and researchers
for the empirical analysis of machine learning algorithms. This repository, established in 1987,
ensures data accuracy and reliability, further enhancing the credibility of our study. More details
about
the
dataset
can
be
found
(https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success).
15
at
3.4 Data Description
This study uses a dataset that includes variables that could be useful in predicting cases of
university dropout. The dataset is in CSV format and encompasses Student-related variables
including marital status, nationality, prior qualifications, and information about the parents'
qualifications and occupations. It also contains variables of academic performance, such as the
number of curricular units credited, enrolled, evaluated, and approved, with grades in both the first
and second semesters. Other socio-economic variables included in the dataset are the
unemployment rate, inflation rate, and gross domestic product. This is a fully comprehensive
dataset with continuous variables such as grades, unemployment rate, and GDP, among others, as
well as categorical variables like marital status and nationality. More importantly, there is no
missing information for any variable of interest; hence, the data are complete and reliable for any
subsequent analysis.
3.5 Study variables
In this study, variables were selected based on an extensive literature review and a conceptual
framework that would allow for an understanding of the factors influencing dropouts among
students in higher education. Individual characteristics, socioeconomic background, demographic
factors, and academic performance are major contributors to dropout and will attract special
attention. The dependent variable is the dropout status of the students which is in two categories
or classes: 1 for dropouts and 0 for non-dropouts (which include students who either remain
enrolled or graduated). This binary classification will help explain clearly the factors that
differentiate those who drop out from those who continue their education.
Table 1. Study variables
Class of Variables
Demographic data
Socioeconomic Data
Variable name
Variable Type
Marital_status
Nationality
Numerical/discrete
Numerical/discrete
Displacement_status
Numerical/binary
Gender
Numerical/binary
Enrollment_Age
International_status
Numerical/discrete
Numerical/binary
Mother_Qualification
Numerical//discrete
Father_Qualification
Numerical/discrete
Mother_Occupation
Numerical/discrete
Father_Occupation
Numerical/discrete
16
Academic data
Target
Educational_special_needs
Numerical/binary
Debtor_status
Numerical/binary
Tuition fees_up_to_date
Numerical/binary
Scholarship_holder
Numerical/binary
Unemployment_rate
Numerical/continuous
Inflation_rate
Numerical/continuous
GDP
Numerical/continuous
Application_mode
Numerical/ discrete
Application_order
Numerical/ordinal
CourseName
Numerical/discrete
Day/evening attendance
Numerical/binary
Previous Qualification
Numerical/discrete
firstSemCurricularUnits_Credited
Numerical/discrete
firstSemCurricularUnits_enrolled
Numerical/discrete
firstSemCurricularUnits_evaluations
Numerical/discrete
firstSemCurricularUnits_approved
Numerical/discrete
firstSemCurricularUnits_grade
Numerical/continuous
firstSemCurricularUnitswithout_evaluations
Numerical/discrete
secondSemCurricularUnits_credited
Numerical/discrete
secondSemCurricularUnits_enrolled
Numerical/discrete
secondSemCurricularUnitse_valuations
Numerical/discrete
secondSemCurricularUnits_approved
Numerical/discrete
secondSemCurricularUnits_grade
Numerical/continuous
secondSemCurricularUnitswithout_evaluations
Numerical/discrete
Target
Categorical
3.6 Data preprocessing
The primary step in the machine learning pipeline is data preprocessing, which makes sure the data
is reliable and of excellent quality for input into model training and evaluation. This process
includes solving data issues like missing values and outliers and normalizing or standardizing. The
dataset used in this research does not require additional data preprocessing, as rigorous data
preprocessing to handle anomalies, outliers, and missing values in a dataset and ensuring a clean
dataset ready for analysis has already been completed. Missing data handling is important because,
if not handled properly, it would lead to biased data and poor model performance (García et al.,
17
2015). Proper data preprocessing is fundamental in building robust and generalizable predictive
models (Sammut & Webb, 2011).
3.7 Handling Data Imbalance
Class balance becomes an important step in the development of an effective predictive model,
where the correct identification of at-risk students is accurately made. Sometimes, class imbalance
occurs because some classes are radically underrepresented compared with others, which may later
lead to a partial performance of the model and low predictive accuracy within the minority class.
(Chawla et al., 2002; Mduma, 2023). To tackle the class imbalance, we applied SMOTE, a widely
recognized method for balancing datasets. SMOTE works by generating synthetic examples for
the minority class, and in essence, its development increases the representation of that class,
allowing the model to learn from a more balanced dataset (Chawla et al., 2002). This technique is
especially useful in classification problems where the target variable has imbalanced class
distributions.
While SMOTE significantly improved the balance of the dataset, one should keep in mind that if
treated incorrectly, synthetic data may add noise or overfitting. Making sure that these
synthetically generated samples are real-world scenarios representatives is very critical to avoid
misleading model training. We can achieve this by choosing significant features that accurately
depict real-world scenarios. We will also apply cross-validation and performance metrics to
evaluate the impact of SMOTE on generalization ability of the model (Saad Hussein et al., 2019).
3.8 Model Development
A predictive model, whether it relies on statistical methods or machine learning, is crafted to
estimate future events or outcomes by examining past data. This model is developed using a dataset
that contains various input variables or features, which are utilized to forecast a specific target
variable. The main objective of such a model is to identify patterns and correlations in the data that
allow for precise predictions about upcoming events (Bishop & Nasrabadi, 2006).
In the context of this study on predicting higher education dropout among students, various
classification modeling techniques will be utilized. These techniques include decision tree, random
forest, support vector machine, gradient boosting machine, XGBoost and logistic regression. By
employing these methods, a predictive model will be developed to estimate the likelihood of
18
student dropout in higher education settings. This approach will help in the identification of
dropout predictors, thereby facilitating timely interventions to reduce dropout rates (AndradeGirón et al., 2023).
The selection of these models is based on their effectiveness in handling complex data structures
and their capability to reveal significant patterns associated with dropout rates. Each technique
offers unique advantages: Decision Trees provide interpretability (Gilmore et al., 2021), Random
Forest enhances accuracy through ensemble methods (Breiman, 2001), Support Vector Machine
is adept in high-dimensional spaces (Cortes & Vapnik, 1995), and Gradient Boosting and XGBoost
enhance prediction by iteratively improving upon previous models' errors. XGBoost further
optimizes this process with advanced techniques like regularization and parallelization for better
accuracy and efficiency (Ramraj et al., 2016). Logistic Regression, meanwhile, offers a clear
probabilistic framework. Together, these models form a robust approach to accurately predict
student dropout and inform targeted intervention
3.8.1 Logistic regression
Logistic Regression is the most used classification technique to predict the probability of a binary
outcome, like student dropout. In higher education dropout prediction, Logistic Regression models
a dependent variable based on one or more independent variables: for example (academic
performance, demographic factors, and socioeconomic status), a dependent variable might be the
status of being a dropout (yes/no) (Hosmer Jr et al., 2013). It takes these features as input and feeds
into a logistic function to give predictions on the probability of a student's dropout.
3.8.2 Decision Tree
Decision tree is a Machine Learning model applied to both classification and regression problems.
The basic form of the data and decision process is represented in a tree-like structure. In the tree,
internal nodes express decisions about certain features. Their branches define the results of these
decisions. The leaf nodes represent the final stage of the prediction or classification process.
Decision tree is very intuitive and straightforward for interpretation. It works with categorical and
numerical data and recursively dividing a dataset into subsets according to given feature values,
then attempting to create groups that are as homogeneous as possible for each accurate prediction
(Mitchell & Mitchell, 1997).
19
Furthermore, Decision trees is one of the mostly used ML model in the prediction of school
dropout, and it extracts the pattern from large amounts of educational data using data mining
techniques and function by creating a hierarchical tree-like model in which the internal nodes
represent decisions about specific student attributes, the branches represent the outcomes, and the
leaf nodes provide the final prediction of the likelihood of a student for persisting or dropping out
education. It is developed to deal with categorical and numerical data, decision tree is particularly
well suited to handle the task, and their intuitive structure will facilitate educators' interpretation
and subsequent actions. Using decision tree, researchers can classify students based on their
responses and forecast dropout risks with a view to intervening early in the improvement of
student retention rates (Mariano et al., 2022).
Figure 1. Decision Tree Illustration.
3.8.3 Random Forest
Random Forest is an ensemble learning technique that can be applied to both classification and
regression tasks. The model works on the principle of creating several decision trees at the time of
training. In case of classification, it gives the most voted class by the trees, whereas in case of
regression, it provides the average of the predictions from the trees. The algorithm employs
bootstrap aggregation, or bagging, in which each tree is trained on a random subset of the data,
and can repeatedly select the same data points. To reduce variance and prevent overfitting, they
consider a random subset of features at each split. Random forests have good accuracy, resistant
to overfitting, and have the ability to handle big datasets with high dimensionality (Hastie et al.,
20
2001). Their effectiveness stems from the collective of decorrelated trees, which often yields
superior prediction accuracy compared to individual decision tree.
Breiman, (2001) revealed that the effectiveness of RF in handling complex datasets has led to its
application in dropout prediction tasks. RF's robustness to outliers and noise also fits well with
educational data, which often has a lot of variability. It often outperforms many other machinelearning models, such as XGBoost and SVM, providing high accuracy and reliability to the
prediction. Gini impurity helps RF reduce overfitting and biases; as a result, the model becomes
very accurate, even on noisy data. To predict dropouts, we train RF on student data to identify the
most important features for prediction, which enables us to target interventions based on these
variables (Dass et al., 2021).
Figure 2. Random Forest Algorithm (GeksforGeeks, 2024)
3.8.4 Support Vector Machines
Support vector machines are an extremely effective algorithm for supervised learning, mostly used
for classification tasks but also applicable to regression. The essential idea behind SVM is to find
the optimum separation between data points of different classes, using the maximum possible
21
margin. Therefore, this hyperplane is defined using support vectors, which are the data points
closest to the decision boundary. SVM handles problems in both linear and nonlinear classification
using different kernel functions. Among them are linear, polynomial, and radial basis function
(RBF) kernels, which project input data onto higher-dimensional space for a possible linear
separation (Cortes & Vapnik, 1995). One of the more important strengths of SVMs is that they
work very well in high-dimensional spaces and have a certain resistance to overfitting, especially
when the number of features or dimensions exceeds the number of samples. As a result, SVMs in
dropout prediction have been very promising, especially because they can model complex decision
boundaries and be less prone to noisy data. SVMs identify the optimal separating hyperplane that
differentiates students at risk of dropout from others based on various academic and socioeconomic
features (Del Bonifro et al., 2020).
3.8.5 XGBoost
XGBoost stands for Extreme Gradient Boosting, making it a high-performance implementation of
the gradient boosting algorithm. It aspires to be powerful, flexible, portable, and also highly
efficient. You can apply this to classification and regression problems. It's great for structured or
tabular data. It also has some important improvements to gradient boosting, such as regularization
to prevent overfitting, better ways to deal with sparse data and missing values, and a weighted
quantile sketch algorithm that makes it faster, more accurate, and more stable when used on large
datasets than conventional gradient-boosting methods (Chen & Guestrin, 2016).
3.8.6 Gradient Boosting Machine
The gradient boosting machine is a technique used in ensemble learning for both classification and
regression tasks. The model is constructed incrementally using a series of weak learners, usually
decision trees, in a step-by-step manner. During each stage of this procedure, a new tree is trained
to rectify faults made by the prior trees. This is done in order to minimize a certain loss function
via gradient descent. This iterative process thus focuses on decreasing bias and increasing the
accuracy of the models (Hastie et al., 2001).
3.9 Model Evaluation
It is important to note that once the learning algorithm has been developed using the training set,
it is critical to evaluate how effective the model classifier is. The model classifier in machine
learning is evaluated using the test data. The assessment is done using the performance metric
22
classification approach. It is a common practice to use the confusion matrix for the evaluation. The
confusion matrix is characterized as a cross-tabulation that provides an overview of how well the
model classifier predicts similar samples of the target classes. Furthermore, from the confusion
matrix, a considerable number of classification metrics namely Sensitivity, Accuracy, Specificity,
Recall, Precision, F1 score, FPR and TPR are extracted and used to identify what best predicting
model. Another important and mostly used metric is Receiver Operating Characteristics curve that
graphically depicts the relationship between True and False Positive rate (TPR vs FPR).
3.9.1 Confusion Matrix
A tabular representation known as a confusion matrix offers a succinct overview of a machine
learning model's performance on a particular testing dataset. It is a technique for graphically
displaying the proportion of accurate and inaccurate occurrences based on the model's predictions.
It is frequently used to assess how well classification models work, which aim to predict a category
label for every input event (GeksforGeeks, 2024).
Figure 3. Confusion Matrix illustration
A True Positive (TP) occurs when the model accurately predicts a positive outcome, and
the actual event is indeed positive.
A True Negative (TN) occurs when the model accurately predicts a negative consequence,
and the actual outcome is indeed negative.
A false positive (FP) occurs when the model makes a wrong prediction of a positive
outcome, even when the actual outcome was negative. Also referred to as a Type I error.
A false negative (FN) occurs when the model makes a wrong prediction of a negative
outcome, when in reality the outcome was positive.
23
3.9.2 Accuracy
Accuracy is a metric that quantifies the effectiveness of a model. The ratio of total correct instances
to the total instances is referred to as the accuracy.
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 =
3.9.3 Precision
Number of correct prediction
Total number of input samples
Precision is a metric that quantifies the level of accuracy in a model's positive predictions.
Precision is the quotient obtained by dividing the number of correct positive predictions by the
total number of positive predictions generated by the model.
3.9.4 Recall
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 =
True Positives
True Positives + False Positive
Recall quantifies the efficacy of a classification model in accurately recognizing all pertinent
occurrences from a dataset. The term "true positive rate" refers to the proportion of correctly
identified positive cases (TP) out of the total number of positive occurrences, which includes both
TP and FN.
𝑹𝒆𝒄𝒂𝒍𝒍 =
3.9.5 F1 Score
True Positives
True Positives + False Negative
The F1-score is employed to assess the overall efficacy of a classification model. A high F1
score indicates a low number of false positives and false negatives.
𝑭𝟏 𝒔𝒄𝒐𝒓𝒆 = 2 ×
3.9.6 Area Under Curve (AUC)
1
1
1
+
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑅𝑒𝑐𝑎𝑙𝑙
When assessing binary classification models in machine learning, the Area Under the Curve
(AUC) is a critical performance indicator. It is produced by plotting the True Positive Rate (TPR)
versus the False Positive Rate (FPR) across a range of categorization criteria on the Receiver
24
Operating Characteristic (ROC) curve. The AUC measures the model's overall capacity to
discriminate between positive and negative classes.
The AUC values range from 0 to 1:
0.5: No discriminative ability, equivalent to random guessing.
0.7 to 0.8: Acceptable performance, indicating the model has some ability to distinguish
between classes.
0.8 to 0.9: Excellent performance, reflecting a strong capability to separate the classes.
0.9 to 1.0: Outstanding performance, demonstrating near-perfect classification ability
1: The model perfectly discriminates between positive and negative classes.
3.10 Feature Importance Analysis
To identify key factors influencing higher education dropout, a comprehensive feature importance
analysis was conducted using multiple methods: Random Forest, XGBoost, Logistic Regression,
SHAP values, and Permutation Importance. Each method evaluates the contribution of features to
model predictions, providing unique insights. By averaging the importance scores from these
diverse methods, we obtain a comprehensive and reliable measure of feature significance, ensuring
a well-rounded understanding of the key factors influencing higher education dropout.
Random Forest evaluates feature importance by measuring how much each feature increases the
accuracy of the model when included. It does this by averaging over many decision trees, making
it robust against overfitting (Breiman, 2001).
XGBoost is a gradient boosting technique that can consider feature importance with regards to
how often features are used to create the splits in the data across all trees within the ensemble. It
emphasizes error reduction through weighted adjustments, leading to highly predictive models
(Chen & Guestrin, 2016).
Logistic Regression determines the importance of features by assessing the weights assigned to
each predictor. The magnitude of these weights indicates the strength and direction of the influence
of each variable (Hosmer Jr et al., 2013).
25
SHAP Values provide an explanation of model predictions by assigning each feature an
importance score based on its contribution to the prediction outcome. This method considers
interactions among features and provides consistent explanations (Lundberg, 2017).
Permutation Importance works by shuffling feature values and measuring the decrease in model
performance, thus identifying features that significantly impact predictions. It is particularly useful
for understanding complex models (Fisher et al., 2019).
The steps involve calculating feature importance scores using each method, normalizing these
scores to a common scale, and averaging them to obtain a "Mean Importance" score for each
feature. This combined score integrates the strengths of all methods, mitigating individual biases
and offering a more reliable assessment of feature influence.
The most critical features, ranked by their mean importance scores, will be used to refine the model
and guide the identification of significant predictors of dropout, ultimately supporting targeted
intervention strategies.
This study demonstrates originality through its comprehensive use of advanced feature importance
methods, integrating Random Forest, XGBoost, Logistic Regression, SHAP values, and
Permutation Importance. Unlike traditional approaches that often rely on a single technique, the
combination of these diverse methods provides a deeper, more reliable understanding of key
determinants of higher education dropout. SHAP values and Permutation Importance, in particular,
offer advanced interpretability by explaining feature contributions and interactions within the
model, a significant enhancement over conventional techniques (Lundberg, 2017; Fisher et al.,
2019). By combining the strengths of these methods, the study not only improves prediction
accuracy but also delivers actionable insights for targeted interventions, filling a notable gap in
existing dropout prediction research.
26
CHAPTER 4: DATA ANALYSIS AND PRESENTATION OF RESULTS
4.1 Introduction
This chapter presents the analysis of the data used for the study and discusses the results obtained.
The chapter describes a detailed examination of the descriptive statistics, exploring the distribution
of key variables. Then focuses on evaluating the performance of the predictive models used in the
study. Comparative results for various machine learning algorithms are presented, highlighting the
robustness and weaknesses of each model. Finally, the chapter concludes with an interpretation of
the findings with the research objectives, discussing the implications for early detection of student
dropout and suggesting potential interventions based on the results.
4.2 Data quality check
Inspecting the dataset, an imbalanced data situation was encountered where in dropout instances
constituted approximately 32% of the dataset, while non-dropout which includes graduate and
enrolled classes represented 50% and 18% respectively. Specifically, the dataset initially
comprised 1421 dropout instances, 2209 graduates, and 794 enrolled students. After combining
the graduate and enrolled classes, the resulting non dropout instances becomes 3003(68%). This
imbalance can lead to biased outcome as the models predicting the majority class (non dropouts)
while overlooking the minority class (dropouts) despite robust model diagnostics. To tackle this
issue, a SMOTE method was applied to balance the dataset. This resulted in an even representation
of dropout and non-dropout instances, with both categories consisting 3003 samples each, as
shown in Figure 4. This balancing step is crucial for improving the model's ability to predict
dropout instances accurately.
Figure 4. Bar plot showing the imbalanced data and the data balance
27
4.3 Descriptive View of Dropout Rate for Demographic, Socioeconomic, and Academic
Variables
This section shows dropout rates across key demographic and academic variables, with the
outcome of student Dropout. Analysis reveals significant variations in dropout rates among
different groups. The dropout rate of legally separated students is high at 66.67%, followed by
married students (47.23%) and those in facto unions (44%). Single students, the largest group,
have a dropout rate of 30.21%.
Course dropout rates highlight that Biofuel Production Technologies (66.67%) and Informatics
Engineering (54.12%) struggle with higher dropout rates, suggesting possible issues with
curriculum difficulty or student engagement. Conversely, fields like Nursing (15.4%) and Social
Service (18.31%) show lower dropout rates, potentially due to job stability or strong support
systems. Evening students are more likely to drop out (42.86%) compared to daytime students
(30.80%). Additionally, debtors have a higher dropout rate (62.03%) than non-debtors (28.28%),
and students with up-to-date tuition fees are less likely to drop out (24.74%) compared to those
with overdue fees (86.55%).
Gender differences are also notable, with a dropout rate of 25.10% for females versus 45.05% for
males, highlighting a significant gender disparity. Age at enrollment affects dropout rates as well,
with a mean age of 26.07 for dropouts compared to 21.94 for non-dropouts, indicating that older
students may face additional challenges.
In terms of academic performance, the dropout’s average is 2.55 of approved curricular units (i.e.,
units passed) in the first semester, compared to 5.73 for non-dropouts. Dropouts also have an
average grade of 7.26 out of 20, while non-dropouts average 12.24. These indicators suggest that
lower grades and fewer passed units are associated with a higher risk of dropping out. Table 2
presents a summary of a few selected categorical variables that are key to understanding the
determinants of student dropout and Table 3 presents numerical variables. For a more
comprehensive breakdown of all categorical variables and additional details, see Appendix A.
28
Table 2. Distribution of categorical variables
Variable
Marital status
Course
Daytime/eveni
ng attendance
Debtor
Tuition fees up
to date
Gender
Scholarship
holder
Category
Count
%
Dropout
Count %
Non-Dropout
Count
%
Single
Married
Divorced
Facto union
Legally separated
Widower
Nursing
Management
Social Service
Veterinary Nursing
Journalism and Communication
Advertising and Marketing
Management
Management (evening attendance)
Tourism
Communication Design
Animation and Multimedia Design
Social Service (evening attendance)
Agronomy
Basic Education
Informatics Engineering
Equinculture
Oral Hygiene
Biofuel Production Technologies
Day
Evening
No
Yes
Yes
No
Yes
Yes
No
Yes
3919
379
91
25
6
4
766
380
355
337
331
88.58
8.57
2.06
0.57
0.14
0.09
17.31
8.59
8.02
7.62
7.48
1184
179
42
11
4
1
118
134
65
90
101
30.21
47.23
46.15
44
66.67
25
15.4
35.26
18.31
26.71
30.51
2735
200
49
14
2
3
648
246
290
247
230
69.79
52.77
53.85
56
33.33
75
84.6
64.74
81.69
73.29
69.49
268
268
252
226
215
215
210
192
170
141
86
12
3941
483
3921
503
3896
528
2868
1556
3325
1099
6.06
6.06
5.7
5.11
4.86
4.86
4.75
4.34
3.84
3.19
1.94
0.27
89.08
10.92
88.63
11.37
88.07
11.93
64.83
35.17
75.16
24.84
95
136
96
51
82
71
86
85
92
78
33
8
1214
207
1109
312
964
457
720
701
1287
134
35.45
50.75
38.1
22.57
38.14
33.02
40.95
44.27
54.12
55.32
38.37
66.67
30.8
42.86
28.28
62.03
24.74
86.55
25.1
45.05
38.71
12.19
173
132
156
175
133
144
124
107
78
63
53
4
2727
276
2812
191
2932
71
2148
855
2038
965
64.55
49.25
61.9
77.43
61.86
66.98
59.05
55.73
45.88
44.68
61.63
33.33
69.2
57.14
71.72
37.97
75.26
13.45
74.9
54.95
61.29
87.81
Table 3. Distribution of Numerical variables
Variables
Age at enrollment
Dropout
TOT
Mean
STD
Non-Dropout
TOT
Mean
STD
1421
26.069
8.704
3003
21.938
6.596
Curricular_units 1st sem (credited)
1421
0.6094
2.105
3003
0.7576
2.471
Curricular_units 1st sem (enrolled)
1421
5.8213
2.326
3003
6.4832
2.522
Curricular_units 1st sem (evaluations)
1421
7.7516
4.922
3003
8.5581
3.75
29
Curricular_units 1st sem (approved)
1421
2.5517
2.858
3003
5.7263
2.647
Curricular_units 1st sem (grade)
1421
7.2567
6.031
3003
12.242
3.062
Curricular_units 1st sem (without evaluations)
1421
0.1921
0.795
3003
0.1119
0.634
Curricular_units 2nd sem (credited)
1421
0.4497
1.68
3003
0.5854
2.021
Curricular_units 2nd sem (enrolled)
1421
5.7804
2.108
3003
6.4459
2.205
Curricular_units 2nd sem (evaluations)
1421
7.1738
4.817
3003
8.4842
3.382
Curricular_units 2nd sem (approved)
1421
1.9402
2.574
3003
5.6167
2.432
Curricular_units 2nd sem (grade)
1421
5.8993
6.119
3003
12.28
3.036
Curricular_units 2nd sem (without evaluations)
1421
0.2379
0.994
3003
0.1089
0.604
Unemployment rate
1421
11.616
2.768
3003
11.542
2.613
Inflation rate
1421
1.284
1.405
3003
1.2016
1.371
GDP
1421
-0.1509
2.252
3003
0.0743
2.275
4.4 Identification of important predictors of Dropout risk in Higher Education
In this study, different methods have been used to identify key determinants of student dropout.
The evaluation of feature importance across various methods such as Random Forest (RF),
XGBoost (XGB), Logistic Regression (LR), SHAP, and Permutation reveals nuanced insights into
the factors influencing student retention. Notably, "Curricular units 2nd sem (approved)"
consistently stands out as a crucial predictor across all methods, with the highest mean importance
score of 0.923. This indicates that students' performance in their second-semester courses
significantly impacts dropout risk.
"Tuition fees up to date" is another key feature, especially highlighted by XGBoost and
Permutation importance measures, showing its substantial influence on dropout risk with a mean
importance of 0.618. Conversely, features such as "Age at enrollment" and "Unemployment rate"
show varying levels of importance across different methods, suggesting their role in dropout risk
is more context-dependent.
The SHAP values provide additional clarity by highlighting the nuanced contributions of each
feature to individual predictions. For instance, "Curricular units1st sem (approved)" and
"Curricular units 2nd sem (grade)" show notable differences in their contributions based on the
model, emphasizing the importance of academic performance in dropout prediction.
30
Therefore, the results underscore the complex interplay of academic performance, financial status,
and demographic factors in predicting dropout risk. These findings suggest that while some
features have consistently high importance, others may influence dropout risk differently
depending on the context and the model used. This comprehensive analysis informs targeted
interventions aimed at reducing dropout rates by addressing the most influential predictors
identified through various interpretative methods. Table 4 and Figure 5 below provide the
summary of importance values of each machine learning methods used with the overall mean and
the top 10 features influencing dropout risk.
Table 4. Feature importance results for all variables
Feature
secondSemCurricularUnits_Approved
Tuition fees up to date
firstSemCurricularUnits_Approved
secondSemCurricularUnits _grade)
firstSemCurricularUnits (grade)
secondSemCurricularUnits _enrolled)
Age at enrollment
Unemployment rate
CourseName
Application mode
Inflation rate
GDP
secondSemCurricularUnits _enrolled
secondSemCurricularUnits_evaluations
secondSemCurricularUnits _credited
Father's occupation
Father's qualification
Mother's qualification
Mother's occupation
secondSemCurricularUnits _evaluations
Scholarship holder
Displaced
secondSemCurricularUnits _credited)
Daytime/evening attendance
Debtor
Application order
Gender
Nationality
firstSemCurricularUnits _without
evaluations
Previous qualification
secondSemCurricularUnits _without
evaluations
RF
1
0.65
0.659
0.756
0.455
0.107
0.173
0.239
0.158
0.162
0.251
0.105
0.142
0.147
0.155
0.166
0.118
XGB
1
0.932
0.073
0.022
0.007
0.165
0.029
0.013
0.041
0.036
0.017
0.066
0.018
0.048
0.021
0.015
0.015
LR
0.614
0
0.735
0.843
0.887
1
0.885
0.872
0.906
0.873
0.843
0.893
0.858
0.858
0.838
0.846
0.865
SHAP
1
0.586
0.284
0.175
0.129
0.104
0.186
0.162
0.198
0.184
0.12
0.072
0.129
0.146
0.133
0.108
0.107
Permutation
1
0.923
0.367
0.158
0.091
0.108
0.108
0.095
0.059
0.095
0.057
0.132
0.107
0.053
0.061
0.073
0.083
Mean
Importance
0.923
0.618
0.424
0.391
0.314
0.297
0.276
0.276
0.273
0.27
0.258
0.254
0.251
0.251
0.242
0.241
0.238
0.157
0.142
0.028
0.181
0.071
0.053
0.017
0.037
0.032
0.037
0.03
0.008
0.007
0.023
0.054
0.163
0.003
0.002
0.024
0.028
0.038
0
0.009
0.037
0.848
0.855
0.982
0.503
0.871
0.914
0.949
0.888
0.877
0.828
0.843
0.853
0.08
0.09
0.049
0.157
0.04
0.042
0.027
0.023
0.035
0.039
0.02
0.001
0.095
0.067
0.014
0.103
0.049
0.016
0
0.03
0.008
0.043
0.016
0.01
0.237
0.236
0.226
0.221
0.207
0.205
0.204
0.201
0.198
0.189
0.184
0.182
0.016
0.016
0.014
0.039
0.816
0.782
0.014
0.005
0.002
0.004
0.172
0.169
31
Educational special needs
Marital status
International
0.012
0
0
0.003
0.01
0.026
0.768
0.807
0.739
0.032
0
0.008
0.012
0.002
0.012
0.165
0.164
0.157
Figure 5. Top 10 Feature Importance
4.5 Machine learning model results
This section evaluates six machine learning models applied to predict student dropout. The
performance of each model is assessed using key metrics such as F1 score, precision, accuracy,
recall and AUC score. The goal is to identify the best-performing model in predicting dropout,
providing a reliable tool for early intervention. By comparing these metrics, we determine which
model is the most reliable for predicting student attrition accurately.
4.5.1 Logistic regression
The results of Logistic regression shows that the model performed well with the accuracy of 87%,
demonstrating its ability of classifying dropout students. Its precision of 89% indicates a strong
capability in accurately predicting true positives, while a recall of 83% suggests it also captures
most of the actual positive cases. The F1-score of 86% balances precision and recall, and the AUCROC score of 94% highlights the model’s capability to discriminate between dropout and nondropout classes.
32
4.5.2 Support Vector Machine (SVM)
The SVM model was also used to predict dropout cases and performed well based on different
performance metrics used in this study. It shows an accuracy of 88%, making it more effective to
identify 88% of cases correctly. A high precision of 90% and a recall of 84% indicate that the
model effectively identifies true positives while maintaining a good sensitivity towards false
negatives. The F1-score of 87% reflects its balanced performance, and the AUC-ROC of 94%
confirms its solid discriminative power.
4.5.3 Decision Tree
The results of the Decision Tree model show that compared to other models used in this study, it
predicted dropout cases with an accuracy of 86% which is slightly low. With both precision and
recall of 86% and 84% respectively, demonstrate moderate capability of correctly identifying
positive and negative cases. The F1 score of 85% reflects a balance between precision and recall,
while the AUC-ROC score of 92% suggests high discriminative effectiveness between classes.
4.5.4 Random Forest
The Random Forest model predicts dropout cases with the highest accuracy of 90%; this shows
that it has a strong ability to make correct dropout predictions. On the other metrics, it also
performed exceptionally well with a recall of 86%, suggesting that it captures almost all positive
cases. The model’s precision of 92% and F1-score of 89% indicate a well-balanced performance.
its AUC-ROC score of 95.21% further confirms its excellent discriminatory power, making it one
of the best-performing models used in this study.
4.5.5 Gradient Boosting Machine
The Gradient Boosting model performed well as SVM with an accuracy of 88%. It has a precision
of 90% and recall of 84%, which basically underlines its good performance in recognizing true
positives correctly. Therefore, with an F1-score of 87%, it presents a balanced performance, while
the AUC-ROC score is 94%, showing very good discrimination between dropout and non-dropout
classes and indicating a robust performance.
4.5.6 XGBoost
The accuracy rate of the XGBoost model was notably 90%, which justifies that it is reliable to
generate correct predictions. It is also high in precision, at 92%, with a recall rate of 86%, which
33
means it is good at picking out true positives and hence minimizing false negatives. The F1score is 89%, showing its balanced performance, while an AUC-ROC score of 95.30% further
justifies its superiority in discriminating well between the dropout and non-dropout classes.
Table 5. Model evaluation metrics for six classifiers.
Model
Accuracy
AUC Score
Precision
Recall
F1 Score
Logistic Regression
0.87
0.94
0.89
0.83
0.86
Decision Tree
0.86
0.92
0.86
0.84
0.85
Random Forest
0.90
0.95
0.92
0.86
0.89
Gradient Boosting Machine
0.88
0.95
0.89
0.84
0.87
Support Vector Machine
0.88
0.94
0.90
0.84
0.87
XGBoost
0.90
0.95
0.92
0.86
0.89
4.5.7 Comparative analysis of Model performances
The results of all six trained models presented above revealed that XGBoost and Random Forest
are the outperformed models with the highest overall performance concerning accuracy, the
AUC-ROC score, and balanced F1-score. These results, thus indicate that ensemble methods
with multiple decision trees, provide superior predictive accuracy compared to individual models
such as Decision Tree or Logistic Regression. These models consistently identified the key
determinants of dropout more accurately than the other methods, making them the preferred
choices for early detection of dropout.
Figure 6 describes the ROC-AUC of all six models and presents the XGBoost and Random Forest
models as the top best models among the analyzed ROC. Each of these two models obtains an
AUC of 95.30% and 95.21% respectively, which indicates that these models are very capable of
distinguishing between positive and negative classes. On the other hand, Logistic Regression
achieved the ROC-AUC of 94%. The Gradient Boosting Machine also shows high discriminative
power with the ROC-AUC of 95%. Support Vector Machine and Decision Tree obtain high AUC
values of 94% and 92% respectively but the XGBoost and Random Forest models outperform
them.
34
Figure 6. ROC Curves for six ML models
Figure 7 outlines the confusion matrices of the six models that were trained. Logistic Regression
and SVM performed well, with relatively balanced classification, but still noticeable
misclassifications. Decision Tree was more problematic in differentiating between dropout and
non-dropout cases. Random Forest reveals the best results, with the fewest misclassifications,
demonstrating strong performance in separating classes accurately. Gradient Boosting and
XGBoost also performed well, with XGBoost closely matching Random Forest in minimizing
errors. Overall, ensemble models, particularly XGBoost and Random Forest, excelled in accurately
predicting dropouts. Therefore, the XGBoost model was selected as the best model for predicting
student dropout due to its superior performance across all metrics, particularly based on its AUC
value of 95.30% and its capability of handling imbalanced classes. Random Forest was a close
second, highlighting the effectiveness of ensemble techniques in identifying at-risk students.
35
Figure 7. Confusion matrices for six employed ML classifiers
36
CHAPTER 5. DISCUSSION OF THE RESULTS
5.1 Introduction
This chapter delves into the interpretation and implications of the key findings from the study on
predicting student dropout in higher education using machine learning models. By analyzing the
performance of various models, including Support Vector Machine, XGBoost, Logistic
Regression, Decision Tree, Random Forest, and Gradient Boosting Machine, the discussion
focuses on identifying the most effective approach for early detection of dropout. It also examines
the importance of various features influencing dropout rates, emphasizing both academic and
socio-economic determinants. The discussion aims to provide a comprehensive understanding of
the study’s results, their relevance in the existing body of literature, and their potential implications
for higher education institutions.
5.2 Key Findings Discussion
The focus of this study is building a predictive model for student dropout in higher education. Six
machine learning models, namely Support Vector Machine, XGBoost, Logistic Regression,
Decision Tree, Random Forest, and Gradient Boosting Machine were used and compared to
identify the most effective predictive model for early detection of student dropout in higher
education. Among these six models trained and tested, XGBoost outperformed other models with
an AUC-ROC score of 95.30%. The second-best performer is Random Forest, which closely
follows XGBoost with an AUC score of 95.21%. This is highlighted by Bentéjac et al., (2019),
which underscores XGBoost’s superior performance in classification tasks compared to other
models. Additionally, This finding is consistent with a previous study conducted by Park & Yoo,
(2021) that shows Random Forest as the best-performed model which demonstrates its robustness
and reliability for predicting student dropout.
The results demonstrate the great predictive power of ensemble learning techniques when working
with an imbalanced dataset, where the dropout instances are lower than in other classes. A recent
study utilizing ensemble learning models, including a novel stacking ensemble, demonstrated high
performance in predicting student dropout, with testing accuracy reaching 92.18%. This aligns
with the strong results achieved by Random Forest and XGBoost in this research, reinforcing the
effectiveness of ensemble methods in dropout prediction (Niyogisubizo et al., 2022).
37
The SVM and logistic regression also have given very strong results, each with an accuracy of
88% and 87% respectively, and the same AUC-ROC of 94%. While these models yielded a high
degree of precision and recall, these were slightly lower compared to the performances of ensemble
methods, especially in the case of complex nonlinear relationships inherent in this data. The
performance of the Decision Tree model was the lowest, even after addressing overfitting it
achieves accuracy of 86% and 92% AUC-ROC. Its parameters were tuned, making it bound to
overfit and, thus, not robust for more sophisticated ensemble methods, including Random Forest
and XGBoost, with better accuracy and AUC-ROC.
Feature importance was analyzed by several methods such as Random Forest, XGBoost, Logistic
Regression, SHAP values, and permutation importance to make sure that important determinants
were taken into account. The most influential feature throughout proved to be the number of
curricular units approved in the second semester, which was consistently ranked first in all the
methods with an average importance score of 0.923. This points to the fact that academic
performance, especially in the latter part of the year, is one of the strongest reasons for students
either remaining or dropping out, as it was highlighted by the study of Nurmalitasari et al.,
(2023). Other studies highlight that student engagement and achievement have been found as
some of the best predictors of student retention.
Financial stability also emerged as a significant determinant, with “tuition fees up to date” scoring
highly in importance (mean score of 0.618), emphasizing the impact of financial constraints on
student persistence. The findings of this study align with previous research, such as a Peruvian
study that also identified age, term, and financing method as critical dropout predictors, with
Random Forest showing superior performance (AUC 0.9623). This emphasizes the importance of
financial and contextual factors in both developed and developing countries, supporting the global
relevance of these determinants (Jiménez et al., 2023).
Tinto (2012) highlighted that financial constraints are the key determinant of student attrition.
Other socioeconomic factors, including unemployment and inflation rates, were critical in showing
the wide context in which the academic journeys of students unfold. The course also has moderate
importance, indicating that different courses significantly influence dropout rates. This suggests
that course difficulty, engagement, or relevance may impact student retention, highlighting the
38
need for course-specific support and improvements. Parental qualifications and occupations also
feature prominently, showing how students from relatively less educated families experience more
challenges in their academic paths. The multi-method strategy employed in this study
substantiated the significance of these essential determinants, as their relevance was validated
through various models and metrics of feature importance for dropout prediction.
39
CHAPTER 6. CONCLUSION AND RECOMMENDATIONS
6.1 Introduction
This concluding chapter summarizes the general contribution of the findings, provides actionable
recommendations for higher education policy and practice, and suggests some future research
avenues.
6.2 Conclusion
The findings of the present research further emphasize how sophisticated machine learning
techniques, particularly ensemble methods such as XGBoost and Random Forest, are efficiently
applicable to the reliable prediction of student dropout in higher education. This is proved by the
high accuracy values and AUC-ROC metrics that these methods demonstrate. These models
therefore provide the institutions with an important tool for timely and effective intervention
through early identification of dropout predictors.
The research highlighted critical factors contributing to student dropout, revealing that academic
achievement, financial stability, and socioeconomic conditions are the most significant influences.
These results align with the current body of literature and emphasize the complex nature of student
dropout, which cannot be ascribed to a singular cause but rather to the interaction of academic,
financial, and contextual elements. Accredited curricular units are highly relevant, and tuition fees
have deep implications, making universities focus their efforts on academic support and financial
aid as part of a retention strategy.
Furthermore, the research contributes to the growing body of knowledge on dropout prediction by
providing a comparative evaluation of ML models, offering practical insights into their strengths.
Several limitations may influence the generalization of the findings because the study used only
one dataset. For this reason, the generalization of the results to other educational contexts is
limited. Moreover, this study did not consider other non-quantifiable variables, such as students'
motivations, social influences, and emotional comfort, which may also cause student dropout.
6.3 Recommendations
Despite the promising results achieved with ensemble methods like XGBoost and Random Forest
in predicting higher education dropout, further advancements are needed for practical application
and research. It is recommended that higher education institutions integrate these models into their
student management systems for early identification of at-risk students. Emphasis should be placed
on targeting key determinants such as academic support, financial aid, and student engagement
40
initiatives to effectively address dropout factors. Additionally, incorporating more diverse datasets
and exploring advanced techniques such as deep learning could further enhance the models'
predictive accuracy. Institutions should ensure the continuous evaluation and updating of these
models to adapt to changing student behaviors and institutional dynamics, thus maintaining their
effectiveness and relevance.
In addition to higher learning institutions, other key stakeholders such as policymakers,
government bodies, and non-governmental organizations (NGOs) play a crucial role in addressing
student dropout. Policymakers can implement nationwide strategies to improve access to
education, while NGOs can provide financial and social support to at-risk students. Furthermore,
the role of families and communities should not be overlooked, as they are instrumental in
encouraging student retention.
41
REFERENCES:
Aguiar, E., Lakkaraju, H., Bhanpuri, N., Miller, D., Yuhas, B., & Addison, K. L. (2015). Who,
when, and why: A machine learning approach to prioritizing students at risk of not
graduating high school on time. Proceedings of the Fifth International Conference on
Learning Analytics And Knowledge, 93–102. https://doi.org/10.1145/2723576.2723619
Aina, C., Baici, E., Casalone, G., & Pastore, F. (2022). The determinants of university dropout:
A review of the socio-economic literature. Socio-Economic Planning Sciences, 79,
101102. https://doi.org/10.1016/j.seps.2021.101102
Alyahyan, E., & Düştegör, D. (2020). Predicting academic success in higher education:
Literature review and best practices. International Journal of Educational Technology in
Higher Education, 17(1), 3. https://doi.org/10.1186/s41239-020-0177-7
Andrade-Girón, D., Sandivar-Rosas, J., Marín-Rodriguez, W., Susanibar-Ramirez, E., ToroDextre, E., Ausejo-Sanchez, J., Villarreal-Torres, H., & Angeles-Morales, J. (2023).
Predicting Student Dropout based on Machine Learning and Deep Learning: A
Systematic Review. ICST Transactions on Scalable Information Systems.
https://doi.org/10.4108/eetsis.3586
Andrea, M. (2024, March 12). College Dropout Rates in 2024: Higher Education Statistics.
https://www.skillademia.com/blog/college-dropout-rates/
Araque, F., Roldán, C., & Salguero, A. (2009). Factors influencing university drop out rates.
Computers & Education, 53(3), 563–574. https://doi.org/10.1016/j.compedu.2009.03.013
Asthana, P., & Hazela, B. (2020). Applications of Machine Learning in Improving Learning
Environment. In S. Tanwar, S. Tyagi, & N. Kumar (Eds.), Multimedia Big Data
Computing for IoT Applications (Vol. 163, pp. 417–433). Springer Singapore.
https://doi.org/10.1007/978-981-13-8759-3_16
42
Baker, R., & Inventado, P. (2014). Educational Data Mining and Learning Analytics (pp. 61–
75). https://doi.org/10.1007/978-1-4614-3305-7_4
Baker, R., & Siemens, G. (2014). Learning analytics and educational data mining. Cambridge
Handbook of the Leaning Sciences, 253–272.
Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2019). A Comparative Analysis of XGBoost.
https://doi.org/10.48550/arXiv.1911.01914
Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 4).
Springer. https://link.springer.com/book/9780387310732
Bowers, A. J., Sprott, R., & Taff, S. A. (2013). Do We Know Who Will Drop Out?: A Review of
the Predictors of Dropping out of High School: Precision, Sensitivity, and Specificity.
The High School Journal, 96(2), 77–100.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
https://doi.org/10.1023/A:1010933404324
Callender, C. (1999). The Hardship of Learning: Students’ income and expenditure and their
impact on participation in further education. Further Education Funding Council
Coventry. https://www.voced.edu.au/content/ngv:68297
Casanova, J. R., Assis Gomes, C., Almeida, L. S., Tuero, E., Bernardo, A. B., Casanova, J. R.,
Assis Gomes, C., Almeida, L. S., Tuero, E., & Bernardo, A. B. (2023). “If I were
young…”: Increased Dropout Risk of Older University Students. Revista Electrónica de
Investigación Educativa, 25. https://doi.org/10.24320/redie.2023.25.e27.5671
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–
357.
43
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 785–794. https://doi.org/10.1145/2939672.2939785
Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC)
over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6.
https://doi.org/10.1186/s12864-019-6413-7
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
https://doi.org/10.1007/BF00994018
Creswell, J. W., & Creswell, J. D. (2017). Research design: Qualitative, quantitative, and mixed
methods approaches. Sage publications.
Dake, D., & Buabeng-Andoh, C. (2022). Using Machine Learning Techniques to Predict Learner
Drop-out Rate in Higher Educational Institutions. Mobile Information Systems, 2022, 1–
9. https://doi.org/10.1155/2022/2670562
Dass, S., Gary, K., & Cunningham, J. (2021). Predicting student dropout in self-paced MOOC
course using random forest model. Information, 12(11), 476.
Del Bonifro, F., Gabbrielli, M., Lisanti, G., & Zingaro, S. P. (2020). Student Dropout Prediction.
In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial
Intelligence in Education (Vol. 12163, pp. 129–140). Springer International Publishing.
https://doi.org/10.1007/978-3-030-52237-7_11
Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful:
Learning a variable’s importance by studying an entire class of prediction models
simultaneously. Journal of Machine Learning Research, 20(177), 1–81.
44
García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining (Vol. 72).
Springer International Publishing. https://doi.org/10.1007/978-3-319-10247-4
GeksforGeeks. (2024, February 22). Random Forest Algorithm in Machine Learning.
GeeksforGeeks. https://www.geeksforgeeks.org/random-forest-algorithm-in-machinelearning/
Gilmore, E., Estivill-Castro, V., & Hexel, R. (2021). More Interpretable Decision Trees. In H.
Sanjurjo González, I. Pastor López, P. García Bringas, H. Quintián, & E. Corchado
(Eds.), Hybrid Artificial Intelligent Systems (pp. 280–292). Springer International
Publishing. https://doi.org/10.1007/978-3-030-86271-8_24
Guzmán, A., Barragán, S., & Cala Vitery, F. (2021). Dropout in Rural Higher Education: A
Systematic Review. Frontiers in Education, 6.
https://doi.org/10.3389/feduc.2021.727833
Hastie, T., Friedman, J., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer
New York. https://doi.org/10.1007/978-0-387-21606-5
Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John
Wiley & Sons.
https://books.google.com/books?hl=en&lr=&id=bRoxQBIZRd4C&oi=fnd&pg=PR13&d
q=Hosmer,+Lemeshow+%26+Sturdivant.+(2013).+Applied+Logistic+Regression.+&ots
=kM2Nsu4Wde&sig=P5zmlP_6tVyNVe-F4xZGX91B_PE
Jagwani, A., & Aloysius, S. (2019). A REVIEW OF MACHINE LEARNING IN EDUCATION.
Jiménez, O., Jesús, A., & Wong, L. (2023). Model for the Prediction of Dropout in Higher
Education in Peru applying Machine Learning Algorithms: Random Forest, Decision
Tree, Neural Network and Support Vector Machine. 2023 33rd Conference of Open
45
Innovations Association (FRUCT), 116–124.
https://doi.org/10.23919/FRUCT58615.2023.10143068
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects.
Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415
Kabathova, J., & Drlik, M. (2021). Towards predicting student’s dropout in university courses
using different machine learning techniques. Applied Sciences, 11(7), 3130.
Kharb, L., & Singh, P. (2021). Role of Machine Learning in Modern Education and Teaching. In
Impact of AI Technologies on Teaching, Learning, and Research in Higher Education
(pp. 99–123). IGI Global. https://doi.org/10.4018/978-1-7998-4763-2.ch006
Kim, D., & Kim, S. (2018). Sustainable Education: Analyzing the Determinants of University
Student Dropout by Nonlinear Panel Data Models. Sustainability, 10(4), Article 4.
https://doi.org/10.3390/su10040954
Kirsch, D., & Hurwitz, J. (2018). Machine Learning for dummies. Hoboken: IBM.
https://www.ibm.com/downloads/cas/GB8ZMQZ3
Kotsiantis, S. B., Pierrakeas, C. J., & Pintelas, P. E. (2003). Preventing Student Dropout in
Distance Learning Using Machine Learning Techniques. In V. Palade, R. J. Howlett, &
L. Jain (Eds.), Knowledge-Based Intelligent Information and Engineering Systems (Vol.
2774, pp. 267–274). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-54045226-3_37
Kumar, M., Singh, A. J., & Handa, D. (2017). Literature survey on educational dropout
prediction. International Journal of Education and Management Engineering, 7(2), 8.
46
Lamba, M., & Madhusudhan, M. (2022). Predictive Modeling. In M. Lamba & M. Madhusudhan
(Eds.), Text Mining for Information Professionals: An Uncharted Territory (pp. 213–
242). Springer International Publishing. https://doi.org/10.1007/978-3-030-85085-2_8
Larusson, J. A., & White, B. (Eds.). (2014). Learning Analytics: From Research to Practice.
Springer New York. https://doi.org/10.1007/978-1-4614-3305-7
Lee, J., Kim, M., Kim, D., & Gil, J.-M. (2021). Evaluation of Predictive Models for Early
Identification of Dropout Students. Journal of Information Processing Systems, 17(3).
https://s3.ap-northeast-2.amazonaws.com/journalhome/journal/jips/fullText/594/jips14.pdf
Letseka, M., & Breier, M. (2008). Student poverty in higher education: The impact of higher
education dropout on poverty. Education and Poverty Reduction Strategies: Issues of
Policy Coherence: Colloquium Proceedings, 83–101.
https://www.researchgate.net/profile/Ursula-Hoadley2/publication/237260645_The_boundaries_of_care_Education_policy_interventions_for_
vulnerable_children/links/544e2fc20cf26dda088e5e3a/The-boundaries-of-careEducation-policy-interventions-for-vulnerable-children.pdf#page=107
Lugyi, N. (2024). Gender Differences in Dropout Rates Across Course Types in Higher
Education. International Journal of Education and Research, 12(5), 89.
Lundberg, S. (2017). A unified approach to interpreting model predictions. arXiv Preprint
arXiv:1705.07874.
https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767Abstract.html
47
Lykourentzou, I., Giannoukos, I., Nikolopoulos, V., Mpardis, G., & Loumos, V. (2009). Dropout
prediction in e-learning courses through the combination of machine learning techniques.
Computers & Education, 53(3), 950–965.
Magolda, M., & Astin, A. (1993). What Matters in College: Four Critical Years Revisited.
Educational Researcher, 22. https://doi.org/10.2307/1176821
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.
Cambridge university press. https://dl.acm.org/doi/abs/10.5555/1394399
Marbouti, F., Diefes-Dux, H. A., & Madhavan, K. (2016). Models for early prediction of at-risk
students in a course using standards-based grading. Computers & Education, 103, 1–15.
https://doi.org/10.1016/j.compedu.2016.09.005
Mariano, A. M., Ferreira, A. B. de M. L., Santos, M. R., Castilho, M. L., & Bastos, A. C. F. L. C.
(2022). Decision trees for predicting dropout in Engineering Course students in Brazil.
Procedia Computer Science, 214, 1113–1120.
Martins, M. V., Tolledo, D., Machado, J., Baptista, L. M. T., & Realinho, V. (2021). Early
Prediction of student’s Performance in Higher Education: A Case Study. In Á. Rocha, H.
Adeli, G. Dzemyda, F. Moreira, & A. M. Ramalho Correia (Eds.), Trends and
Applications in Information Systems and Technologies (Vol. 1365, pp. 166–175).
Springer International Publishing. https://doi.org/10.1007/978-3-030-72657-7_16
Mduma, N. (2023). Data Balancing Techniques for Predicting Student Dropout Using Machine
Learning. Data, 8(3), Article 3. https://doi.org/10.3390/data8030049
Mishra, A., Gupta, D., & Chetty, G. (Eds.). (2023). Advances in IoT and Security with
Computational Intelligence: Proceedings of ICAISA 2023, Volume 2 (Vol. 756). Springer
Nature Singapore. https://doi.org/10.1007/978-981-99-5088-1
48
Mitchell, T. M., & Mitchell, T. M. (1997). Machine learning (Vol. 1). McGraw-hill New York.
http://www.pachecoj.com/courses/csc380_fall21/lectures/mlintro.pdf
Morgan, P. L., Farkas, G., Hillemeier, M. M., & Maczuga, S. (2009). Risk Factors for LearningRelated Behavior Problems at 24 Months of Age: Population-Based Estimates. Journal of
Abnormal Child Psychology, 37(3), 401–413. https://doi.org/10.1007/s10802-008-9279-8
Nimy, E., Mosia, M., & Chibaya, C. (2023). Identifying At-Risk Students for Early
Intervention—A Probabilistic Machine Learning Approach. Applied Sciences, 13(6),
3869.
Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P. C. (2022).
Predicting student’s dropout in university classes using two-layer ensemble machine
learning approach: A novel stacked generalization. Computers and Education: Artificial
Intelligence, 3, 100066. https://doi.org/10.1016/j.caeai.2022.100066
Nurmalitasari, N., awang long, Z., & Mohd Noor, F. (2023). Factors Influencing Dropout
Students in Higher Education. Education Research International, 2023, 1–13.
https://doi.org/10.1155/2023/7704142
OECD. (2009). How many students drop out of tertiary education? In HIGHLIGHTS from
Education at a Glance, 2008 (pp. 24–26). OECD Publishing Paris.
Oqaidi, K., Aouhassi, S., & Mansouri, K. (2022). Towards a students’ dropout prediction model
in higher education institutions using machine learning algorithms. International Journal
of Emerging Technologies in Learning (iJET), 17(18), 103–117.
Osborne, J. B., & Lang, A. S. (2023). Predictive Identification of At-Risk Students: Using
Learning Management System Data. Journal of Postsecondary Student Success, 2(4),
108–126.
49
Park, H., & Yoo, S. (2021). Early Dropout Prediction in Online Learning of University using
Machine Learning. JOIV : International Journal on Informatics Visualization, 5, 347.
https://doi.org/10.30630/joiv.5.4.732
Paura, L., & Arhipova, I. (2014). Cause analysis of students’ dropout rate in higher education
study program. Procedia-Social and Behavioral Sciences, 109, 1282–1286.
Powdthavee, N., & Vignoles, A. (2008). The Socio-Economic Gap in University Drop Out.
Ramraj, S., Uzir, N., Sunil, R., & Banerjee, S. (2016). Experimenting XGBoost algorithm for
prediction and classification of different datasets. International Journal of Control
Theory and Applications, 9(40), 651–662.
Romero, C., Ventura, S., Pechenizkiy, M., & Baker, R. S. J. d. (2010). Handbook of Educational
Data Mining. CRC Press.
Rumberger, R. (2011). Dropping Out: Why Students Drop Out of High School and What Can Be
Done About It. https://doi.org/10.4159/harvard.9780674063167
Rumberger, R. (2020). The economics of high school dropouts (pp. 149–158).
https://doi.org/10.1016/B978-0-12-815391-8.00012-4
Rumberger, R. W., & Lim, S. A. (2008). Why students drop out of school: A review of 25 years
of research. https://www.issuelab.org/resources/11658/11658.pdf
Saad Hussein, A., Li, T., Yohannese, C. W., & Bashir, K. (2019). A-SMOTE: A New
Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE:
International Journal of Computational Intelligence Systems, 12(2), 1412.
https://doi.org/10.2991/ijcis.d.191114.002
Saeed, S., Ahmed, S., & joseph, shalom. (2024). Machine Learning in the Big Data Age:
Advancements, Challenges, and Future Prospects.
50
Sammut, C., & Webb, G. I. (2011). Encyclopedia of machine learning. Springer Science &
Business Media.
https://books.google.com/books?hl=en&lr=&id=i8hQhp1a62UC&oi=fnd&pg=PA3&dq=
Claude+Sammut+%26+Geoffrey+I.+Webb.+(2011).+Encyclopedia+of+Machine+Learni
ng.+Springer.&ots=92kazyjGaQ&sig=JuTFAewGWe60D0z-R1N760LkRs4
Sara, N.-B., Halland, R., Igel, C., & Alstrup, S. (2015). High-School Dropout Prediction Using
Machine Learning: A Danish Large-scale Study. ESANN, 2015, 23rd.
https://books.google.com/books?hl=en&lr=&id=USGLCgAAQBAJ&oi=fnd&pg=PA319
&dq=NicolaeBogdan+Sara,+Rasmus+Halland,+Christian+Igel,+%26+Stephen+Alstrup.+(2015).+Hig
h-School+Dropout+Prediction+Using+Machine+Learning:+A+Danish+Largescale+Study.+s,+European+Symposium+on+Artificial+Neural+Networks.&ots=FuebiuJ
ZSN&sig=-HQgEmXHwGa8vY5MIV0YfjrE4Qo
Shaun, R., De Baker, J., & Inventado, P. S. (2014). Chapter 4: Educational Data Mining and
Learning Analytics. Springer.
Smelser, N. J., & Baltes, P. B. (2001). International encyclopedia of the social & behavioral
sciences (Vol. 11). Elsevier Amsterdam.
http://www.law.harvard.edu/faculty/shavell/pdf/12_Inter_Ency_Soc_8446.pdf
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for
classification tasks. Information Processing & Management, 45(4), 427–437.
Stinebrickner, R., & Stinebrickner, T. (2014). Academic Performance and College Dropout:
Using Longitudinal Expectations Data to Estimate a Learning Model. Journal of Labor
Economics, 32(3), 601–644. https://doi.org/10.1086/675308
51
SYDLE. (2024). University Dropout: Why Does It Happen And How Can You Prevent It? Blog
SYDLE. https://www.sydle.com/blog/university-dropout-639a22f22ff02745fa4eface
Thayer, P. B. (2000). Retention of Students from First Generation and Low Income
Backgrounds. Council for Opportunity in Education, 1025 Vermont Ave.
https://eric.ed.gov/?id=ED446633
Tinto, V. (2012). Leaving college: Rethinking the causes and cures of student attrition.
University of Chicago press.
https://books.google.com/books?hl=en&lr=&id=TlVhEAAAQBAJ&oi=fnd&pg=PR7&d
q=Tinto,+V.+(1993).+Leaving+College:+Rethinking+the+Causes+and+Cures+of+Stude
nt+Attrition.+&ots=yg1VO4MTs1&sig=mhuyHlM7eA7Ty3vLzdAfRpYIlxA
Villegas-Ch, W., Govea, J., & Revelo-Tapia, S. (2023). Improving Student Retention in
Institutions of Higher Education through Machine Learning: A Sustainable Approach.
Sustainability, 15(19), Article 19. https://doi.org/10.3390/su151914512
52
Appendix A: Distribution of all Categorical variables
Variable
Marital
status
Applicati
on mode
Applicati
on order
Course
Category
Overa
ll
Count
Overa
ll
Perce
ntage
Drop
out
Cou
nt
Dropo
ut
Percen
tage
NonDropo
ut
Count
Single
Married
Divorced
Facto union
Legally separated
3919
379
91
25
6
88.58
8.57
2.06
0.57
0.14
1184
179
42
11
4
30.21
47.23
46.15
44
66.67
2735
200
49
14
2
NonDropo
ut
Percen
tage
69.79
52.77
53.85
56
33.33
Widower
1st phase—general contingent
2nd phase—general contingent
Over 23 years old
4
1708
872
785
0.09
38.61
19.71
17.74
1
345
256
435
25
20.2
29.36
55.41
3
1363
616
350
75
79.8
70.64
44.59
Change in course
Technological specialization diploma holders
312
213
7.05
4.81
115
63
36.86
29.58
197
150
63.14
70.42
Holders of other higher courses
3rd phase—general contingent
Transfer
Change in institution/course
1st phase—special contingent (Madeira Island)
Short cycle diploma holders
International student (bachelor)
139
124
77
59
38
35
30
3.14
2.8
1.74
1.33
0.86
0.79
0.68
85
45
34
20
5
4
5
61.15
36.29
44.16
33.9
13.16
11.43
16.67
54
79
43
39
33
31
25
38.85
63.71
55.84
66.1
86.84
88.57
83.33
1st phase—special contingent (Azores Island)
Ordinance No. 854-B/99
16
10
0.36
0.23
2
3
12.5
30
14
7
87.5
70
Ordinance No. 612/93
Ordinance No. 533-A/99, item b2) (Different
Plan)
Ordinance No. 533-A/99, item b3 (Other
Institution)
3
0.07
2
66.67
1
33.33
1
0.02
1
100
0
0
1
0.02
1
100
0
0
Change in institution/course (International)
1
2
3
4
5
1
3026
547
309
249
154
0.02
68.4
12.36
6.98
5.63
3.48
0
1053
150
76
58
53
0
34.8
27.42
24.6
23.29
34.42
1
1973
397
233
191
101
100
65.2
72.58
75.4
76.71
65.58
6
137
3.1
31
22.63
106
77.37
9
0
Nursing
1
1
766
0.02
0.02
17.31
0
0
118
0
0
15.4
1
1
648
100
100
84.6
Management
Social Service
Veterinary Nursing
380
355
337
8.59
8.02
7.62
134
65
90
35.26
18.31
26.71
246
290
247
64.74
81.69
73.29
53
Daytime/
evening
attendanc
e
Previous
qualificat
ion
Nationali
ty
Journalism and Communication
Advertising and Marketing Management
Management (evening attendance)
Tourism
331
268
268
252
7.48
6.06
6.06
5.7
101
95
136
96
30.51
35.45
50.75
38.1
230
173
132
156
69.49
64.55
49.25
61.9
Communication Design
Animation and Multimedia Design
Social Service (evening attendance)
Agronomy
Basic Education
Informatics Engineering
Equinculture
226
215
215
210
192
170
141
5.11
4.86
4.86
4.75
4.34
3.84
3.19
51
82
71
86
85
92
78
22.57
38.14
33.02
40.95
44.27
54.12
55.32
175
133
144
124
107
78
63
77.43
61.86
66.98
59.05
55.73
45.88
44.68
Oral Hygiene
Biofuel Production Technologies
Day
86
12
3941
1.94
0.27
89.08
33
8
1214
38.37
66.67
30.8
53
4
2727
61.63
33.33
69.2
Evening
Secondary education
483
3717
10.92
84.02
207
1078
42.86
29
276
2639
57.14
71
Higher education—bachelor’s degree
Higher education—degree
Higher education—master’s degree
Higher education—doctorate
Frequency of higher education
12th year of schooling—not completed
23
126
8
1
16
11
0.52
2.85
0.18
0.02
0.36
0.25
16
75
4
1
7
11
69.57
59.52
50
100
43.75
100
7
51
4
0
9
0
30.43
40.48
50
0
56.25
0
11th year of schooling—not completed
Other—11th year of schooling
10th year of schooling
4
45
1
0.09
1.02
0.02
3
26
1
75
57.78
100
1
19
0
25
42.22
0
10th year of schooling—not completed
Basic education 3rd cycle (9th/10th/11th year)
or equivalent
Basic education 2nd cycle (6th/7th/8th year) or
equivalent
2
0.05
1
50
1
50
162
3.66
104
64.2
58
35.8
7
0.16
3
42.86
4
57.14
Technological specialization course
Higher education—degree (1st cycle)
Professional higher technical course
Higher education—master’s degree (2nd cycle)
Portuguese
Brazilian
219
40
36
6
4314
38
4.95
0.9
0.81
0.14
97.51
0.86
69
14
6
2
1389
14
31.51
35
16.67
33.33
32.2
36.84
150
26
30
4
2925
24
68.49
65
83.33
66.67
67.8
63.16
Santomean
14
0.32
1
7.14
13
92.86
Spanish
Cape Verdean
Guinean
13
13
5
0.29
0.29
0.11
4
4
1
30.77
30.77
20
9
9
4
69.23
69.23
80
Italian
Moldova (Republic of)
3
3
0.07
0.07
0
2
0
66.67
3
1
100
33.33
54
Mother's
qualificat
ion
Ukrainian
German
Angolan
Mozambican
3
2
2
2
0.07
0.05
0.05
0.05
1
0
1
0
33.33
0
50
0
2
2
1
2
66.67
100
50
100
Romanian
Mexican
Russian
Dutch
English
Lithuanian
Turkish
2
2
2
1
1
1
1
0.05
0.05
0.05
0.02
0.02
0.02
0.02
0
1
1
0
0
1
0
0
50
50
0
0
100
0
2
1
1
1
1
0
1
100
50
50
100
100
0
100
Cuban
Colombian
Secondary Education—12th Year of Schooling
or Equivalent
General Course of Administration and
Commerce
General commerce course
1
1
0.02
0.02
0
1
0
100
1
0
100
0
1069
24.16
300
28.06
769
71.94
1009
953
22.81
21.54
383
271
37.96
28.44
626
682
62.04
71.56
Supplementary Accounting and Administration
Higher Education—degree
2nd cycle of the general high school course
Higher Education—bachelor’s degree
Higher Education—master’s degree
Other—11th Year of Schooling
562
438
130
83
49
42
12.7
9.9
2.94
1.88
1.11
0.95
140
139
96
20
8
22
24.91
31.74
73.85
24.1
16.33
52.38
422
299
34
63
41
20
75.09
68.26
26.15
75.9
83.67
47.62
Higher Education—doctorate
Cannot read or write
12th Year of Schooling—not completed
Unknown
Can read without having a 4th year of schooling
Frequency of Higher Education
Basic education 1st cycle (4th/5th year) or
equivalent
Basic Education 2nd Cycle (6th/7th/8th Year) or
equivalent
11th Year of Schooling—not completed
7th Year (Old)
Complementary High School Course—not
concluded
21
9
8
8
6
4
0.47
0.2
0.18
0.18
0.14
0.09
8
3
5
4
2
3
38.1
33.33
62.5
50
33.33
75
13
6
3
4
4
1
61.9
66.67
37.5
50
66.67
25
4
0.09
2
50
2
50
4
3
3
0.09
0.07
0.07
1
2
2
25
66.67
66.67
3
1
1
75
33.33
33.33
3
0.07
1
33.33
2
66.67
7th year of schooling
9th Year of Schooling—not completed
3
3
0.07
0.07
1
2
33.33
66.67
2
1
66.67
33.33
8th year of schooling
3
0.07
2
66.67
1
33.33
2nd year complementary high school course
10th Year of Schooling
Basic Education 3rd Cycle (9th/10th/11th Year)
or Equivalent
2
1
0.05
0.02
1
1
50
100
1
0
50
0
1
0.02
0
0
1
100
55
Father's
qualificat
ion
Complementary High School Course
Technical-professional course
Technological specialization course
Basic education 1st cycle (4th/5th year) or
equivalent
Basic Education 3rd Cycle (9th/10th/11th Year)
or Equivalent
Secondary Education—12th Year of Schooling
or Equivalent
Basic Education 2nd Cycle (6th/7th/8th Year) or
equivalent
Higher Education—degree
1
1
1
0.02
0.02
0.02
0
1
1
0
100
100
1
0
0
100
0
0
1209
27.33
432
35.73
777
64.27
968
21.88
264
27.27
704
72.73
904
20.43
281
31.08
623
68.92
702
282
15.87
6.37
167
90
23.79
31.91
535
192
76.21
68.09
Unknown
Higher Education—bachelor’s degree
Higher Education—master’s degree
Other—11th Year of Schooling
Technological specialization course
Higher Education—doctorate
112
68
39
38
20
18
2.53
1.54
0.88
0.86
0.45
0.41
81
22
14
14
8
10
72.32
32.35
35.9
36.84
40
55.56
31
46
25
24
12
8
27.68
67.65
64.1
63.16
60
44.44
7th Year (Old)
10
0.23
4
40
6
60
Can read without having a 4th year of schooling
12th Year of Schooling—not completed
Higher education—degree (1st cycle)
10th Year of Schooling
Technical-professional course
8th year of schooling
8
5
5
4
4
4
0.18
0.11
0.11
0.09
0.09
0.09
5
1
3
1
4
1
62.5
20
60
25
100
25
3
4
2
3
0
3
37.5
80
40
75
0
75
9th Year of Schooling—not completed
Frequency of Higher Education
11th Year of Schooling—not completed
7th year of schooling
Cannot read or write
Specialized higher studies course
Higher Education—master’s degree (2nd cycle)
3
2
2
2
2
2
2
0.07
0.05
0.05
0.05
0.05
0.05
0.05
3
2
2
1
2
1
0
100
100
100
50
100
50
0
0
0
0
1
0
1
2
0
0
0
50
0
50
100
2nd year complementary high school course
General commerce course
Complementary High School Course
Complementary High School Course—not
concluded
2nd cycle of the general high school course
General Course of Administration and
Commerce
Supplementary Accounting and Administration
Professional higher technical course
1
1
1
0.02
0.02
0.02
1
1
1
100
100
100
0
0
0
0
0
0
1
1
0.02
0.02
1
1
100
100
0
0
0
0
1
1
1
0.02
0.02
0.02
1
1
0
100
100
0
0
0
1
0
0
100
Higher Education—doctorate (3rd cycle
Unskilled Workers
1
1577
0.02
35.65
1
490
100
31.07
0
1087
0
68.93
Administrative staff
817
18.47
248
30.35
569
69.65
56
Personal Services, Security and Safety Workers,
and Sellers
Intermediate Level Technicians and Professions
Specialists in Intellectual and Scientific
Activities
Skilled Workers in Industry, Construction, and
Craftsmen
Student
Representatives of the Legislative Power and
Executive Bodies,Directors, Directors and
Executive Managers
Farmers and Skilled Workers in Agriculture,
Fisheries,and Forestry
Mother's
occupati
on
Father's
occupati
on
530
351
11.98
7.93
156
95
29.43
27.07
374
256
70.57
72.93
318
7.19
102
32.08
216
67.92
272
144
6.15
3.25
80
99
29.41
68.75
192
45
70.59
31.25
102
2.31
39
38.24
63
61.76
91
2.06
26
28.57
65
71.43
Other Situation;
Installation and Machine Operators and
Assembly Workers
70
1.58
51
72.86
19
27.14
36
0.81
15
41.67
21
58.33
Other administrative support staff
(blank)
26
17
0.59
0.38
0
13
0
76.47
26
4
100
23.53
Personal care workers and the like
Health professionals
11
8
0.25
0.18
1
0
9.09
0
10
8
90.91
100
Armed Forces Sergeants
Specialists in finance, accounting, administrative
organization,and public and commercial
relations
Data, accounting, statistical, financial services,
andregistry-related operators
Personal service workers
7
0.16
2
28.57
5
71.43
6
0.14
0
0
6
100
5
5
0.11
0.11
1
0
20
0
4
5
80
100
Armed Forces Professions
Specialists in the physical sciences,
mathematics, engineering,and related techniques
4
0.09
1
25
3
75
4
0.09
1
25
3
75
Sellers
Hotel, catering, trade, and other services
directors
Teachers
Intermediate level science and engineering
technicians and professions
Armed Forces Officers
Technicians and professionals of intermediate
level of health
Intermediate level technicians from legal, social,
sports, cultural,and similar services
Other Armed Forces personnel
Directors of administrative and commercial
services
Information and communication technology
technicians
Office workers, secretaries in general,and data
processing operators
4
0.09
1
25
3
75
3
3
0.07
0.07
0
0
0
0
3
3
100
100
3
2
0.07
0.05
0
0
0
0
3
2
100
100
2
0.05
0
0
2
100
2
1
0.05
0.02
0
0
0
0
2
1
100
100
1
0.02
0
0
1
100
1
0.02
0
0
1
100
1
0.02
0
0
1
100
Unskilled Workers
Skilled Workers in Industry, Construction, and
Craftsmen
1010
22.83
323
31.98
687
68.02
666
15.05
184
27.63
482
72.37
57
Personal Services, Security and Safety Workers,
and Sellers
516
386
384
11.66
8.73
8.68
148
139
114
28.68
36.01
29.69
368
247
270
71.32
63.99
70.31
318
266
7.19
6.01
94
85
29.56
31.95
224
181
70.44
68.05
242
5.47
69
28.51
173
71.49
197
4.45
70
35.53
127
64.47
134
128
3.03
2.89
48
82
35.82
64.06
86
46
64.18
35.94
Other Situation;
(blank)
Unskilled workers in extractive industry,
construction,manufacturing, and transport
Other administrative support staff
Skilled construction workers and the like, except
electricians
Unskilled workers in agriculture, animal
production, and fisheries and forestry
Farmers, livestock keepers, fishermen, hunters
and gatherers,and subsistence
Other Armed Forces personnel
Workers in food processing, woodworking, and
clothing and other industries and crafts
65
19
1.47
0.43
46
13
70.77
68.42
19
6
29.23
31.58
15
8
0.34
0.18
2
0
13.33
0
13
8
86.67
100
8
0.18
0
0
8
100
6
0.14
0
0
6
100
5
4
0.11
0.09
0
1
0
25
5
3
100
75
4
0.09
0
0
4
100
Teachers
Information and communication technology
technicians
Sellers
Fixed plant and machine operators
Vehicle drivers and mobile equipment operators
3
0.07
0
0
3
100
3
3
3
3
0.07
0.07
0.07
0.07
0
1
0
0
0
33.33
0
0
3
2
3
3
100
66.67
100
100
Armed Forces Sergeants
Directors of administrative and commercial
services
Health professionals
Personal service workers
Skilled workers in metallurgy, metalworking,
and similar
2
0.05
0
0
2
100
2
2
2
0.05
0.05
0.05
1
0
0
50
0
0
1
2
2
50
100
100
2
0.05
0
0
2
100
Assembly workers
2
0.05
0
0
2
100
Meal preparation assistants
Armed Forces Officers
Hotel, catering, trade, and other services
directors
Specialists in the physical sciences,
mathematics, engineering,and related techniques
2
1
0.05
0.02
1
0
50
0
1
1
50
100
1
0.02
0
0
1
100
1
0.02
0
0
1
100
Administrative staff
Intermediate Level Technicians and Professions
Installation and Machine Operators and
Assembly Workers
Armed Forces Professions
Farmers and Skilled Workers in Agriculture,
Fisheries,and Forestry
Specialists in Intellectual and Scientific
Activities
Representatives of the Legislative Power and
Executive Bodies,Directors, Directors and
Executive Managers
Student
58
Specialists in finance, accounting, administrative
organization,and public and commercial
relations
Intermediate level science and engineering
technicians and professions
Technicians and professionals of intermediate
level of health
Intermediate level technicians from legal, social,
sports, cultural,and similar services
Office workers, secretaries in general,and data
processing operators
Data, accounting, statistical, financial services,
andregistry-related operators
Personal care workers and the like
Displace
d
Educatio
nal
special
needs
Debtor
Tuition
fees up
to date
Gender
Scholars
hip
holder
Internati
onal
1
0.02
0
0
1
100
1
0.02
0
0
1
100
1
0.02
0
0
1
100
1
0.02
0
0
1
100
1
0.02
0
0
1
100
1
1
0.02
0.02
0
0
0
0
1
1
100
100
Protection and security services personnel
Market-oriented farmers and skilled agricultural
and animal production workers
Skilled workers in electricity and electronics
Street vendors (except food) and street service
providers
Yes
1
0.02
0
0
1
100
1
1
0.02
0.02
0
0
0
0
1
1
100
100
1
2426
0.02
54.84
0
669
0
27.58
1
1757
100
72.42
No
No
1998
4373
45.16
98.85
752
1404
37.64
32.11
1246
2969
62.36
67.89
Yes
No
Yes
51
3921
503
1.15
88.63
11.37
17
1109
312
33.33
28.28
62.03
34
2812
191
66.67
71.72
37.97
Yes
3896
88.07
964
24.74
2932
75.26
No
Yes
Yes
No
528
2868
1556
3325
11.93
64.83
35.17
75.16
457
720
701
1287
86.55
25.1
45.05
38.71
71
2148
855
2038
13.45
74.9
54.95
61.29
Yes
1099
24.84
134
12.19
965
87.81
No
Yes
4314
110
97.51
2.49
1389
32
32.2
29.09
2925
78
67.8
70.91
59
THESIS
15
ORIGINALITY REPORT
%
SIMILARITY INDEX
12%
INTERNET SOURCES
9%
PUBLICATIONS
PRIMARY SOURCES
1
www.mdpi.com
2
Submitted to University of Rwanda
3
wiredspace.wits.ac.za
4
dspace.nm-aist.ac.tz
5
www.ijisae.org
Internet Source
Student Paper
Internet Source
Internet Source
Internet Source
8%
STUDENT PAPERS
1%
<1 %
<1 %
<1 %
<1 %
Submitted to The African Institute for
Mathematical Sciences
<1 %
7
Submitted to Coventry University
<1 %
8
dr.ur.ac.rw
9
Submitted to Aston University
6
Student Paper
Student Paper
Internet Source
Student Paper
<1 %
<1 %
10
11
ebin.pub
Internet Source
Submitted to Asia Pacific University College of
Technology and Innovation (UCTI)
Student Paper
12
Hemant Kumar Soni, Sanjiv Sharma, G. R.
Sinha. "Text and Social Media Analytics for
Fake News and Hate Speech Detection", CRC
Press, 2024
Publication
13
Submitted to University of Strathclyde
14
Submitted to Addis Ababa University
15
academic-accelerator.com
16
www.frontiersin.org
17
dione.lib.unipi.gr
18
etd.repository.ugm.ac.id
19
Submitted to University of Northampton
Student Paper
Student Paper
Internet Source
Internet Source
Internet Source
Internet Source
Student Paper
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
20
"Deep Learning and Visual Artificial
Intelligence", Springer Science and Business
Media LLC, 2024
Publication
21
22
afribary.com
Internet Source
"Proceedings of Eighth International
Congress on Information and Communication
Technology", Springer Science and Business
Media LLC, 2024
Publication
23
24
pdfs.semanticscholar.org
Internet Source
26
<1 %
<1 %
<1 %
Submitted to Jawaharlal Nehru Technological
University
<1 %
etd.hu.edu.et
<1 %
Student Paper
25
<1 %
Internet Source
"Recent Advances on Soft Computing and
Data Mining", Springer Science and Business
Media LLC, 2024
Publication
27
irbackend.kiu.ac.ug
28
Submitted to Uganda Christian University
Internet Source
Student Paper
<1 %
<1 %
<1 %
29
Submitted to UCL
30
dspace.cbe.ac.tz:8080
31
Student Paper
Internet Source
Submitted to Queen Margaret University
College, Edinburgh
Student Paper
32
Qurban A. Memon, Shakeel Ahmed Khoja.
"Data Science - Theory, Analysis, and
Applications", CRC Press, 2019
Publication
33
ijern.com
34
xml.jips-k.org
Internet Source
Internet Source
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
Submitted to University of Technology,
Sydney
<1 %
36
listens.online
<1 %
37
scholar.mzumbe.ac.tz
38
www.scielo.br
35
Student Paper
Internet Source
Internet Source
Internet Source
<1 %
<1 %
39
Badrul H. Khan, Joseph Rene Corbeil, Maria
Elena Corbeil. "Responsible Analytics and
Data Mining in Education - Global
Perspectives on Quality, Support, and
Decision Making", Routledge, 2018
Publication
40
dokumen.pub
41
dspace.lib.uom.gr
42
erepository.uonbi.ac.ke:8080
43
github.com
44
journaleet.in
45
ulspace.ul.ac.za
46
Internet Source
Internet Source
Internet Source
Internet Source
Internet Source
Internet Source
"Breaking Barriers with Generative
Intelligence. Using GI to Improve Human
Education and Well-Being", Springer Science
and Business Media LLC, 2024
Publication
47
"Recent Trends in Image Processing and
Pattern Recognition", Springer Science and
Business Media LLC, 2024
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
Publication
48
Safira Begum, M. V. Ashok. "A novel approach
to mitigate academic underachievement in
higher education: Feature selection, classifier
performance, and interpretability in
predicting student performance",
International Journal of ADVANCED AND
APPLIED SCIENCES, 2024
Publication
49
Sheikh Wakie Masood, Munmi Gogoi, Shahin
Ara Begum. "Optimised SMOTE-based
Imbalanced Learning for Student Dropout
Prediction", Arabian Journal for Science and
Engineering, 2024
Publication
50
51
management.uta.edu
Internet Source
"Artificial Intelligence and Knowledge
Processing", Springer Science and Business
Media LLC, 2024
Publication
52
53
phd-dissertations.unizik.edu.ng
Internet Source
Catherine Régis, Jean-Louis Denis, Maria
Luciana Axente, Atsuo Kishimoto. "HumanCentered AI - A Multidisciplinary Perspective
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
for Policy-Makers, Auditors, and Users", CRC
Press, 2024
Publication
54
55
Submitted to Glyndwr University
Student Paper
Matti Vaarma, Hongxiu Li. "Predicting student
dropouts with machine learning: An empirical
study in Finnish higher education",
Technology in Society, 2024
Publication
<1 %
<1 %
Submitted to University of Wales Institute,
Cardiff
<1 %
57
Submitted to University of Witwatersrand
<1 %
58
hdl.handle.net
59
Submitted to msu
60
www.aimspress.com
61
www.ijeast.com
62
Submitted to Intercollege
56
Student Paper
Student Paper
Internet Source
Student Paper
Internet Source
Internet Source
Student Paper
<1 %
<1 %
<1 %
<1 %
<1 %
63
ir-library.ku.ac.ke
64
ojs.ais.cn
65
rc.library.uta.edu
66
uir.unisa.ac.za
67
www.grossarchive.com
68
Submitted to Grenoble Ecole Management
69
Internet Source
Internet Source
Internet Source
Internet Source
Internet Source
Student Paper
Mukhtar Abdi Hassan, Abdisalam Hassan
Muse, Saralees Nadarajah. "Predicting
Student Dropout Rates Using Supervised
Machine Learning: Insights from the 2022
National Education Accessibility Survey in
Somaliland", Applied Sciences, 2024
Publication
70
Poonam Tanwar, Tapas Kumar, K. Kalaiselvi,
Haider Raza, Seema Rawat. "Predictive Data
Modelling for Biomedical Data and", River
Publishers, 2024
Publication
71
Submitted to Technological University Dublin
Student Paper
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
72
Submitted to Trine University
73
Submitted to University of Reading
74
repository.out.ac.tz
75
www.escholar.manchester.ac.uk
76
www.researchgate.net
77
www.tara.tcd.ie
78
www2.mdpi.com
Student Paper
Student Paper
Internet Source
Internet Source
Internet Source
Internet Source
Internet Source
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
Adwitiya Sinha, Megha Rathi. "Smart
Healthcare Systems", CRC Press, 2019
<1 %
80
Submitted to University of Birmingham
<1 %
81
etd.uwc.ac.za
82
export.arxiv.org
83
fastercapital.com
79
Publication
Student Paper
Internet Source
Internet Source
<1 %
<1 %
Internet Source
84
iieta.org
85
www.ijraset.com
86
www.springerprofessional.de
87
Internet Source
Internet Source
Internet Source
Kok-Kwang Phoon, Takayuki Shuku, Jianye
Ching. "Uncertainty, Modeling, and Decision
Making in Geotechnics", CRC Press, 2023
Publication
88
Thomas Mgonja, Francisco Robles.
"Identifying Critical Factors When Predicting
Remedial Mathematics Completion Rates",
Journal of College Student Retention:
Research, Theory & Practice, 2022
Publication
89
Ton Duc Thang University
90
arxiv.org
91
crm-en.ics.org.ru
92
dissertations.mak.ac.ug
Publication
Internet Source
Internet Source
Internet Source
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
<1 %
93
edoc.ub.uni-muenchen.de
94
ir.knust.edu.gh
95
www.conftool.com
<1 %
Internet Source
<1 %
Internet Source
<1 %
Internet Source
Exclude quotes
On
Exclude bibliography
On
Exclude matches
< 10 words