Academia.eduAcademia.edu

FINAL Dissertation NGENZI Mali Dioscord

I sincerely express my deepest gratitude to the Holy and Almighty God for His constant presence and abundant blessings in my life. His guidance, love, and grace have provided me with the strength and ability to complete this journey successfully. I would like to dedicate my heartfelt appreciation to my supervisor, Prof. Joseph Nzabanita, for his invaluable guidance and unwavering support throughout this research. His expertise, kindness, and encouragement have been instrumental in my growth and success. The knowledge and insights gained under his supervision have greatly enriched my academic and professional journey. I would like to offer my most sincere appreciation to the University of Rwanda, specifically to the African Centre of Excellence in Data Science, for providing an exceptional learning environment and invaluable support throughout my academic journey. The experiences and opportunities provided have significantly contributed to my growth and development. I also want to acknowledge the memory of my father, whose inspiring presence continues to guide me, even though he passed away on September 15, 2011. His legacy of love and inspiration has had a profound impact on my life and achievements. Finally, I extend my heartfelt thanks to my family, including my mother, sisters, and extended family, for their unwavering support and encouragement.

African Centre of Excellence in Data Science College of Business and Economics-University of Rwanda Predictive Model for Early Detection of Higher Education Dropout Using Machine Learning Techniques By NGENZI Mali Dioscord Registration Number: 222022311 A dissertation submitted in partial fulfilment of the requirements for the degree of Master of Data Science in Data Mining requirements for the University degree of master’s degreeCollege of data science in Biostatistics in Africa Centre of of Rwanda, of Business and Economics Supervisor: Prof. Joseph Nzabanita September, 2024 DECLARATION I declare that this dissertation entitled “Predictive Model for Early Detection of Higher Education Dropout Using Machine Learning Techniques” is the result of my work and has not been submitted for any other degree at the University of Rwanda or any other institution. Names: NGENZI Mali Dioscord Signature: i APPROVAL SHEET This dissertation entitled “Predictive Model for Early Detection of Higher Education Dropout Using Machine Learning Techniques” written and submitted by NGENZI Mali Dioscord in partial fulfilment of the requirements for the degree of Master of Science in Data Science majoring in Data Mining is hereby accepted and approved. The rate of plagiarism tested using Turnitin is 15% which is less than 20% accepted by the African Centre of Excellence in Data Science (ACEDS). Prof. Joseph Nzabanita Supervisor Dr. Kabano H. Ignace Head of Training ii DEDICATION I dedicate this work to my family especially my mother and sisters for their support, advice, and encouragement which greatly contributed to the completion of my studies and research process. iii ACKNOWLEDGMENTS I sincerely express my deepest gratitude to the Holy and Almighty God for His constant presence and abundant blessings in my life. His guidance, love, and grace have provided me with the strength and ability to complete this journey successfully. I would like to dedicate my heartfelt appreciation to my supervisor, Prof. Joseph Nzabanita, for his invaluable guidance and unwavering support throughout this research. His expertise, kindness, and encouragement have been instrumental in my growth and success. The knowledge and insights gained under his supervision have greatly enriched my academic and professional journey. I would like to offer my most sincere appreciation to the University of Rwanda, specifically to the African Centre of Excellence in Data Science, for providing an exceptional learning environment and invaluable support throughout my academic journey. The experiences and opportunities provided have significantly contributed to my growth and development. I also want to acknowledge the memory of my father, whose inspiring presence continues to guide me, even though he passed away on September 15, 2011. His legacy of love and inspiration has had a profound impact on my life and achievements. Finally, I extend my heartfelt thanks to my family, including my mother, sisters, and extended family, for their unwavering support and encouragement. God bless you all! iv ABSTRACT The prediction of student dropout is vital for the retention policy improvement in any higher education institution. This study utilized different supervised machine learning techniques for higher education dropout prediction and identified the key influential factors using secondary data derived from the UCI Machine Learning Repository. The dataset comprises 4,424 records with 35 attributes representing various demographic, academic, and socio-economic factors related to student retention. Moreover, the study identified several key determinants that influence dropout namely academic performance, tuition fee status, age, unemployment rate, and economic indicators such as GDP and inflation rate. The study also applied and evaluated several machine learning algorithms, including decision tree, random forest, logistic regression, gradient boosting machine, XGBoost, and support vector machine with key performance metrics like AUC-ROC, precision, accuracy, recall, and F1-score. The comparative analysis showed that XGBoost and Random Forest outperformed other models with an accuracy of 90% and an AUC-ROC of 95.30% and 95.21 respectively. They also achieved a very high precision of 92%, recall of 86%, and F1Score of 89%, justifying the fact that these models are exceptionally good at correctly identifying students who will drop out. The following best-performing models become the Gradient Boosting Machine and Support Vector Machine both with an accuracy of 88% and AUC-ROC of 95% and 94% respectively. Logistic Regression and Decision Tree models also performed well with an accuracy of 87% and 86% respectively and AUC-ROC of 94% and 92%, but were less effective compared to the ensemble methods like Random Forest and XGBoost. The study highlights the value of data-driven approaches in understanding dropout dynamics and underscores the importance of targeted interventions based on identified key determinants. The research recommends the implementation of the Random Forest and XGBoost models by educational stakeholders for proactive intervention and informed decision-making. Future research can expand this work by incorporating larger, more diverse datasets, and exploring more advanced ensemble techniques to enhance predictive accuracy and robustness further. Keywords: Student dropout, Higher Education, Predictive model, Machine learning v TABLE OF CONTENTS DECLARATION ....................................................................................................................................... i APPROVAL SHEET ................................................................................................................................ ii DEDICATION ......................................................................................................................................... iii ACKNOWLEDGMENTS ....................................................................................................................... iv ABSTRACT .............................................................................................................................................. v LIST OF ABBREVIATIONS AND ACRONYMS .............................................................................. viii LISTS OF FIGURES ................................................................................................................................ x CHAPTER ONE: INTRODUCTION ..................................................................................................... 1 1.1 The background of the study.................................................................................................... 1 1.2 Research problem ........................................................................................................................... 3 1.3 Research objectives ......................................................................................................................... 4 1.3.1 General objective...................................................................................................................... 4 1.3.2 Specific objectives .................................................................................................................... 4 1.4 Research questions .......................................................................................................................... 4 1.5 Scope of the study............................................................................................................................ 4 1.6 Significance of the study ................................................................................................................. 5 1.7 Research Structure ......................................................................................................................... 6 CHAPTER 2: LITERATURE REVIEW ................................................................................................ 7 2.1 Introduction..................................................................................................................................... 7 2.2 Definition of key terms ................................................................................................................... 7 2.2.1 Student dropout........................................................................................................................ 7 2.2.2 Predictive Model ...................................................................................................................... 7 2.2.3 Machine learning...................................................................................................................... 8 2.3 Factors Influencing Student Dropout ............................................................................................ 8 2.3.1 Academic factors ...................................................................................................................... 8 2.3.2 Socioeconomic factors .............................................................................................................. 9 2.3.3 Demographic factors .............................................................................................................. 10 2.4 Application of Machine Learning in Education.......................................................................... 10 2.5 Application of Machine learning for dropout prediction ........................................................... 11 2.6 Evaluating Predictive Models of Student Dropout ..................................................................... 13 CHAPTER THREE: METHODOLOGY OF RESEARCH ................................................................ 15 3.1 Introduction................................................................................................................................... 15 vi 3.2 Study design .................................................................................................................................. 15 3.3 Source of data ................................................................................................................................ 15 3.4 Data Description............................................................................................................................ 16 3.5 Study variables .............................................................................................................................. 16 3.6 Data preprocessing ........................................................................................................................ 17 3.7 Handling Data Imbalance............................................................................................................. 18 3.8 Model Development ...................................................................................................................... 18 3.8.1 Logistic regression .................................................................................................................. 19 3.8.2 Decision Tree .......................................................................................................................... 19 3.8.3 Random Forest ....................................................................................................................... 20 3.8.4 Support Vector Machines ...................................................................................................... 21 3.8.5 XGBoost .................................................................................................................................. 22 3.8.6 Gradient Boosting Machine................................................................................................... 22 3.9 Model Evaluation .......................................................................................................................... 22 3.9.1 Confusion Matrix ................................................................................................................... 23 3.9.2 Accuracy ................................................................................................................................. 24 3.9.3 Precision .................................................................................................................................. 24 3.9.4 Recall ....................................................................................................................................... 24 3.9.5 F1 Score .................................................................................................................................. 24 3.9.6 3.10 Area Under Curve (AUC) ............................................................................................... 24 Feature Importance Analysis ................................................................................................. 25 CHAPTER 4: DATA ANALYSIS AND PRESENTATION OF RESULTS ....................................... 27 4.1 Introduction................................................................................................................................... 27 4.2 Data quality check......................................................................................................................... 27 4.3 Descriptive View of Dropout Rate for Demographic, Socioeconomic, and Academic Variables .............................................................................................................................................................. 28 4.4 Identification of important predictors of Dropout risk in Higher Education ........................... 30 4.5 4.5.1 Machine learning model results ............................................................................................. 32 Logistic regression ............................................................................................................... 32 4.5.2 Support Vector Machine (SVM) ........................................................................................... 33 4.5.3 Decision Tree .......................................................................................................................... 33 4.5.4 Random Forest ....................................................................................................................... 33 4.5.5 Gradient Boosting Machine................................................................................................... 33 vii 4.5.6 XGBoost .................................................................................................................................. 33 4.5.7 Comparative analysis of Model performances ..................................................................... 34 CHAPTER 5. DISCUSSION OF THE RESULTS ............................................................................... 37 5.1 Introduction................................................................................................................................... 37 5.2 Key Findings Discussion ............................................................................................................... 37 CHAPTER 6. CONCLUSION AND RECOMMENDATIONS .......................................................... 40 6.1 Introduction................................................................................................................................... 40 6.2 Conclusion ..................................................................................................................................... 40 6.3 Recommendations ......................................................................................................................... 40 REFERENCES: ...................................................................................................................................... 42 viii LIST OF ABBREVIATIONS AND ACRONYMS ACE-DS: African Center of Excellence-Data Science ML: Machine Learning SMOTE: Synthetic Minority Oversampling Technique ROC: Receiver Operating Characteristic AUC: Area Under Curve CGPA: Cumulative Grade Point Average XGBoost: Extreme Gradient Boosting RF: Random Forest UCI: University of California, Irvine GDP: Gross Domestic Product DT: Decision Tree SVM: Support Vector Machine TPR: True Positive Rate FPR: False Positive Rate TP: True Positive TN: True Negative SHAP: SHapley Additive exPlanations TOT: Total STD: Standard Deviation LR: Logistic regression ix LISTS OF FIGURES Figure 1. Decision Tree Illustration. ......................................................................................................... 20 Figure 2. Random Forest Algorithm (GeksforGeeks, 2024) ..................................................................... 21 Figure 3. Confusion Matrix illustration ..................................................................................................... 23 Figure 4. Bar plot showing the imbalanced data and the data balance ...................................................... 27 Figure 5. Top 10 Feature Importance ........................................................................................................ 32 Figure 6. ROC Curves for six ML models ................................................................................................ 35 Figure 7. Confusion matrices for six employed ML classifiers ................................................................. 36 x LIST OF TABLES Table 1. Study variables ............................................................................................................................ 16 Table 2. Distribution of categorical variables ........................................................................................... 29 Table 3. Distribution of Numerical variables ............................................................................................ 29 Table 4. Feature importance results for all variables ................................................................................. 31 Table 5. Model evaluation metrics for six classifiers. .............................................................................. 34 xi CHAPTER ONE: INTRODUCTION 1.1 The background of the study Student dropout is among the major challenges facing higher education institutions in most countries (Nurmalitasari et al., 2023). This issue has attracted considerable interest from the scholarly community, the government, and social stakeholders due to the diverse impacts they have on students, their families, the educational institution itself, and the state (Guzmán et al., 2021; Nurmalitasari et al., 2023). In addition, this impacts students and their families with financial burdens and causes reduced workforce productivity, further social inequality, and damages institutional reputation (Villegas-Ch et al., 2023). Therefore, developing and subsequently implementing dropout prevention strategies becomes very important for the success and sustainability of higher education systems worldwide (Kim & Kim, 2018). Dropout rates in higher education vary significantly across different regions and countries. The college dropout rate in the United States indicates that 33% of undergraduate students fail to finish their degree program (Andrea, 2024). The OECD, (2009) data reports that countries with strong support structures and efficient policies, such as Belgium, Denmark, France, Germany, and Japan, have generally kept their dropout rates low, typically less than 24%, compared to other countries. While exact percentages may vary depending on the year and availability of more current statistics, this problem is still a critical issue globally and continues to demand serious attention and remedy. Furthermore, Trends in higher education in South Africa showed that 50% of students who enroll in institutions of higher learning dropout within the first three years, while approximately 30% drop out in their first year (Letseka & Breier, 2008). Consequently, High dropout rates from higher education mean a great loss of resources, including that of the taxpayer, resulting in fewer graduates and thereby impacting highly skilled labor availability. This is a critical situation in many countries, with implications not only for financial efficiency but also for the quality and evaluation of higher education institutions (Paura & Arhipova, 2014). Several studies highlighted that Early identification of at-risk students is very important for improving academic outcomes, student retention, and dropout reduction (Alyahyan & Düştegör, 2020; Nimy et al., 2023; Osborne & Lang, 2023). Recent developments in the field of data science 1 and ML provide several possible ways to address this challenge using predictive modeling techniques (Baker & Siemens, 2014). Developed countries have carried out numerous studies to develop a data-driven predictive method utilizing machine learning techniques and demonstrating accuracy in forecasting school dropouts. Therefore, unlike traditional techniques, machine learning offers a significant promise. In addition, the algorithms can analyze large datasets, uncovering complex relationships between student characteristics, academic performance, and socioeconomic factors that might contribute to dropout (Romero et al., 2010). Researchers have also explored several ML techniques to identify students likely to drop out of school. Examples of such methods include decision tree algorithms, neural networks, Naive Bayes classifiers, instance-based learning algorithms, Random Forest, Logistic regression, and support vector machines (Kotsiantis et al., 2003). The study conducted in Peru reinforces the significance of financial and contextual variables in predicting university dropout rates in developing countries. Key factors identified include age, term, and the student's financing method, which align with findings from other regions that emphasize the role of socioeconomic status and financial support in educational persistence (Jiménez et al., 2023). The Impact of machine learning approaches in the prediction of student dropout from any given higher education setting has been the subject of numerous studies. Three elements identified as main drivers of the success achieved in such tasks are: identifying relevant features that may influence dropout; choosing a proper algorithm in developing a prediction model; and the evaluation metrics to be used in estimating a model's performance. Addressing these elements will be critical to improving dropout prediction accuracy and, consequently, developing effective early intervention strategies (Oqaidi et al., 2022). While most studies on dropout prevention have focused on developed countries, the findings and methodologies, particularly those using machine learning techniques, can provide valuable insights for developing countries. In fact, leveraging predictive models in developing nations can offer early identification of at-risk students, allowing for more targeted and cost-effective interventions. By adapting the variables and models used in developed countries, educational institutions in developing contexts could mitigate dropout rates, even in the face of resource limitations. 2 Moreover, the key determinants of dropout such as socioeconomic status, academic performance, and student engagement are universal, although their impact may be more pronounced in developing countries where inequalities are more prominent. Thus, the lessons learned from studies in developed countries can be customized to the specific socio-economic conditions of developing countries, making the implementation of data-driven dropout prevention strategies more effective. This study contributes to the growing body of research by showing how machine learning models, applied to datasets from developed countries, can be adjusted to improve dropout predictions and interventions in developing countries. 1.2 Research problem One of the biggest problems facing higher education institutions is student dropout. This impacts the quality of education provided and the long term sustainability of institutions (Tinto, 2012). Accurately being able to predict students who are more likely to leave school would be very helpful in the development of effective early interventions to ultimately improve overall educational outcomes (Dake & Buabeng-Andoh, 2022). While some previous works have studied other predictors of student dropout and adopted different statistical methods, there is a serious lack of comparative studies on the application of machine learning algorithms for early dropout prediction, particularly during the first year of study (Niyogisubizo et al., 2022). To address the problem of dropout in the context of existing machine learning solutions, this study focuses on the application of advanced algorithms such as Random Forest, Logistic Regression, Gradient Boosting Machine, Decision Tree, XGBoost, and Support Vector Machines. These methods have demonstrated effectiveness in analyzing large, diverse datasets to predict dropout more accurately compared to traditional statistical approaches (Andrade-Girón et al., 2023). By leveraging these machine learning techniques, this research seeks to provide more reliable early identification of at-risk students, enabling institutions to implement timely and targeted interventions. Traditional methods of identifying at-risk students rely on manual work and limited analysis of the available data, resulting in delayed and less effective interventions (Thayer, 2000). Such methods typically base their identification on past data, like academic performance and students’ 3 attendance, but fail to unearth other risk factors (Marbouti et al., 2016). In addition, By utilizing machine learning techniques to analyze extensive datasets, we can uncover patterns that conventional methods may have missed, leading to very promising solutions (Jordan & Mitchell, 2015; Lykourentzou et al., 2009). This study aims to enhance the early detection of at-risk students in higher education by developing and comparing various machine learning models. By focusing on the first year of study, the research seeks to facilitate timely and personalized interventions to reduce dropout rates, ultimately improving educational outcomes and institutional sustainability. 1.3 Research objectives 1.3.1 General objective The main objective of this study is to determine the most effective machine learning predictive model for early dropout prediction in higher education. 1.3.2 Specific objectives i. To identify the key determinants influencing higher education dropout. ii. To determine the best machine learning model that accurately predicts higher education dropout using supervised machine learning techniques. iii. To assess the performance of the selected machine learning model in predicting higher education dropout using key evaluation metrics, ensuring its effectiveness and reliability in practical applications. 1.4 Research questions i. What are the key factors that significantly influence the likelihood of higher education dropout as identified by machine learning techniques? ii. Which supervised machine learning model demonstrates the highest accuracy in predicting higher education dropout? iii. How effective is the selected machine learning model in predicting higher education dropout based on key evaluation metrics? 1.5 Scope of the study This study focuses on applying ML techniques to predict dropout of students in higher education as well as providing insights and strategies that learning institutions can use to help students at risk and implement timely interventions. The used dataset contains socioeconomic and academic 4 factors relevant to the prediction of dropout. The study scope concerns only higher education institutions. Furthermore, the study limits the data on specific academic and socio-economic variables to a specific time period. After that, the scope narrows to only predict dropout, not other educational outcomes such as enrollment or graduation. Furthermore, the study acknowledges potential limitations, namely the generalization of findings across different educational contexts, and the representativeness of the synthetic samples generated by SMOTE. All these aspects can affect the effectiveness of the predictive model. The study addresses these constraints by being extremely careful about model performance and ensuring that the synthetic data is representative of real-world scenarios. It will also discuss how such limitations affect the generalizability of findings to different higher education contexts. 1.6 Significance of the study Applying ML techniques for student dropout prediction gives higher education a better way to strengthen retention strategies in higher education. The research will help institutions identify atrisk students and offer them assistance on time, hence improving the student's retention rate. Furthermore, it will assist the institution in effectively utilizing resources and designing focused intervention programs that address specific needs. As a result, such predictions will help educators understand some of the causes of student dropout. This will assist them in modifying their teaching strategies and devising support systems that are most suitable for their students. Furthermore, educators can use this information to design their engagement methods, spot students in need of extra help, or adjust instructional strategies to best foster overall student performance and satisfaction. Moreover, Economic and social advancements are among the many benefits of lowering the national dropout rate. With more graduates, the nation will improve its workforce's skill level and productivity. Additionally, this will reduce social inequality since more children from various backgrounds can complete their education and seek better career prospects. With evidence-based strategies, the findings can help support national educational goals by improving student outcomes and institutional effectiveness. 5 1.7 Research Structure This study consists six chapters, the first chapter is the introduction of the study. This chapter discusses background information about the study, the problem statement, outlines the research objectives, introduces research questions, and justifies the importance of the study. Moving on to the second chapter encompasses a literature review that focuses on previous research conducted on predicting students at risk of dropping out. It also includes a theoretical review, highlights a critical research gap, and presents a conceptual framework. The third chapter describes different methods used in the study, details of sample selection, study design, the source of data, Data collection procedures, model specification, the variables of interest selected for analysis to achieve the objective of the research, predictive model, as well as evaluation of performance for predictive models. The fourth chapter presents and analyses the results from machine learning built models, the fifth chapter discusses the study findings and finally, the last chapter provides the conclusion of the study and some recommendations. 6 CHAPTER 2: LITERATURE REVIEW 2.1 Introduction The early detection of Higher Education dropout is critical for implementing timely interventions. This literature review explores the definition of key terms, key factors influencing dropout rates, Application of machine learning techniques for dropout prediction, and the evaluation of predictive models for accuracy and reliability. 2.2 Definition of key terms 2.2.1 Student dropout Student dropout refers to a student who terminates their education before the completion of the academic program in which they are enrolled. In the context of University, Dropout refers to the act of a student discontinuing their education while still officially enrolled in a higher education institution (SYDLE, 2024). Dropout rates are a crucial measure of how well an educational system is doing and have important consequences for students' future socioeconomic status, including their job prospects, prospective income, and overall well-being. The causes of dropout are complex and involve various aspects such as socioeconomic status, academic performance, and school engagement (R. Rumberger, 2011). It also has financial and societal impacts on both the dropouts themselves and the country in general (R. Rumberger, 2020). 2.2.2 Predictive Model Traditionally, humans can analyze data. However, human capacity is limited, making data analysis difficult. As a result, they have developed automated systems that gain knowledge from data and its fluctuations to adjust to the ever-changing data environment. Machine learning employs statistical algorithms to acquire knowledge from data samples and involves the development of statistical programs referred to as models (Lamba & Madhusudhan, 2022). Predictive modeling, as defined by Gartner, is a widely employed statistical technique for predicting future behavior. Its solutions employ data-mining technologies to examine both current and past data, empowering the creation of a good model for future outcomes prediction (Mishra et al., 2023). 7 2.2.3 Machine learning Machine Learning is defined as the part of artificial intelligence dealing with the creation of models that can conduct a predefined task on data, making predictions or decisions by learning from data with no explicit programming. In traditional programming, humans write rules and instructions, but in ML, systems automatically identify and analyze patterns in data and make predictions. These systems change their behavior after making predictions, depending on how accurately they turn out. With the ability to understand more data, they improve the results over time unaided by human involvement (Saeed et al., 2024). Machine Learning uses various algorithms to develop, describe, and predict outcomes iteratively using data. As algorithms absorb training data, more accurate models can be produced based on that data (Kirsch & Hurwitz, 2018). 2.3 Factors Influencing Student Dropout Factors that affect higher education dropout need to be understood so that appropriate interventions and policy formulations can be developed toward improving student retention. Dropout rates not only reflect the problems of individual students; they also impact the institutional reputation and national educational outcomes. This review aims to identify and discuss various factors contributing to student dropout in higher education by going through existing literature and research findings. 2.3.1 Academic factors Student retention is highly correlated with academic performance. Students with poor academic performance or grades are much more likely to drop out of school. Increased academic difficulties, such as rigorous coursework and a high course load, can lead to higher dropout rates because they push students beyond their limits (Stinebrickner & Stinebrickner, 2014). A study by Nurmalitasari et al., (2023) identifies factors such as academic performance including CGPA, and interest in the study program as crucial determinants of student success. In this context, academic ability as measured through the CGPA is very important. Low academic ability typically leads to students who frequently fail to follow lessons or are unable to complete their thesis. As result, a low level of academic performance and success is connected with a big probability of abandoning (Araque et al., 2009). 8 2.3.2 Socioeconomic factors Socioeconomic status is an important factor influencing students to drop out of higher education. Inadequate finances to cover tuition fees and other related expenses are the primary reasons why students drop out of school prematurely (Powdthavee & Vignoles, 2008). Furthermore, students who work long hours to support themselves might find it difficult to balance work and study, which would increase the dropout rate (Magolda & Astin, 1993). However, Callender, (1999) indicates that part-time work to fund education mostly negatively impacts academic performance, since students are engaged in many working hours and have little time available for studying and other academic aspects. Other socio-economic disadvantages further lead to an increase in stress and less available time for academic activities; therefore, the risk of dropout increases. The financial status of a family significantly influences school dropout rates. Students from lowincome families often face challenges that impede their ability to continue their education, such as the need to work to support their families or the inability to afford school-related expenses. Financial problems increase the likelihood of students from low-income families dropping out (R. W. Rumberger & Lim, 2008). Socioeconomic status, including income, education, and financial security, significantly impacts life outcomes and academic achievement (Morgan et al., 2009). Aina et al., (2022) show that students whose families have low financial status increase likelihood of leaving school prematurely compared to those with the highest family income because they spend much time contributing to their family income. Family background and parental education have a significant impact on dropout rates. In most cases, families with all parents alive have lower dropout rates and more likely to finish their studies, in contrast to those without both parents. However, other family issues, such as illness, deaths, adults entering and leaving the household, and marital disruptions, contribute more to the dropout rate (R. W. Rumberger & Lim, 2008). The educational level of the parents has also a significant impact on the dropout rate prediction. Research shows that parents who have a higher educational level are more involved in their children's educational achievements by spending time with them and supporting them. As a result, their children are more likely to stay at school until they complete their educational level. 9 Additionally, parents significantly impact students' performance by instilling values, aspirations, and the necessary motivation for success and continuous school attendance (Smelser & Baltes, 2001). 2.3.3 Demographic factors Demographic variables, specifically age and gender, are significant contributors to dropouts in higher education. Older students commonly have more problems with academic workload and personal life balancing compared with their younger peers, since it is more likely that they work full-time at other jobs and families competing for their time and energy (Casanova et al., 2023). Moreover, older students might find it harder to get into academic life again after being away from formal education for a period of their lives, which in turn leads to them dropping out (Bean J. P. & Metzner B. S. , 1985). A study conducted at the Polytechnic Institute of Portalegre in Portugal 2018-2019, shows that male students leave schools prematurely compared to female students. On the other hand, female students, especially in STEM, Design, and Multimedia courses, exhibit lower dropout rates. In addition, factors such as marital status and debt also correlate with increased dropout rates, highlighting the complex interplay of socioeconomic and demographic factors in student retention (Lugyi, 2024). 2.4 Application of Machine Learning in Education The application of ML in education transforms traditional teaching methodologies into facilitating personalized learning experiences, from real-time feedback on every student's behavior to individual factors. This technology also changes assessment for the better; not only does it eliminate biases, but it also uncovers hidden insights for better learning outcomes and more effective teaching strategies (Jagwani & Aloysius, 2019). Machine learning also improves educational institutions by transforming the learning process and providing a tool to monitor students' achievement and engagement. Hence education becomes more equitable. For several years, Machine learning was a very specific area of artificial intelligence, really explored within academic and research circles. However, it has developed a powerful tool with wide-ranging applications across diverse fields, including education (Jordan & Mitchell, 2015). 10 2.5 Application of Machine learning for dropout prediction In education, machine learning was used in dropout prediction and prevention for several decades. This is driven by the potential of ML to improve education outcomes and bring effective support to at-risk students (Larusson & White, 2014). Machine learning effectively predicts student learning outcomes, enabling early intervention to improve dropout rates. The accuracy, sensitivity, and specificity of these systems use different algorithms to predict student performance early in their academic journey. Therefore, we can categorize these tasks as the most important ones, which include automatic exam score collection and data analysis to identify unobservable variables like pre-knowledge, talent, and diligence. It uses these data to build predictive models of learning outcomes, making it easy to spot any student who might be falling behind or otherwise at risk of falling out, thus targeting support specifically in this group (Asthana & Hazela, 2020). Machine learning personalizes the learning experience by scaling educational content to meet the needs of each student. Such personalization makes sure that every student faces the right amount of challenge and support, therefore staying on the right path and reducing the risk of dropping out (Shaun et al., 2014). Additionally, machine learning-based early warning systems can detect students who are at risk of dropping out of school based on variables such as attendance, grades, and behavior and take appropriate measures (Bowers et al., 2013). Large amounts of unprocessed raw data could be analyzed using advanced methods to derive insightful information helpful in predicting student performance. The difficulties and learning trends of a student could be identified with the aid of machine learning models, highlighting the areas in need of improvement. This enables the elaboration of personalized strategies to improve student results. Moreover, educators can use such models to understand the level of comprehension achieved by students and then adjust their teaching to meet the needs of the different learners (Kharb & Singh, 2021). Several studies in educational data mining have utilized varied machine learning approaches in predicting student dropout status. Among these include Support Vector Machine, Naive Bayes, association rule mining, logistic regression, artificial neural networks, and decision tree (Kumar et al., 2017). These algorithms have been successfully applied in different learning contexts to optimize the process and outputs of learning. 11 Baker & Siemens, (2014) describe these fields of study by using algorithms like decision trees, neural networks, and clustering methods to investigate large amounts of data created in educational settings. The results showed that the decision trees make the process of identifying student performance and behavioral patterns easier by pointing out at-risk students. Neural networks model complex patterns to predict student success and customize learning experiences. The grouping that clustering methods provide supports the development of targeted interventions. Generally, these techniques aim to deepen the student's understanding of the learning process, thereby enhancing its efficacy and ultimately supporting improved educational outcomes (Baker & Inventado, 2014). Romero et al., (2010) further propose the influence of data mining and machine learning in the educational context. These researchers have successfully used different methodologies to address the problem of student dropout. These include matrix factorization and deep neural networks for predicting student data that will result in dropouts. Advanced statistical frameworks, probabilistic graphical models, and survival analysis allow dealing with complexity and variability in educational data. Consequently, one will gain a better understanding of the variables that predict attrition as a result, develop proactive strategies to support at-risk students. This ability to predict is beneficial not only for timely interventions but also in personalizing learning according to the needs of a single learner so that it could provide a more supportive and effective learning environment (Romero et al., 2010). Various machine learning methodologies, including Cox Regression, Logistic Regression, and Random Forest, were employed to detect students in the United States who are likely of not completing their education within the expected timeframe. The model was optimized with the data from a school district and performance was measured in terms of information gain, gini impurity, stepwise regression and single feature performance. Results show that the implemented Random Forest model surpasses other ML approach (Aguiar et al., 2015). A study by Sara et al., (2015) demonstrated the evidence of the problem of student withdrawal in Danish high schools using high school datasets and implemented model building with the help of Support Vector Machines, Random Forests and Naive Bayes. Their measure of performance was either the Accuracy or the Area Under the ROC Curve (AUC). The Random Forest model performed well with the highest 12 accuracy. However, they did not deal with data imbalance which is a very crucial aspect of improving model performance and predictability. Moreover, research conducted by Kotsiantis et al., (2003) reveals that the prediction of school dropout was approached using machine learning models. Six distinct ML classifiers, namely Support Vector Machine, Logistic Regression, Artificial Neural Networks, Naive Bayes, Decision Tree, and K-Nearest Neighbor, have been applied in the study, and their performances have been evaluated based on the accuracy and F1-score metrics. The best performing model was Random Forest. Kabathova & Drlik, (2021) attempted to use machine learning methods in the prediction of student dropout. Based on the data collected from four academic years, a number of machine learning algorithms namely Logistic Regression, Random Forest, Support Vector Machines, and Naive Bayes were used and compared. These algorithms search for patterns in student data describing activities and accomplishments that distinguish course completers from noncompleters. This manner enabled the model to predict if a student is at risk. The study found Random Forest as the best model compared to several other models used. Even though there are different ML approaches for model development, there is no single best model for predicting dropouts, as the variety of models ranges from logistic regression, decision tree, and naïve Bayes to support vector machines, random forests, and neural networks. Whether one algorithm shows better performance than another greatly depends on the quality and features of the data as well as the context in which it is found. (Romero et al., 2010; Shaun et al., 2014). Furthermore, knowledge of the data is required to choose the proper algorithm because a few techniques work well with small samples and some require a huge amount of data to provide good results. 2.6 Evaluating Predictive Models of Student Dropout One of the major aspects of evaluating a predictive model, is choosing relevant performance metrics. Metrics including accuracy, precision, recall, F1 score, and AUC-ROC are frequently utilized (Sokolova & Lapalme, 2009). Accuracy refers to the proportion of correctly classified instances, whereas precision and recall refer to model performance considering true positives and false negatives, respectively. The F1 score presents a weighted average between precision and recall that may be important in very imbalanced datasets where the case of dropout can be less frequent (Manning et al., 2008). In another related study, the authors also used accuracy, precision, recall, F1 score, and AU-ROC curve to evaluate predictive models. They chose the F1 score and 13 AUC specifically because of the imbalanced class ratio in their experimental data(Lee et al., 2021). These measures are valuable, but sometimes they can be misleading, mostly for datasets with imbalanced classes. Therefore, a set of metrics should be used to evaluate model performance without potential biases and present a realistic picture of predictive capabilities. Furthermore, the selection of appropriate evaluation metrics should align with the specific goals and context of the predictive model. For instance, in educational settings, the cost of misclassification must be carefully considered. False negatives, where at-risk students are not identified, can have more severe consequences compared to false positives, where students are incorrectly flagged as at-risk (Kotsiantis et al., 2003). Therefore, precision and recall become crucial metrics, particularly in imbalanced datasets where the dropout rate is low. Additionally, metrics such as Cohen's Kappa and Matthews Correlation Coefficient (MCC) offer insights into the model's performance beyond simple accuracy, accounting for the possibility of random chance in predictions (Chicco & Jurman, 2020). These metrics provide a more nuanced evaluation, helping researchers and practitioners to achieve a better understanding of the strengths and limitations of their predictive models in the context of early dropout detection. 14 CHAPTER THREE: METHODOLOGY OF RESEARCH 3.1 Introduction The methodology chapter provides a clear description of the steps taken in conducting the study. This chapter outlines the procedures involved in sample selection, study design, data sources, data collection procedures, model specification, and variable selection for analysis. This section provides a comprehensive summary of all the methods employed to effectively address both general and specific objectives using well-structured and scientifically sound procedures. 3.2 Study design This research utilizes a quantitative approach and machine learning methods to create a model that predicts student dropout. The main goal is to detect students who are at risk of leaving their studies early, which enables the institution to take proactive measures and offer timely interventions. Quantitative research design is appropriate for this study as it allows for the statistical analysis of numerical data and the development of predictive models based on historical data. Machine learning techniques, such as random forest, decision tree, logistic regression, support vector machine, XGBoost, and Gradient Boosting machines are utilized to analyze the data and build predictive models. These models help in understanding the patterns and factors contributing to student dropout and academic success (Creswell & Creswell, 2017). 3.3 Source of data The present work utilizes a dataset from a higher education institution in Portugal provided by (Martins et al., 2021). This dataset consists 4424 tuples and 35 variables that were created within the SATDAP program, financed through the grant POCI-05-5762-FSE-000191, whose main objective was to address and reduce academic failure and dropout rates in higher education. The dataset is characterized by a broad diversity of variables, including academic path, demographics, and socioeconomic background available during the student enrollment period. It also includes students' academic performances at the end of the first and second semesters. Additionally, the dataset aligns with the standards and rigor expected from datasets housed in the UCI Machine Learning Repository, a renowned collection used globally by students, educators, and researchers for the empirical analysis of machine learning algorithms. This repository, established in 1987, ensures data accuracy and reliability, further enhancing the credibility of our study. More details about the dataset can be found (https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success). 15 at 3.4 Data Description This study uses a dataset that includes variables that could be useful in predicting cases of university dropout. The dataset is in CSV format and encompasses Student-related variables including marital status, nationality, prior qualifications, and information about the parents' qualifications and occupations. It also contains variables of academic performance, such as the number of curricular units credited, enrolled, evaluated, and approved, with grades in both the first and second semesters. Other socio-economic variables included in the dataset are the unemployment rate, inflation rate, and gross domestic product. This is a fully comprehensive dataset with continuous variables such as grades, unemployment rate, and GDP, among others, as well as categorical variables like marital status and nationality. More importantly, there is no missing information for any variable of interest; hence, the data are complete and reliable for any subsequent analysis. 3.5 Study variables In this study, variables were selected based on an extensive literature review and a conceptual framework that would allow for an understanding of the factors influencing dropouts among students in higher education. Individual characteristics, socioeconomic background, demographic factors, and academic performance are major contributors to dropout and will attract special attention. The dependent variable is the dropout status of the students which is in two categories or classes: 1 for dropouts and 0 for non-dropouts (which include students who either remain enrolled or graduated). This binary classification will help explain clearly the factors that differentiate those who drop out from those who continue their education. Table 1. Study variables Class of Variables Demographic data Socioeconomic Data Variable name Variable Type Marital_status Nationality Numerical/discrete Numerical/discrete Displacement_status Numerical/binary Gender Numerical/binary Enrollment_Age International_status Numerical/discrete Numerical/binary Mother_Qualification Numerical//discrete Father_Qualification Numerical/discrete Mother_Occupation Numerical/discrete Father_Occupation Numerical/discrete 16 Academic data Target Educational_special_needs Numerical/binary Debtor_status Numerical/binary Tuition fees_up_to_date Numerical/binary Scholarship_holder Numerical/binary Unemployment_rate Numerical/continuous Inflation_rate Numerical/continuous GDP Numerical/continuous Application_mode Numerical/ discrete Application_order Numerical/ordinal CourseName Numerical/discrete Day/evening attendance Numerical/binary Previous Qualification Numerical/discrete firstSemCurricularUnits_Credited Numerical/discrete firstSemCurricularUnits_enrolled Numerical/discrete firstSemCurricularUnits_evaluations Numerical/discrete firstSemCurricularUnits_approved Numerical/discrete firstSemCurricularUnits_grade Numerical/continuous firstSemCurricularUnitswithout_evaluations Numerical/discrete secondSemCurricularUnits_credited Numerical/discrete secondSemCurricularUnits_enrolled Numerical/discrete secondSemCurricularUnitse_valuations Numerical/discrete secondSemCurricularUnits_approved Numerical/discrete secondSemCurricularUnits_grade Numerical/continuous secondSemCurricularUnitswithout_evaluations Numerical/discrete Target Categorical 3.6 Data preprocessing The primary step in the machine learning pipeline is data preprocessing, which makes sure the data is reliable and of excellent quality for input into model training and evaluation. This process includes solving data issues like missing values and outliers and normalizing or standardizing. The dataset used in this research does not require additional data preprocessing, as rigorous data preprocessing to handle anomalies, outliers, and missing values in a dataset and ensuring a clean dataset ready for analysis has already been completed. Missing data handling is important because, if not handled properly, it would lead to biased data and poor model performance (García et al., 17 2015). Proper data preprocessing is fundamental in building robust and generalizable predictive models (Sammut & Webb, 2011). 3.7 Handling Data Imbalance Class balance becomes an important step in the development of an effective predictive model, where the correct identification of at-risk students is accurately made. Sometimes, class imbalance occurs because some classes are radically underrepresented compared with others, which may later lead to a partial performance of the model and low predictive accuracy within the minority class. (Chawla et al., 2002; Mduma, 2023). To tackle the class imbalance, we applied SMOTE, a widely recognized method for balancing datasets. SMOTE works by generating synthetic examples for the minority class, and in essence, its development increases the representation of that class, allowing the model to learn from a more balanced dataset (Chawla et al., 2002). This technique is especially useful in classification problems where the target variable has imbalanced class distributions. While SMOTE significantly improved the balance of the dataset, one should keep in mind that if treated incorrectly, synthetic data may add noise or overfitting. Making sure that these synthetically generated samples are real-world scenarios representatives is very critical to avoid misleading model training. We can achieve this by choosing significant features that accurately depict real-world scenarios. We will also apply cross-validation and performance metrics to evaluate the impact of SMOTE on generalization ability of the model (Saad Hussein et al., 2019). 3.8 Model Development A predictive model, whether it relies on statistical methods or machine learning, is crafted to estimate future events or outcomes by examining past data. This model is developed using a dataset that contains various input variables or features, which are utilized to forecast a specific target variable. The main objective of such a model is to identify patterns and correlations in the data that allow for precise predictions about upcoming events (Bishop & Nasrabadi, 2006). In the context of this study on predicting higher education dropout among students, various classification modeling techniques will be utilized. These techniques include decision tree, random forest, support vector machine, gradient boosting machine, XGBoost and logistic regression. By employing these methods, a predictive model will be developed to estimate the likelihood of 18 student dropout in higher education settings. This approach will help in the identification of dropout predictors, thereby facilitating timely interventions to reduce dropout rates (AndradeGirón et al., 2023). The selection of these models is based on their effectiveness in handling complex data structures and their capability to reveal significant patterns associated with dropout rates. Each technique offers unique advantages: Decision Trees provide interpretability (Gilmore et al., 2021), Random Forest enhances accuracy through ensemble methods (Breiman, 2001), Support Vector Machine is adept in high-dimensional spaces (Cortes & Vapnik, 1995), and Gradient Boosting and XGBoost enhance prediction by iteratively improving upon previous models' errors. XGBoost further optimizes this process with advanced techniques like regularization and parallelization for better accuracy and efficiency (Ramraj et al., 2016). Logistic Regression, meanwhile, offers a clear probabilistic framework. Together, these models form a robust approach to accurately predict student dropout and inform targeted intervention 3.8.1 Logistic regression Logistic Regression is the most used classification technique to predict the probability of a binary outcome, like student dropout. In higher education dropout prediction, Logistic Regression models a dependent variable based on one or more independent variables: for example (academic performance, demographic factors, and socioeconomic status), a dependent variable might be the status of being a dropout (yes/no) (Hosmer Jr et al., 2013). It takes these features as input and feeds into a logistic function to give predictions on the probability of a student's dropout. 3.8.2 Decision Tree Decision tree is a Machine Learning model applied to both classification and regression problems. The basic form of the data and decision process is represented in a tree-like structure. In the tree, internal nodes express decisions about certain features. Their branches define the results of these decisions. The leaf nodes represent the final stage of the prediction or classification process. Decision tree is very intuitive and straightforward for interpretation. It works with categorical and numerical data and recursively dividing a dataset into subsets according to given feature values, then attempting to create groups that are as homogeneous as possible for each accurate prediction (Mitchell & Mitchell, 1997). 19 Furthermore, Decision trees is one of the mostly used ML model in the prediction of school dropout, and it extracts the pattern from large amounts of educational data using data mining techniques and function by creating a hierarchical tree-like model in which the internal nodes represent decisions about specific student attributes, the branches represent the outcomes, and the leaf nodes provide the final prediction of the likelihood of a student for persisting or dropping out education. It is developed to deal with categorical and numerical data, decision tree is particularly well suited to handle the task, and their intuitive structure will facilitate educators' interpretation and subsequent actions. Using decision tree, researchers can classify students based on their responses and forecast dropout risks with a view to intervening early in the improvement of student retention rates (Mariano et al., 2022). Figure 1. Decision Tree Illustration. 3.8.3 Random Forest Random Forest is an ensemble learning technique that can be applied to both classification and regression tasks. The model works on the principle of creating several decision trees at the time of training. In case of classification, it gives the most voted class by the trees, whereas in case of regression, it provides the average of the predictions from the trees. The algorithm employs bootstrap aggregation, or bagging, in which each tree is trained on a random subset of the data, and can repeatedly select the same data points. To reduce variance and prevent overfitting, they consider a random subset of features at each split. Random forests have good accuracy, resistant to overfitting, and have the ability to handle big datasets with high dimensionality (Hastie et al., 20 2001). Their effectiveness stems from the collective of decorrelated trees, which often yields superior prediction accuracy compared to individual decision tree. Breiman, (2001) revealed that the effectiveness of RF in handling complex datasets has led to its application in dropout prediction tasks. RF's robustness to outliers and noise also fits well with educational data, which often has a lot of variability. It often outperforms many other machinelearning models, such as XGBoost and SVM, providing high accuracy and reliability to the prediction. Gini impurity helps RF reduce overfitting and biases; as a result, the model becomes very accurate, even on noisy data. To predict dropouts, we train RF on student data to identify the most important features for prediction, which enables us to target interventions based on these variables (Dass et al., 2021). Figure 2. Random Forest Algorithm (GeksforGeeks, 2024) 3.8.4 Support Vector Machines Support vector machines are an extremely effective algorithm for supervised learning, mostly used for classification tasks but also applicable to regression. The essential idea behind SVM is to find the optimum separation between data points of different classes, using the maximum possible 21 margin. Therefore, this hyperplane is defined using support vectors, which are the data points closest to the decision boundary. SVM handles problems in both linear and nonlinear classification using different kernel functions. Among them are linear, polynomial, and radial basis function (RBF) kernels, which project input data onto higher-dimensional space for a possible linear separation (Cortes & Vapnik, 1995). One of the more important strengths of SVMs is that they work very well in high-dimensional spaces and have a certain resistance to overfitting, especially when the number of features or dimensions exceeds the number of samples. As a result, SVMs in dropout prediction have been very promising, especially because they can model complex decision boundaries and be less prone to noisy data. SVMs identify the optimal separating hyperplane that differentiates students at risk of dropout from others based on various academic and socioeconomic features (Del Bonifro et al., 2020). 3.8.5 XGBoost XGBoost stands for Extreme Gradient Boosting, making it a high-performance implementation of the gradient boosting algorithm. It aspires to be powerful, flexible, portable, and also highly efficient. You can apply this to classification and regression problems. It's great for structured or tabular data. It also has some important improvements to gradient boosting, such as regularization to prevent overfitting, better ways to deal with sparse data and missing values, and a weighted quantile sketch algorithm that makes it faster, more accurate, and more stable when used on large datasets than conventional gradient-boosting methods (Chen & Guestrin, 2016). 3.8.6 Gradient Boosting Machine The gradient boosting machine is a technique used in ensemble learning for both classification and regression tasks. The model is constructed incrementally using a series of weak learners, usually decision trees, in a step-by-step manner. During each stage of this procedure, a new tree is trained to rectify faults made by the prior trees. This is done in order to minimize a certain loss function via gradient descent. This iterative process thus focuses on decreasing bias and increasing the accuracy of the models (Hastie et al., 2001). 3.9 Model Evaluation It is important to note that once the learning algorithm has been developed using the training set, it is critical to evaluate how effective the model classifier is. The model classifier in machine learning is evaluated using the test data. The assessment is done using the performance metric 22 classification approach. It is a common practice to use the confusion matrix for the evaluation. The confusion matrix is characterized as a cross-tabulation that provides an overview of how well the model classifier predicts similar samples of the target classes. Furthermore, from the confusion matrix, a considerable number of classification metrics namely Sensitivity, Accuracy, Specificity, Recall, Precision, F1 score, FPR and TPR are extracted and used to identify what best predicting model. Another important and mostly used metric is Receiver Operating Characteristics curve that graphically depicts the relationship between True and False Positive rate (TPR vs FPR). 3.9.1 Confusion Matrix A tabular representation known as a confusion matrix offers a succinct overview of a machine learning model's performance on a particular testing dataset. It is a technique for graphically displaying the proportion of accurate and inaccurate occurrences based on the model's predictions. It is frequently used to assess how well classification models work, which aim to predict a category label for every input event (GeksforGeeks, 2024). Figure 3. Confusion Matrix illustration  A True Positive (TP) occurs when the model accurately predicts a positive outcome, and the actual event is indeed positive.  A True Negative (TN) occurs when the model accurately predicts a negative consequence, and the actual outcome is indeed negative.  A false positive (FP) occurs when the model makes a wrong prediction of a positive outcome, even when the actual outcome was negative. Also referred to as a Type I error.  A false negative (FN) occurs when the model makes a wrong prediction of a negative outcome, when in reality the outcome was positive. 23 3.9.2 Accuracy Accuracy is a metric that quantifies the effectiveness of a model. The ratio of total correct instances to the total instances is referred to as the accuracy. 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = 3.9.3 Precision Number of correct prediction Total number of input samples Precision is a metric that quantifies the level of accuracy in a model's positive predictions. Precision is the quotient obtained by dividing the number of correct positive predictions by the total number of positive predictions generated by the model. 3.9.4 Recall 𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 = True Positives True Positives + False Positive Recall quantifies the efficacy of a classification model in accurately recognizing all pertinent occurrences from a dataset. The term "true positive rate" refers to the proportion of correctly identified positive cases (TP) out of the total number of positive occurrences, which includes both TP and FN. 𝑹𝒆𝒄𝒂𝒍𝒍 = 3.9.5 F1 Score True Positives True Positives + False Negative The F1-score is employed to assess the overall efficacy of a classification model. A high F1 score indicates a low number of false positives and false negatives. 𝑭𝟏 𝒔𝒄𝒐𝒓𝒆 = 2 × 3.9.6 Area Under Curve (AUC) 1 1 1 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑅𝑒𝑐𝑎𝑙𝑙 When assessing binary classification models in machine learning, the Area Under the Curve (AUC) is a critical performance indicator. It is produced by plotting the True Positive Rate (TPR) versus the False Positive Rate (FPR) across a range of categorization criteria on the Receiver 24 Operating Characteristic (ROC) curve. The AUC measures the model's overall capacity to discriminate between positive and negative classes. The AUC values range from 0 to 1:  0.5: No discriminative ability, equivalent to random guessing.  0.7 to 0.8: Acceptable performance, indicating the model has some ability to distinguish between classes.  0.8 to 0.9: Excellent performance, reflecting a strong capability to separate the classes.  0.9 to 1.0: Outstanding performance, demonstrating near-perfect classification ability  1: The model perfectly discriminates between positive and negative classes. 3.10 Feature Importance Analysis To identify key factors influencing higher education dropout, a comprehensive feature importance analysis was conducted using multiple methods: Random Forest, XGBoost, Logistic Regression, SHAP values, and Permutation Importance. Each method evaluates the contribution of features to model predictions, providing unique insights. By averaging the importance scores from these diverse methods, we obtain a comprehensive and reliable measure of feature significance, ensuring a well-rounded understanding of the key factors influencing higher education dropout. Random Forest evaluates feature importance by measuring how much each feature increases the accuracy of the model when included. It does this by averaging over many decision trees, making it robust against overfitting (Breiman, 2001). XGBoost is a gradient boosting technique that can consider feature importance with regards to how often features are used to create the splits in the data across all trees within the ensemble. It emphasizes error reduction through weighted adjustments, leading to highly predictive models (Chen & Guestrin, 2016). Logistic Regression determines the importance of features by assessing the weights assigned to each predictor. The magnitude of these weights indicates the strength and direction of the influence of each variable (Hosmer Jr et al., 2013). 25 SHAP Values provide an explanation of model predictions by assigning each feature an importance score based on its contribution to the prediction outcome. This method considers interactions among features and provides consistent explanations (Lundberg, 2017). Permutation Importance works by shuffling feature values and measuring the decrease in model performance, thus identifying features that significantly impact predictions. It is particularly useful for understanding complex models (Fisher et al., 2019). The steps involve calculating feature importance scores using each method, normalizing these scores to a common scale, and averaging them to obtain a "Mean Importance" score for each feature. This combined score integrates the strengths of all methods, mitigating individual biases and offering a more reliable assessment of feature influence. The most critical features, ranked by their mean importance scores, will be used to refine the model and guide the identification of significant predictors of dropout, ultimately supporting targeted intervention strategies. This study demonstrates originality through its comprehensive use of advanced feature importance methods, integrating Random Forest, XGBoost, Logistic Regression, SHAP values, and Permutation Importance. Unlike traditional approaches that often rely on a single technique, the combination of these diverse methods provides a deeper, more reliable understanding of key determinants of higher education dropout. SHAP values and Permutation Importance, in particular, offer advanced interpretability by explaining feature contributions and interactions within the model, a significant enhancement over conventional techniques (Lundberg, 2017; Fisher et al., 2019). By combining the strengths of these methods, the study not only improves prediction accuracy but also delivers actionable insights for targeted interventions, filling a notable gap in existing dropout prediction research. 26 CHAPTER 4: DATA ANALYSIS AND PRESENTATION OF RESULTS 4.1 Introduction This chapter presents the analysis of the data used for the study and discusses the results obtained. The chapter describes a detailed examination of the descriptive statistics, exploring the distribution of key variables. Then focuses on evaluating the performance of the predictive models used in the study. Comparative results for various machine learning algorithms are presented, highlighting the robustness and weaknesses of each model. Finally, the chapter concludes with an interpretation of the findings with the research objectives, discussing the implications for early detection of student dropout and suggesting potential interventions based on the results. 4.2 Data quality check Inspecting the dataset, an imbalanced data situation was encountered where in dropout instances constituted approximately 32% of the dataset, while non-dropout which includes graduate and enrolled classes represented 50% and 18% respectively. Specifically, the dataset initially comprised 1421 dropout instances, 2209 graduates, and 794 enrolled students. After combining the graduate and enrolled classes, the resulting non dropout instances becomes 3003(68%). This imbalance can lead to biased outcome as the models predicting the majority class (non dropouts) while overlooking the minority class (dropouts) despite robust model diagnostics. To tackle this issue, a SMOTE method was applied to balance the dataset. This resulted in an even representation of dropout and non-dropout instances, with both categories consisting 3003 samples each, as shown in Figure 4. This balancing step is crucial for improving the model's ability to predict dropout instances accurately. Figure 4. Bar plot showing the imbalanced data and the data balance 27 4.3 Descriptive View of Dropout Rate for Demographic, Socioeconomic, and Academic Variables This section shows dropout rates across key demographic and academic variables, with the outcome of student Dropout. Analysis reveals significant variations in dropout rates among different groups. The dropout rate of legally separated students is high at 66.67%, followed by married students (47.23%) and those in facto unions (44%). Single students, the largest group, have a dropout rate of 30.21%. Course dropout rates highlight that Biofuel Production Technologies (66.67%) and Informatics Engineering (54.12%) struggle with higher dropout rates, suggesting possible issues with curriculum difficulty or student engagement. Conversely, fields like Nursing (15.4%) and Social Service (18.31%) show lower dropout rates, potentially due to job stability or strong support systems. Evening students are more likely to drop out (42.86%) compared to daytime students (30.80%). Additionally, debtors have a higher dropout rate (62.03%) than non-debtors (28.28%), and students with up-to-date tuition fees are less likely to drop out (24.74%) compared to those with overdue fees (86.55%). Gender differences are also notable, with a dropout rate of 25.10% for females versus 45.05% for males, highlighting a significant gender disparity. Age at enrollment affects dropout rates as well, with a mean age of 26.07 for dropouts compared to 21.94 for non-dropouts, indicating that older students may face additional challenges. In terms of academic performance, the dropout’s average is 2.55 of approved curricular units (i.e., units passed) in the first semester, compared to 5.73 for non-dropouts. Dropouts also have an average grade of 7.26 out of 20, while non-dropouts average 12.24. These indicators suggest that lower grades and fewer passed units are associated with a higher risk of dropping out. Table 2 presents a summary of a few selected categorical variables that are key to understanding the determinants of student dropout and Table 3 presents numerical variables. For a more comprehensive breakdown of all categorical variables and additional details, see Appendix A. 28 Table 2. Distribution of categorical variables Variable Marital status Course Daytime/eveni ng attendance Debtor Tuition fees up to date Gender Scholarship holder Category Count % Dropout Count % Non-Dropout Count % Single Married Divorced Facto union Legally separated Widower Nursing Management Social Service Veterinary Nursing Journalism and Communication Advertising and Marketing Management Management (evening attendance) Tourism Communication Design Animation and Multimedia Design Social Service (evening attendance) Agronomy Basic Education Informatics Engineering Equinculture Oral Hygiene Biofuel Production Technologies Day Evening No Yes Yes No Yes Yes No Yes 3919 379 91 25 6 4 766 380 355 337 331 88.58 8.57 2.06 0.57 0.14 0.09 17.31 8.59 8.02 7.62 7.48 1184 179 42 11 4 1 118 134 65 90 101 30.21 47.23 46.15 44 66.67 25 15.4 35.26 18.31 26.71 30.51 2735 200 49 14 2 3 648 246 290 247 230 69.79 52.77 53.85 56 33.33 75 84.6 64.74 81.69 73.29 69.49 268 268 252 226 215 215 210 192 170 141 86 12 3941 483 3921 503 3896 528 2868 1556 3325 1099 6.06 6.06 5.7 5.11 4.86 4.86 4.75 4.34 3.84 3.19 1.94 0.27 89.08 10.92 88.63 11.37 88.07 11.93 64.83 35.17 75.16 24.84 95 136 96 51 82 71 86 85 92 78 33 8 1214 207 1109 312 964 457 720 701 1287 134 35.45 50.75 38.1 22.57 38.14 33.02 40.95 44.27 54.12 55.32 38.37 66.67 30.8 42.86 28.28 62.03 24.74 86.55 25.1 45.05 38.71 12.19 173 132 156 175 133 144 124 107 78 63 53 4 2727 276 2812 191 2932 71 2148 855 2038 965 64.55 49.25 61.9 77.43 61.86 66.98 59.05 55.73 45.88 44.68 61.63 33.33 69.2 57.14 71.72 37.97 75.26 13.45 74.9 54.95 61.29 87.81 Table 3. Distribution of Numerical variables Variables Age at enrollment Dropout TOT Mean STD Non-Dropout TOT Mean STD 1421 26.069 8.704 3003 21.938 6.596 Curricular_units 1st sem (credited) 1421 0.6094 2.105 3003 0.7576 2.471 Curricular_units 1st sem (enrolled) 1421 5.8213 2.326 3003 6.4832 2.522 Curricular_units 1st sem (evaluations) 1421 7.7516 4.922 3003 8.5581 3.75 29 Curricular_units 1st sem (approved) 1421 2.5517 2.858 3003 5.7263 2.647 Curricular_units 1st sem (grade) 1421 7.2567 6.031 3003 12.242 3.062 Curricular_units 1st sem (without evaluations) 1421 0.1921 0.795 3003 0.1119 0.634 Curricular_units 2nd sem (credited) 1421 0.4497 1.68 3003 0.5854 2.021 Curricular_units 2nd sem (enrolled) 1421 5.7804 2.108 3003 6.4459 2.205 Curricular_units 2nd sem (evaluations) 1421 7.1738 4.817 3003 8.4842 3.382 Curricular_units 2nd sem (approved) 1421 1.9402 2.574 3003 5.6167 2.432 Curricular_units 2nd sem (grade) 1421 5.8993 6.119 3003 12.28 3.036 Curricular_units 2nd sem (without evaluations) 1421 0.2379 0.994 3003 0.1089 0.604 Unemployment rate 1421 11.616 2.768 3003 11.542 2.613 Inflation rate 1421 1.284 1.405 3003 1.2016 1.371 GDP 1421 -0.1509 2.252 3003 0.0743 2.275 4.4 Identification of important predictors of Dropout risk in Higher Education In this study, different methods have been used to identify key determinants of student dropout. The evaluation of feature importance across various methods such as Random Forest (RF), XGBoost (XGB), Logistic Regression (LR), SHAP, and Permutation reveals nuanced insights into the factors influencing student retention. Notably, "Curricular units 2nd sem (approved)" consistently stands out as a crucial predictor across all methods, with the highest mean importance score of 0.923. This indicates that students' performance in their second-semester courses significantly impacts dropout risk. "Tuition fees up to date" is another key feature, especially highlighted by XGBoost and Permutation importance measures, showing its substantial influence on dropout risk with a mean importance of 0.618. Conversely, features such as "Age at enrollment" and "Unemployment rate" show varying levels of importance across different methods, suggesting their role in dropout risk is more context-dependent. The SHAP values provide additional clarity by highlighting the nuanced contributions of each feature to individual predictions. For instance, "Curricular units1st sem (approved)" and "Curricular units 2nd sem (grade)" show notable differences in their contributions based on the model, emphasizing the importance of academic performance in dropout prediction. 30 Therefore, the results underscore the complex interplay of academic performance, financial status, and demographic factors in predicting dropout risk. These findings suggest that while some features have consistently high importance, others may influence dropout risk differently depending on the context and the model used. This comprehensive analysis informs targeted interventions aimed at reducing dropout rates by addressing the most influential predictors identified through various interpretative methods. Table 4 and Figure 5 below provide the summary of importance values of each machine learning methods used with the overall mean and the top 10 features influencing dropout risk. Table 4. Feature importance results for all variables Feature secondSemCurricularUnits_Approved Tuition fees up to date firstSemCurricularUnits_Approved secondSemCurricularUnits _grade) firstSemCurricularUnits (grade) secondSemCurricularUnits _enrolled) Age at enrollment Unemployment rate CourseName Application mode Inflation rate GDP secondSemCurricularUnits _enrolled secondSemCurricularUnits_evaluations secondSemCurricularUnits _credited Father's occupation Father's qualification Mother's qualification Mother's occupation secondSemCurricularUnits _evaluations Scholarship holder Displaced secondSemCurricularUnits _credited) Daytime/evening attendance Debtor Application order Gender Nationality firstSemCurricularUnits _without evaluations Previous qualification secondSemCurricularUnits _without evaluations RF 1 0.65 0.659 0.756 0.455 0.107 0.173 0.239 0.158 0.162 0.251 0.105 0.142 0.147 0.155 0.166 0.118 XGB 1 0.932 0.073 0.022 0.007 0.165 0.029 0.013 0.041 0.036 0.017 0.066 0.018 0.048 0.021 0.015 0.015 LR 0.614 0 0.735 0.843 0.887 1 0.885 0.872 0.906 0.873 0.843 0.893 0.858 0.858 0.838 0.846 0.865 SHAP 1 0.586 0.284 0.175 0.129 0.104 0.186 0.162 0.198 0.184 0.12 0.072 0.129 0.146 0.133 0.108 0.107 Permutation 1 0.923 0.367 0.158 0.091 0.108 0.108 0.095 0.059 0.095 0.057 0.132 0.107 0.053 0.061 0.073 0.083 Mean Importance 0.923 0.618 0.424 0.391 0.314 0.297 0.276 0.276 0.273 0.27 0.258 0.254 0.251 0.251 0.242 0.241 0.238 0.157 0.142 0.028 0.181 0.071 0.053 0.017 0.037 0.032 0.037 0.03 0.008 0.007 0.023 0.054 0.163 0.003 0.002 0.024 0.028 0.038 0 0.009 0.037 0.848 0.855 0.982 0.503 0.871 0.914 0.949 0.888 0.877 0.828 0.843 0.853 0.08 0.09 0.049 0.157 0.04 0.042 0.027 0.023 0.035 0.039 0.02 0.001 0.095 0.067 0.014 0.103 0.049 0.016 0 0.03 0.008 0.043 0.016 0.01 0.237 0.236 0.226 0.221 0.207 0.205 0.204 0.201 0.198 0.189 0.184 0.182 0.016 0.016 0.014 0.039 0.816 0.782 0.014 0.005 0.002 0.004 0.172 0.169 31 Educational special needs Marital status International 0.012 0 0 0.003 0.01 0.026 0.768 0.807 0.739 0.032 0 0.008 0.012 0.002 0.012 0.165 0.164 0.157 Figure 5. Top 10 Feature Importance 4.5 Machine learning model results This section evaluates six machine learning models applied to predict student dropout. The performance of each model is assessed using key metrics such as F1 score, precision, accuracy, recall and AUC score. The goal is to identify the best-performing model in predicting dropout, providing a reliable tool for early intervention. By comparing these metrics, we determine which model is the most reliable for predicting student attrition accurately. 4.5.1 Logistic regression The results of Logistic regression shows that the model performed well with the accuracy of 87%, demonstrating its ability of classifying dropout students. Its precision of 89% indicates a strong capability in accurately predicting true positives, while a recall of 83% suggests it also captures most of the actual positive cases. The F1-score of 86% balances precision and recall, and the AUCROC score of 94% highlights the model’s capability to discriminate between dropout and nondropout classes. 32 4.5.2 Support Vector Machine (SVM) The SVM model was also used to predict dropout cases and performed well based on different performance metrics used in this study. It shows an accuracy of 88%, making it more effective to identify 88% of cases correctly. A high precision of 90% and a recall of 84% indicate that the model effectively identifies true positives while maintaining a good sensitivity towards false negatives. The F1-score of 87% reflects its balanced performance, and the AUC-ROC of 94% confirms its solid discriminative power. 4.5.3 Decision Tree The results of the Decision Tree model show that compared to other models used in this study, it predicted dropout cases with an accuracy of 86% which is slightly low. With both precision and recall of 86% and 84% respectively, demonstrate moderate capability of correctly identifying positive and negative cases. The F1 score of 85% reflects a balance between precision and recall, while the AUC-ROC score of 92% suggests high discriminative effectiveness between classes. 4.5.4 Random Forest The Random Forest model predicts dropout cases with the highest accuracy of 90%; this shows that it has a strong ability to make correct dropout predictions. On the other metrics, it also performed exceptionally well with a recall of 86%, suggesting that it captures almost all positive cases. The model’s precision of 92% and F1-score of 89% indicate a well-balanced performance. its AUC-ROC score of 95.21% further confirms its excellent discriminatory power, making it one of the best-performing models used in this study. 4.5.5 Gradient Boosting Machine The Gradient Boosting model performed well as SVM with an accuracy of 88%. It has a precision of 90% and recall of 84%, which basically underlines its good performance in recognizing true positives correctly. Therefore, with an F1-score of 87%, it presents a balanced performance, while the AUC-ROC score is 94%, showing very good discrimination between dropout and non-dropout classes and indicating a robust performance. 4.5.6 XGBoost The accuracy rate of the XGBoost model was notably 90%, which justifies that it is reliable to generate correct predictions. It is also high in precision, at 92%, with a recall rate of 86%, which 33 means it is good at picking out true positives and hence minimizing false negatives. The F1score is 89%, showing its balanced performance, while an AUC-ROC score of 95.30% further justifies its superiority in discriminating well between the dropout and non-dropout classes. Table 5. Model evaluation metrics for six classifiers. Model Accuracy AUC Score Precision Recall F1 Score Logistic Regression 0.87 0.94 0.89 0.83 0.86 Decision Tree 0.86 0.92 0.86 0.84 0.85 Random Forest 0.90 0.95 0.92 0.86 0.89 Gradient Boosting Machine 0.88 0.95 0.89 0.84 0.87 Support Vector Machine 0.88 0.94 0.90 0.84 0.87 XGBoost 0.90 0.95 0.92 0.86 0.89 4.5.7 Comparative analysis of Model performances The results of all six trained models presented above revealed that XGBoost and Random Forest are the outperformed models with the highest overall performance concerning accuracy, the AUC-ROC score, and balanced F1-score. These results, thus indicate that ensemble methods with multiple decision trees, provide superior predictive accuracy compared to individual models such as Decision Tree or Logistic Regression. These models consistently identified the key determinants of dropout more accurately than the other methods, making them the preferred choices for early detection of dropout. Figure 6 describes the ROC-AUC of all six models and presents the XGBoost and Random Forest models as the top best models among the analyzed ROC. Each of these two models obtains an AUC of 95.30% and 95.21% respectively, which indicates that these models are very capable of distinguishing between positive and negative classes. On the other hand, Logistic Regression achieved the ROC-AUC of 94%. The Gradient Boosting Machine also shows high discriminative power with the ROC-AUC of 95%. Support Vector Machine and Decision Tree obtain high AUC values of 94% and 92% respectively but the XGBoost and Random Forest models outperform them. 34 Figure 6. ROC Curves for six ML models Figure 7 outlines the confusion matrices of the six models that were trained. Logistic Regression and SVM performed well, with relatively balanced classification, but still noticeable misclassifications. Decision Tree was more problematic in differentiating between dropout and non-dropout cases. Random Forest reveals the best results, with the fewest misclassifications, demonstrating strong performance in separating classes accurately. Gradient Boosting and XGBoost also performed well, with XGBoost closely matching Random Forest in minimizing errors. Overall, ensemble models, particularly XGBoost and Random Forest, excelled in accurately predicting dropouts. Therefore, the XGBoost model was selected as the best model for predicting student dropout due to its superior performance across all metrics, particularly based on its AUC value of 95.30% and its capability of handling imbalanced classes. Random Forest was a close second, highlighting the effectiveness of ensemble techniques in identifying at-risk students. 35 Figure 7. Confusion matrices for six employed ML classifiers 36 CHAPTER 5. DISCUSSION OF THE RESULTS 5.1 Introduction This chapter delves into the interpretation and implications of the key findings from the study on predicting student dropout in higher education using machine learning models. By analyzing the performance of various models, including Support Vector Machine, XGBoost, Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Machine, the discussion focuses on identifying the most effective approach for early detection of dropout. It also examines the importance of various features influencing dropout rates, emphasizing both academic and socio-economic determinants. The discussion aims to provide a comprehensive understanding of the study’s results, their relevance in the existing body of literature, and their potential implications for higher education institutions. 5.2 Key Findings Discussion The focus of this study is building a predictive model for student dropout in higher education. Six machine learning models, namely Support Vector Machine, XGBoost, Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Machine were used and compared to identify the most effective predictive model for early detection of student dropout in higher education. Among these six models trained and tested, XGBoost outperformed other models with an AUC-ROC score of 95.30%. The second-best performer is Random Forest, which closely follows XGBoost with an AUC score of 95.21%. This is highlighted by Bentéjac et al., (2019), which underscores XGBoost’s superior performance in classification tasks compared to other models. Additionally, This finding is consistent with a previous study conducted by Park & Yoo, (2021) that shows Random Forest as the best-performed model which demonstrates its robustness and reliability for predicting student dropout. The results demonstrate the great predictive power of ensemble learning techniques when working with an imbalanced dataset, where the dropout instances are lower than in other classes. A recent study utilizing ensemble learning models, including a novel stacking ensemble, demonstrated high performance in predicting student dropout, with testing accuracy reaching 92.18%. This aligns with the strong results achieved by Random Forest and XGBoost in this research, reinforcing the effectiveness of ensemble methods in dropout prediction (Niyogisubizo et al., 2022). 37 The SVM and logistic regression also have given very strong results, each with an accuracy of 88% and 87% respectively, and the same AUC-ROC of 94%. While these models yielded a high degree of precision and recall, these were slightly lower compared to the performances of ensemble methods, especially in the case of complex nonlinear relationships inherent in this data. The performance of the Decision Tree model was the lowest, even after addressing overfitting it achieves accuracy of 86% and 92% AUC-ROC. Its parameters were tuned, making it bound to overfit and, thus, not robust for more sophisticated ensemble methods, including Random Forest and XGBoost, with better accuracy and AUC-ROC. Feature importance was analyzed by several methods such as Random Forest, XGBoost, Logistic Regression, SHAP values, and permutation importance to make sure that important determinants were taken into account. The most influential feature throughout proved to be the number of curricular units approved in the second semester, which was consistently ranked first in all the methods with an average importance score of 0.923. This points to the fact that academic performance, especially in the latter part of the year, is one of the strongest reasons for students either remaining or dropping out, as it was highlighted by the study of Nurmalitasari et al., (2023). Other studies highlight that student engagement and achievement have been found as some of the best predictors of student retention. Financial stability also emerged as a significant determinant, with “tuition fees up to date” scoring highly in importance (mean score of 0.618), emphasizing the impact of financial constraints on student persistence. The findings of this study align with previous research, such as a Peruvian study that also identified age, term, and financing method as critical dropout predictors, with Random Forest showing superior performance (AUC 0.9623). This emphasizes the importance of financial and contextual factors in both developed and developing countries, supporting the global relevance of these determinants (Jiménez et al., 2023). Tinto (2012) highlighted that financial constraints are the key determinant of student attrition. Other socioeconomic factors, including unemployment and inflation rates, were critical in showing the wide context in which the academic journeys of students unfold. The course also has moderate importance, indicating that different courses significantly influence dropout rates. This suggests that course difficulty, engagement, or relevance may impact student retention, highlighting the 38 need for course-specific support and improvements. Parental qualifications and occupations also feature prominently, showing how students from relatively less educated families experience more challenges in their academic paths. The multi-method strategy employed in this study substantiated the significance of these essential determinants, as their relevance was validated through various models and metrics of feature importance for dropout prediction. 39 CHAPTER 6. CONCLUSION AND RECOMMENDATIONS 6.1 Introduction This concluding chapter summarizes the general contribution of the findings, provides actionable recommendations for higher education policy and practice, and suggests some future research avenues. 6.2 Conclusion The findings of the present research further emphasize how sophisticated machine learning techniques, particularly ensemble methods such as XGBoost and Random Forest, are efficiently applicable to the reliable prediction of student dropout in higher education. This is proved by the high accuracy values and AUC-ROC metrics that these methods demonstrate. These models therefore provide the institutions with an important tool for timely and effective intervention through early identification of dropout predictors. The research highlighted critical factors contributing to student dropout, revealing that academic achievement, financial stability, and socioeconomic conditions are the most significant influences. These results align with the current body of literature and emphasize the complex nature of student dropout, which cannot be ascribed to a singular cause but rather to the interaction of academic, financial, and contextual elements. Accredited curricular units are highly relevant, and tuition fees have deep implications, making universities focus their efforts on academic support and financial aid as part of a retention strategy. Furthermore, the research contributes to the growing body of knowledge on dropout prediction by providing a comparative evaluation of ML models, offering practical insights into their strengths. Several limitations may influence the generalization of the findings because the study used only one dataset. For this reason, the generalization of the results to other educational contexts is limited. Moreover, this study did not consider other non-quantifiable variables, such as students' motivations, social influences, and emotional comfort, which may also cause student dropout. 6.3 Recommendations Despite the promising results achieved with ensemble methods like XGBoost and Random Forest in predicting higher education dropout, further advancements are needed for practical application and research. It is recommended that higher education institutions integrate these models into their student management systems for early identification of at-risk students. Emphasis should be placed on targeting key determinants such as academic support, financial aid, and student engagement 40 initiatives to effectively address dropout factors. Additionally, incorporating more diverse datasets and exploring advanced techniques such as deep learning could further enhance the models' predictive accuracy. Institutions should ensure the continuous evaluation and updating of these models to adapt to changing student behaviors and institutional dynamics, thus maintaining their effectiveness and relevance. In addition to higher learning institutions, other key stakeholders such as policymakers, government bodies, and non-governmental organizations (NGOs) play a crucial role in addressing student dropout. Policymakers can implement nationwide strategies to improve access to education, while NGOs can provide financial and social support to at-risk students. Furthermore, the role of families and communities should not be overlooked, as they are instrumental in encouraging student retention. 41 REFERENCES: Aguiar, E., Lakkaraju, H., Bhanpuri, N., Miller, D., Yuhas, B., & Addison, K. L. (2015). Who, when, and why: A machine learning approach to prioritizing students at risk of not graduating high school on time. Proceedings of the Fifth International Conference on Learning Analytics And Knowledge, 93–102. https://doi.org/10.1145/2723576.2723619 Aina, C., Baici, E., Casalone, G., & Pastore, F. (2022). The determinants of university dropout: A review of the socio-economic literature. Socio-Economic Planning Sciences, 79, 101102. https://doi.org/10.1016/j.seps.2021.101102 Alyahyan, E., & Düştegör, D. (2020). Predicting academic success in higher education: Literature review and best practices. International Journal of Educational Technology in Higher Education, 17(1), 3. https://doi.org/10.1186/s41239-020-0177-7 Andrade-Girón, D., Sandivar-Rosas, J., Marín-Rodriguez, W., Susanibar-Ramirez, E., ToroDextre, E., Ausejo-Sanchez, J., Villarreal-Torres, H., & Angeles-Morales, J. (2023). Predicting Student Dropout based on Machine Learning and Deep Learning: A Systematic Review. ICST Transactions on Scalable Information Systems. https://doi.org/10.4108/eetsis.3586 Andrea, M. (2024, March 12). College Dropout Rates in 2024: Higher Education Statistics. https://www.skillademia.com/blog/college-dropout-rates/ Araque, F., Roldán, C., & Salguero, A. (2009). Factors influencing university drop out rates. Computers & Education, 53(3), 563–574. https://doi.org/10.1016/j.compedu.2009.03.013 Asthana, P., & Hazela, B. (2020). Applications of Machine Learning in Improving Learning Environment. In S. Tanwar, S. Tyagi, & N. Kumar (Eds.), Multimedia Big Data Computing for IoT Applications (Vol. 163, pp. 417–433). Springer Singapore. https://doi.org/10.1007/978-981-13-8759-3_16 42 Baker, R., & Inventado, P. (2014). Educational Data Mining and Learning Analytics (pp. 61– 75). https://doi.org/10.1007/978-1-4614-3305-7_4 Baker, R., & Siemens, G. (2014). Learning analytics and educational data mining. Cambridge Handbook of the Leaning Sciences, 253–272. Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2019). A Comparative Analysis of XGBoost. https://doi.org/10.48550/arXiv.1911.01914 Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 4). Springer. https://link.springer.com/book/9780387310732 Bowers, A. J., Sprott, R., & Taff, S. A. (2013). Do We Know Who Will Drop Out?: A Review of the Predictors of Dropping out of High School: Precision, Sensitivity, and Specificity. The High School Journal, 96(2), 77–100. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324 Callender, C. (1999). The Hardship of Learning: Students’ income and expenditure and their impact on participation in further education. Further Education Funding Council Coventry. https://www.voced.edu.au/content/ngv:68297 Casanova, J. R., Assis Gomes, C., Almeida, L. S., Tuero, E., Bernardo, A. B., Casanova, J. R., Assis Gomes, C., Almeida, L. S., Tuero, E., & Bernardo, A. B. (2023). “If I were young…”: Increased Dropout Risk of Older University Students. Revista Electrónica de Investigación Educativa, 25. https://doi.org/10.24320/redie.2023.25.e27.5671 Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321– 357. 43 Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785 Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018 Creswell, J. W., & Creswell, J. D. (2017). Research design: Qualitative, quantitative, and mixed methods approaches. Sage publications. Dake, D., & Buabeng-Andoh, C. (2022). Using Machine Learning Techniques to Predict Learner Drop-out Rate in Higher Educational Institutions. Mobile Information Systems, 2022, 1– 9. https://doi.org/10.1155/2022/2670562 Dass, S., Gary, K., & Cunningham, J. (2021). Predicting student dropout in self-paced MOOC course using random forest model. Information, 12(11), 476. Del Bonifro, F., Gabbrielli, M., Lisanti, G., & Zingaro, S. P. (2020). Student Dropout Prediction. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education (Vol. 12163, pp. 129–140). Springer International Publishing. https://doi.org/10.1007/978-3-030-52237-7_11 Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177), 1–81. 44 García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining (Vol. 72). Springer International Publishing. https://doi.org/10.1007/978-3-319-10247-4 GeksforGeeks. (2024, February 22). Random Forest Algorithm in Machine Learning. GeeksforGeeks. https://www.geeksforgeeks.org/random-forest-algorithm-in-machinelearning/ Gilmore, E., Estivill-Castro, V., & Hexel, R. (2021). More Interpretable Decision Trees. In H. Sanjurjo González, I. Pastor López, P. García Bringas, H. Quintián, & E. Corchado (Eds.), Hybrid Artificial Intelligent Systems (pp. 280–292). Springer International Publishing. https://doi.org/10.1007/978-3-030-86271-8_24 Guzmán, A., Barragán, S., & Cala Vitery, F. (2021). Dropout in Rural Higher Education: A Systematic Review. Frontiers in Education, 6. https://doi.org/10.3389/feduc.2021.727833 Hastie, T., Friedman, J., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer New York. https://doi.org/10.1007/978-0-387-21606-5 Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons. https://books.google.com/books?hl=en&lr=&id=bRoxQBIZRd4C&oi=fnd&pg=PR13&d q=Hosmer,+Lemeshow+%26+Sturdivant.+(2013).+Applied+Logistic+Regression.+&ots =kM2Nsu4Wde&sig=P5zmlP_6tVyNVe-F4xZGX91B_PE Jagwani, A., & Aloysius, S. (2019). A REVIEW OF MACHINE LEARNING IN EDUCATION. Jiménez, O., Jesús, A., & Wong, L. (2023). Model for the Prediction of Dropout in Higher Education in Peru applying Machine Learning Algorithms: Random Forest, Decision Tree, Neural Network and Support Vector Machine. 2023 33rd Conference of Open 45 Innovations Association (FRUCT), 116–124. https://doi.org/10.23919/FRUCT58615.2023.10143068 Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415 Kabathova, J., & Drlik, M. (2021). Towards predicting student’s dropout in university courses using different machine learning techniques. Applied Sciences, 11(7), 3130. Kharb, L., & Singh, P. (2021). Role of Machine Learning in Modern Education and Teaching. In Impact of AI Technologies on Teaching, Learning, and Research in Higher Education (pp. 99–123). IGI Global. https://doi.org/10.4018/978-1-7998-4763-2.ch006 Kim, D., & Kim, S. (2018). Sustainable Education: Analyzing the Determinants of University Student Dropout by Nonlinear Panel Data Models. Sustainability, 10(4), Article 4. https://doi.org/10.3390/su10040954 Kirsch, D., & Hurwitz, J. (2018). Machine Learning for dummies. Hoboken: IBM. https://www.ibm.com/downloads/cas/GB8ZMQZ3 Kotsiantis, S. B., Pierrakeas, C. J., & Pintelas, P. E. (2003). Preventing Student Dropout in Distance Learning Using Machine Learning Techniques. In V. Palade, R. J. Howlett, & L. Jain (Eds.), Knowledge-Based Intelligent Information and Engineering Systems (Vol. 2774, pp. 267–274). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-54045226-3_37 Kumar, M., Singh, A. J., & Handa, D. (2017). Literature survey on educational dropout prediction. International Journal of Education and Management Engineering, 7(2), 8. 46 Lamba, M., & Madhusudhan, M. (2022). Predictive Modeling. In M. Lamba & M. Madhusudhan (Eds.), Text Mining for Information Professionals: An Uncharted Territory (pp. 213– 242). Springer International Publishing. https://doi.org/10.1007/978-3-030-85085-2_8 Larusson, J. A., & White, B. (Eds.). (2014). Learning Analytics: From Research to Practice. Springer New York. https://doi.org/10.1007/978-1-4614-3305-7 Lee, J., Kim, M., Kim, D., & Gil, J.-M. (2021). Evaluation of Predictive Models for Early Identification of Dropout Students. Journal of Information Processing Systems, 17(3). https://s3.ap-northeast-2.amazonaws.com/journalhome/journal/jips/fullText/594/jips14.pdf Letseka, M., & Breier, M. (2008). Student poverty in higher education: The impact of higher education dropout on poverty. Education and Poverty Reduction Strategies: Issues of Policy Coherence: Colloquium Proceedings, 83–101. https://www.researchgate.net/profile/Ursula-Hoadley2/publication/237260645_The_boundaries_of_care_Education_policy_interventions_for_ vulnerable_children/links/544e2fc20cf26dda088e5e3a/The-boundaries-of-careEducation-policy-interventions-for-vulnerable-children.pdf#page=107 Lugyi, N. (2024). Gender Differences in Dropout Rates Across Course Types in Higher Education. International Journal of Education and Research, 12(5), 89. Lundberg, S. (2017). A unified approach to interpreting model predictions. arXiv Preprint arXiv:1705.07874. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767Abstract.html 47 Lykourentzou, I., Giannoukos, I., Nikolopoulos, V., Mpardis, G., & Loumos, V. (2009). Dropout prediction in e-learning courses through the combination of machine learning techniques. Computers & Education, 53(3), 950–965. Magolda, M., & Astin, A. (1993). What Matters in College: Four Critical Years Revisited. Educational Researcher, 22. https://doi.org/10.2307/1176821 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge university press. https://dl.acm.org/doi/abs/10.5555/1394399 Marbouti, F., Diefes-Dux, H. A., & Madhavan, K. (2016). Models for early prediction of at-risk students in a course using standards-based grading. Computers & Education, 103, 1–15. https://doi.org/10.1016/j.compedu.2016.09.005 Mariano, A. M., Ferreira, A. B. de M. L., Santos, M. R., Castilho, M. L., & Bastos, A. C. F. L. C. (2022). Decision trees for predicting dropout in Engineering Course students in Brazil. Procedia Computer Science, 214, 1113–1120. Martins, M. V., Tolledo, D., Machado, J., Baptista, L. M. T., & Realinho, V. (2021). Early Prediction of student’s Performance in Higher Education: A Case Study. In Á. Rocha, H. Adeli, G. Dzemyda, F. Moreira, & A. M. Ramalho Correia (Eds.), Trends and Applications in Information Systems and Technologies (Vol. 1365, pp. 166–175). Springer International Publishing. https://doi.org/10.1007/978-3-030-72657-7_16 Mduma, N. (2023). Data Balancing Techniques for Predicting Student Dropout Using Machine Learning. Data, 8(3), Article 3. https://doi.org/10.3390/data8030049 Mishra, A., Gupta, D., & Chetty, G. (Eds.). (2023). Advances in IoT and Security with Computational Intelligence: Proceedings of ICAISA 2023, Volume 2 (Vol. 756). Springer Nature Singapore. https://doi.org/10.1007/978-981-99-5088-1 48 Mitchell, T. M., & Mitchell, T. M. (1997). Machine learning (Vol. 1). McGraw-hill New York. http://www.pachecoj.com/courses/csc380_fall21/lectures/mlintro.pdf Morgan, P. L., Farkas, G., Hillemeier, M. M., & Maczuga, S. (2009). Risk Factors for LearningRelated Behavior Problems at 24 Months of Age: Population-Based Estimates. Journal of Abnormal Child Psychology, 37(3), 401–413. https://doi.org/10.1007/s10802-008-9279-8 Nimy, E., Mosia, M., & Chibaya, C. (2023). Identifying At-Risk Students for Early Intervention—A Probabilistic Machine Learning Approach. Applied Sciences, 13(6), 3869. Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P. C. (2022). Predicting student’s dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization. Computers and Education: Artificial Intelligence, 3, 100066. https://doi.org/10.1016/j.caeai.2022.100066 Nurmalitasari, N., awang long, Z., & Mohd Noor, F. (2023). Factors Influencing Dropout Students in Higher Education. Education Research International, 2023, 1–13. https://doi.org/10.1155/2023/7704142 OECD. (2009). How many students drop out of tertiary education? In HIGHLIGHTS from Education at a Glance, 2008 (pp. 24–26). OECD Publishing Paris. Oqaidi, K., Aouhassi, S., & Mansouri, K. (2022). Towards a students’ dropout prediction model in higher education institutions using machine learning algorithms. International Journal of Emerging Technologies in Learning (iJET), 17(18), 103–117. Osborne, J. B., & Lang, A. S. (2023). Predictive Identification of At-Risk Students: Using Learning Management System Data. Journal of Postsecondary Student Success, 2(4), 108–126. 49 Park, H., & Yoo, S. (2021). Early Dropout Prediction in Online Learning of University using Machine Learning. JOIV : International Journal on Informatics Visualization, 5, 347. https://doi.org/10.30630/joiv.5.4.732 Paura, L., & Arhipova, I. (2014). Cause analysis of students’ dropout rate in higher education study program. Procedia-Social and Behavioral Sciences, 109, 1282–1286. Powdthavee, N., & Vignoles, A. (2008). The Socio-Economic Gap in University Drop Out. Ramraj, S., Uzir, N., Sunil, R., & Banerjee, S. (2016). Experimenting XGBoost algorithm for prediction and classification of different datasets. International Journal of Control Theory and Applications, 9(40), 651–662. Romero, C., Ventura, S., Pechenizkiy, M., & Baker, R. S. J. d. (2010). Handbook of Educational Data Mining. CRC Press. Rumberger, R. (2011). Dropping Out: Why Students Drop Out of High School and What Can Be Done About It. https://doi.org/10.4159/harvard.9780674063167 Rumberger, R. (2020). The economics of high school dropouts (pp. 149–158). https://doi.org/10.1016/B978-0-12-815391-8.00012-4 Rumberger, R. W., & Lim, S. A. (2008). Why students drop out of school: A review of 25 years of research. https://www.issuelab.org/resources/11658/11658.pdf Saad Hussein, A., Li, T., Yohannese, C. W., & Bashir, K. (2019). A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE: International Journal of Computational Intelligence Systems, 12(2), 1412. https://doi.org/10.2991/ijcis.d.191114.002 Saeed, S., Ahmed, S., & joseph, shalom. (2024). Machine Learning in the Big Data Age: Advancements, Challenges, and Future Prospects. 50 Sammut, C., & Webb, G. I. (2011). Encyclopedia of machine learning. Springer Science & Business Media. https://books.google.com/books?hl=en&lr=&id=i8hQhp1a62UC&oi=fnd&pg=PA3&dq= Claude+Sammut+%26+Geoffrey+I.+Webb.+(2011).+Encyclopedia+of+Machine+Learni ng.+Springer.&ots=92kazyjGaQ&sig=JuTFAewGWe60D0z-R1N760LkRs4 Sara, N.-B., Halland, R., Igel, C., & Alstrup, S. (2015). High-School Dropout Prediction Using Machine Learning: A Danish Large-scale Study. ESANN, 2015, 23rd. https://books.google.com/books?hl=en&lr=&id=USGLCgAAQBAJ&oi=fnd&pg=PA319 &dq=NicolaeBogdan+Sara,+Rasmus+Halland,+Christian+Igel,+%26+Stephen+Alstrup.+(2015).+Hig h-School+Dropout+Prediction+Using+Machine+Learning:+A+Danish+Largescale+Study.+s,+European+Symposium+on+Artificial+Neural+Networks.&ots=FuebiuJ ZSN&sig=-HQgEmXHwGa8vY5MIV0YfjrE4Qo Shaun, R., De Baker, J., & Inventado, P. S. (2014). Chapter 4: Educational Data Mining and Learning Analytics. Springer. Smelser, N. J., & Baltes, P. B. (2001). International encyclopedia of the social & behavioral sciences (Vol. 11). Elsevier Amsterdam. http://www.law.harvard.edu/faculty/shavell/pdf/12_Inter_Ency_Soc_8446.pdf Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. Stinebrickner, R., & Stinebrickner, T. (2014). Academic Performance and College Dropout: Using Longitudinal Expectations Data to Estimate a Learning Model. Journal of Labor Economics, 32(3), 601–644. https://doi.org/10.1086/675308 51 SYDLE. (2024). University Dropout: Why Does It Happen And How Can You Prevent It? Blog SYDLE. https://www.sydle.com/blog/university-dropout-639a22f22ff02745fa4eface Thayer, P. B. (2000). Retention of Students from First Generation and Low Income Backgrounds. Council for Opportunity in Education, 1025 Vermont Ave. https://eric.ed.gov/?id=ED446633 Tinto, V. (2012). Leaving college: Rethinking the causes and cures of student attrition. University of Chicago press. https://books.google.com/books?hl=en&lr=&id=TlVhEAAAQBAJ&oi=fnd&pg=PR7&d q=Tinto,+V.+(1993).+Leaving+College:+Rethinking+the+Causes+and+Cures+of+Stude nt+Attrition.+&ots=yg1VO4MTs1&sig=mhuyHlM7eA7Ty3vLzdAfRpYIlxA Villegas-Ch, W., Govea, J., & Revelo-Tapia, S. (2023). Improving Student Retention in Institutions of Higher Education through Machine Learning: A Sustainable Approach. Sustainability, 15(19), Article 19. https://doi.org/10.3390/su151914512 52 Appendix A: Distribution of all Categorical variables Variable Marital status Applicati on mode Applicati on order Course Category Overa ll Count Overa ll Perce ntage Drop out Cou nt Dropo ut Percen tage NonDropo ut Count Single Married Divorced Facto union Legally separated 3919 379 91 25 6 88.58 8.57 2.06 0.57 0.14 1184 179 42 11 4 30.21 47.23 46.15 44 66.67 2735 200 49 14 2 NonDropo ut Percen tage 69.79 52.77 53.85 56 33.33 Widower 1st phase—general contingent 2nd phase—general contingent Over 23 years old 4 1708 872 785 0.09 38.61 19.71 17.74 1 345 256 435 25 20.2 29.36 55.41 3 1363 616 350 75 79.8 70.64 44.59 Change in course Technological specialization diploma holders 312 213 7.05 4.81 115 63 36.86 29.58 197 150 63.14 70.42 Holders of other higher courses 3rd phase—general contingent Transfer Change in institution/course 1st phase—special contingent (Madeira Island) Short cycle diploma holders International student (bachelor) 139 124 77 59 38 35 30 3.14 2.8 1.74 1.33 0.86 0.79 0.68 85 45 34 20 5 4 5 61.15 36.29 44.16 33.9 13.16 11.43 16.67 54 79 43 39 33 31 25 38.85 63.71 55.84 66.1 86.84 88.57 83.33 1st phase—special contingent (Azores Island) Ordinance No. 854-B/99 16 10 0.36 0.23 2 3 12.5 30 14 7 87.5 70 Ordinance No. 612/93 Ordinance No. 533-A/99, item b2) (Different Plan) Ordinance No. 533-A/99, item b3 (Other Institution) 3 0.07 2 66.67 1 33.33 1 0.02 1 100 0 0 1 0.02 1 100 0 0 Change in institution/course (International) 1 2 3 4 5 1 3026 547 309 249 154 0.02 68.4 12.36 6.98 5.63 3.48 0 1053 150 76 58 53 0 34.8 27.42 24.6 23.29 34.42 1 1973 397 233 191 101 100 65.2 72.58 75.4 76.71 65.58 6 137 3.1 31 22.63 106 77.37 9 0 Nursing 1 1 766 0.02 0.02 17.31 0 0 118 0 0 15.4 1 1 648 100 100 84.6 Management Social Service Veterinary Nursing 380 355 337 8.59 8.02 7.62 134 65 90 35.26 18.31 26.71 246 290 247 64.74 81.69 73.29 53 Daytime/ evening attendanc e Previous qualificat ion Nationali ty Journalism and Communication Advertising and Marketing Management Management (evening attendance) Tourism 331 268 268 252 7.48 6.06 6.06 5.7 101 95 136 96 30.51 35.45 50.75 38.1 230 173 132 156 69.49 64.55 49.25 61.9 Communication Design Animation and Multimedia Design Social Service (evening attendance) Agronomy Basic Education Informatics Engineering Equinculture 226 215 215 210 192 170 141 5.11 4.86 4.86 4.75 4.34 3.84 3.19 51 82 71 86 85 92 78 22.57 38.14 33.02 40.95 44.27 54.12 55.32 175 133 144 124 107 78 63 77.43 61.86 66.98 59.05 55.73 45.88 44.68 Oral Hygiene Biofuel Production Technologies Day 86 12 3941 1.94 0.27 89.08 33 8 1214 38.37 66.67 30.8 53 4 2727 61.63 33.33 69.2 Evening Secondary education 483 3717 10.92 84.02 207 1078 42.86 29 276 2639 57.14 71 Higher education—bachelor’s degree Higher education—degree Higher education—master’s degree Higher education—doctorate Frequency of higher education 12th year of schooling—not completed 23 126 8 1 16 11 0.52 2.85 0.18 0.02 0.36 0.25 16 75 4 1 7 11 69.57 59.52 50 100 43.75 100 7 51 4 0 9 0 30.43 40.48 50 0 56.25 0 11th year of schooling—not completed Other—11th year of schooling 10th year of schooling 4 45 1 0.09 1.02 0.02 3 26 1 75 57.78 100 1 19 0 25 42.22 0 10th year of schooling—not completed Basic education 3rd cycle (9th/10th/11th year) or equivalent Basic education 2nd cycle (6th/7th/8th year) or equivalent 2 0.05 1 50 1 50 162 3.66 104 64.2 58 35.8 7 0.16 3 42.86 4 57.14 Technological specialization course Higher education—degree (1st cycle) Professional higher technical course Higher education—master’s degree (2nd cycle) Portuguese Brazilian 219 40 36 6 4314 38 4.95 0.9 0.81 0.14 97.51 0.86 69 14 6 2 1389 14 31.51 35 16.67 33.33 32.2 36.84 150 26 30 4 2925 24 68.49 65 83.33 66.67 67.8 63.16 Santomean 14 0.32 1 7.14 13 92.86 Spanish Cape Verdean Guinean 13 13 5 0.29 0.29 0.11 4 4 1 30.77 30.77 20 9 9 4 69.23 69.23 80 Italian Moldova (Republic of) 3 3 0.07 0.07 0 2 0 66.67 3 1 100 33.33 54 Mother's qualificat ion Ukrainian German Angolan Mozambican 3 2 2 2 0.07 0.05 0.05 0.05 1 0 1 0 33.33 0 50 0 2 2 1 2 66.67 100 50 100 Romanian Mexican Russian Dutch English Lithuanian Turkish 2 2 2 1 1 1 1 0.05 0.05 0.05 0.02 0.02 0.02 0.02 0 1 1 0 0 1 0 0 50 50 0 0 100 0 2 1 1 1 1 0 1 100 50 50 100 100 0 100 Cuban Colombian Secondary Education—12th Year of Schooling or Equivalent General Course of Administration and Commerce General commerce course 1 1 0.02 0.02 0 1 0 100 1 0 100 0 1069 24.16 300 28.06 769 71.94 1009 953 22.81 21.54 383 271 37.96 28.44 626 682 62.04 71.56 Supplementary Accounting and Administration Higher Education—degree 2nd cycle of the general high school course Higher Education—bachelor’s degree Higher Education—master’s degree Other—11th Year of Schooling 562 438 130 83 49 42 12.7 9.9 2.94 1.88 1.11 0.95 140 139 96 20 8 22 24.91 31.74 73.85 24.1 16.33 52.38 422 299 34 63 41 20 75.09 68.26 26.15 75.9 83.67 47.62 Higher Education—doctorate Cannot read or write 12th Year of Schooling—not completed Unknown Can read without having a 4th year of schooling Frequency of Higher Education Basic education 1st cycle (4th/5th year) or equivalent Basic Education 2nd Cycle (6th/7th/8th Year) or equivalent 11th Year of Schooling—not completed 7th Year (Old) Complementary High School Course—not concluded 21 9 8 8 6 4 0.47 0.2 0.18 0.18 0.14 0.09 8 3 5 4 2 3 38.1 33.33 62.5 50 33.33 75 13 6 3 4 4 1 61.9 66.67 37.5 50 66.67 25 4 0.09 2 50 2 50 4 3 3 0.09 0.07 0.07 1 2 2 25 66.67 66.67 3 1 1 75 33.33 33.33 3 0.07 1 33.33 2 66.67 7th year of schooling 9th Year of Schooling—not completed 3 3 0.07 0.07 1 2 33.33 66.67 2 1 66.67 33.33 8th year of schooling 3 0.07 2 66.67 1 33.33 2nd year complementary high school course 10th Year of Schooling Basic Education 3rd Cycle (9th/10th/11th Year) or Equivalent 2 1 0.05 0.02 1 1 50 100 1 0 50 0 1 0.02 0 0 1 100 55 Father's qualificat ion Complementary High School Course Technical-professional course Technological specialization course Basic education 1st cycle (4th/5th year) or equivalent Basic Education 3rd Cycle (9th/10th/11th Year) or Equivalent Secondary Education—12th Year of Schooling or Equivalent Basic Education 2nd Cycle (6th/7th/8th Year) or equivalent Higher Education—degree 1 1 1 0.02 0.02 0.02 0 1 1 0 100 100 1 0 0 100 0 0 1209 27.33 432 35.73 777 64.27 968 21.88 264 27.27 704 72.73 904 20.43 281 31.08 623 68.92 702 282 15.87 6.37 167 90 23.79 31.91 535 192 76.21 68.09 Unknown Higher Education—bachelor’s degree Higher Education—master’s degree Other—11th Year of Schooling Technological specialization course Higher Education—doctorate 112 68 39 38 20 18 2.53 1.54 0.88 0.86 0.45 0.41 81 22 14 14 8 10 72.32 32.35 35.9 36.84 40 55.56 31 46 25 24 12 8 27.68 67.65 64.1 63.16 60 44.44 7th Year (Old) 10 0.23 4 40 6 60 Can read without having a 4th year of schooling 12th Year of Schooling—not completed Higher education—degree (1st cycle) 10th Year of Schooling Technical-professional course 8th year of schooling 8 5 5 4 4 4 0.18 0.11 0.11 0.09 0.09 0.09 5 1 3 1 4 1 62.5 20 60 25 100 25 3 4 2 3 0 3 37.5 80 40 75 0 75 9th Year of Schooling—not completed Frequency of Higher Education 11th Year of Schooling—not completed 7th year of schooling Cannot read or write Specialized higher studies course Higher Education—master’s degree (2nd cycle) 3 2 2 2 2 2 2 0.07 0.05 0.05 0.05 0.05 0.05 0.05 3 2 2 1 2 1 0 100 100 100 50 100 50 0 0 0 0 1 0 1 2 0 0 0 50 0 50 100 2nd year complementary high school course General commerce course Complementary High School Course Complementary High School Course—not concluded 2nd cycle of the general high school course General Course of Administration and Commerce Supplementary Accounting and Administration Professional higher technical course 1 1 1 0.02 0.02 0.02 1 1 1 100 100 100 0 0 0 0 0 0 1 1 0.02 0.02 1 1 100 100 0 0 0 0 1 1 1 0.02 0.02 0.02 1 1 0 100 100 0 0 0 1 0 0 100 Higher Education—doctorate (3rd cycle Unskilled Workers 1 1577 0.02 35.65 1 490 100 31.07 0 1087 0 68.93 Administrative staff 817 18.47 248 30.35 569 69.65 56 Personal Services, Security and Safety Workers, and Sellers Intermediate Level Technicians and Professions Specialists in Intellectual and Scientific Activities Skilled Workers in Industry, Construction, and Craftsmen Student Representatives of the Legislative Power and Executive Bodies,Directors, Directors and Executive Managers Farmers and Skilled Workers in Agriculture, Fisheries,and Forestry Mother's occupati on Father's occupati on 530 351 11.98 7.93 156 95 29.43 27.07 374 256 70.57 72.93 318 7.19 102 32.08 216 67.92 272 144 6.15 3.25 80 99 29.41 68.75 192 45 70.59 31.25 102 2.31 39 38.24 63 61.76 91 2.06 26 28.57 65 71.43 Other Situation; Installation and Machine Operators and Assembly Workers 70 1.58 51 72.86 19 27.14 36 0.81 15 41.67 21 58.33 Other administrative support staff (blank) 26 17 0.59 0.38 0 13 0 76.47 26 4 100 23.53 Personal care workers and the like Health professionals 11 8 0.25 0.18 1 0 9.09 0 10 8 90.91 100 Armed Forces Sergeants Specialists in finance, accounting, administrative organization,and public and commercial relations Data, accounting, statistical, financial services, andregistry-related operators Personal service workers 7 0.16 2 28.57 5 71.43 6 0.14 0 0 6 100 5 5 0.11 0.11 1 0 20 0 4 5 80 100 Armed Forces Professions Specialists in the physical sciences, mathematics, engineering,and related techniques 4 0.09 1 25 3 75 4 0.09 1 25 3 75 Sellers Hotel, catering, trade, and other services directors Teachers Intermediate level science and engineering technicians and professions Armed Forces Officers Technicians and professionals of intermediate level of health Intermediate level technicians from legal, social, sports, cultural,and similar services Other Armed Forces personnel Directors of administrative and commercial services Information and communication technology technicians Office workers, secretaries in general,and data processing operators 4 0.09 1 25 3 75 3 3 0.07 0.07 0 0 0 0 3 3 100 100 3 2 0.07 0.05 0 0 0 0 3 2 100 100 2 0.05 0 0 2 100 2 1 0.05 0.02 0 0 0 0 2 1 100 100 1 0.02 0 0 1 100 1 0.02 0 0 1 100 1 0.02 0 0 1 100 Unskilled Workers Skilled Workers in Industry, Construction, and Craftsmen 1010 22.83 323 31.98 687 68.02 666 15.05 184 27.63 482 72.37 57 Personal Services, Security and Safety Workers, and Sellers 516 386 384 11.66 8.73 8.68 148 139 114 28.68 36.01 29.69 368 247 270 71.32 63.99 70.31 318 266 7.19 6.01 94 85 29.56 31.95 224 181 70.44 68.05 242 5.47 69 28.51 173 71.49 197 4.45 70 35.53 127 64.47 134 128 3.03 2.89 48 82 35.82 64.06 86 46 64.18 35.94 Other Situation; (blank) Unskilled workers in extractive industry, construction,manufacturing, and transport Other administrative support staff Skilled construction workers and the like, except electricians Unskilled workers in agriculture, animal production, and fisheries and forestry Farmers, livestock keepers, fishermen, hunters and gatherers,and subsistence Other Armed Forces personnel Workers in food processing, woodworking, and clothing and other industries and crafts 65 19 1.47 0.43 46 13 70.77 68.42 19 6 29.23 31.58 15 8 0.34 0.18 2 0 13.33 0 13 8 86.67 100 8 0.18 0 0 8 100 6 0.14 0 0 6 100 5 4 0.11 0.09 0 1 0 25 5 3 100 75 4 0.09 0 0 4 100 Teachers Information and communication technology technicians Sellers Fixed plant and machine operators Vehicle drivers and mobile equipment operators 3 0.07 0 0 3 100 3 3 3 3 0.07 0.07 0.07 0.07 0 1 0 0 0 33.33 0 0 3 2 3 3 100 66.67 100 100 Armed Forces Sergeants Directors of administrative and commercial services Health professionals Personal service workers Skilled workers in metallurgy, metalworking, and similar 2 0.05 0 0 2 100 2 2 2 0.05 0.05 0.05 1 0 0 50 0 0 1 2 2 50 100 100 2 0.05 0 0 2 100 Assembly workers 2 0.05 0 0 2 100 Meal preparation assistants Armed Forces Officers Hotel, catering, trade, and other services directors Specialists in the physical sciences, mathematics, engineering,and related techniques 2 1 0.05 0.02 1 0 50 0 1 1 50 100 1 0.02 0 0 1 100 1 0.02 0 0 1 100 Administrative staff Intermediate Level Technicians and Professions Installation and Machine Operators and Assembly Workers Armed Forces Professions Farmers and Skilled Workers in Agriculture, Fisheries,and Forestry Specialists in Intellectual and Scientific Activities Representatives of the Legislative Power and Executive Bodies,Directors, Directors and Executive Managers Student 58 Specialists in finance, accounting, administrative organization,and public and commercial relations Intermediate level science and engineering technicians and professions Technicians and professionals of intermediate level of health Intermediate level technicians from legal, social, sports, cultural,and similar services Office workers, secretaries in general,and data processing operators Data, accounting, statistical, financial services, andregistry-related operators Personal care workers and the like Displace d Educatio nal special needs Debtor Tuition fees up to date Gender Scholars hip holder Internati onal 1 0.02 0 0 1 100 1 0.02 0 0 1 100 1 0.02 0 0 1 100 1 0.02 0 0 1 100 1 0.02 0 0 1 100 1 1 0.02 0.02 0 0 0 0 1 1 100 100 Protection and security services personnel Market-oriented farmers and skilled agricultural and animal production workers Skilled workers in electricity and electronics Street vendors (except food) and street service providers Yes 1 0.02 0 0 1 100 1 1 0.02 0.02 0 0 0 0 1 1 100 100 1 2426 0.02 54.84 0 669 0 27.58 1 1757 100 72.42 No No 1998 4373 45.16 98.85 752 1404 37.64 32.11 1246 2969 62.36 67.89 Yes No Yes 51 3921 503 1.15 88.63 11.37 17 1109 312 33.33 28.28 62.03 34 2812 191 66.67 71.72 37.97 Yes 3896 88.07 964 24.74 2932 75.26 No Yes Yes No 528 2868 1556 3325 11.93 64.83 35.17 75.16 457 720 701 1287 86.55 25.1 45.05 38.71 71 2148 855 2038 13.45 74.9 54.95 61.29 Yes 1099 24.84 134 12.19 965 87.81 No Yes 4314 110 97.51 2.49 1389 32 32.2 29.09 2925 78 67.8 70.91 59 THESIS 15 ORIGINALITY REPORT % SIMILARITY INDEX 12% INTERNET SOURCES 9% PUBLICATIONS PRIMARY SOURCES 1 www.mdpi.com 2 Submitted to University of Rwanda 3 wiredspace.wits.ac.za 4 dspace.nm-aist.ac.tz 5 www.ijisae.org Internet Source Student Paper Internet Source Internet Source Internet Source 8% STUDENT PAPERS 1% <1 % <1 % <1 % <1 % Submitted to The African Institute for Mathematical Sciences <1 % 7 Submitted to Coventry University <1 % 8 dr.ur.ac.rw 9 Submitted to Aston University 6 Student Paper Student Paper Internet Source Student Paper <1 % <1 % 10 11 ebin.pub Internet Source Submitted to Asia Pacific University College of Technology and Innovation (UCTI) Student Paper 12 Hemant Kumar Soni, Sanjiv Sharma, G. R. Sinha. "Text and Social Media Analytics for Fake News and Hate Speech Detection", CRC Press, 2024 Publication 13 Submitted to University of Strathclyde 14 Submitted to Addis Ababa University 15 academic-accelerator.com 16 www.frontiersin.org 17 dione.lib.unipi.gr 18 etd.repository.ugm.ac.id 19 Submitted to University of Northampton Student Paper Student Paper Internet Source Internet Source Internet Source Internet Source Student Paper <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % 20 "Deep Learning and Visual Artificial Intelligence", Springer Science and Business Media LLC, 2024 Publication 21 22 afribary.com Internet Source "Proceedings of Eighth International Congress on Information and Communication Technology", Springer Science and Business Media LLC, 2024 Publication 23 24 pdfs.semanticscholar.org Internet Source 26 <1 % <1 % <1 % Submitted to Jawaharlal Nehru Technological University <1 % etd.hu.edu.et <1 % Student Paper 25 <1 % Internet Source "Recent Advances on Soft Computing and Data Mining", Springer Science and Business Media LLC, 2024 Publication 27 irbackend.kiu.ac.ug 28 Submitted to Uganda Christian University Internet Source Student Paper <1 % <1 % <1 % 29 Submitted to UCL 30 dspace.cbe.ac.tz:8080 31 Student Paper Internet Source Submitted to Queen Margaret University College, Edinburgh Student Paper 32 Qurban A. Memon, Shakeel Ahmed Khoja. "Data Science - Theory, Analysis, and Applications", CRC Press, 2019 Publication 33 ijern.com 34 xml.jips-k.org Internet Source Internet Source <1 % <1 % <1 % <1 % <1 % <1 % Submitted to University of Technology, Sydney <1 % 36 listens.online <1 % 37 scholar.mzumbe.ac.tz 38 www.scielo.br 35 Student Paper Internet Source Internet Source Internet Source <1 % <1 % 39 Badrul H. Khan, Joseph Rene Corbeil, Maria Elena Corbeil. "Responsible Analytics and Data Mining in Education - Global Perspectives on Quality, Support, and Decision Making", Routledge, 2018 Publication 40 dokumen.pub 41 dspace.lib.uom.gr 42 erepository.uonbi.ac.ke:8080 43 github.com 44 journaleet.in 45 ulspace.ul.ac.za 46 Internet Source Internet Source Internet Source Internet Source Internet Source Internet Source "Breaking Barriers with Generative Intelligence. Using GI to Improve Human Education and Well-Being", Springer Science and Business Media LLC, 2024 Publication 47 "Recent Trends in Image Processing and Pattern Recognition", Springer Science and Business Media LLC, 2024 <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % Publication 48 Safira Begum, M. V. Ashok. "A novel approach to mitigate academic underachievement in higher education: Feature selection, classifier performance, and interpretability in predicting student performance", International Journal of ADVANCED AND APPLIED SCIENCES, 2024 Publication 49 Sheikh Wakie Masood, Munmi Gogoi, Shahin Ara Begum. "Optimised SMOTE-based Imbalanced Learning for Student Dropout Prediction", Arabian Journal for Science and Engineering, 2024 Publication 50 51 management.uta.edu Internet Source "Artificial Intelligence and Knowledge Processing", Springer Science and Business Media LLC, 2024 Publication 52 53 phd-dissertations.unizik.edu.ng Internet Source Catherine Régis, Jean-Louis Denis, Maria Luciana Axente, Atsuo Kishimoto. "HumanCentered AI - A Multidisciplinary Perspective <1 % <1 % <1 % <1 % <1 % <1 % for Policy-Makers, Auditors, and Users", CRC Press, 2024 Publication 54 55 Submitted to Glyndwr University Student Paper Matti Vaarma, Hongxiu Li. "Predicting student dropouts with machine learning: An empirical study in Finnish higher education", Technology in Society, 2024 Publication <1 % <1 % Submitted to University of Wales Institute, Cardiff <1 % 57 Submitted to University of Witwatersrand <1 % 58 hdl.handle.net 59 Submitted to msu 60 www.aimspress.com 61 www.ijeast.com 62 Submitted to Intercollege 56 Student Paper Student Paper Internet Source Student Paper Internet Source Internet Source Student Paper <1 % <1 % <1 % <1 % <1 % 63 ir-library.ku.ac.ke 64 ojs.ais.cn 65 rc.library.uta.edu 66 uir.unisa.ac.za 67 www.grossarchive.com 68 Submitted to Grenoble Ecole Management 69 Internet Source Internet Source Internet Source Internet Source Internet Source Student Paper Mukhtar Abdi Hassan, Abdisalam Hassan Muse, Saralees Nadarajah. "Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland", Applied Sciences, 2024 Publication 70 Poonam Tanwar, Tapas Kumar, K. Kalaiselvi, Haider Raza, Seema Rawat. "Predictive Data Modelling for Biomedical Data and", River Publishers, 2024 Publication 71 Submitted to Technological University Dublin Student Paper <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % 72 Submitted to Trine University 73 Submitted to University of Reading 74 repository.out.ac.tz 75 www.escholar.manchester.ac.uk 76 www.researchgate.net 77 www.tara.tcd.ie 78 www2.mdpi.com Student Paper Student Paper Internet Source Internet Source Internet Source Internet Source Internet Source <1 % <1 % <1 % <1 % <1 % <1 % <1 % Adwitiya Sinha, Megha Rathi. "Smart Healthcare Systems", CRC Press, 2019 <1 % 80 Submitted to University of Birmingham <1 % 81 etd.uwc.ac.za 82 export.arxiv.org 83 fastercapital.com 79 Publication Student Paper Internet Source Internet Source <1 % <1 % Internet Source 84 iieta.org 85 www.ijraset.com 86 www.springerprofessional.de 87 Internet Source Internet Source Internet Source Kok-Kwang Phoon, Takayuki Shuku, Jianye Ching. "Uncertainty, Modeling, and Decision Making in Geotechnics", CRC Press, 2023 Publication 88 Thomas Mgonja, Francisco Robles. "Identifying Critical Factors When Predicting Remedial Mathematics Completion Rates", Journal of College Student Retention: Research, Theory & Practice, 2022 Publication 89 Ton Duc Thang University 90 arxiv.org 91 crm-en.ics.org.ru 92 dissertations.mak.ac.ug Publication Internet Source Internet Source Internet Source <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % <1 % 93 edoc.ub.uni-muenchen.de 94 ir.knust.edu.gh 95 www.conftool.com <1 % Internet Source <1 % Internet Source <1 % Internet Source Exclude quotes On Exclude bibliography On Exclude matches < 10 words