Absenteeism
Absenteeism
Absenteeism
CITATIONS READS
8 4,106
7 authors, including:
All content following this page was uploaded by Sifat Momen on 06 April 2023.
Abstract—High absenteeism among employees can be detri- In this paper, we present a prediction of absenteeism at
mental to an organization as it can result in productivity and work using data mining techniques in a courier company in
economic loss. This paper looks into a case of absenteeism Brazil. The dataset [3] is publicly available in Kaggle and
in a courier company in Brazil. Machine learning techniques
have been employed to understand and predict absenteeism. comprises of a total of three years of data (from July 2007
Understanding this would provide human resource managers an to July 2010) containing details of absences in the company.
excellent decision aid to create policies that can aim to reduce When employees were absent, their details of absence have
absenteeism. Data has been preprocessed, and several machine been recorded. The novelty of the paper lies in the use of
learning classification algorithms (such as zeroR, tree-based J48, data mining techniques (particularly classification) to predict
naive Bayes, and KNN) have been applied. The paper reports
models that can predict absenteeism with an accuracy of over absenteeism. Such a methodology can be applied to under-
92%. Furthermore, from an initial of 20 attributes, disciplinary stand absenteeism behavior in other companies and develop a
failure turns out to be a very prominent feature in predicting decision-aid system that can help managers in controlling it.
absenteeism. The rest of the paper is organized as follows: In section 2,
Index Terms—absenteeism, prediction, data mining, classifica- a critical literature review has been conducted. This follows
tion
our research methodology in section 3. Experimental results
I. I NTRODUCTION are presented in section 4. Finally, the paper is concluded in
section 5 with the highlights of the results obtained.
In Human Resources and Management (HRM), employees
are regarded as valuable assets to their organization. Employee II. BACKGROUND
productivity directly contributes to the work efficiency and Absenteeism is defined as the temporary withdrawal from
the success of a company. The economic sustainability of a work due to personal reasons, including illness and demise of
company lies in its revenue, which is generated only with close relatives [4]. Cuchiella and colleagues, on the other hand,
sufficient support from the employees. If employees under- remarks absenteeism as an employee’s habitual absence from
perform or do not perform well, it results in an incurment work [5]. Despite the differences in opinion, high absenteeism
of high costs to the organization. Consequently, in good is always found to be correlated with high productivity loss
organizations, HRM professionals have to work prudently to and economic loss. Research in understanding why employees
ensure that employee motivation stays high enough to yield stay absent is a very active area partly due to the fact that
target productivity [1]. it often has serious economic consequences. Different factors
One key indicator of low motivation in an organization is have been found to contribute towards absenteeism. Arai and
high employee absenteeism. Absence in work can arise due to Thoursie [6], for instance, found that incentives have a crucial
an array of factors including age, health condition, bad work- role to play in absenteeism. Using industry-region panel data,
life balance, low motivation, organizational culture, policies, they were able to establish a negative correlation between sick
pay scale, poor recognition in the organization, and many rates and shares of temporary contracts. Arai and Thoursie
more. Understanding why employees stay absent and realizing argue that this is because workers on temporary contracts have
the patterns in absenteeism behavior is an invaluable tool for lower job security and hence, a higher possibility of being laid
human resource managers. Understanding this, in turn, would off compared to the employees on time-unlimited contracts.
help organizations to shape the policies and organizational Hence, workers on temporary contracts tend to be more loyal
culture to transform the organization into a more people-driven in the hope of a renewal of contracts. Hesselius [7] found that
organization, which is expected to result in higher productivity absenteeism is negatively correlated to unemployment. This is
[2]. due to the fact that when unemployment is high, the probability
978-0-7381-4403-0/20/$31.00 ©2023 IEEE of an individual to find a new job is low. This acts as an
TABLE I
ATTRIBUTES AND THEIR DESCRIPTION
incentive for the individual to be less absent in work and thus 3 years of data from July 2007 to July 2010. Information per-
reducing his/her probability of being laid off. Winkelmann [8] taining to health, work, workload, habit, traveling, and details
found that absenteeism is dependent on factors such as wages of absence have been incorporated into the dataset. Table I
and even the firm size. Organizational policies also impact the shows the details of the attributes recorded in the dataset.
level of absenteeism [9]. For example, Halpern and colleagues One particular attribute to note in the dataset is the reason
[10] found that the smoking policy in the workplace affects for absence, which is an important attribute to investigate.
absenteeism and productivity. Their research concludes that Table II describes the various reasons for absence, along with
current smokers tend to have significantly higher absenteeism the percentage of occurrence of each. The dataset size is small
than former smokers and non-smokers. Absenteeism is re- and as a consequence applying deep learning techniques will
peatedly reported to be strongly correlated with employees’ not be effective. Classical data mining techniques are able to
health status. For instance, Tunceli and colleagues [11] found predict with high accuracy. Hence we refrained from using
that for employees with diabetes, the absolute probability of any deep learning techniques. We split the dataset into training
working for male and female employees is 7.1 % and 4.4 and test sets in an 80-20 ratio with a stratified class attribute
% points less compared to the individuals without diabetes. distribution.
Gates and colleagues [12] found that moderately or extremely
obese workers experience a 4.2% loss in productivity, which B. Research Methodology
is tantamount to 1.18% more than all other employees. Shah
Our prime objective is to develop a model that can predict
and colleagues [13] used deep neural networks to predict
absenteeism (with high accuracy) in the courier company for
absenteeism before employees are actually hired. Research in
aid in decision-making by the managers. Since there exists
understanding absenteeism behavior reveals that it depends on
different reasons for absence among employees and many
various factors that act as incentives (directly or indirectly),
factors contribute towards absenteeism, a deductive learning
including organizational culture, policies (both national and
approach is an infeasible option. We instead use the inductive
organizational), size of the organization, and many more.
learning approach, where we use data mining techniques to
III. R ESEARCH W ORK predict absenteeism. Scikit-learn [14], a Python library for data
This paper looks into the factors affecting absenteeism in a science, has been used to carry out the data mining tasks.
courier company in Brazil. Whenever an employee is absent, The research methodology embraced in this work has been
he/she needs to fill up a form detailing reasons for absence. outlined in figure 1.
This, in conjunction with other personal details, has been The raw data (the Absenteeism dataset) is first preprocessed
recorded in a dataset. The dataset was first published by [3] to a form that is suitable for applying machine learning
and is now publicly available in Kaggle. algorithms. The preprocessing phase includes data clean-
ing (removal of attribute and marking missing labels), data
A. Dataset discretization (i.e., converting continuous data into discrete
The dataset comprises of a total of 740 instances recorded categories), and data transformation (converting data from a
over 21 attributes per instance. The dataset is a collection of numeric form to categorical). After the preprocessing step,
TABLE II
P REPROCESSING FOR THE ATTRIBUTE ” REASON OF ABSENCE ”
TABLE IV
F EATURE SCORES BY R ELIEF ATTRIBUTE E VALUATOR
TABLE VI
T YPES OF E XPERIMENTS
n
X
d(x, y) = |xi − yi | (3)
i=1
for experiment A. The KNN, naive Bayes, and J48 classifiers [2] M. Mayfield, J. Mayfield, and K. Q. Ma, “Innovation matters: creative
yield the highest accuracy in experiment C if applied to environment, absenteeism, and job satisfaction,” Journal of Organiza-
tional Change Management, 2020.
the dataset without the SMOTE filter. Applying the SMOTE [3] A. Martiniano, R. Ferreira, R. Sassi, and C. Affonso, “Application of
filter to the data and conducting experiment C results in a a neuro fuzzy network in prediction of absenteeism at work,” in 7th
significant reduction of performance. For experiment B, the Iberian Conference on Information Systems and Technologies (CISTI
2012). IEEE, 2012, pp. 1–4.
performance of the naive Bayes classifier falls with or without [4] R. L. Mathis and J. H. Jackson, Human resource management: Essential
SMOTE. However, the J48 classifier’s performance stays more perspectives. Cengage Learning, 2011.
or less the same regardless of the sampling strategy. The [5] F. Cucchiella, M. Gastaldi, and L. Ranieri, “Managing absenteeism in
the workplace: the case of an italian multiutility company,” Procedia-
highest measure of accuracy overall is 92.3%, achieved by Social and Behavioral Sciences, vol. 150, pp. 1157–1166, 2014.
the KNN classifier with the Chebyshev distance metric in [6] M. Arai and P. S. Thoursie, “Incentives and selection in cyclical
experiment B after applying the SMOTE filter to the train absenteeism,” Labour Economics, vol. 12, no. 2, pp. 269–280, 2005.
[7] P. Hesselius, “Does sickness absence increase the risk of unemploy-
set. For this model, the accuracy of class A is 67%, for ment?” The Journal of Socio-Economics, vol. 36, no. 2, pp. 288–310,
class B it is 92.1%, and for class C is 8.3%. On another 2007.
note, the experiment indicates that disciplinary failure is a [8] R. Winkelmann, “Wages, firm size and absenteeism,” Applied Economics
Letters, vol. 6, no. 6, pp. 337–341, 1999.
very influential attribute for determining absenteeism as all [9] S. A. Ruhle and S. Süß, “Presenteeism and absenteeism at work—an
classifiers other than the baseline result in over 90% accuracy analysis of archetypes of sickness attendance cultures,” Journal of
for it without SMOTE applied. Experiment B, comprising of Business and Psychology, vol. 35, no. 2, pp. 241–255, 2020.
[10] M. T. Halpern, R. Shikiar, A. M. Rentz, and Z. M. Khan, “Impact of
all attributes, is a good indicator of the performance of the smoking status on workplace absenteeism and productivity,” Tobacco
classifiers and reflect a more realistic view due to the class control, vol. 10, no. 3, pp. 233–238, 2001.
imbalance problem causing overestimations in performance [11] K. Tunceli, C. J. Bradley, D. Nerenz, L. K. Williams, M. Pladevall,
and J. E. Lafata, “The impact of diabetes on employment and work
otherwise. An imbalanced dataset can sometimes lead to productivity,” Diabetes care, vol. 28, no. 11, pp. 2662–2667, 2005.
high bias. When the bias is removed owing to SMOTE, the [12] D. M. Gates, P. Succop, B. J. Brehm, G. L. Gillespie, and B. D.
performance naturally decreases. Sommers, “Obesity and presenteeism: the impact of body mass index
on workplace productivity,” Journal of Occupational and Environmental
Medicine, vol. 50, no. 1, pp. 39–45, 2008.
V. C ONCLUSION [13] S. A. Ali Shah, I. Uddin, F. Aziz, S. Ahmad, M. A. Al-Khasawneh, and
M. Sharaf, “An enhanced deep neural network for predicting workplace
This paper looks into the absenteeism dataset, a dataset absenteeism,” Complexity, vol. 2020, 2020.
detailing information about absence records. There were a [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
total of 740 instances and 21 initial attributes. After careful plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
preprocessing and feature selection, machine learning algo- esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
rithms were applied. Three kinds of experiments with different Learning Research, vol. 12, pp. 2825–2830, 2011.
[15] M. A. Hall, “Correlation-based feature selection for machine learning,”
subsets of features were devised. It has been found that the 1999.
model can predict absenteeism with over 92% accuracy. [16] K. Kira and L. A. Rendell, “The feature selection problem: Traditional
methods and a new algorithm,” in Aaai, vol. 2, 1992, pp. 129–134.
R EFERENCES [17] Z. Wahid, A. Z. Satter, A. Al Imran, and T. Bhuiyan, “Predicting
absenteeism at work using tree-based learners,” in Proceedings of the
[1] C. Navarro and C. Bass, “The cost of employee absenteeism,” Compen- 3rd International Conference on Machine Learning and Soft Computing,
sation & Benefits Review, vol. 38, no. 6, pp. 26–30, 2006. 2019, pp. 7–11.