Effective Injury Prediction in Professional Soccer
Effective Injury Prediction in Professional Soccer
Effective Injury Prediction in Professional Soccer
net/publication/317087789
Effective injury prediction in professional soccer with GPS data and machine
learning
CITATIONS READS
2 2,859
6 authors, including:
Some of the authors of this publication are also working on these related projects:
Special Issue on EPJ Data Science: "Individual and Collective Human Mobility: Description, Modelling, Prediction" View project
All content following this page was uploaded by Luca Pappalardo on 01 June 2017.
1 Introduction
[1, 2]. Furthermore, the costs associated with the complex process of recovery
and rehabilitation for the player is often considerable, both in terms of medical
care and missed earnings deriving from the popularity of the player himself
[3]. Recent research demonstrates that injuries in Spain cause an average of
16% of season absence by players, corresponding to a total cost estimation of
188 million euros just in one season [4]. It is not surprising hence that injury
prediction is attracting a growing interest from researchers, managers, and
coaches, who are interested in intervening with appropriate actions to reduce
the likelihood of injuries of their players.
Historically, academic work on injury prediction has been deterred for
decades by the limited availability of data describing the physical activity of
players during the season. Nowadays, the data revolution and the Internet of
Things have the potential to change rapidly this scenario thanks to Electronic
Performance and Tracking Systems (EPTS) [5, 6], new tracking technologies
that provide high-fidelity data streams, based on video recordings by different
cameras or observations by various kinds of fixed and mobile sensors [5, 7, 8, 9,
10]. Professional soccer clubs are starting to use these new technologies to col-
lect data from official games and training sessions, to ensure they can remain
in control of their players’ performance as much as possible. These soccer data
depict in detail the movements of players on the field [5, 6] and have been used
for many purposes, such as understanding game performance [11], identifying
training patterns [12], or performing automatic tactical analysis [5].
Despite this wealth of data, a little effort has been put on investigating in-
jury prediction in professional soccer so far [13, 14, 15]. Existing studies in the
literature provide just a preliminary understanding of which variables mostly
affect injury risk [13,14,15], while an evaluation of the potential of statistical
models in forecasting injuries is still missing. A major limit of existing studies
is that they follow a monodimensional approach: since they use just one vari-
able at a time to estimate injury risk, they do not fully exploit the complex
patterns underlying the available data. Professional soccer clubs are interested
in practical, usable and interpretable models as support in decision making to
coaches and athletic trainers [16]. The construction of such injury prediction
models poses many challenges. On one hand, an injury prediction model must
be highly accurate. Predictors which rarely forecast injuries are useless for
coaches, as well as predictors which frequently produce “false alarms”, i.e.,
misclassify healthy players as risky ones. On the other hand a “black box”
approach is less desirable for practical use since it does not provide insights
about the reason behind injuries. It is fundamental to understand the complex
relationships between the performance of players and their injury risk through
simple, interpretable, and easy-to-use tools. An interpretable model reveals
the influence of variables to injuries and the reasons behind them, allowing
soccer practitioners to react on time by modifying properly training demands.
Therefore, predictive models for injury prediction must achieve a good tradeoff
between accuracy and interpretability.
In this paper, we propose a data-driven approach to predict the injuries
in a professional soccer club and demonstrate that it is accurate and easy-to-
Title Suppressed Due to Excessive Length 3
2 Related Work
Position of our work. In literature, the only studies addressing the problem of
injury prediction in soccer are the ones by Brink et al. [13], Ehrmann et al. [14]
and Venturelli et al. [15]. However, these studies suffer a major limitation: while
they observe the existence of a correlation between specific aspects of training
workload and the chance of injury, they do not construct any predictor as a
practical tool for coaches and athletic trainers to prevent injuries. Therefore,
to the best of our knowledge, there is no quantification of the potential of
predictive analytics in preventing non-traumatic injuries in professional soccer.
In this paper, we fill this gap by proposing a machine learning approach to
injury prevention and show that outperforms existing injury risk estimation
methods for professional soccer players.
The club’s medical staff recorded all the non-contact injuries occurred dur-
ing 23 weeks. According to UEFA regulations [17], a non-contact injury is de-
fined as any tissue damage sustained by a player that causes absence in next
football activities for at least the day after the day of the onset. We observed
21 non-contact injuries during the period of observation, 19 of them associ-
ated with players who got injured at least once in the past. The distribution of
injuries per player is provided in Figure 1. We observe that half of the players
never get injured during the period of observation, while the others get injured
once (seven players), twice (five players) or four times (one player). For every
player, we also collected information about his age, weight, height and role on
the field. Moreover, for every single training session of a player, we collected
information about the play time in the official game before the training session
and the number of official games played before the training session.
From the players’ GPS data of every training session we select 12 features
describing different aspects of workload [34]. Two features – Total Distance
(dTOT ) and High Speed Running Distance (dHSR ) – are kinematic, i.e., they
quantify the overall movement of a player during a training session. Three
features – Metabolic Distance dMET , High Metabolic Load Distance dHML and
High Metabolic Load Distance per minute dHML /m – are metabolic, i.e., they
quantify the energy expenditure of the overall movement of a player during
a training session. The remaining seven features – Explosive Distance (dEXP ),
number of accelerations above 2m/s2 (Acc2 ), number of accelerations above
3m/s2 (Acc3 ), number of decelerations above 2m/s2 (Dec2 ), number of decel-
erations above 3m/s2 (Dec3 ), Dynamic Stress Load (DSL) and Fatigue Index
(FI) – are mechanical features describing the overall muscular-scheletrical load
of a player during a training session. Table 1 provides a description of the
workload features, Appendix A provides descriptive statistics of the variables.
(a)
I U H T X H Q F \
I U H T X H Q F \
L Q M X U L H V S H U S O D \ H U W U D L Q L Q J V S H U S O D \ H U
(b) (c)
Fig. 1 (a) A visualization of the GPS trace of a player during a training session in our
dataset. The color of the lines indicates, in a gradient from white to red, the speed of the
player. (b) Distribution of the number of injuries per player. (c) Distribution of the number
of training sessions per player.
Many existing works on injury risk estimation rely on the so-called Acute
Chronic Workload Ratio (ACWR) [14, 20, 22, 28, 29, 35], i.e., the ratio between
the acute workload and the chronic workload of a player. Although its validity
has been questioned by some authors [36, 37], ACWR is still one of the most
used techniques in professional soccer clubs [38]. As proposed by Murray et
al. [29], the acute workload of a player can be estimated by the exponential
weighted moving average of the workload in the previous 7 days; the chronic
workload of a player can be estimated as the exponential weighted moving av-
erage of the workload in the previous 28 days. ACWR can be calculated on any
8 Alessio Rossi et al.
Table 1 Training workload features used in our study. Description of the training
workload features extracted from GPS data and the players’ personal features collected
during the study. We defined four categories of features: kinematic features (blue), metabolic
features (red), mechanical features (green) and personal features (white).
$ &