Forecasting and Anomaly Detection Approaches Using LSTM and LSTM Autoencoder Techniques With The Applications in Supply Chain Management
Forecasting and Anomaly Detection Approaches Using LSTM and LSTM Autoencoder Techniques With The Applications in Supply Chain Management
Forecasting and Anomaly Detection Approaches Using LSTM and LSTM Autoencoder Techniques With The Applications in Supply Chain Management
France.
c CEO Driven, 54 rue norbert segard 59510, France.
Abstract
∗ Correspondingauthor
Email address: kim-phuc.tran@ensait.fr (K. P. Tran*)
2
in Schölkopf et al. (2001). However, this method may not always effective for
the multivariate time series data as only a single value of the characteristic of
interest is outputted from the network. From these points of view, the goal
of this paper is (1) to provide an LSTM based method for forecasting multi-
variate time series data and (2) to present an effective method for detecting
anomaly from multivariate time series data without using any assumptions for
the distribution of prediction errors. In particular, we suggest using a one-class
support vector machine (OCSVM) algorithm to separate anomalies from the
data outputted based on the LSTM Autoencoder network. In order to assess
the suitability of our proposed method, a real case study based on the fashion
retailing supply chain is considered. Fashion retailing, and more especially the
downstream supply chain, is a very challenging domain that requires advanced
intelligent techniques. The considered scenario is described more specifically in
the next section.
2. Related Works
3
problems in many domains such as finance, banking, insurance, industrial man-
ufacturing, etc. As a result, references devoted to them are abundant in the
literature.
For the anomaly detection problem, Zhao et al. (2013) improved the quick
outlier detection (QOD) algorithm by clustering based on data streams applied
to cold chain logistics. Roesch and Van Deusen (2010) suggested a quality
control approach for detecting anomalies in the analysis of annual inventory
data. Two anomaly detection techniques, including a statistical-based approach
and clustering-based approach, were used to detect outliers in sensor data for
real-time monitoring systems of the perishable supply chain in (Alfian et al.,
2017). A number of studies focus on abnormal event detection in the supply
chain based on radio frequency identification (RFID) technology can be seen
in Sharma and Singh (2013); Huang and Wang (2014). Habeeb et al. (2019)
provided a comprehensive survey on real-time big data processing for anomaly
detection. The authors also proposed a taxonomy to classify existing literature
into a set of categories involved anomaly detection techniques and then analyzed
existing solutions based on the proposed taxonomy. A comprehensive survey on
deep learning approaches for anomaly detection is conducted in Chalapathy and
Chawla (2019). A large number of references have been studied to provide an
expansive overview of the problem. The deep learning-based anomaly detec-
tion models are divided into types, involving unsupervised, seme-supervised,
hybrid, and one-class neural networks. The idea of deep hybrid models is to use
deep neural networks mainly autoencoders as feature extractors. After learn-
ing within the hidden representations of autoencoders, these features are fed
to traditional anomaly detection algorithms such as OCSVM and SVDD (sup-
port vector data description) to detect anomalies. This type of deep learning
model has been applied in several situations with great success. However, the
structure of these deep hybrid models for anomaly detection is just a combina-
tion of some separated deep networks like CNN (convolution neural network)
and LSTM, with OCSVM or SVDD. Also, this type of model has not yet been
applied to multivariate time series.
For the forecasting problem, the auto-regressive integrated moving average
(ARIMA) model is commonly used as a methodology for linear time series data,
4
however, it is not suitable for analyzing non-linear data (Zhang, 2003). The
machine learning models such as support vector regression and random forest
regressor are then developed to deal with non-linear data (Carbonneau et al.,
2008; Maqsood et al., 2020; Yang et al., 2020). By using nonlinear activation
functions, recurrent neural networks (RNNs) are essentially a nonlinear time
series model, where the non-linearity is learned from the data. A comparison
of ARIMA and long short term memory (LSTM) networks in forecasting time
series conducted in Siami-Namini et al. (2018) showed that the LSTM model
outperforms ARIMA model as the average reduction in error rates obtained by
LSTM was about 80% when compared to ARIMA. The time series forecasting
methods with deep learning are reviewed broadly in Lim and Zohren (2020). The
complex structures forming from combinations of deep learning networks like
CNN-FNN, LSTM-FNN, CNN-BLSTM, RBM-LSTM-FNN are also introduced
to deal with multivariate time series for forecasting (Xia et al., 2020; Deng
et al., 2020; Ellefsen et al., 2019), where FNN stands for feed-forward neural
network, BLSTM stands for bi-directional long short-term memory, and RBM
stands for restricted Boltzmann machines. It seems that one has to use more
complex structures for deep learning models to get higher performance, and the
use of simper deep learning networks for solving the forecasting problem is no
longer paying much attention The objective of this study is then to consider the
shortcomings in the literature discussed above.
3. Scenarios
- the supply chain of fashion products is very complex and particularly long
compared to the short lifespan of products.
5
UPSTREAM SUPPLY CHAIN DOWNSTREAM SUPPLY CHAIN
(scope of the study)
Textile manufacturing Garment manufacturing Retailer Warehouse
Weaving
Cutting Sewing Store 1
Consumers
Spinning Dyeing
Finishing Packaging
Knitting
Store 2
Replenishment
…
Order
Store n
Information System
POS Data
To deal with these specificities, fashion retailers have developed a two-part sup-
ply chain management (Thomassey, 2010) as illustrated in figure 1, including
(1) upstream from suppliers to warehouse, a cost-oriented supply chain with
bulk procurement based on long-term forecasts, and (2) downstream from the
warehouse to local stores, a responsive supply chain with frequent replenishment
of stores mainly based on short-term Point Of Sales (POS) data.
In this study, we focus on the downstream supply chain of fashion retailers.
As mentioned earlier, consumer demand very fluctuates. When the product
variety is high, inventory allocations become very challenging for an extensive
store network. Thus, companies rely on efficient and reactive information sys-
tem to monitor POS data and compute replenishment of each store for the next
day or next two days. Combined with efficient transportation and distribution
logistics, this process enables companies to drive their local inventories in most
of the situations. However, the high sensitivity of the demand to pricing effect
and weather conditions frequently involves sharp and immediate fluctuations
which can not be predicted by the POS data-based replenishment system. Tak-
ing into account the different constraints such as small store surfaces, limited
staff numbers to manage product reception, shelving, and sales force, these high
fluctuations generate significant profit loss. Therefore, a short-term sales fore-
6
casting system should be developed to cope with this problem. Different models
have been proposed in the literature for this task ((Sirovich et al., 2018). How-
ever, the product variety and extensive store network generate a huge number
of situations which are as many sources of forecast errors. To deal with these
issues, the proposed approach which combines new advances in forecasting with
the LSTM network, the LSTM Autoencoder network, and the OCSVM algo-
rithm. In this context, the aim of our method is not only to predict the exact
sales by stock-keeping unit (SKU) and store but also to detect and anticipate
exceptional sales in order to enable practitioners to make a suitable decision
and adjust their replenishment for highlighted SKU/stores accordingly.
LSTM is a type of Recurrent Neural Network (RNN) that allows the net-
work to retain long-term dependencies between data at a given time from many
timesteps before. It has a form of a chain of repeated modules of neural net-
works, where each module includes three control gates, i.e. the forget gate, the
input gate, and the output gate. Each gate is composed out of a sigmoid neural
net layer and a pointwise multiplication operation. The sigmoid layers output
numbers in the interval [0, 1], representing a portion of input information that
should be let through. As the use of a RNN for time series data, the LSTM
reads a sequence of input vectors x = {x1 , x2 , . . . , xt , . . .}, where xt ∈ Rm rep-
resents an m-dimensional vector of readings for m variables at time-instance t.
We consider the scenario where multiple such time-series can be obtained by
taking a window over a larger time-series. Even LSTM can work with any time-
series data, one should consider that its performance is not always the same as
it could vary depending on the input.
7
Given the new information xt in state t, the LSTM module works as follows.
Firstly, it decides what old information should be forgotten by outputing a
number within [0, 1], say ft with
where ht−1 is the output in state t − 1, Wf and bf is the weight matrices and
the bias of the forget gate. Then, xt is processed before storing in cell state.
The value it is determined in the input gate along with a vector of candidate
values C̃t generated by a tanh layer at the same time to updated in the new cell
state Ct , in which
and
Ct = ft ∗ Ct−1 + it ∗ C̃t , (4)
where (Wi , bi ) and (Wc , bc ) are the weight matrices and the biases of input
gate and memory cell state, respectively. Finally, the output gate, which is
defined by
ht = ot ∗ tanh(Ct ). (6)
where Wo and bo are the weight matrix and the bias of output gate, determines
a part of the cell state being outputed. Figure 2, which has been reproduced from
figure 1 in (Tran et al., 2019) with the modifications, presents an illustration
of the structure and the operational principle of a typical LTSM module. In
this figure, the cell state runs straight down the entire chain, maintaining the
sequential information in an inner state and allowing the LSTM to persist the
knowledge accrued from subsequent time steps.
There are also various variants of LSTM suggested by different authors. A
direct comparison of popular variants of LSTM made by Greff et al. (2016)
showed that these variations are almost the same; a few among them are more
efficient than others but only in some specific problems.
8
Ct−1 × + Ct
tanh
it
ft ×
ot
×
C̃t
σ1 σ2 tanh σ3
Ht−1 Ht
Xt
1X
L= kx − x̂k2 . (7)
2 x
The main purpose of the autoencoder is not simply to copy the input to the
output. By constraining the latent space to have a smaller dimension than the
input, i.e. n < m, the autoencoder is forced to learn the most salient features of
the training data. In other words, an important feature in the design of autoen-
coder is that it reduces data dimensions while keeping the major information of
data structure.
Several types of autoencoders have been proposed in the literature, such as
vanilla autoencoder, convolutional autoencoder, regularized autoencoder, and
9
Neuron network Neuron network
encoder decoder
(LSTM) (LSTM)
x z = e(x) x̂ = d(z)
LSTM autoencoder. Among these types, LSTM autoencoder refers to the au-
toencoder that both the encoder and the decoder are the LSTM network. The
ability of LSTM to learn patterns in data over long sequences makes them
suitable for time series forecasting or anomaly detection. That is, the use of
the LSTM cell is to capture temporal dependencies in multivariate data. It is
shown in (Malhotra et al., 2016) that an encoder-decoder model learned using
only the normal sequences can be used for detecting anomalies in multivariate
time-series. The encoder-decoder has only seen normal instances during training
and learned to reconstruct them. When it is fed with an anomalous sequence,
it may not be reconstructed well, leading to higher errors. This has a practical
meaning since anomalous data are not always available or it is impossible to
cover all the types of these data. Many advantages of using the autoencoder
approach have been discussed in (Provotar et al., 2019). The use of LSTM au-
toencoder for anomaly detection on multivariate time series data can be seen
in several studies, for example, Pereira and Silveira (2018) and Principi et al.
(2019).
Figure 4 provides an illustration of a LSTM autoencoder network.
10
a hyperplane defined in a high-dimensional Hilbert feature space F with max-
imum margin separation from the origin. The data are mapped to space F
through a nonlinear transformation Φ(.). Then, the problem of separating the
data set from the origin is equivalent to solving the following quadratic program
(Schölkopf et al., 2001):
N
1 2 1 X
Minimize ||w|| + ξi − ρ (8)
2 νN i=1
w, a, ξ, ρ
where σ > 0 stands for the kernel width parameter. In the feature space, the
distance between two mapped samples yi and yj is:
2
||φ (yi ) − φ (yj )|| = k (yi , yi ) + k (yj , yj ) − 2k (yi , yj )
" !#
2
||yi − yj ||
= 2 1 − exp − (12)
2σ 2
Equation (12) shows a positively proportional relation between ||φ (yi ) − φ (yj )||
and ||yi − yj ||. That is to say, the ranking order of the distances between
samples in the input and feature spaces is preserved by using the Gaussian
kernel.
11
By using the Lagrangian method and the kernel function, Schölkopf et al.
(2001) showed that the problem of solving the quadratic program (8) can be
transferred to the following dual optimization:
N X
X N
αi? = Argmin αi αj k(yi , yj ) (13)
i=1 j=1
α
N
X 1
subject to αi = 1, 0 ≤ αi ≤ , ∀i = 1 . . . N (14)
i=1
νN
1
Samples yi that correspond to 0 < αi? < νN are called support vectors. Let
NSV stands for the number of support vectors, then the discriminant function
is reduced to:
N
!
X SV
5. Proposed approaches
Multivariate time series refers to a time series that has more than one time-
dependent variable. That means each variable depends not only on its past
values but also has some dependency on other variables. This dependency of
multivariate time series is convenient in modeling interesting interdependencies
and forecasting future values. However, because of its nature, it can be difficult
to build accurate models for multivariate time series forecasting, an important
task in many practical applications. In the literature, several multivariate time
series predictive models have been proposed such as the vector auto-regressive
(VAR) model and the Bayesian VAR model. A summary of advanced multi-
variate time series forecasting approaches based on statistical models can be
seen in (Wang, 2018). Recently, the rapid developments of artificial neuron
networks provide a powerful tool to handle a wide variety of problems that
were either out-of-scope or difficult to do with classical time series predictive
approaches. For example, a multivariate time series forecasting method using
LSTM has been suggested for forecasting air quality in (Freeman et al., 2018).
The method will be explained in detail below to apply in our situation.
12
(1) (2) (k)
Let xt = {xt , xt , ..., xt }, t = 1, 2, . . . denote a multivariate time series at
the time t where k is the number of variables. In a supply chain, xt could be the
value of some specific features such as sales, temperature, humidity and product
price. The LSTM network is trained based on a sequence of observed data {x1 ,
x2 , . . . , xN }, where N is the number of samples, as follows. Firstly, individual
observations are scaled using the MinMaxScaler function by the formula
(i)
(i) x(i) − xmin
xscaled = (i) (i)
, i = 1, . . . , k, (16)
xmax − xmin
(i) (i)
where xmax and xmin are the maximum and minimum values of x(i) in the data
(i)
set, respectively. To make the notations simple, we write x(i) for xscaled and
understand that this is scaled data. Then, in the training process, we set up a
sliding window of size m, m < N . That is to say, m consecutive multivariate
variables are fed to the LSTM at the same time. We will use these m ∗ k inputs
(1)
to predict the next value of the characteristic of interest, say x∗ . For example,
at the first window, the sequence {x1 , x2 , . . . , xm } in the training data set is
(1)
taken to feed the LSTM and the network can predict the value x̂m+1 . In the
second one, based on the sequence {x2 , x3 , . . . , xm+1 }, the LSTM can predict
(1)
the value x̂m+2 . This process continues until the windows slide to the end of
the training data set. The weights of the LSTM network is trained to minimize
the loss function of error prediction:
N
X
L= ei , (17)
i=m+1
(1) (1)
where ei = kx̂i − xi k. The performance of the LSTM network is evaluated
using the loss metric root mean square error (RSME):
v
u N
u 1 X (1) (1)
RM SE = t (x̂ − xi )2 . (18)
N − m − 1 i=m+1 i
After training, the network is used for forecasting. In particular, the value
(1)
x̂N +1 can be predicted from the LSTM based on the input {xN −m+1 , xN −m+2 ,
. . . , xN }. In practice, some of the parameters of the model need to be optimized
based on the input data to achieve the best performance. In our study, the
learning rate, the number of cells, and the dropout will be optimized. The
choice of the sliding window is also a question in some situations. However, one
13
should consider the ability to learn long temporal dependence of the LSTM.
This ability makes LSTM not need to pre-determine a specified time window:
it can find the optimal look-back number on its own. That is to say, we can
try some specific values for the size of the sliding window and let LSTM learn
from data. If one wants to try another value for the sliding window size, other
parameters need to be re-optimized and it can take more time. In this study, we
will assign a particular value for the sliding window size based on our knowledge
of the data. Appendix A provides pseudocode for the proposed method.
14
not require any specific assumption of data. Among the machine learning al-
gorithms, OCSVM is a very effective algorithm that can be used to detect the
anomaly. Since the dependency in the multivariate time series is eliminated
by using the autoencoder LSTM, the error vectors ei , i = m + 1, . . . , N can
be considered as independent. From these vectors, the OCSVM can define a
hyperplane to separate the abnormal observations from normal samples. An-
other possible method to avoid the Gaussian distribution assumption is to use
the kernel quantile estimation (KQE) method as applied in (Tran et al., 2019).
Compared to the anomaly detection method suggested in (Tran et al., 2019), the
proposed method in this study has more advantages. The autoencoder LSTM
using in this study allows extracting important features from the multivariate
time series more efficiently. Moreover, by outputting a vector rather than a
component of the vector, the dependence between the components of the pre-
dicted vector is held. As a result, it makes the machine learning algorithms
for classification or anomaly detection more efficient. Similar to the previous
section, the learning rate and the number of cells will be optimized based on the
input data rather than being pre-determined for achieving better performance
of the model. Pseudocode for the proposed method can be seen in Appendix B.
15
Time
x1 , x2
xN −1 , xN Input sequence
x2 , x3
Encoded features
Figure 4: An illustration of the operation of the autoencoder LSTM network for the sliding
window of size 2
we evaluate the method based on the first dataset of C-MAPSS, i.e. the FD001
dataset. The C-MAPSS FD001 is split into the training set and the test set of
multiple multivariate time series. The training set contains the run-to-failure
condition monitoring data stream for 100 engines of the same type, while the
testing set contains the same type data of engines that ends sometime before
failure occurs. The length of condition monitoring data is inconsistent from one
engine to another, and it is contaminated with sensor noise, making it a chal-
lenging task to predict the remaining useful lifetime (RUL). (Xia et al., 2020).
Table 1 presents more details of this dataset.
The objective is to predict the true RUL of each engine in the testing set
by using the data from the training set. That is, the data from the training
set are fed to train the model and the trained model is used to predict the
RUL of testing engines. In the training process, the number of cells, dropouts,
and the learning rate of the model is optimized. The optimized model for this
dataset is presented in Appendix D. Our computation is performed on a platform
with 2.6 GHz Intel(R) Core(TM) i7 and 32GB of RAM. It took about 5 hours
16
FD001 Training set Test set
Number of engines 100 100
Number of data 20631 13096
Minimum running cycle 128 31
Maximum running cycle 362 303
Mean running cycle 206.31 130.96
for the training parameters of the model during the training process. After
being optimized, these parameters have been used to re-train the model and to
predict RUL on the testing set, this stage took only a few minutes. It should
be considered that one can obtain a higher performance of the model by finding
optimized parameters with different structures of LSTM. However, it might take
more time for training. Figures 5-6 sketch the difference between the predicted
RUL and the true RUL from the testing set and the corresponding line plots of
train loss and validation loss using our proposed method. The obtained result
shows that the predicted values are very close to the true ones. Also, after a few
epochs, errors on the training sets and the validation sets decrease remarkably.
That is to say, the proposed LSTM based method for forecasting multivariate
time series from the C-MAPSS FD001 dataset is effective.
In the literature, the C-MAPSS FD001 dataset has been extensively studied
for verifying an RUL prognostic model and many related studies have been
published. Table 2 compares the prognostic performance of our proposed model
with some other recent model based on the metric RMSE. It can be seen from
Table 2 that although the structure of our LSTM based model is simpler than
other ensemble or hybrid models, it still leads to the smallest RMSE. That is,
we can say that the proposed method has a superior performance in forecasting
multivariate time series data.
17
Figure 5: The true RUL and the predicted RUL using LSTM autoencoder for the FD001
dataset
Figure 6: Line plot of train and validation loss from the proposed model during training on
the FD001 dataset.
18
Method & Refs. RMSE
MTW-BLSTM ensemble (Xia et al., 2020) 12.61
LSTM- FW-CatBoost (Deng et al., 2020) 15.8
RBM-LSTM-FNN (Ellefsen et al., 2019) 12.56
Proposed method 9.71
Table 2: RMSE comparison with the literature on the C-MAPSS FD001 dataset
samples representing the normal sales. A function for generating data has been
shown in Appendix C. The optimized LSTM autoencoder model from the train-
ing process based on this simulated data is displayed in Appendix E.
After training, we compare the performance of the LSTM-KQE based method
applied in (Tran et al., 2019), the LSTM Autoencoder-KQE based method, and
the LSTM Autoencoder-OCSVM based method proposed in this study through
2989 normal and abnormal samples of a simulated testing data set. In this
testing data set, we simulate a small shift from 999th sample to 1499th sample.
Figure 7 displays a graph of generated data for the testing phase.
The comparison is made by using the following measures:
19
Method DR(Recall) Precision Accuracy F-score
Method in Tran et al. (2019) 0.9815 0.9465 0.9384 0.8805
LSTM Autoencoder with KQE 0.9807 0.9583 0.9484 0.9029
LSTM Autoencoder with OCSVM 0.9959 0.9845 0.9836 0.9698
Table 3: Compare the performance of our proposed method and the method suggested in
Tran et al. (2019)
T P +T N TP
• Accuracy = T P +F P +T N +F N • Recall = T P +F N
• Precision = T PT+F
P
P • F-score = 2 × Precision×Recall
Precision+Recall
where TP (True Positive) stands for the number of anomalies correctly diag-
nosed as anomalies, TN (True Negative) stands for the number of normal events
correctly diagnosed as normal, FP (False Positive) stands for the number of nor-
mal events incorrectly diagnosed as anomalies, and FN (False Negative) stands
for the number of anomalies incorrectly diagnosed as normal events. By their
definition, Precision is used to evaluate how accurate the result is, and Recall
is used to evaluate how complete the result is. Also, F-score is used to seek a
balance between Precision and Recall.
The obtained results are given in Table 3. As can be seen from this Table,
the LSTM Autoencoder based method leads to better performance compared to
the LSTM based method in Tran et al. (2019). In particular, the Accuracy, the
Precision, and the F-score corresponding to the LSTM Autoencoder (in the sec-
ond row and the third row) are significantly larger than the ones corresponding
to the LSTM (in the first row). In addition, the use of the OCSVM algorithm
for classification brings the best results with an Accuracy of 98.36%, a Preci-
sion of 98.45%, and F-score of 96.98%, and a Recall of 99.59%. That is to say,
our proposed method of using the LSTM Autoencoder combining with OCSVM
outperforms other methods, ensuring more accurate detection of anomalies in
sales. Therefore, this method will be applied in the next section for anomaly
detection in a real fashion retail data set.
20
6.2. Real fashion retail data
The data are collected from a store in the center of a city in France from
01/01/2015 to 18/11/2019. They are considered as a multivariate time series
with five variables, involving sales quantity (of the T-shirts), price discount,
temperature, rain (precipitation in mm) and initial price (without discount).
Figure 8 presents the distribution of variables from the collected data.
Figure 8 illustrates the historical data which are collected. The daily sales
(figure 8 (a)) demonstrate different seasonal effects:
An overall decreasing trend can also be detected since the among of sales seems
to decline every year. These features are typically well dealt with time series
models. However, some peaks and sharp surges often occur in sales. These vari-
ations are produced by different factors. Sales of fashion products are generally
considered as very sensitive to price discounts and weather data (Thomassey,
2014, 2010). Impacts of these explanatory variables on sales are often complex,
nonlinear, period-dependent, and inter-correlated. Consequently, the analysis
of these impacts requires a multivariate time series model. The figure 8b and 8c
show the discount rates and the original price (average) of the T-Shirts. Sales
increasing can be identified during discount periods. However, it appears that
similar discount rates have very different impacts on sales. Thus, the original
price is also considered to complete the information on the discount rate. The
weather data, temperature (figure 8d), and rainfall (figure 8e) give further in-
formation to explain the peaks in the sales. It is difficult to measure visually
the impacts of these variables since they are generally very brief. The purpose
of the proposed forecasting model is to deal with the combinations of all these
factors (sales features, discounts, weather data) to provide a forecast as accu-
rate as possible. Nevertheless, unexpected variations, visible in figure 8a and
identified more specifically in figure 12, can not be taken into account by the
forecasting model. For this reason, the proposed anomaly detection model aims
21
Time
Time
Time
Time
Time
22
to detect these variations to enable decision-makers to modify and adapt the
replenishment strategy accordingly.
The total data of 1441 days are divided into three parts: 56% (807 days of
sales) of data is for the training, 14% of data (202 days of sales) is for validation,
and the rest of 30% of the data (432 days of sales) is for testing. For the choice
of the test base, we took the sales of a fiscal year as a test (01/04 2018 -
31/03/2019) and we took one week for each sales period (8 weeks). Moreover,
we have tried different ratios which are quite popular in the literature and picked
up this ratio since it gave the best performance. However, one should consider
that the obtained result also depends on each dataset and the performance of
the proposed method would vary based on the portions of the training.
where xi and x̂i represent the real sales and predicted sales at the time i. The
obtained result is 12.47%. These results show that the LSTM based forecasting
23
Figure 9: Line plot of train and validation loss from the multivariate LSTM during training
model leads to a good prediction and it can be applied to predict the sales in
practice. For further research, more features/variables that may affect the sales
such as the color of products, the size, and the store could be considered in the
multivariate time series to improve the performance of the model.
24
Figure 10: A comparison between real sales and predicted sales using LSTM
mal behavior. For example, we can see the unusually high values in September
2018 and there are times when the sales are abnormally low like in September
2019. The company should find out the factors that lead to these anomalies. It
could be new sales policies, new sales staff, and new style products that lead to
higher sales quality; or they could also be the factors that lead to lower sales
quality. Pointing out these anomaly sales may be very useful for companies to
make better decisions for future management.
7. Discussion
The theoretical contribution of this study includes two parts. Firstly, we de-
velop a multivariate time series forecasting model based on LSTM with the ap-
plication in sales forecasting. In order to verify the performance of the proposed
25
Figure 11: Illustration of the the learned representation of LSTM Autoencoder from the
original multivariate time series using PCA method
Figure 12: The anomaly detection for real data based on the LSTM Autoencoder network
and the OCSVM algorithm
26
forecasting model, we utilized a well-known dataset (i.e. C-MAPSS FD001
dataset provided by NASA) with a large number of samples, including 20631
samples for training and 13096 samples for testing. In the literature, this dataset
has been widely used to evaluate the effectiveness of many complex deep learn-
ing models like CNN-FNN, LSTM-FNN, CNN-BLSTM, and RBM-LSTM-FNN.
Our proposed model, which is simply an LSTM based model, is obviously sim-
pler than these models. However, by considering optimizing the parameters
(i.e. learning rate, number of cells, and dropout) rather than choosing a pre-
determined value, it has brought a significantly higher performance compared
to the performance of others. This finding could be very useful for other authors
in designing their deep learning model for a specific purpose of forecasting deal-
ing with not only multivariate time series but also other kinds of data. That
is, they can consider a simpler structure and optimize its parameters instead
of choosing more complex combinations. In applying our proposed model for
a real situation in SCM, we have suggested using weather variables (i.e. tem-
perature, rain - precipitation in mm) to integrate into the model along with
traditional variables like initial price and price discount to predict the sales.
This could be the first time a forecasting model of sales in SCM considering
these weather attributes has been suggested. The use of these variables will
help to improve the performance of the forecasting model. Secondly, we have
developed a novel deep hybrid model for anomaly detection. The autoencoder
LSTM is used as a feature extractor to extract important representations of
the multivariate time series input and then these features are input to OCSVM
for detecting anomalies. This model results in better performance compared
to the performance from several previous studies. We also consider optimizing
the hyperparameters of autoencoder LSTM. To the best of our knowledge, the
idea of using autoencoder LSTM with optimized hyperparameters and OCSVM
for anomaly detection has not been suggested in the literature. The proposed
model has been applied to detect anomalies in sales from a real dataset of a
fashion company in France.
27
the near future can help managers to have a good plan for stocking, enhancing
economic efficiency, and optimizing the business of the company. In this study,
only five variables are considered. However, more factors that may have a sig-
nificant effect on the sales can be involved in the practice. The accuracy of the
proposed method could be improved remarkably once these factors are included
in the input. One should consult experts or experienced staff to find out them.
Meanwhile, detecting accurately anomalies in sales enables the company to have
an insight into its operating and marketing strategies. A negative anomaly in
sales may correspond to not good strategies in marketing, leading to a decrease
in sales. The strategies need to be reviewed and adjusted. By contrast, once
a positive anomaly is detected from the model, it could be useful to investi-
gate and explain the reason, thereby increasing sales and having appropriate
strategies for the future. In addition, one should consider that the application
of our proposed models is not limited to SCM. In fact, they can be applied
to any scenarios related to multivariate time series. For example, the LSTM
based forecasting model can be used for stock forecasting, power consumption
forecasting, air pollution forecasting, RUL forecasting, etc. The anomaly detec-
tion model can be used for fraud detection, cyber-intrusion detection, medical
anomaly detection, industrial damage detection, and so on.
28
be very useful for companies to have an effective and early strategy. Finally, we
are thinking of improving the performance of the proposed anomaly detection
model by using another version of Autoencoder like the Variational autoencoder
(Kingma and Welling, 2019). An advantage of a variational autoencoder is that
it can avoid overfitting and ensure that the latent space has good properties of
enabling the generative process.
8. Conclusions
29
Acknowledgements
The authors would like to thank editors and the anonymous referees for their
insightful and valuable suggestions which helped to improve the quality of the
final manuscript.
References
Acharya, A., Singh, S. K., Pereira, V., and Singh, P. (2018). Big data, knowledge
co-creation and decision making in fashion industry. International Journal
of Information Management, 42:90–101.
Alfian, G., Syafrudin, M., and Rhee, J. (2017). Real-time monitoring system
using smartphone-based sensors and nosql database for perishable supply
chain. Sustainability, 9(11):2073.
Bontemps, L., McDermott, J., and Le-Khac, N.-A. (2016). Collective anomaly
detection based on long short-term memory recurrent neural networks. In
International Conference on Future Data and Security Engineering, pages
141–152. Springer.
Chen, H. Y., Das, A., and Ivanov, D. (2019). Building resilience and managing
post-disruption supply chain recovery: Lessons from the information and
communication technology industry. International Journal of Information
Management, 49:330–342.
Deng, K., Zhang, X., Cheng, Y., Zheng, Z., Jiang, F., Liu, W., and Peng, J.
(2020). A remaining useful life prediction method with long-short term fea-
ture processing for aircraft engines. Applied Soft Computing, page 106344.
30
Dolgui, A., Ivanov, D., Potryasaev, S., Sokolov, B., Ivanova, M., and Werner,
F. (2020). Blockchain-oriented dynamic modelling of smart contract design
and execution in the supply chain. International Journal of Production
Research, 58(7):2184–2199.
Ellefsen, A. L., Bjørlykhaug, E., Æsøy, V., Ushakov, S., and Zhang, H. (2019).
Remaining useful life predictions for turbofan engine degradation using
semi-supervised deep architecture. Reliability Engineering & System Safety,
183:240–251.
Freeman, B. S., Taylor, G., Gharabaghi, B., and Thé, J. (2018). Forecasting
air quality time series using deep learning. Journal of the Air & Waste
Management Association, 68(8):866–886.
Greff, K., Srivastava, R. K., Koutnı́k, J., Steunebrink, B. R., and Schmidhuber,
J. (2016). Lstm: A search space odyssey. IEEE transactions on neural
networks and learning systems, 28(10):2222–2232.
Habeeb, R. A. A., Nasaruddin, F., Gani, A., Hashem, I. A. T., Ahmed, E., and
Imran, M. (2019). Real-time big data processing for anomaly detection: A
survey. International Journal of Information Management, 45:289–307.
Hosseini, S., Ivanov, D., and Dolgui, A. (2019). Review of quantitative meth-
ods for supply chain resilience analysis. Transportation Research Part E:
Logistics and Transportation Review, 125:285–307.
Ivanov, D. and Dolgui, A. (2020). A digital supply chain twin for managing
the disruption risks and resilience in the era of industry 4.0. Production
Planning & Control, pages 1–14.
Ivanov, D., Dolgui, A., and Sokolov, B. (2019). The impact of digital technol-
ogy and industry 4.0 on the ripple effect and supply chain risk analytics.
International Journal of Production Research, 57(3):829–846.
31
Kingma, D. P. and Welling, M. (2019). An introduction to variational autoen-
coders. arXiv preprint arXiv:1906.02691.
Lim, B. and Zohren, S. (2020). Time series forecasting with deep learning: A
survey. arXiv preprint arXiv:2004.13408.
Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P., and Shroff,
G. (2016). Lstm-based encoder-decoder for multi-sensor anomaly detection.
arXiv preprint arXiv:1607.00148.
Malhotra, P., Vig, L., Shroff, G., and Agarwal, P. (2015). Long short term
memory networks for anomaly detection in time series. In Proceedings,
volume 89, pages 89–94. Presses universitaires de Louvain.
Maqsood, H., Mehmood, I., Maqsood, M., Yasir, M., Afzal, S., Aadil, F., Selim,
M. M., and Muhammad, K. (2020). A local and global event sentiment
based efficient stock exchange forecasting using deep learning. International
Journal of Information Management, 50:432–451.
Principi, E., Rossetti, D., Squartini, S., and Piazza, F. (2019). Unsupervised
electric motor fault detection by using deep autoencoders. IEEE/CAA
Journal of Automatica Sinica, 6(2):441–451.
Saxena, A. and Goebel, K. (2008). C-mapss data set. NASA Ames Prognostics
Data Repository.
32
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson,
R. C. (2001). Estimating the support of a high-dimensional distribution.
Neural computation, 13(7):1443–1471.
Xia, T., Song, Y., Zheng, Y., Pan, E., and Xi, L. (2020). An ensemble framework
based on convolutional bi-directional lstm with multiple time windows for
remaining useful life estimation. Computers in Industry, 115:103182.
33
Yang, R., Yu, L., Zhao, Y., Yu, H., Xu, G., Wu, Y., and Liu, Z. (2020). Big data
analytics for financial market volatility forecast based on support vector
machine. International Journal of Information Management, 50:452–462.
Zhang, G. P. (2003). Time series forecasting using a hybrid arima and neural
network model. Neurocomputing, 50:159–175.
Zhao, W., Dai, W., and Zhou, S. (2013). Outlier detection in cold-chain logistics
temperature monitoring. Elektronika ir Elektrotechnika, 19(3):65–68.
Appendix
34
B. Multivariate time series anomaly detection using LSTM autoencoder and
OCSVM
Algorithm 2: Multivariate time series anomaly detection using LSTM
for i ∈ {1, . . . , B} do
End for
35
C. Generating time series data
Algorithm 3: Generating time series data
Generate a synthetic wave by adding up a few sine waves and some noise
Output: the final wave
t ← an initial sequence of size n
wave1 = sin(2 ∗ 2 ∗ π ∗ t)
noise ← a random normal sample of size t
wave1 ← wave1 + noise
wave2 ← sin(2 ∗ π ∗ t)
t.rider ← an initial sequence of size m, m n
wave3 ← −2 ∗ sin(10 ∗ π ∗ t.rider)
insert ← an interger value less than n − m
wave1[insert:insert + m] ← wave1[insert:insert + m] + wave3
return: wave1 - 2*wave2
36
E. Optimized LSTM autoencoder model for anomaly detection based on the gen-
erated dataset
#Best parameters:
num cells=256/64/64/256
lr=0.01
Layer (type) Output Shape Param #
lstm 1 (LSTM) (None, 10, 256) 264192
lstm 2 (LSTM) (None, 64) 82176
repeat vector 1 (None, 10, 64) 0
lstm 3 (LSTM) (None, 10, 64) 33024
lstm 4 (LSTM) (None, 10, 256) 328704
time distributed 1 (None, 10, 1) 257
Total params: 708353
Trainable params: 708353
Non-trainable params: 0
37