Abnormal Activity Detection Using Shear Transformed Spatio-Temporal Regions at The Surveillance Network Edge

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Multimedia Tools and Applications

https://doi.org/10.1007/s11042-020-09277-8

Abnormal activity detection using shear transformed


spatio-temporal regions at the surveillance network
edge

Michael George1 · Babita Roslind Jose1 · Jimson Mathew2

Received: 12 September 2019 / Revised: 10 May 2020 / Accepted: 26 June 2020 /

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract
This paper presents a method of detecting abnormal activity in crowd videos while con-
sidering the direction of the dominant crowd motion. One main goal of our approach is
to be able to run at the edge of the surveillance network close to the surveillance cam-
eras so as to reduce network congestion and decision latency. To capture motion features
while considering the direction of dominant crowd direction we propose a generalised
shear transform based spatio-temporal region. To detect abnormal activity, an autoencoder
based method is adopted considering the requirement for running the method at the net-
work edge. During training, the autoencoder learns motion features for each spatio-temporal
region from video frames containing normal activity. While testing, those motion fea-
tures from each spatio-temporal region that cannot be reconstructed satisfactorily by the
autoencoder indicate abnormal activity. This approach allows coarse localisation as well
as detection of abnormal activity. The approach demonstrated O(n) behaviour with abil-
ity to work at higher frame rates by trading off accuracy. The approach has been verified
against recent works on standard abnormal activity datasets: UCSD dataset and Subway
dataset.

Keywords Events · Actions and activity recognition · Internet of things · Video


surveillance architectures · Image/video surveillance and analytics

 Michael George
[email protected]

Babita Roslind Jose


[email protected]

Jimson Mathew
[email protected]

1 School of Engineering, Cochin University of Science and Technology, Kochi, Kerala 682022, India
2 Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna,
Bihar 801103, India
Multimedia Tools and Applications

1 Introduction

The recent years have seen a boom in the number of surveillance cameras being installed
for purposes like security [45], crowd control [57], patient care [36] etc. This has also
consequently resulted in an explosion in the amount of video surveillance data that needs to
be analysed, particularly in a smart city environment [46]. Two paradigms could be adopted
for such a large scale video surveillance network as seen in Fig. 1. In cloud computing the
video is processed at a centralised location. While in edge computing [10, 40] the video is
processed using models at the edge of the surveillance network. Cisco [11] predicts that by
2021 half the workload will need to be run outside datacenters. They suggest an edge com-
puting based approach as a solution. Cisco also predicts that between 2017 and 2022, there
will be a seven fold increase [12] in video surveillance data that will be introduced into the
internet as traffic. To quantify the amount of data that would need to be sent to the cloud in
a purely cloud based surveillance system, assume that the size of a single frame is 25 kB
and the frame rate is 30 fps. Under this assumption, from a single surveillance camera the
amount of data that would need to be sent per day to the cloud is 64.8 GB. Such a large
data transfer requirement would result in added load to the data transfer network and would
result in high data transfer cost. The concept of computation at the edge provides a solution
to this problem.
Edge computing has recently been used to tackle surveillance problems like detecting
humans using a lightweight Convolutional Neural Network (CNN) [34, 35], lightweight
object tracking [51], drone-based tracking [8], communication [55], and crowd reidentifi-
cation [33].
This work proposes to tackle the problem of abnormal activity detection by using edge
computing technique. To the best of the authors’ knowledge this is the first such attempt

No Analysis Raw Frames


Analysis at Edge Analysis Results

Edge Processor #1
Model #1 Cloud
Surveillance Camera #1 Raw Frames
Analysis Results
Decisions Taken
at Central Processor
No Analysis
Analysis at Edge
Analysis and
Model #N Decisions Taken at
Central Processor
Edge Processor #N

Surveillance Camera #N

Fig. 1 Analysis of surveillance feeds under cloud and edge computing paradigms. Cloud computing specific
and edge computing specific steps shown in red and green respectively
Multimedia Tools and Applications

that will work without specialised hardware. The proposed technique is an improvement
of our previous work [17], so as to tackle some deficiencies in the earlier work and
to perform some optimisations that will improve run time on resource constrained edge
devices.
In our previous work, an autoencoder based abnormal activity detection method that runs
on a high performance system was implemented. The motion features proposed by Colque
et al. [13] named Histograms of Optical Flow Orientation and Magnitude (HOFM) were
captured from a novel parallelepiped spatio-temporal region based on the x-axis directed
shear transform of the cell structure of Levya et al. [26]. The decision on abnormal activity
was made based on the reconstruction error of the autoencoder.
This work proposes an abnormal activity method that is suitable to be run on a processing
system that has low computing capability as typically found at the edge of the surveillance
network. The following are the improvements from our previous work:
1. The parallelepiped spatio-temporal regions has been improved into a more generic shear
transform based spatio-temporal region, considering both shear along x-axis and y-
axis. The disadvantage that the border of the scene could not be captured by the earlier
parallelepiped spatio-temporal region is mitigated here.
2. The computation of the spatio-temporal region has been optimised for lowering the
computational load.
3. The autoencoder model has been adopted to be able to run on a computing system at
the edge. Additionally, the adopted method is able to coarsely localise the part of the
scene containing the abnormality.
4. A lightweight version of an existing deep learning based model has been implemented
for comparison with the proposed approach. A detailed performance analysis of this and
other related works against our proposed approach has been conducted using datasets
like UCSD [32] and Subway [1].
The rest of the paper is divided as follows: The recent works related to edge computing
and abnormal activity detection are presented in Section 2. The experimental analysis and
results are presented in Section 4 and Section 5 respectively. The conclusions drawn from
the work are presented in Section 6.

2 Related works

Shi et al. [40] presents an excellent overview of computing at the edge of the IoT network.
This work cites the advantages of edge computing as faster response times, savings on
network bandwidth as well as data security and privacy.
The field of abnormal activity detection method has recently witnessed a large number
of works. Here we don’t aim to present an exhaustive survey of such methods, such a sur-
vey can be found in the work by Afiq et al. [2]. Only very recent and relevant works are
presented. The methods of abnormal activity detection can broadly be classified as using
handcrafted features [31] and as using deep learned features [6].

2.1 Handcrafted feature based

The handcrafted feature based methods can further be subdivided into low-level and high-
level methods. The low-level methods use basic image qualities like appearance and motion.
The high-level methods use greater semantically meaningful features like trajectory.
Multimedia Tools and Applications

The low-level feature based methods are more amenable for use in crowded environ-
ments. The optical flow based method introduced by Colque et al. [13, 14] created features
based on binning of multiple optical flow measurements like magnitude, orientation, and
entropy. Other low-level feature based methods use texture [27, 47], 3-D gradients [23],
optical flow [7, 28], and gray level co-occurrence [29]. There are also works that combine
low-level features like optical flow and gradient [53, 54]. Hu et al. [21] proposed a variant
of the texture feature LBP (Local Binary Pattern) that is captured from a spatio-temporal
region in the shape of a squirrel cage. An LDA (Latent Dirichlet Allocation) based method
is used for abnormality detection.
The trajectory based works usually suffer when used in a crowded environment where the
creation of trajectory suffers due to clutter. Some recent methods have adopted techniques
to overcome this deficiency. Biswas and Babu [3] seek help from nearby trajectories to
curtail the influence of noise in trajectories. Rabiee et al. [37] combine interest point based
trajectories with optical flow based features.

2.2 Deep learning feature based

Recently there has been a wave of deep learning based works that promise excellent accu-
racy. Fan et al. [16] applied gaussian mixture based autoencoder to separately model both
appearance and motion. The first autoencoder uses convolutional features captured from
RGB data to model appearance, while the other uses convolutional features captured from
dynamic flow [48] to model motion. The final decision on abnormality and localisation is
made by the fusion of scores from the two streams. Yang et al. [52] used a combination
of autoencoder and multiple convolutional LSTM (Long Short Term Memory) to model
foreground extracted with robust PCA (Principal Component Analysis) based method. To
reduce the effect of the background, a Euclidean loss function is also adopted. Sun et al.
[42] constructed a dictionary based method for general abnormal activity detection using
features learned with a VAE (Variational Auto-Encoder). They tested their abnormal activ-
ity detection method on network data, images, and videos. Ravanbakhsh et al. [38] used
time analysis of binary patterns captured from deep features to detect an abnormality. These
binary patterns were combined with optical flow features to improve performance.
Hasan et al. [18] proposes dual autoencoders. One autoencoder is fully convolutional
while the other is trained on handcrafted shape and motion features. The reconstruction
error of the autoencoder is used as the basis for detecting abnormality. The work by Chong
and Tay [9] used a convolutional LSTM based autoencoder model to detect abnormality.
The spatio-temporal autoencoder model was trained with normal video volumes. The abnor-
mality was detected based on the inability of this autoencoder to reconstruct test video
volumes. Wang et al. [49] and Zhou et al. [56] improved upon the above model [9] by using
shortcut connections to increase performance. Lei et al. [25] proposes a framework con-
sisting of three networks that predict optical flow, use spatial information and use temporal
information respectively. Bouindour et al. [5] uses one-class neural network to detect abnor-
mal video events. The network consists of a convolutional autoencoder learned using two
loss functions. One loss function is used to obtain robust features while the other ensures
compactness of the representation.

2.3 Others

There are some methods that can’t be strictly defined as belonging to either of the above
two classes. A few such works are briefly explained next. Lu et al. [30] proposed a sparse
Multimedia Tools and Applications

learning based method that works at high speeds approaching 1000 frames per second. They
were able to achieve high speeds as they learned possible combinations of vectors from the
dictionary of features beforehand. They tested using both handcrafted and deep features.
Khan et al. [22] presented an FPGA based hardware implementation of abnormal activity
detection suitable for IoT domain. A superpixel based feature is combined with Gaussian
discriminant analysis in their work. Colque et al. [15] learned behaviour patterns from the
interaction between humans and objects. Their high-level semantic learning permits learn-
ing in one scene to be used in another scene with a similar context. But their method is
not suitable for crowded scenes captured by low resolution cameras as typically found in
surveillance settings.
The abnormal activity detection methods based on deep learning present impressive
accuracy results. But such methods are computationally intensive, and face issues related to
low memory and excessive computational time when they are tested on edge devices. In an
IoT edge setting a compromise between accuracy and resource utilisation needs to be struck.
In this work, an improved version of our work [17] suitable for an IoT edge environment is
presented.

3 Proposed framework

The proposed framework for performing abnormal activity detection is depicted in Fig. 2.
The details of the steps that are involved in the proposed framework are as follows:
1. The video frames captured by the surveillance camera are transmitted up to the cen-
tralised location for model creation. The centralised location will have the advantage of
higher computing resources that might be required in the training process needed for
model creation.
2. The trained model is then transmitted down to the edge computing node. This model
is then used at the edge to perform the analysis on the video frames at the edge device
itself.

Frames for training model

1 Cloud
2
Model to be run at edge
3
Analysis results from edge
Edge Processor
Model creation
and decision making
Surveillance Camera

Fig. 2 The framework used in the proposed edge-based abnormal activity detection
Multimedia Tools and Applications

3. The analysis results are continually transmitted up to the centralised location for
monitoring and final decision making.
The abnormal activity method used is an autoencoder based model that learns HOFM
feature [13] from a shear transformed spatio-temporal region. The autoencoder is trained
with features captured from training dataset frames that contain no abnormal activity. The
design of the feature used, the design of the region from which the feature is captured, and
the method to detect an abnormality is described next.

3.1 Feature design and spatio-temporal region design

The HOFM feature is obtained from the optical flow vector [4]. The optical flow vectors
are extracted from pixel locations where the difference between two frames is greater than a
threshold (set empirically as 30). These optical flow vectors from a spatio-temporal region
are put into histogram bins based on both their magnitude and direction as shown in Fig. 3.
As seen in the figure, the direction is quantised into 4 directions labeled Q1 through Q4.
The magnitude is quantised into 4 intervals, with three intervals being equal and the fourth
interval extending to infinity to account for outlier magnitudes. The design of the HOFM
feature involves the selection of the number of bins used to quantise the magnitude and
direction, and the magnitude that determines the final interval.
In typical surveillance settings, the camera would be situated at a height to prevent tam-
pering and theft and would be looking down on the scene. Consequently, there will be a
lack of spatial uniformity in the captured features across the frame. That is objects nearer to
the camera will provide more information than objects further away. A novel non-uniform

a) Spatio−temporal stack of frames with capture of features


from non−uniform shear shaped cells

Optical flow magnitude range

90 degree (0.5−1]
(1−1.5]
(1.5−2]
Q2 Q1
(2−inf)
y (space)

180 degree 0 degree

Q3 Q4
x (space)

t (time)
270 degree

b) Optical flow vectors in a 3 X 3 X 2 shear shaped region c) Map for binning optical flow vector
based on magnitude and direction

2 2 0 1 0 1 0 0 1 1 0 0 2 0 1 2 0 0
1 1 1 1 1 1 1

e) HOFM feature from the region in b)


Q1 Q2 Q3 Q4

d) Binning optical flow vector based on magnitude and direction

Fig. 3 The HOFM feature construction using optical flow vectors captured from the sheared spatio-temporal
region
Multimedia Tools and Applications

cell structure to handle this was proposed by Levya et al. [26]. This non-uniform cell
structure was adapted in our previous work [17], in which we proposed a parallelepiped
shaped spatio-temporal region that was obtained as a shear transform along the x-axis of a
cuboidal spatio-temporal region, with the factor of shear decided by the dominant direction
of motion. The superiority of the parallelepiped spatio-temporal region was experimentally
demonstrated.
In the present work, a more generalised approach is followed. Both shear along the x-axis
and y-axis is used, eliminating the effects due to excessive shear along only one direction.
Excessive shear along one axis only will result in spatio-temporal regions that are extremely
narrow. The shear transform of the cuboidal spatio-temporal regions can be applied along
the x-axis or y-axis. This operation is represented in Fig. 4. The choice of x-axis or y-axis
shear depends on which is better able to capture optical flow vectors in a particular direction
as seen in Fig. 4c. The type of shear used is selected depending on the direction of dominant

Fig. 4 The shear transform applied along the x-axis and along the y-axis in part (a) and (b). In part c of the
figure the same optical flow vectors (marked in red) are captured better by the y-shear spatio-temporal region
than the x-shear spatio-temporal region. Part d represents the selection of the shear type and the determination
of the shear factor, sf
Multimedia Tools and Applications

motion direction as shown in Fig. 4d. The shear factor along one axis is limited to 45◦ to
prevent excessive shear resulting in very narrow spatio-temporal regions. The shear factor is
represented by the symbol sf . The type of shear and the shear coefficient m associated with
the values of sf are tabulated in Table 1. Additionally, over our previous work, the spatio-
temporal regions at the border of the frame are included, resulting in spatio-temporal regions
at the border that are not parallelepiped shaped. The complete algorithm for the creation of
the spatio-temporal cells is given in the next subsection. To save computational time, the
pixel locations that comprise each spatio-temporal region are saved in memory beforehand.

3.2 Creation of shear transformed spatial cell

The method used for creating the shear transformed spatial cell is a generalised version of
the one presented previously by us [17], with both x-axis and y-axis shear considered and
with the ability to include more region towards the edge of the frame being added in the
current method. The steps for the shear transformed spatial cell generation are as follows:
1. In case of a frame of size w×h as seen in Fig. 5a, for x-axis shear let X = 2*w and
Y = h and for y-axis shear let X = w and Y = 2*h. The non-uniform cells are created
as follows:
(a) Let y0 be the height of the smallest cell and a be the factor by which the cells grow.
(b) The number of cells that can fit vertically is computed as,

k = loga ((Y /y0 ) × (a − 1) + 1) − 1 (1)

Based on this, the height of the smallest cell needs to be recomputed as,
 
a−1
y0 = Y (2)
a k+1 − 1
(c) The height of the cells as they grow can be computed by recursion as yl+1 = a ×yl .
The number of cells and cell heights are increased until they cover the vertical
height Y.
(d) The cells are filled along the width dimension along the right side starting at the
center of frame till X/2.
(e) If the cells don’t cover the entire width of X/2 or height of Y , then the cell
dimensions are iteratively increased until the condition is met.
(f) The cell dimensions determined for the right side are mirrored on the left side of
the frame.
2. To the cells created above, apply an x-axis shear transform or a y-axis shear transform
(see Fig. 4) depending on the direction of the dominant motion.

Table 1 Shear factor


Shear factor, sf Type of shear Shear coefficient, m

0 < sf ≤ 1 Positive y-axis shear 0<m≤1


1 < sf < ∞ Positive x-axis shear 1>m>0
−1 ≤ sf < 0 Negative y-axis shear −1 ≤ m < 0
−∞ < sf < −1 Negative x-axis shear 0 > m > −1
sf = 0, ∞, −∞ No shear m=0
Multimedia Tools and Applications

Fig. 5 a The dimensions of the non-uniform cells that form the basis for the proposed shear transformed
region. b The regions after proposed x-axis shear. c The regions after proposed y-axis shear. Cells indicated
by varying shades of grey. White region not covered by cell

3. Ensure the cells lie within the boundary of the frame, with the cells right at the edges
having a set minimum area within the frame.
The results of application of proposed x-axis and y-axis shear is shown in Fig. 5b and c
respectively.

3.3 Anomaly detection and localisation

The HOFM vectors (HV #1, HV #2,..., HV #Nr ) captured from the Nr spatio-temporal
regions extracted from a set of frames (set of 5 frames each in this work) are concatenated
as HV and given as input to the autoencoder. The autoencoder can be said to consist of
encoder (φ) and decoder (ψ) sections. While training, the encoder and decoder are learned
so that output vector HVR closely matches the input HV. This can be represented as,
φψ = argminHV − HVR2 (3)
φψ

Since HVR can be expressed as,


HVR = ψ(φ(HV)) (4)
hence the learning represented in (3) can be rewritten as,
φψ = argminHV − ψ(φ(HV))2 (5)
φψ

The ability of the autoencoder to reconstruct the input vector at the output with minimum
error is used for abnormal activity detection and localisation. During testing, if an input vec-
tor HV that is significantly different from the vectors that were used for training is given as
Multimedia Tools and Applications

input to the autoencoder, then it will struggle to reconstruct these vectors. This special prop-
erty of the autoencoder is used for abnormal activity detection and coarse localisation. The
steps for abnormal activity detection and coarse localisation performed in the edge device
(unless otherwise stated as being performed in the cloud) are detailed in Algorithm 1. A
coarse localisation of abnormality at spatio-temporal level is performed based on thresholds
for each spatio-temporal region.

3.4 Lightweight deep network for comparison

To compare the performance of the proposed method against deep learning methods, a
lightweight version of the model proposed by Chong and Tay [9], named as Chong-Lite, is
implemented. It utilises a spatio-temporal deep autoencoder based framework. The spatial
variations in video volumes are captured using convolution based filters while the temporal
variations are captured using a convolutional LSTM. The model shown in Fig. 6 takes as
input video volume of size 10×224×224. This volume is made up of ten consecutive video
frames that have been resized to 224×224. The video volume is then passed through an
autoencoder architecture consisting of convolutional layers, deconvolutional layers and con-
volutional LSTM layers. The model is trained to reconstruct the input video volume from
video frames containing no abnormality. The decision on abnormality is based on regularity
score [9]. The important layers that make up this deep model are discussed next.

3.4.1 Convolutional LSTM layer

The LSTM layer [43] was proposed for the task of sequence learning. This layer overcame
the vanishing/exploding gradient problem encountered in typical Recurrent Neural Network
(RNN). To accomplish this the LSTM incorporates additional gates called output gate, input
gate, and forget gate to control the flow of gradients. This layer was extended to deal with
multi-dimensional sequences using the Convolutional LSTM [41]. It can be represented
using the following equations,

it = σ (Wi ∗ ht−1 + Ui ∗ xt + Vi st−1 + bi ) (6)


Multimedia Tools and Applications

Fig. 6 The proposed light-weight


spatio-temporal autoencoder

ft = σ (Wf ∗ ht−1 + Uf ∗ xt + Vf st−1 + bf ) (7)

ot = σ (Wo ∗ ht−1 + Uo ∗ xt + Vo st−1 + bo ) (8)

st = ft st−1 + it tanh(Ws ∗ ht−1 + Us ∗ xt + bs ) (9)

ht = ot tanh(st ) (10)

where i, f , o, s and h are input gate, forget gate, output gate, state, and hidden state respec-
tively. The symbol * represents convolution while represents a hadamard product. The
parameters of the LSTM layer like the various U , V , W and b are learned using the Back
Propagation Through Time (BPTT) technique [50].

3.4.2 Convolutional layer

The convolution layer applies numerous filters of fixed size to extract feature maps. These
filters are learned during the training process through the Back Propagation technique [39]
to minimise a loss function. The filters are moved over (based on amount of stride) the input
tensor or input feature map to compute the cross-correlation based value that constitute the
output feature map.

3.4.3 Deconvolution layer

The deconvolution layer performs inverse convolution operation. This operation is imple-
mented as a combination of upsampling and convolution operation. The input is upsampled
based on the stride factor by the insertion of zero values. The upsampled tensor is then given
as input to a convolution layer with multiple filters.
Multimedia Tools and Applications

4 Experimental analysis

The proposed method is tested on standard datasets that are used to test abnormal activ-
ity i.e., UCSD [32] and Subway [1]. The split of train/test frame numbers, image size, and
location are shown in Table 2. The abnormalities in the UCSD dataset include minitruck,
bicycle or skate-boarder moving on a pedestrian walkway. In case of the Subway dataset
the abnormalities include moving train and people running or moving opposite to subway
entrance/exit. The performance of the proposed method is quantified using standard met-
rics like Area Under the Curve (AUC) and Equal Error Rate (EER), determined from the
Receiver Operating Characteristic (ROC) curve. The ROC curve is a plot of the True Positive
Rate (TPR) versus the False Positive Rate (FPR).
The hardware used to simulate the processing at the edge was a Raspberry Pi 3 Model B.
A CPU with Intel Core i7 processor coupled with a GTX 1050 Ti GPU was used to act as
the central processing node which is used to learn the autoencoder model with the HOFM
feature captured from the spatio-temporal regions of the training frames. The autoencoder
model was created using keras with a tensorflow backend. The keras autoencoder model
was then converted to a tensorflow model. This tensorflow model is run on the Raspberry
Pi using the model inference support provided by DNN (Deep Neural Network) module of
the OpenCV library. The HOFM features from the testing frames are captured within the
Raspberry Pi using the optical flow extraction function of the OpenCV library and provided
to the autoencoder model to make a decision on abnormality.

5 Results

The results of experiments that were conducted are presented in this section. The design
parameters i.e. the design of the HOFM feature and the type and amount of shear to be
used and the results are presented next in subsections for each of the datasets used. In
the design of HOFM feature, the number of bins for orientation and magnitude is decided
empirically and they are similar values to those used by George et al. [17]. This is done
so as to attain good performance while keeping the feature vector size small. The autoen-
coder is designed to have a hidden layer of 128-d, with the input layer (and output layer)
dimension to be set as the product of the number of shear transformed regions (to cover
a set of five consecutive frames) and the dimension of the HOFM feature extracted from
each shear transformed region. The activation function used in the nodes of the encoder
section of the autoencoder is Relu and for the nodes of the decoder section it is Sigmoid.
The optimiser used is Adadelta with a learning rate of 0.01 and the model is trained for 2000
epochs.

Table 2 Dataset details


Dataset name Train/test Frame Location
frames size/frame
rate

UCSD Ped1 6800/7200 476 by 316/10 fps Pedestrian walkway


UCSD Ped2 2550/2010 720 by 480/10 fps Pedestrian walkway
Subway Ent. 30000/113540 512 by 384/25 fps Subway entrance
Subway Ext. 7500/57400 512 by 384/25 fps Subway exit
Multimedia Tools and Applications

5.1 UCSD

The spatio-temporal region was constructed using five frames. The number of orientation
and magnitude bins used for creating the HOFM feature is four each, resulting in a 16-d
vector. The results obtained are tabulated in Table 3.

5.1.1 UCSD Ped1

The best performance is obtained for shear factor (sf ) of 3 and smallest cell size (y0 ) of
50. The spatio-temporal regions for this shear value are a result of x-axis shear towards the
positive axis. The factor by which the cells grow, a is set as 1.1. The number of shear trans-
formed regions from which the feature is captured is 15. The heatmap of Fig. 7 demonstrates
the influence of the cell size and the shear factor on abnormality detection performance.
This dataset has pedestrians moving at an angle with respect to the horizontal and hence
for a particular value of sf at which the shear shaped spatio-temporal region captures the
optical flow features best, the performance is greatest.

5.1.2 UCSD Ped2

The best performance is obtained for shear factor (sf ) of 0 and smallest cell size (y0 ) of
90. The spatio-temporal regions for this shear value are a result of no shear. The factor by
which the cells grow, a is set as 1.25. The number of shear transformed regions from which
the feature is captured is 6. The heatmap of Fig. 8 illustrates the influence of the cell size
(by varying y0 and a) on abnormal activity detection performance as tested on the UCSD
Ped2 dataset. This dataset has pedestrians moving along the horizontal and hence the sf
value is set as 0 i.e. capturing features from spatio-temporal region with no shear. It was

Table 3 UCSD dataset results


comparison Ped1 Ped2

Approach
(Non deep learning) AUC EER (%) AUC EER (%)

Chaudhry [7, 14] 0.69 36.4 0.82 25.9


Colque [13, 14] 0.727 33.3 0.87 20.7
Colque [14] 0.727 33.1 0.875 20.0
George [17] 0.78 29.49 0.91 15.78
Khan [22] 0.811 23.7 0.938 9.80

Approach
(Deep learning) AUC EER (%) AUC EER (%)

Hasan [18] 0.81 27.9 0.90 21.7


Chong [9] 0.899 12.5 0.874 12.0
Zhou [49, 56] 0.905 13.5 0.889 11.5
Lei [25] 0.844 – 0.95 –
Chong-Lite 0.71 32.88 0.897 17.75
Best among non-deep learning
and deep learning approaches in Proposed approach 0.8475 22.98 0.954 10.7
bold and italics respectively
Multimedia Tools and Applications

Fig. 7 Heatmap of the AUC and EER metrics of UCSD Ped1 dataset on varying the shear factor (sf ) and the
minimum cell size (y0 )

experimentally observed that using spatio-temporal region with a shear factor resulted in
decreased performance, as the UCSD Ped2 dataset has dominant pedestrian motion along
the horizontal.

5.2 Subway

The spatio-temporal region was constructed using five frames. The number of orienta-
tion and magnitude bins used for creating the HOFM feature is four and six respectively,
resulting in a 24-d vector. The results obtained are tabulated in Table 4.

5.2.1 Subway entrance

The best performance is obtained for shear factor (sf ) of −3 and smallest cell size (y0 ) of
60. The spatio-temporal regions for this shear value are a result of x-axis shear towards the
negative axis. The factor by which the cells grow, a is set as 1.29. The number of shear
transformed regions from which the feature is captured is 8.

5.2.2 Subway Exit

The best performance is obtained for shear factor (sf ) of 1.01 and smallest cell size (y0 )
of 50. The spatio-temporal regions for this shear value are a result of x-axis shear towards

Fig. 8 Heatmap of the AUC and EER metrics of UCSD Ped2 dataset on varying the cell growth factor (a)
and the minimum cell size (y0 )
Multimedia Tools and Applications

Table 4 Subway dataset results


comparison Entrance Exit

Approach
(Non deep learning) AUC EER (%) AUC EER (%)

Chaudhry [7, 14] 0.774 24.4 0.8 25.1


Colque [13, 14] 0.815 23.5 0.845 18.8
Colque [14] 0.816 22.8 0.849 17.8
George [17] 0.84 22.68 0.86 19.58

Approach
(Deep learning) AUC EER (%) AUC EER (%)

Hasan [18] 0.943 26.0 0.807 9.9


Chong [9] 0.847 23.7 0.94 9.5
Zhou [49, 56] 0.916 16.6 0.984 4.9
Chong-Lite 0.706 35.7 0.768 30.8
Best among non-deep learning
and deep learning approaches in Proposed approach 0.849 23.4 0.835 19.9
bold and italics respectively

the positive axis. The factor by which the cells grow, a is set as 1.2. The number of shear
transformed regions from which the feature is captured is 15.

5.3 Timing results

Our timing results reported in Table 5 includes the time taken for optical flow extrac-
tion. The optical flow extraction accounts for between 60 % (Subway Exit dataset) and
90 % (UCSD Ped2) of time taken for the feature capture. In pursuit of quicker run times
we have implemented optimisation (saving pixel location of each spatio-temporal region
in memory beforehand) that result in time savings on average of 60 ms per frame as
compared to our previous work [17]. The frame processing rate obtained in the pro-
posed method is between 3.05 fps to 23.12 fps, though there is possibility of a trade-off
between speed and abnormal activity detection accuracy performance as will be seen
below.
The frame size of the frames used for processing can affect the speed and performance
metrics. The effect of frame size on speed and performance is depicted in Fig. 9. It can be

Table 5 The timing results for


the proposed approach with Dataset Feature Forward Abnormality Avg. fps
average per frame time taken for capture pass detection
different components
UCSD Ped1 121.10 ms 67.32 μs 9.53 μs 8.25 fps
UCSD Ped2 327.90 ms 87.59 μs 5.81 μs 3.05 fps
Subway Ent. 46.06 ms 64.23 μs 5.67 μs 21.67 fps
The average frame processing Subway Ext. 43.15 ms 91.67 μs 8.56 μs 23.12 fps
rate per second is also reported
Multimedia Tools and Applications

Fig. 9 Variation of the AUC, frame processing time (in ms), and frame rate (in fps) metrics on varying the
frame size (written next to data point) of the frames of the UCSD dataset

seen that there is roughly O(n) time-complexity behaviour. This is observed from the frame
processing time vs pixels in a frame curves which demonstrate a roughly linear relationship.
Also an increase in frame size results in a decrease in frame rate while the abnormal activity
detection performance metrics improve. Both observations can be explained by the fact that
an increase in frame size would result in the need to extract optical flow vector from more
points (hence decreasing the speed), but which results in better feature capture and increased
abnormal activity detection performance.
Thus real-time performance is possible with slight decrease in abnormal activity detec-
tion performance. Consider for example the UCSD Ped2 dataset which has a frame rate of
10 fps. The proposed approach attains frame processing rates close to the required frame
rate of 10 fps when the frame size is 480 by 320 with only a small decrease in abnormal
activity detection performance.
Multimedia Tools and Applications

5.4 Comparison

The proposed approach displays better performance in comparison to previous method


[17] (see Tables 3, 4) for all datasets except Subway Exit. Both methods use autoen-
coder reconstruction error for abnormal activity detection. The reconstruction error should
be close to zero for normal frames and it should be greater for abnormal frames. The
comparison of the autoencoder reconstruction error of proposed approach against previ-
ous method [17] is shown in Fig. 10. In most cases the reconstruction error for normal
frames is lower for the proposed approach implying that it is better able to learn regu-
larity. In Fig. 10a abnormality is a skateboarder from frames 40 to 135. As can be seen
the reconstruction error in case of [17] for frames from 136 to 200 (normal frames) are
high, while they are low for the proposed approach. The abnormality in Fig. 10b is a
minitruck from frames 31 to 180. The reconstruction error for the proposed approach
is higher than that in [17] for frames 60 to 150, while this quantity is close to zero
for normal frames. In Fig. 10c the abnormality is a man moving in the opposite direc-
tion to subway entrance in frames 39237 to 39339. The reconstruction error for [17]
is mostly higher for the normal frames and lower for the abnormal frames when com-
pared to the proposed approach. For Fig. 10d, all the frames are normal with a man
moving close to the camera. The reconstruction error in the proposed approach is higher
and it’s higher for longer as compared to that in [17]. This is due to the spatio-temporal

Fig. 10 Proposed approach compared to previous method [17]. The abnormal frames are indicated
in red
Multimedia Tools and Applications

Table 6 Frame rate of proposed


approach compared against deep Proposed approach Chong [9] Zhou [49, 56] Chong-Lite
learning models when run on an
edge device 3.05–23.12 fps 0.156 fps 0.128 fps 0.413 fps

region at the border capturing the man’s motion in the proposed approach while in [17]
this motion is not covered by a spatio-temporal region. Also the high reconstruction cost
can be explained by paucity of similar motion in training data. The above explains the
lower performance of the proposed approach on the Subway Exit dataset as compared
to [17].
Our proposed approach displays competitive performance on the UCSD Ped1 and UCSD
Ped2 when compared to the method of [22] (see Table 3). Their method is the one which is
closest in application to our approach. Their method required an additional custom hardware
in the form of FPGA, while our method uses only a general purpose edge processor as found
in a Raspberry Pi. Their work reports a frame processing rate of approximately 400 fps
because of the dedicated hardware.
On comparing the performance of the proposed approach against deep learning methods
(see Tables 3 and 4) it can be seen that it gives performance that is comparable to most
deep learning methods. Although the deep learning methods give better performance they
are unable to run at real time speeds on edge devices as seen in Table 6.

5.5 Future work

A direction for future research that is obvious from the results obtained in this work is the
development of lightweight deep learning models for abnormal activity detection. Such a
model needs to perform at real time speeds on edge devices while still performing accept-
ably. Of particular interest in this pursuit could be the use of Inception [44] layers that
promise reduced model size. Inspiration may also be drawn from the Mobilenet [20] model
which has performed well for object detection task at the edge. Another promising line of
research could be the use of Temporal Convolution Networks (TCN) [19, 24] for video
abnormal activity detection.

6 Conclusion

The paper motivated the need to process frames at the edge in a large video surveillance
network from the viewpoint of latency and bandwidth usage. A generalised shear transform
based approach that is suitable for an edge-based abnormal activity detection was proposed.
This approach was tested on an edge device using standard abnormal activity detection
datasets. The AUC/EER (in %) metrics on UCSD Ped1, UCSD Ped2, Subway Entrance,
and Subway Exit are 0.8475/22.98, 0.954/10.7, 0.849/23.4, and 0.835/19.9 respectively. The
frame processing rates obtained on an edge device were between 3.08 to 23.12 fps. It was
also observed that frame-rates could be improved with a trade-off with accuracy by lowering
frame size.

Acknowledgments This work was supported by Technical Education Quality Improvement Programme
(TEQIP) Research Seed Money Project (No. TEQIP/PTRA/2017); APJ Abdul Kalam Technological Univer-
sity - Center for Engineering Research & Development (APJAKTU-CERD) Research Seed Money Project
(No. KTU/RESEARCH 2/4068/2019).
Multimedia Tools and Applications

References

1. Adam A, Rivlin E, Shimshoni I, Reinitz D (2008) Robust real-time unusual event detection using
multiple fixed-location monitors. IEEE Trans Pattern Anal Mach Intell 30(3):555–560
2. Afiq AA et al (2019) A review on classifying abnormal behavior in crowd scene. J Vis Commun Image
Represent 58:285–303
3. Biswas S, Babu RV (2017) Anomaly detection via short local trajectories. Neurocomputing 242:
63–72
4. Bouguet JY (2000) Pyramidal implementation of the Lucas Kanade feature tracker description of the
algorithm. OpenCV Document, Intel, Microprocessor Research Labs
5. Bouindour S, Hu R, Snoussi H (2019) Enhanced convolutional neural network for abnormal event
detection in video streams. IEEE Int Conf Artif Intell Knowl Eng (AIKE) 172–178
6. Chalapathy R, Chawla S (2019) Deep learning for anomaly detection: a survey. arXiv:1901.03407
7. Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and Binet-
Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. IEEE Conf
Comput Vis Pattern Recognit 1932–1939
8. Chen N, Chen Y, Blasch E, Ling H, You Y, Ye X (2017) Enabling smart urban surveillance at the edge.
IEEE Int Conf Smart Cloud 109–119. https://doi.org/10.1109/SmartCloud.2017.24
9. Chong YS, Tay YH (2017) Abnormal event detection in videos using spatiotemporal Autoencoder. Int
Symp Neural Netw. 189–196
10. Cicirelli F et al (2018) Edge computing and social internet of things for Large-Scale smart environments
development. IEEE Internet Things J 5(4):2557–2571. https://doi.org/10.1109/JIOT.2017.2775739
11. Cisco Annual Internet Report, 2018–2023 - Whitepaper (2020)
12. Cisco Visual Networking Index: Complete Forecast Update, 2017–2022 - White Paper (2018)
13. Colque RVHM, Junior CAC, Schwartz WR (2015) Histograms of optical flow orientation and magnitude
to detect anomalous events in videos. SIBGRAPI Conf Graph Patterns Images 126–133
14. Colque RVHM, Caetano C, de Andrade MTL, Schwartz WR (2017) Histograms of optical flow orienta-
tion and magnitude and entropy to detect anomalous events in videos. IEEE Trans Circ Sys Video Tech
27(3):673–682
15. Colque RM et al (2018) Novel anomalous event detection based on human-object interactions. Int Conf
Comput Vis Theory Appl 293–300
16. Fan Y, Wen G, Li D, Qiu S, Levine MD (2018) Video anomaly detection and localization via gaussian
mixture fully convolutional variational autoencoder. arXiv:1805.11223
17. George M, Jose BR, Mathew J, Kokare P (2019) Autoencoder-based abnormal activity detection using
parallelepiped spatiotemporal region. IET Comput Vis 13(1):23–30. https://doi.org/10.1049/iet-cvi.2018.
5240
18. Hasan M, Choi J, Neumann J, Roy-Chowdhury AK, Davis L (2016) Learning temporal regularity in
video sequences. IEEE Conf Comput Vis Pattern Recognit 733–742. https://doi.org/10.1109/CVPR.
2016.86
19. He Y, Zhao J (2019) Temporal convolutional networks for anomaly detection in time series. J Phys Conf
Ser 1213
20. Howard AG et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision
applications. arXiv:1704.04861
21. Hu X, Huang Y, Gao X, Luo L, Duan Q (2019) Squirrel-cage local binary pattern and its application in
video anomaly detection. IEEE Trans Inf Forensics Secur 14(4):1007–1022
22. Khan MUK, Park H, Kyung C (2019) Rejecting motion outliers for efficient crowd anomaly detection.
IEEE Trans Inf Forensics Secur 14(2):541–556. https://doi.org/10.1109/TIFS.2018.2856189
23. Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. Br Mach
Vis Conf 275:1–10
24. Lea C, Flynn M, Vidal R, Reiter A, Hager G (2017) Temporal convolutional networks for
action segmentation and detection. IEEE Conf Comput Vis Pattern Recognit (CVPR) 1003–1012.
https://doi.org/10.1109/CVPR.2017.113
25. Lei Z, Deng F, Yang X (2019) Spatial temporal balanced generative adversarial autoencoder for anomaly
detection. Int Conf Image Video Signal Process 1–7
26. Leyva R, Sanchez V, Li CT (2017) Video anomaly detection with compact feature sets for online
performance. IEEE Trans Image Proc 26(7):3463–3478
27. Li W, Mahadevan V, Vasconcelos N (2014) Anomaly detection and localization in crowded scenes.
IEEE Trans Pattern Anal Mach Intell 36(1):18–32
Multimedia Tools and Applications

28. Liu P, Tao Y, Zhao W, Tang X (2017) Abnormal crowd motion detection using double sparse
representation. Neurocomputing 269:3–12
29. Lloyd K, Marshall D, Moore SC, Rosin PL (2017) Detecting violent and abnormal crowd activity using
temporal analysis of grey level co-occurrence matrix (GLCM) based texture measures. Mach Vis Appl
28:361–371
30. Lu C, Shi J, Wang W, Jia J (2018) Fast abnormal event detection. Int J Comput Vis 1–18
31. Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance
systems: a review. Expert Syst Appl 91:480–491
32. Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. Proc
IEEE Conf Comput Vis Pattern Recognit (CVPR) 1975–1981
33. Miraftabzadeh SA, Rad P, Choo KR, Jamshidi M (2018) A privacy-aware architecture at the edge
for autonomous real-time identity reidentification in crowds. IEEE Internet Things J 5(4):2936–2946.
https://doi.org/10.1109/JIOT.2017.2761801
34. Nikouei SY, Chen Y, Song S, Xu R, Choi BY, Faughnan T (2018) Real-time human detection as an edge
service enabled by a lightweight CNN. IEEE Int Conf Edge Comput 125–129
35. Nikouei SY, Chen Y, Song S, Xu R, Choi BY, Faughnan T (2018) Intelligent surveillance as an edge
network service: from harr-cascade, SVM to a lightweight CNN. arXiv:1805.00331
36. Quigley PA et al (2019) Outcomes of patient-engaged video surveillance on falls and other adverse
events. Clin Geriatr Med 35(2):253–263
37. Rabiee H, Mousavi H, Nabi M, Ravanbakhsh M (2017) Detection and localization of crowd behavior
using a novel tracklet-based model. Int J Mach Learn Cybern 1–12
38. Ravanbakhsh M, Nabi M, Mousavi H, Sangineto E, Sebe N (2018) Plug-and-play CNN for crowd motion
analysis: an application in abnormal event detection. IEEE Winter Conf Appl Comput Vis (WACV)
1689–1698
39. Rumelhart D, Hinton G, Williams R (1986) Learning representations by back-propagating errors. Nature
323:533–536. https://doi.org/10.1038/323533a0
40. Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE Internet Things
J 3(5):637–646
41. Shi X, Chen Z, Wang H, Yeung DY, Wong W, Woo W (2015) Convolutional LSTM network: a machine
learning approach for precipitation nowcasting. Int Conf Neural Inf Process Syst 802–810
42. Sun J, Wang X, Xiong N, Shao J (2018) Learning sparse representation with variational Auto-
Encoder for anomaly detection. IEEE Access 6:33353–33361. https://doi.org/10.1109/ACCESS.2018.
2848210
43. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Int Conf
Neural Inf Process Syst 3104–3112
44. Szegedy C et al (2015) Going deeper with convolutions. IEEE Conf Comput Vis Pattern Recognit
(CVPR) 1–9
45. Tsakanikas V, Dagiuklas T (2017) Video surveillance systems-current status and future trends. Comput
Electr Eng. https://doi.org/10.1016/j.compeleceng.2017.11.011
46. Usman M, Jan MA, He X, Chen J (2019) A survey on big multimedia data processing and management
in smart cities. ACM Comput Surv 52(3):54:1–54:29
47. Wang J, Xu Z (2016) Spatio-temporal texture modelling for real-time crowd anomaly detection. Comput
Vis Image Underst 144:177–187
48. Wang J, Cherian A, Porikli F (2017) Ordered pooling of optical flow sequences for action recognition.
IEEE Winter Conf Appl Comput Vis (WACV) 168–176. https://doi.org/10.1109/WACV.2017.26
49. Wang L, Zhou F, Li Z, Zuo W, Tan H (2018) Abnormal event detection in videos using hybrid spatio-
temporal autoencoder. IEEE Int Conf Image Process 2276–2280
50. Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–
1560
51. Xu R et al (2018) Real-time human objects tracking for smart surveillance at the edge. IEEE Int Conf
Commun 1–6. https://doi.org/10.1109/ICC.2018.8422970
52. Yang B, Cao J, Ni R, Zou L (2018) Anomaly detection in moving crowds through spatiotemporal
autoencoding and additional attention. Adv Multimed 2018:1–8
53. Yuan Y, Feng Y, Lu X (2017) Statistical hypothesis detector for abnormal event detection in crowded
scenes. IEEE Trans Cybern 47(11):3597–3608
54. Yuan Y, Feng Y, Lu X (2018) Structured dictionary learning for abnormal event detection in crowded
scenes. Pattern Recognit 73:99–110
55. Zhang T et al (2015) The design and implementation of a wireless video surveillance system. Annu Int
Conf Mob Comput Netw 426–438
Multimedia Tools and Applications

56. Zhou F, Wang L, Li Z, Zuo W, Tan H (2019) Unsupervised learning approach for
abnormal event detection in surveillance video by hybrid autoencoder. Neural Process Lett.
https://doi.org/10.1007/s11063-019-10113-w
57. Zitouni MS, Sluzek A, Bhaskar H (2019) Visual analysis of socio-cognitive crowd behaviors for
surveillance: a survey and categorization of trends and methods. Eng Appl Artif Intell 82:294–312

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Michael George is currently a part-time research scholar undertaking research at the Division of Electronics,
School of Engineering, CUSAT, India. He is working as Assistant Professor in Department of Electronics
and Communication Engineering, Government Engineering College, Kottayam (RIT Kottayam), India. He
received his bachelors degree from National Institute of Technology Calicut, India in 2010. He also holds
a masters degree from Cochin University of Science and Technology, India which he received in 2013.
He has previously worked as a research fellow at Naval Physical and Oceanographic Lab, India and as an
Assistant Professor at Albertian Institute of Science and Technology, India. His research interests are focused
on image/video analysis for surveillance and security.

Babita Roslind Jose received her B. Tech degree in Electronics and Communication Engineering from
Mahatma Gandhi University, Kerala, India in 1997 and Masters Degree in Digital Electronics from Karnataka
University, India in 1999. She also holds an M.S. degree in System on Chip designs from Royal Institute of
Technology (KTH), Stockholm, Sweden. She obtained her Ph.D. degree in the area of Wireless Communi-
cation from CUSAT in the year 2010. She is a recipient of UK-India Staff Exchange program Fellowship
(UKIERI Award) in the year 2011. She has many funded research projects and has more than 80 publications
in referred international journals and conferences. Currently, she is serving as Associate Professor, Division
of Electronics, School of Engineering, CUSAT. Her research interests are focused on development of System
on chip architectures, Multi-standard Wireless Transceivers, Low-power design of Sigma-delta modulators,
Image/video analysis and Machine learning architectures.
Multimedia Tools and Applications

Jimson Mathew is currently an associate professor and head of the Computer Science and Engineering
Department, Indian Institute of Technology (IIT) Patna, India. He is also honorary visiting fellow at the
Department of Computer Science and Engineering, University of Bristol, UK. He received the Masters in
Computer engineering from Nanyang Technological University, Singapore and the Ph.D. degree in computer
engineering from the University of Bristol, Bristol, U.K. Prior to this, he has worked with the Centre for
Wireless Communications, National University of Singapore, Bell Laboratories Research Lucent Technolo-
gies North Ryde, Australia, Royal Institute of Technology KTH, Stockholm, Sweden and Department of
Computer Science, University of Bristol, UK. His research interests include fault-tolerant computing, com-
puter arithmetic, hardware security, very large scale integration design and design automation, and design of
Network on Chip Architectures. He has authored multiple books.

You might also like