ICCBI 2019 Paper 103
ICCBI 2019 Paper 103
ICCBI 2019 Paper 103
1 1 1 1
Sunita Suralkar , Smit Gangurde , Sanjeevkumar Chintakindi , and Haresh Chawla
1
VES Institute of Technology, Chembur, Mumbai
1
[email protected], [email protected], [email protected],
[email protected]
Abstract. The purpose of the system is to provide a powerful and intelligent surveillance tool to the police
force so as to reduce crime and help maintain the peace of our country. The law enforcement agencies have
been motivated to use video surveillance systems to monitor and curb these threats. But this becomes a
tedious task, prone to human errors. The core module of this system estimates the pose in humans present in
the video and a backend capable of understanding the context as a whole. Many AI powered surveillance
systems are good at recognising the violent or malicious activity but fail to understand the context as a
whole. We aim to understand the gradual change in human behaviour in the given scenario, understand the
confidence level of each expression and derive if the given scenario is truly violent or malicious. The
Ornithopter is allowed to follow the suspect wherein the direction offsets are given by server. The system
differs from any state-of-the-art surveillance systems as it provides aerial surveillance covering larger area,
and since the drone is bird shaped, it can easily navigate the area without being easily detected. And as
mentioned, the recognition of the true violent or malicious activity is context based.
Keywords: Ornithopter, deep learning, artificial Intelligence, video analytics, human activity prediction.
1 Introduction
The rate of criminal activities by individuals and threats by terrorist groups has been on the rise in recent years.
The law enforcement agencies have been motivated to use video surveillance systems to monitor and curb these
threats. But this becomes a tedious task. Many automated video surveillance systems have been developed in the
past to monitor abandoned objects (bags), theft, violent activities, etc. Governments have recently deployed
drones in war zones to monitor hostile activities, smuggling, conducting border control operations as well as
finding criminal activity in the urban and rural area. One or more soldiers pilot most of these drones for long
durations which makes these systems prone to mistakes due to human fatigue. We propose an autonomous bird
drone surveillance system capable of detecting individuals engaged in violent activities, criminal activities on
city streets, etc. This system uses the deformable parts model to estimate human poses which are then used to
identify suspicious individuals. This is an extremely challenging task as the images or videos recorded by the
drone can suffer from illumination changes, shadows, poor resolution, and blurring. Also, humans can appear at
different locations, orientations, and scales. In this project, we aim at making an ornithopter, a bird-shaped
drone, which will patrol a specified area and send video footage to our cloud. An artificial intelligence system
based on deep learning will analyze this footage to detect any suspicious activities.
For humans to manually monitor the cities using cameras is a tedious task. Moreover, even monitoring the
streets of cities manually is a monotonous task. There is a growing need for automated video surveillance
techniques for the army for border patrol and the police for patrolling the streets. Our aim is to produce an
unmanned surveillance drone, in the shape of a bird(ornithopter), that helps the military to detect intruders near
the border and for the police to help them detect any kind of crime. It should not be easily detectable. The bird
drone should be able to follow the individual and report to authority its location. The bird drone streams the
video, while we have a server that runs an AI system which detects any individuals and their violent activities.
2 Literature Survey
UAV by Microdrones: This system uses remote control drones to surveil an area. The individual behind the
remote analyses and infer the ongoing scenario in live footage.
2
Paulo Vinicius Koerich Borges et. al, [11] presents different methods for human behavior understanding from
video data and presents an effective summarization of pros and cons regarding different models that can be used
for the same. The paper helps in defining human action, its interaction and understanding what environment is,
working of Hidden Markov Model (HMM), How HMM can be used for human behaviour understanding, its
advantages and limitations. Increasing efficiency using HMM and SVM together for accurate human behavior
understanding.
Cameras installed in traffic signal pole detect accidents, these systems have a huge lag and generally not
accurate and can be easily hacked by other entities. Many Surveillance cameras require manual work to
monitoring the incoming video. The existing system of video surveillance is not context-based surveillance. The
government is trying to make a system automated type where system can detect threats and violent activity
easily by itself without any manual work. The existing system of the video is drone which can be detected easily
by people and culprit from below. The proposed system is based on bird type natural drone which can be easily
detected in areas for patrol. From above it can easily detect the many humans and flow the culprit from the back.
The existing system is not context-based human activity recognition it just detects threats or not. This proposed
system is totally context-based understandable so it can verify its just chill out in friends or threats.
The proposed project aims to assist the police forces with powerful surveillance technology. We plan to assist
police forces against criminal activities like assaults, robbing etc and help society, with rescue operations in case
of accidents. As manually analyzing CCTV camera footage or manual video surveillance is a tedious and
monotonous task. Hence, automated video surveillance is an active field of research. Systems are being
developed to detect criminal activities using deep learning. While in the robotics field, animal behavior and their
bodily functions are also constantly being mimicked. Ornithopters are being constantly improved in terms of
size, weight and mimicking real-life birds.
We can further leverage this technology in rescue operations by patrolling the affected area and designing a
rescue plan accordingly.
3 Methodology Used
The system consists of two entities; an Ornithopter and a server having deep learning based system.
3. 1 Ornithopter
The Ornithopter has Raspberry Pi, camera, accelerometer/gyroscope module, ultrasonic sensors and motors. The
sensor data is used to adjust the flight of the Ornithopter and for maneuvering. The video footage is then sent to
server, which uses Deep learning on the cloud to analyze the video and understand the context of the scene. Re-
port generation is done if any suspicious activity is encountered.
The ornithopter starts with getting patrol area from authorities. Next it patrols the area and continuously sends
surveillance footage to the server. The deep learning system on the cloud then analyses the video and checks for
any suspicious activity in the scene. It will follow the activity if it receives the command for the same from the
server.
4
Steps:
1 Requirement Gathering:
This step involves listing all the required components on the Ornithopter and their functions.
2 Parameter Calculations:
Analyzing the total weight of the Ornithopter and calculating appropriate measurements for the body struc-
ture of the Ornithopter. This includes calculating the wingspan, body length, speed, and wing flapping fre-
quency.
3 Designing of the Ornithopter structure:
This involves determining the placement of various components on the body of the Ornithopter. In this
step we also determine the gear mechanism for the flapping of wings.
4 Making the skeletal structure of the Ornithopter and basic programming:
In this step we prepare a basic skeletal structure of the ornithopter based on the previously deduced param-
eters and also place the components in this structure. Along with this, we also program and test basic func-
tionalities such as motor control, taking inputs from accelerometer and gyroscope, and capturing video
footage from the camera.
5 Programming and testing:
After preparing the skeletal structure and testing the components, we program the desired functionalities.
This includes securely transferring the video footage to the server, automatic stabilization of the
ornithopter based on accelerometer and gyroscope reading, obstacle avoidance, and controlling of the mo-
tors for given paths.
6 Integration and testing:
The final step involves integrating all the components and making the complete structure of the
Ornithopter. In the final structure, the drone will be covered and painted to look like a bird. This step also
involves testing the Ornithopter and correcting any errors.
Server:
The backbone of the server is the deep learning neural network. This neural network consists of FPN
layer for human detection, and ScatterNet Hybrid layers for pose detection. The LSTM layer understands
the context of the scene and determines in whole if the scene is suspicious or not.
5
Steps:
Implementing AI based video surveillance system
A. Requirement Gathering :
Finalizing libraries and frameworks that will help us assist implement our system.
B. Datasets :
Collecting and if necessary making our own datasets depicting different human activities for better under-
standing for our system. UCF-Crime datasets can be used for this purpose. URL for UCF-Crime dataset
http://crcv.ucf.edu/cchen/
C. Cloud System :
Selecting a Cloud system optimal for our solution considering usage, cost, storage, processor etc.
D. Implement AI system :
i. Human Detection :
Using Deep Convolutional Neural Networks (DCNN), we try and identify humans, animals and object.
We feed different images containing humans, animals and objects and train our net. This step involves
tuning weights and bias for network, choosing optimal learning rate and dividing datasets for training
and testing. Designing filters for DCNNs.
ii. Single Human Pose Estimation :
Now using Learning Featured Pyramids Networks as a filter in Deep Convolutional Neural Networks,
we will try to estimate the Human Pose depending on Keypoint-Based Non Maximum Suppression
(NMS) i.e.a part of a person’s pose that is estimated, such as the nose, right ear, left knee, right foot,
etc. We also tune and test the network. Training data will be augmented by scaling, rotation, flipping,
and adding color noise.
iii. Multi Person Pose Estimation :
The disadvantage of Single Human Pose Estimation is that if there are multiple people in an image,
keypoints from both persons will likely be estimated as being part of the same single pose, meaning,
for example, that person #1’s left arm and person #2’s right knee might be overlapped by the algorithm
6
as belonging to the same pose. If there is any likelihood that the input images will contain multiple
people, the multi-pose estimation algorithm should used.
iv. Context based Human Activity Understanding :
To understand the true nature of the scenario, it is important to add context to end result so to avoid
false positives, For this we implement LSTM - ScatterNet Hybrid Neural Network. At frontend we
have our ScatterNet Hybrid Network that will be implemented using PyTorch that consists of
ScatterNet using dual wavelet transform as filters and Regression Network to precisely classify differ-
ent actions, then the end result will be forwarded to our Long Short Term Memory(LSTM) that now
accurately predicts if the scenario is truly violent or not.
As mentioned earlier this step too involves tuning weights and bias for network, choosing optimal
learning rate and dividing datasets for training and testing. With LSTM - ScatterNet Hybrid Network,
Our model is capable of recognizing actions and derive context from it. We generate optimal flow im-
ages which are fed to Single Frame Representation Model that generates representation. Finally, our
backend LSTM network predicts the activities based on Generated Representation.
E. Training :
Datasets will be divided in 70:30 ratio for training and testing. Our system uses the FPN network first to de-
tect the humans [6] the SHDL network for human pose estimation, and then the orientations of the limbs of
the estimated pose are used to identify the violent individuals.
Initially random weights are assigned to these layers and a learning rate will be given to the network.
The FPN network will be pre-trained on the different categories on datasets.
The image regions detected by the FPN network are resized and normalized by subtracting the image re-
gions mean and dividing by its standard deviation. ScatterNet: The resultant image region is given as input
to the ScatterNet (SHDL front-end) which extracts invariant edge representations at L0, L1, and L2 using
DTCWT filters where Li is the location of interest [1].
The Regression Network (SHDL back-end) with convolutional layers, is trained on the concatenated
ScatterNet features (L0, L1, and L2) extracted from image regions (detected by FPN network).
Then our back-end LSTM network will be trained on input given by SHDL network which generates our
human pose representation to predict given activities based on previous activity (context) i.e., Step 4.
F. Testing
We generate test cases based on test requirements; the 30% dataset will be used to test our Network. We
plot our objective as a function of the epochs for training and testing data and look for any underfitting or
overfitting based on training and testing error rate. We compare the results and look for optimized hyper pa-
rameters of the network. We will also test our neural network against artificial and trivial activities to un-
derstand decision boundaries and test it our network accurately predicts it.
TC_1 Violent Video Flag the video as a vio- Flagged some true violent 67% accurate
lent activity videos as violent [PASS]
TC_2 Non_Violent Video Flag the video as a Flagged some true violent 61% accurate
nonviolent activity videos as violent [PASS]
7
TC_3 Images containing hu- Coordinates to move left, right commands to [PASS]
mans to Human Follow- the drone move drone
ing Module
TC_4 Brushless DC Motor with Continuous rotation of The throttle cannot be [PARTIALLY
Raspberry Pi 3 and LiPo motor with sufficient mapped yet PASS]
battery throttle
TC_5 Two servo motors for tail Movement of the tail The tail is able to move [PASS]
movement in 3D plane up/down, and rotate towards
left and towards right
The training accuracy of our model have been raised to 100 % whereas testing accuracy is 67%. The Human
Following Module accurately gives commands to move left or right following the human.
4 Implementation Details
This project consists of two separate entities working together, an ornithopter and a deep learning-based system
on the cloud. The ornithopter is a bird-shaped drone, which flies by wing flapping mechanism. It is equipped
with a camera. Real-time video footage from the camera is sent to the server. The deep learning-based system
on the server analyses this video footage to detect any suspicious activities. Once any such activity is detected,
concerned authorities are informed about it. Similar projects have been developed for automated video
surveillance drones, our project aims at making a bird-shaped drone so it is harder for an enemy or an individual
to detect it.
Difficulties may include understanding the context of a given scenario, the autonomous nature of the ornithopter
and flight stabilization.
The basic structure of ornithopter is first designed using AutoCAD[8]. Various components of ornithopter are
selected and accessing dimensions are placed on ornithopter structure after the laser cut structure is available.
Basic programming will be done to stabilize the bird during flight. Object detection is done to avoid obstacles at
later stage. If the server detects a violent or suspicious activity, it will send commands to the ornithopter to
follow the individuals.
The server powered by artificial intelligence will detect violent individuals and send commands to follow the
culprits. The server is a complete neural network designed with having a frontend neural networks consisting of
feature pyramid network for Object detection (humans, animals, weapons, cars etc). The segmented image
forwarded to scatternet hybrid neural network to extract human pose estimation that will be forwarded to the
regression network to classify the type of action. The backend consists of a bidirectional Long Short Term
Memory Network (LSTM) that takes certain scenes under guidance, It will be an unsupervised LSTM model
that tries to understand the scenario based upon its training. It will be useful to derive context of the ongoing
scenario that determines it true nature of scenario is actually violet or not. After understanding the context, the
server sends live commands to drone on following the activists, the x, y and z changes are given. A report is
generated having; the type of activity, location and number of individuals. The report is sent to the concerned
authority and hence the authority shows up at location, i.e. we have to recognise that the authority has arrived
based upon its location and then the tracking ends. The authority location is transferred to server, the server
checks if the authority has reached the location and sends the command the stop the drone from tracking.
Example Scenarios:
Police Use:
8
We can have our bird drone to surveil an individual and track them, surveil a region for any malicious incidents
or road accidents and report to hospitals/police force, track the culprits, transfer video data as evidence etc. It
will be a proper video surveillance system and detects these activities providing a huge help for a police force.
Future Scope:
Future scope may include analyzing and designing rescue paths at disaster area.
5 Conclusion
This project aims at producing an ornithopter, a bird-shaped drone, and a server system based on deep learning
to provide automated video surveillance. The ornithopter will autonomously patrol the civilian areas, while the
server system will detect any suspicious activities by analyzing the video footage from the ornithopter. The deep
learning system will be implemented on the cloud. A crucial aspect of this project is to develop a context-
sensitive video surveillance system. This can reduce false positives that the existing systems fail are prone to
detect. A bird-shaped drone is chosen so as to remain hidden or undetected by criminals. The project is chosen
with an intention to reduce manual work and human errors for police forces, and overall prevent any surprise
attacks in the city, and reduce crime rate in the cities.
Acknowledgments
References
1 Amarjot Singh, Devendra Patil, SN Omkar, “Eye in the Sky: Real-time Drone Surveillance System (DSS) for Violent
Individuals Identification using ScatterNet Hybrid Deep Learning Network”, To Appear in the IEEE Computer Vision
and Pattern Recognition (CVPR) Workshops, 2018
2 Joon Hyuk Park, Kwang-Joon Yoon, “Designing a Biomimetic Ornithopter Capable of Sustained and Controlled
Flight”, Journal of Bionic Engineering 5 (2008) 39−47
3 Deepika Singh, Erinc Merdivan, Ismini Psychoula, Johannes Kropf, Sten Hanke, Matthieu Geist, Andreas Holzinger,
“Human Activity Recognition using Recurrent Neural Networks”, In International Cross-Domain Conference for
Machine Learning and Knowledge Extraction: CD-MAKE, 2017
4 Xin Li, Mooi Choo Chuah, “ReHAR: Robust and Efficient Human Activity Recognition”, arXiv:1802.09745
5 M. Shamim Hossain, Ghulam Muhammad, Wadood Abdul, Biao Song, B.B.Gupta, “Cloud-assisted secure video
transmission and sharing framework for smart cities”, In Future Generation Computer Systems, Volume 83 Issue C,
June 2018
6 Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, Xiaogang Wang, “Learning Feature Pyramids for Human Pose
Estimation”, In IEEE International Conference on Computer Vision, 2017
7 Yilun Chen, Zhicheng Wang, Yuxiang Peng, Gang Yu, Jian Sun, “Cascaded Pyramid Network for Multi-Person Pose
Estimation”, arXiv:1711.07319v1, 20 Nov 2017
8 Zachary John Jackowski, “Design and Construction of Autonomous ornithopter”, Massachusetts Institute of
Technology, June 2009
9 L. Sifre and S. Mallat., “Rotation, scaling and deformation invariant scattering for texture discrimination”, In
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference
10 Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, Sung Wook Baik, “Action Recognition in Video
Sequences using Deep Bi-directional LSTM with CNN Features”, Intelligent Media Laboratory, College of Software
and Convergence Technology, Sejong University, Seoul, Republic of Korea.
11 Paulo Vinicius Koerich Borges, Nicola Conci, Andrea Cavallaro, “Video-Based Human Behavior Understanding:A
Survey”, In Circuits and systems for video technology, 2013 IEEE Conference.
12 Neil Robertson, Ian Reid, “Behaviour understanding in video: a combined method”, In 2010 IEEE International
Conference on Computer Vision (ICCV’05).
13 Fan, Yin, et al. "Video-based emotion recognition using CNN-RNN and C3D hybrid networks." Proceedings of the
18th ACM International Conference on Multimodal Interaction.ACM, 2016