Sample report- Abiram
Sample report- Abiram
Sample report- Abiram
By
RAVIPATI ABHIRAM (Reg.No - 39110840)
KONDAMURI RAKESH KRISHNA (Reg.No - 39110521)
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC | 12B Status by UGC | Approved by AICTE
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI - 600119
APRIL - 2023
i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of RAVIPATI
ABHIRAM (Reg.No - 39110840) and KONDAMURI RAKESH KRISHNA (Reg.No
- 39110521) who carried out the Project Phase-2 entitled “VISION BASED
DETECTION AND ANALYSIS OF HUMAN ACTIVITIES” under my supervision
from January 2023 to April 2023.
Internal Guide
Dr. A. MARY POSONIA M.E., Ph.D.
ii
DECLARATION
DATE: 20/4/2023
iii
ACKNOWLEDGEMENT
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr. A. Mary Posonia M.E., Ph.D., for her valuable guidance, suggestions and
constant encouragement paved way for the successful completion of my phase-2
project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many
ways for the completion of the project.
iv
ABSTRACT
Human activity recognition (HAR) is an important research area in the field of
computer vision and artificial intelligence. HAR involves identifying and classifying
the activities that a person is performing based on sensor data or video data. HAR
has numerous applications in various fields, including healthcare, security, and
sports analysis. In recent years, deep learning models have shown great promise in
HAR tasks. Two popular deep learning models for HAR are Convolutional LSTM
(ConvLSTM) and Long-term Recurrent Convolutional Networks (LRCN).
ConvLSTM models are an extension of the LSTM model that includes convolutional
layers in the recurrent structure. ConvLSTM models are particularly useful for
activity recognition tasks since they can capture the temporal dependencies
between frames in video data. LRCN models combine a convolutional neural
network (CNN) and an LSTM to classify videos. The CNN component extracts
features from each frame in the video, while the LSTM component processes the
temporal sequence of features to classify the activity. The success of HAR models
is heavily dependent on the quality and quantity of data. To train the models, a
dataset of labeled videos is required. The dataset is divided into training, validation,
and test sets. The models are trained on the training set, and the validation set is
used for hyperparameter tuning and model selection. The performance of the
models is evaluated on the test set. HAR has numerous applications in healthcare,
including monitoring the daily activities of elderly people or patients with chronic
conditions. The system can alert caregivers if there is a deviation from normal
activity patterns, indicating potential health issues or falls. In the field of security,
HAR can be used for surveillance in public areas, such as airports or shopping malls,
to detect suspicious behavior or identify individuals engaged in criminal activities. In
sports analysis, HAR can be used to analyze the performance of athletes during
training or competitions. The system can provide feedback on their technique and
suggest improvements based on their activity patterns. Human activity recognition
using ConvLSTM and LRCN models is a promising technology with numerous
applications. The models are capable of accurately recognizing human activities in
video data by capturing the temporal dependencies between frames. As the field of
HAR continues to evolve, the models are likely to become more accurate and
efficient, further expanding their potential applications
v
TABLE OF CONTENTS
CHAPTER PAGE
TITLE
NO. NO.
v
ABSTRACT
ix
LIST OF FIGURES
xi
LIST OF ABBREVIATIONS
1 INTRODUCTION 1
vi
3.3.2 System Use Case Scenario 20
4 DESCRIPTION OF PROPOSED SYSTEM 21
5.2 Algorithms 32
5.3 Testing 33
6 OUTCOMES AND DISCUSSIONS 34
6.1 Introduction 34
6.2 IEEE Standards Followed in the Project 35
6.3 Constraints 35
6.4 Tradeoff in the Project 36
7 RESULTS AND DISCUSSIONS 38
8.1 Conclusion 45
REFERENCES 48
vii
APPENDIX 51
A. SOURCE CODE 51
B. SCREENSHOTS 59
C. RESEARCH PAPER 69
viii
LIST OF FIGURES
CHAPTER PAGE
FIGURE NAME NO.
NO.
x
LIST OF ABBREVIATIONS
ABBREVIATION EXPANSION
xi
CHAPTER 1
INTRODUCTION
Human activity recognition (HAR) is an emerging research field that has gained
increasing attention in recent years due to its numerous applications in various
fields, including healthcare, security, sports analysis, and robotics. HAR involves
identifying and classifying the activities that a person is performing based on sensor
data or video data. HAR has become increasingly important with the growing
interest in personalized healthcare, aging in place, and smart homes. HAR can be
broadly categorized into two types: sensor-based HAR and vision-based HAR.
Sensor-based HAR involves using wearable sensors to capture motion and
physiological data to identify activities. Vision-based HAR involves analyzing video
data to identify activities. Vision-based HAR has received considerable attention due
to the widespread availability of cameras and the rapid development of computer
vision algorithms.
One of the main challenges in HAR is dealing with the variability and complexity of
human activities. Human activities can vary significantly in terms of their duration,
frequency, and context. Moreover, the same activity can be performed in different
ways by different people. Another challenge is dealing with the noisy and incomplete
sensor data or video data, which can lead to inaccurate activity recognition results.
In recent years, deep learning models have shown great promise in HAR tasks. Two
popular deep learning models for HAR are Convolutional LSTM (ConvLSTM) and
Long-term Recurrent Convolutional Networks (LRCN). ConvLSTM models are an
extension of the LSTM model that includes convolutional layers in the recurrent
structure. ConvLSTM models are particularly useful for activity recognition tasks
since they can capture the temporal dependencies between frames in video data.
LRCN models combine a convolutional neural network (CNN) and an LSTM to
classify videos. The CNN component extracts features from each frame in the video,
while the LSTM component processes the temporal sequence of features to classify
the activity.
1
1.1 BACKGROUND AND MOTIVATION
Human activity recognition using LSTM and CNN is a popular research area in the
field of deep learning. LSTM and CNN are two deep learning architectures that can
be used together to recognize human activities from sensor data. LSTM is used to
recognize time-sequential features, while CNN is used to extract features from
signals. This technology has many potential applications, such as in healthcare,
sports, and security. The use of deep learning models for human activity recognition
has shown great progress in recent years. The combination of LSTM and CNN has
been found to be effective in recognizing human activities with high accuracy. This
technology has the potential to improve the quality of life for people by providing
personalized healthcare and fitness monitoring.
The problem statement for human activity recognition using LSTM and CNN is to
accurately recognize and classify human activities from sensor data. The traditional
pattern recognition methods have limitations in recognizing complex human
activities. Therefore, deep learning models such as LSTM and CNN are used to
overcome these limitations. The standards of designing the architecture, deciding
the appropriate language for the effective architecture which could extract features
from sensor data and recognize time-sequential features with high accuracy is
followed as per the standards IEEE Std. 1016-1998 [22] Recommended Practice for
Software Design Descriptions. The use of smartphones sensors data for human
activity recognition is a popular research area. The goal is to develop a system that
can recognize different forms of human activities in real-time with high accuracy.
The system should be able to recognize activities such as walking, running, sitting,
standing, and other complex activities
Objective
The objective of human activity recognition using LSTM and CNN is to develop an
accurate and efficient system that can recognize and classify human activities from
2
sensor data. The system should be able to recognize different forms of human
activities in real-time with high accuracy. The use of deep learning models such as
LSTM and CNN can help to overcome the limitations of traditional pattern
recognition methods. The system should be able to recognize activities such as
walking, running, sitting, standing, and other complex activities. The goal is to
develop a system that can be used in various applications such as healthcare,
sports, and security. The system should be able to provide personalized healthcare
and fitness monitoring. The use of smartphones sensors data for human activity
recognition is a popular research area.
Scope
The scope of human activity recognition using LSTM and CNN is broad and has
many potential applications in various fields. In healthcare, it can be used to monitor
patients' activities and provide personalized healthcare. It can also be used in sports
to monitor athletes' activities and improve their performance. The system can be
used in security to detect suspicious activities and prevent crimes. The use of
smartphones sensors data for human activity recognition has made it possible to
develop low-cost and portable systems that can be used in various applications. The
combination of LSTM and CNN has shown promising results in recognizing human
activities with high accuracy. The system can recognize activities such as walking,
running, sitting, standing, and other complex activities. The use of deep learning
models for human activity recognition has shown great progress in recent years
In healthcare, HAR can be used to monitor the daily activities of elderly people or
patients with chronic conditions. The system can alert caregivers if there is a
deviation from normal activity patterns, indicating potential health issues or falls.
HAR can also be used to monitor the physical activity levels of patients with chronic
conditions, such as diabetes or cardiovascular disease, to assess their adherence
to prescribed exercise regimens.
3
In security, HAR can be used for surveillance in public areas, such as airports or
shopping malls, to detect suspicious behavior or identify individuals engaged in
criminal activities. HAR can also be used for crowd monitoring and control during
large events, such as concerts or sporting events.
In sports analysis, HAR can be used to analyze the performance of athletes during
training or competitions. The system can provide feedback on their technique and
suggest improvements based on their activity patterns. HAR can also be used for
injury prevention by monitoring the movement patterns of athletes and identifying
potential sources of strain or injury.
In robotics, HAR can be used to enable robots to interact with humans more
effectively by understanding their activities and intentions. HAR can also be used to
enable robots to perform household tasks, such as cleaning or cooking, by
recognizing and responding to human activities.
4
CHAPTER 2
LITERATURE SURVEY
Several studies on human activity recognition have been conducted in the recent
years. This section summarizes all of the previous work on facial expression
recognition.
Usharani J et al. [1] came up with an idea for a human activity recognition system
based on the Android platform. They created an application using the accelerometer
data for classification, which supported online training and classification. They used
the clustered k-NN approach to enhance the performance, accuracy, and execution
time of the k-NN classifier with limited resources on the Android platform. They also
concluded that the classification times were also dependent on the device models
and capabilities.
In Davide Anguita et al.’s [2] paper, they introduced the improvised Support Vector
Machine algorithm, which works with fixed point arithmetic to produce an energy-
efficient model for the classification of human activities using a smartphone. They
aimed to use the presented novel technology for various intelligence applications
and smart environments for faster processing with the least possible use of system
resources to save the consumption of energy along with maintaining comparable
results with other generally used classification techniques.
The paper titled "Service Direct: Platform that Incorporates Service Providers and
Consumers Directly" by Mary Posonia A et al. [3] published in the International
Journal of Engineering and Advanced Technology (IJEAT) in 2019, discusses a
platform that enables direct interaction between service providers and consumers.
It begins by highlighting the challenges faced by consumers in finding service
providers and the difficulties faced by service providers in reaching their target
audience. The authors propose a solution in the form of a platform called "Service
Direct" that directly connects service providers and consumers. The proposed
platform allows consumers to search for service providers based on their
requirements, such as location and service type, and directly interact with the
service providers without any intermediaries. The platform also offers service
5
providers the ability to create their profiles, list their services, and interact with
potential consumers directly.
Meysam, Vakili, et al. [5] proposed a real-time HAR model for online prediction of
human physical movements based on the smartphone inertial sensors. A total of 20
different activities were selected, and six incremental learning algorithms were used
to check the performance of the system, then all of them were also compared with
the state-of-the-art HAR algorithms such as Decision Trees (DTs), AdaBoost, etc.
Incremental k-NN and Incremental Naive Bayesian have given the best accuracy of
95%.
In Jirapond Muangprathub et al.’s [6] paper, they introduced a novel elderly person
tracking system using a machine learning algorithm. In this work, they used the k-
NN model with a k value of 5, which was able to achieve the best accuracy of
96.40% in detecting the real-time activity of elderly people. Furthermore, they
created a system that displays information in a spatial format for an elderly person,
and in case of an emergency, they can use a messaging device to request any help.
The paper titled "A Joint Optimization Approach for Security and Insurance
Management on the Cloud" by Joshila Grace L.K et al.[7] published in Lecture Notes
in Networks and Systems in 2021, proposes a joint optimization approach for
security and insurance management on cloud computing platforms. The paper
begins by highlighting the need for robust security measures and insurance policies
in cloud computing to protect against potential threats such as data breaches and
cyber-attacks. The authors propose a solution that integrates both security and
insurance management into a single optimization problem. The proposed joint
6
optimization approach involves modeling the security and insurance management
as a two-stage stochastic optimization problem. In the first stage, the authors
optimize the security measures to minimize the risk of potential threats, while in the
second stage, they optimize the insurance policies to minimize the potential losses
from these threats.
Baoding Zhou et al. [8] proposed a CNN for indoor human activity recognition. A
total of nine different activities were recognized based on accelerometers,
magnetometers, gyroscopes, and barometers collected by smartphones. The
proposed method was able to achieve an excellent accuracy of 98%.
Abdulmajid Murad et al. [9] proposed a deep LSTM network for recognizing six
different activities based on smartphone data. The network was able to achieve an
accuracy of 96.70% on the UCI-HAD dataset.
The paper titled "Feature Representation and Data Augmentation for Human Activity
Classification Based on Wearable IMU Sensor Data Using a Deep LSTM Neural
Network" by S. O. Eyobu and D. S. Han [10], published in Sensors in 2018 proposes
a deep learning-based approach for human activity recognition using wearable IMU
sensors. The authors focus on the problem of feature representation and data
augmentation for improving the performance of the deep LSTM neural network
model in classifying human activities.
The paper by Rueda et al. [11] proposes a system for human activity recognition
using convolutional neural networks (CNNs) with body-worn sensors. The system
consists of two stages: feature extraction and activity classification. CNNs are used
for both stages, allowing the system to automatically learn discriminative features
and classify activities. The authors evaluate their system on a public dataset
containing sensor data from daily activities of 30 subjects. The proposed system
achieves high accuracy rates of up to 98% for recognizing activities such as walking,
standing, sitting, and lying down.
The paper by Kim et al. [12] (2019) presents a deep learning-based approach for
recognizing the accompanying status of smartphone users using multimodal data.
The authors proposed a system that combines accelerometer data, Bluetooth, and
Wi-Fi signals to detect the accompanying status of users, which includes walking
alone, walking with others, and not walking. The proposed system was evaluated
7
using a dataset collected from 50 participants over a period of 5 days. The results
showed that the system achieved an accuracy of 94.2% in detecting the
accompanying status of users.
The paper titled "An Efficient Algorithm for Traffic Congestion Control" by Mary
Posonia A. et al. [13] proposes an algorithm for controlling traffic congestion. The
algorithm is designed to be efficient and effective in reducing traffic congestion in
urban areas. The study aims to address the increasing problem of traffic congestion,
which results in delays, increased fuel consumption, and increased pollution. The
proposed algorithm is based on a dynamic traffic control model that takes into
account real-time traffic flow data, road network topology, and the demand for
transportation services. The algorithm is designed to optimize traffic flow and reduce
congestion by adjusting traffic signal timings in real time.
The paper by F. Chollet [14] titled "Layer wrappers" describes the use of layer
wrappers in Keras, a popular deep learning framework. Layer wrappers are used to
modify the behavior of a layer in a neural network, such as adding regularization or
modifying the input/output shapes. The "TimeDistributed" layer wrapper is
particularly useful for handling sequences of data, where the same layer is applied
to each time step of the sequence. This wrapper allows for efficient processing of
temporal data in neural networks, such as in natural language processing or video
analysis. The paper provides examples of how to use layer wrappers in Keras and
discusses their practical applications in deep learning.
8
videos into temporal regions, and extract features that are discriminative for different
human activities.
After going through the previous works we inferred some points. In Human activity
recognition system based on the Android platform, they used the clustered approach
to enhance the performance accuracy, and execution time but the classification
times are dependent on the device models and capabilities. Real-time HAR model
based on the smartphone inertial sensors used six different machine learning
algorithms but they used inertial sensors. Novel elderly person tracking system
using a machine learning algorithm. In this work, they used the k-NN model with a
k value of 5, which was able to achieve the best accuracy of 96.40% in detecting
the real-time activity of elderly people. Davide Anguita et al.’s paper, they introduced
the improvised Support Vector Machine algorithm, which works with fixed point
arithmetic to produce an energy-efficient model for the classification of human
activities using a smartphone. Proposed model successfully achieved a recall of
99%, which was further compared to the existing deep learning models such as the
RNN, Convolutional Neural Network (CNN), and Deep Belief Network (DBN). A deep
9
LSTM network for recognizing six different activities based on smartphone data. The
network was able to achieve an accuracy of 96.70% on the UCI-HAD dataset.
While sensor-based human activity recognition (HAR) has shown great promise in
recent years, there are still several open problems in the existing system that need
to be addressed. Some of these open problems include:
10
user privacy and data security is essential for the widespread adoption of
sensor-based HAR systems.
Addressing these open problems is critical for the widespread adoption of sensor-
based HAR systems. New research is needed to develop solutions to these
challenges, including improved data collection methods, standardized sensor
placement protocols, and more efficient and interpretable models. By overcoming
these challenges, sensor-based HAR has the potential to revolutionize healthcare,
security, and other industries.
11
CHAPTER 3
REQUIREMENT ANALYSIS
All systems are feasible when provided with unlimited resource and infinite time. But
unfortunately, this condition does not prevail in practical world. So it is both
necessary and prudent to evaluate the feasibility of the system at the earliest
possible time. Also IEEE 12207.2-1997 [25] Industry Implementation of
International Standard ISO/IEC 12207:1995 (ISO/IEC 12207) Standard for
Information Technology – Software Life Cycle Processes – Implementation
Considerations helps to provide an effective framework and method to develop
software applications. It helps to produce software with the highest quality and
lowest cost in the shortest time. Months or years of effort, thousands of rupees
and untold professional embarrassment can be averted if an ill- conceived system
is recognized early in the definition phase. Feasibility & risk analysis are related in
many ways. If project risk is great, the feasibility of producing quality software is
reduced. In this case three key considerations involved in the feasibility analysis are:
Economical Feasibility
Technical Feasibility
Social Feasibility
12
3.1.2 Technical Feasibility
Availability of data - The availability of labeled data covering a diverse range
of human activities and environmental conditions is essential for the success
of the CNN-LSTM model.
Model architecture - The design of the CNN-LSTM model plays a critical
role in the performance of activity recognition. Hyperparameter tuning and
feature selection are also important factors to consider.
13
3.2.3 Python
Python is a scripting language that is high-level, interpreted, interactive,
and object-oriented. Python is intended to be extremely readable. It commonly
employs English terms rather than punctuation, and it has fewer syntactical
structures than other languages.
Python Features
Easy-to-read - Python code is more clearly defined and visible to the eyes.
Scalable - Python provides a better structure and support for large programs
than shell scripting.
14
3.2.4 Deep Learning
Deep learning is a type of machine learning that uses algorithms meant to function
in a manner similar to the human brain. While the original goal for AI was broadly to
make machines able to do things that would otherwise require human intelligence,
the idea has been refined in the decades since. François Chollet, AI researcher at
Google and creator of the machine learning software library Keras, says:
“Intelligence is not a skill in itself, it’s not about what you can do, but how well and
how efficiently you can learn new things.
15
3.2.6 Long Short-Term Memory
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
capable of learning order dependence in sequence prediction problems. This is a
behavior required in complex problem domains like machine translation, speech
recognition, and more. LSTMs are a complex area of deep learning. It can be hard
to get your hands around what LSTMs are, and how terms like bidirectional and
sequence-to-sequence relate to the field. In this post, you will get insight into LSTMs
using the words of research scientists that developed the methods and applied them
to new and important problems. There are few that are better at clearly and precisely
articulating both the promise of LSTMs and how they work than the experts that
developed them.
16
and do not need any preprocessing. A convolutional neural network is a feed
forward neural network, seldom with up to 20. The strength of a convolutional neural
network comes from a particular kind of layer called the convolutional layer. CNN
contains many convolutional layers assembled on top of each other, each one
competent of recognizing more sophisticated shapes. With three or four
convolutional layers it is viable to recognize handwritten digits and with 25 layers it
is possible to differentiate human faces. The agenda for this sphere is to activate
machines to view the world as humans do, perceive it in a alike fashion and even
use the knowledge for a multitude of duty such as image and video recognition,
image inspection and classification, media recreation, recommendation systems,
natural language processing, etc.
3.2.8 ConvLSTM
ConvLSTM is a type of recurrent neural network (RNN) that combines the
convolutional neural network (CNN) and LSTM (long short-term memory)
architectures. It is a powerful model that can be used for various tasks, including
video prediction, image processing, and natural language processing. The
ConvLSTM network is designed to learn spatial and temporal dependencies in
sequential data. It consists of convolutional layers, which extract features from the
input data, and LSTM layers, which capture the temporal dependencies in the data.
The combination of these layers enables the ConvLSTM network to learn both
spatial and temporal patterns in the input data.
17
The ConvLSTM architecture is similar to the standard LSTM architecture, with the
addition of convolutional layers before the LSTM layers. The convolutional layers
help in reducing the spatial dimensions of the input data and extracting relevant
features. The output of the convolutional layers is then passed to the LSTM layers,
which capture the temporal dependencies in the data. The ConvLSTM architecture
has shown significant improvements in various tasks such as video prediction,
where it can learn long-term dependencies in sequential data. It is also useful in
image processing tasks, where it can capture the spatial dependencies in the data.
3.2.9 LRCN
Overall, LRCN models have been shown to be effective in various tasks such as
video analysis, image captioning, and speech recognition, where they can learn both
spatial and temporal patterns in the data.
18
3.2.10 Tensorflow
TensorFlow is an open-source library for numerical computation and machine
learning developed by Google. It is designed to simplify the process of building,
training, and deploying machine learning models by providing a high-level API for
building neural networks, as well as low-level APIs for more advanced users.
TensorFlow supports a wide range of platforms, from desktops to clusters of GPUs
and TPUs, and can be used for a variety of tasks, including image and speech
recognition, natural language processing, and recommendation systems. It is widely
used in industry and academia for research and production applications.
3.2.11 Keras
TensorFlow is an open-source library for numerical computation and machine
learning developed by Google. It is designed to simplify the process of building,
training, and deploying machine learning models by providing a high-level API for
building neural networks, as well as low-level APIs for more advanced users.
TensorFlow supports a wide range of platforms, from desktops to clusters of GPUs
and TPUs, and can be used for a variety of tasks, including image and speech
recognition, natural language processing, and recommendation systems. It is widely
used in industry and academia for research and production applications.
Advantages of Keras
Keras has several advantages that make it a popular choice for deep learning.
Simplicity - Keras is very easy and simple to use. It is a user-friendly API
with easy-to-learn and code features.
Backend support - Keras does not operate with low-level computations. So,
it supports the use of backends.
Pre-trained models - Keras provides numerous pre-trained models.
Fast experimentation - Keras is built to simplify the tasks of users.
Great community and calibre documentation - Keras has a large
supportive community.
19
3.3 SYSTEM USE CASE
By following the IEEE Std. 1016-1998 [22] Recommended Practice for Software
Design Descriptions.
20
CHAPTER 4
After analyzing the drawbacks of previous works, we proposed a new system. The
proposed system detects human activity without the use of additional sensors. It
only depends on the camera feed. In this proposed system, Data of human activity
is taken over fixed time interval for determining the activity and fed to the model for
detection. The output shows the activity being performed by the human.
Advantages of Proposed System
No initial setup is required before implementation. In the existing system, sensors
are used to detect the human activity. Here we don’t need any sensors. Hence,
Sensor cost is eliminated. We need not to depend on Sensors. Easy to enhance
and add activities to current model without any additional hardware requirements.
21
4.1.1 Data Collection and Pre-processing
The first step in HAR using ConvLSTM and LRCN models is to collect the data from
the sensors. The sensors used in the system typically include accelerometers,
gyroscopes, and magnetometers. The data collected from these sensors is typically
noisy and requires pre-processing to remove noise and artifacts. The data is then
segmented into windows of fixed length, typically ranging from 1-10 seconds. The
length of the window depends on the activity being recognized and the sampling
frequency of the sensor.
Once the data has been pre-processed and segmented, the next step is to extract
features from the sensor data. ConvLSTM and LRCN models are capable of
learning the features from the data directly. However, feature extraction can help
improve the accuracy of the models. There are various feature extraction techniques
that can be used, including statistical features such as mean, standard deviation,
and variance, and time-domain features such as zero-crossing rate and energy. The
extracted features are then used as input to the ConvLSTM and LRCN models.
ConvLSTM and LRCN models are two types of deep learning models that have
been shown to be effective in HAR. ConvLSTM is a combination of Convolutional
Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. The
CNNs are used to extract features from the input data, while the LSTM networks are
used to capture the temporal dependencies in the data. LRCN is a combination of
CNNs and Recurrent Neural Networks (RNNs), where the CNNs are used to extract
features and the RNNs are used to capture the temporal dependencies.
22
4.1.4 Training and Testing
Once the models have been developed, the next step is to train them on the labelled
dataset of physical activities. The development products of the given activity confirm
the requirement of that activity. The software is validated for its intended use and
user need IEEE Std. 1012-1998 [21] IEEE Standard for Software Verification and
Validation. The labelled dataset typically includes a set of physical activities such as
walking, running, and cycling. The dataset is divided into training and testing sets,
with a larger portion of the dataset used for training the models. The models are
trained using backpropagation and stochastic gradient descent. Once the models
have been trained, they are tested on the testing set to evaluate their accuracy.
The accuracy of the models is evaluated using various evaluation metrics such as
accuracy, precision, recall, and F1-score. The accuracy of the models is the
proportion of correctly classified physical activities. The precision is the ratio of
correctly classified physical activities to the total number of physical activities
classified as positive. The recall is the ratio of correctly classified physical activities
to the total number of actual positive physical activities. The F1-score is the
harmonic mean of precision and recall.
23
The architecture of a human activity recognition system using CNN and LSTM with
ConvLSTM and LRCN models consists of several layers that perform feature
extraction, classification, and prediction tasks.
The input to the system is a time series of data captured by a camera. The raw data
is first pre-processed to remove noise, filter out unwanted frequencies, and
normalize the data. The pre-processed data is then fed into the feature extraction
layer, which uses a combination of convolutional and pooling layers to extract
meaningful features from the input data.
The output of the feature extraction layer is a sequence of feature maps, which are
then fed into the classification layer. The classification layer consists of one or more
LSTM layers, which learn to model the temporal dependencies in the input data and
perform classification of the activities based on the extracted features.
In the ConvLSTM model, the input data is fed into a ConvLSTM layer, which
combines the functionality of convolutional and LSTM layers. The ConvLSTM layer
learns to capture both spatial and temporal features in the input data and performs
classification based on these features.
In the LRCN model, the input data is fed into a CNN layer, which extracts spatial
features from the input data. The output of the CNN layer is then fed into an LSTM
layer, which learns to model the temporal dependencies in the input data and
performs classification based on the extracted features.
The output of the classification layer is a probability distribution over the set of
possible activities, which is then used to predict the most likely activity at each time
step.
The architecture of the human activity recognition system using CNN and LSTM
with ConvLSTM and LRCN models is designed to be scalable and adaptable to
different types of sensor data and activity recognition tasks. The system can be
trained using supervised learning techniques, such as backpropagation and
gradient descent, to optimize the model parameters and improve the classification
performance.
24
Fig. 4.3: Flow Diagram
Taking the reference of IEEE Std. 1012-1998 [21] Standard for Verification and
Validation.
The software for implementation and testing of the proposed human activity
recognition system using CNN and LSTM algorithms can include several
components, such as:
25
as our dataset which contains 50 different categories of actions. This component
can be implemented using Python libraries, such as Pandas, NumPy, and Scikit-
Learn.
Model development and training - This component involves developing and
training the CNN and LSTM models to recognize human activities based on the
preprocessed data. This component can be implemented using deep learning
frameworks, such as TensorFlow or PyTorch.
Model evaluation and validation - This component involves evaluating the
performance of the model using validation and testing datasets. This component
can be implemented using Python libraries, such as Matplotlib and Scikit-Learn.
Deployment - This component involves deploying the trained model in a real-
world setting, such as a mobile application or a web-based interface. This
component can be implemented using software development frameworks, such
as Flask or Django.
Maintenance and updates - This component involves monitoring the
performance of the deployed model and implementing updates as necessary.
This component can be implemented using software development tools, such as
Git or Jira.
The testing plan for the proposed human activity recognition system can include
several stages, such as:
Unit testing - This involves testing the individual components of the system,
such as the data collection and pre-processing, model development and
training, and model evaluation and validation, to ensure that they are working
as expected.
26
Acceptance testing - This involves testing the system in a real-world setting
with end-users to ensure that it is meeting their needs and requirements.
Overall, the software for implementation and testing plan of the proposed human
activity recognition system using CNN and LSTM algorithms should be designed to
ensure that the system is accurate, efficient, and reliable, and can be deployed and
maintained effectively.
The Testing plan is designed by referring IEEE Std. 1008-1997 [20] Standard for
Software Unit Testing. A sample of data is used to measure the developed software
against the requirement.
Article IEEE Std. 1058-1998 [23] IEEE Standard for Software Project Management
Plans, IEEE Std. 1540-2001 [24] IEEE Standard for Software Life Cycle Processes
– Risk Management incorporates and subsumes the software development plans
Project Planning and Management Module - This module will include the
overall planning and management of the project, including setting project
goals and timelines, creating and assigning tasks, tracking progress, and
communicating with team members.
Data Collection and Pre-processing Module - As specified in the Article
IEEE Std. 1058-1998 [23] IEEE Standard for Software Project Management
Plans, IEEE Std. 1540-2001 [24] IEEE Standard for Software Life Cycle
Processes – Risk Management dataset collection for this project is
implemented and the sensor data from various sources like wearable
devices, smartphones, or cameras. is preprocessed. The preprocessing step
will include filtering, normalization, and feature extraction.
ConvLSTM Model Module - This module will implement the ConvLSTM
model for human activity recognition. It will include designing the model
architecture, training the model on the preprocessed data, and evaluating the
performance of the model on a validation dataset.
LRCN Model Module - This module will implement the LRCN model for
human activity recognition. It will include designing the model architecture,
27
training the model on the preprocessed data, and evaluating the performance
of the model on a validation dataset.
Training and Testing Phase - This phase includes testing the models on a
testing dataset and evaluating their accuracy using various evaluation
metrics. The testing phase is expected to take two weeks.
The project management plan with modules for human activity recognition will allow
for a well-organized and structured approach to developing the system, ensuring
that each component is carefully designed, implemented, and tested before
integrating them into the final product. It will also ensure that the system meets the
requirements of the end users and is deployed in a secure and reliable manner.
Deep Learning techniques such as LSTM and CNN has greatly improved the
accuracy of HAR systems. However, developing a HAR system using LSTM and
CNN can be a complex and resource-intensive process that requires careful
planning and budgeting. In this financial report, we will provide an estimate of the
cost of developing a HAR system using LSTM and CNN
Cost Estimate - Hardware and Software Costs: The hardware and software
cost for developing a HAR system using LSTM and CNN can be significant.
This includes the cost of purchasing and maintaining high-performance
GPUs, cloud computing services, and specialized software such as
TensorFlow or Keras.
Data Acquisition and Pre-processing Costs - Collecting, annotating, and
preprocessing the data required for training and testing the LSTM and CNN
models can be a significant expense. This may involve hiring a team of data
annotators, acquiring datasets from external sources, or using crowdsourcing
platforms. Additionally, the cost of storing and managing large datasets can
be significant.
Model Development Costs - Developing and optimizing the LSTM and CNN
models can be a time-consuming and resource-intensive process. This may
involve hiring machine learning experts or outsourcing the development work
to a third-party provider. Additionally, the cost of testing and validating the
28
models can be significant, as this requires access to large datasets and
specialized software tools.
Deployment and Maintenance Costs - Once the models have been
developed, they need to be deployed and integrated into the target system.
This may involve additional costs for server infrastructure, API development,
and ongoing maintenance and support. The cost of ongoing maintenance
and support can be significant, especially if the system is deployed in a
complex environment with multiple users or if the models need to be updated
frequently.
Developing a HAR system using LSTM and CNN can be a complex and resource-
intensive process that requires careful planning and budgeting. The cost of
developing a HAR system can vary greatly depending on the size and complexity of
the system, the data sources and size, the programming languages and tools used,
and the expertise of the developers. Therefore, it is important to carefully assess the
requirements of the project and work with experienced developers and data
scientists to ensure the success of the project.
Using IEEE Std. 1016-1998 [22] Recommended Practice for Software Design
Descriptions
29
Maintenance and Support - Regular maintenance and support should be
provided to ensure the system remains functional and effective. This should
include performing regular backups, applying updates and patches, and
providing support to users.
Security - The HAR system should be regularly audited for security
vulnerabilities and measures should be taken to address any issues that are
identified. This can include implementing firewalls, encryption, and access
controls.
Upgrades and Enhancements - The HAR system should be periodically
upgraded and enhanced to ensure that it remains up-to-date with the latest
technology and requirements. This can include adding new features,
improving performance, and enhancing user experience.
Developing HAR system is a complex process that requires careful planning and
execution. Once the system is developed and tested, it needs to be transitioned to
operations to ensure its long-term sustainability and effectiveness. This requires a
well-planned project transition/ software to operations plan. The plan should include
establishing a project transition team, defining transition requirements, developing
a deployment plan, developing a maintenance and
30
CHAPTER 5
IMPLEMENTATION DETAILS
Overall, the development and deployment setup for human activity recognition using
CNN-LSTM involves several steps, including data collection and preprocessing,
model architecture design, model training, model evaluation, deployment, and
31
continuous improvement. With proper planning and execution, you can develop a
robust and accurate human activity recognition system that can be deployed in a
real-world environment.
5.2 ALGORITHMS
Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) are two
types of neural network architectures used in machine learning for various
applications, including image recognition, natural language processing, and time
series analysis.
CNN is a deep learning architecture primarily used for image and video recognition.
CNNs use a series of convolutional and pooling layers to extract and learn features
from input images, followed by fully connected layers for classification. The
convolutional layer applies a filter or kernel to the input image, sliding it across the
image and computing a dot product at each position. The pooling layer then reduces
the spatial size of the image by aggregating the output of the previous layer. This
process of convolution and pooling is repeated multiple times to extract higher-level
features from the input image. CNNs are highly effective for image recognition tasks
and have achieved state-of-the-art performance on various benchmark datasets.
32
allows for the modelling of long-term dependencies in the input data and has been
widely used for various applications, including speech recognition, language
translation, and time series analysis.
Overall, CNN and LSTM are two powerful neural network architectures that have
been used in various machine learning applications. While CNN is mainly used for
image and video recognition, LSTM is used for modelling sequences of data. Both
architectures have achieved state-of-the-art performance on various benchmark
datasets and continue to be widely used and researched in the machine
learning community.
5.3 TESTING
By using IEEE 829 [18], IEEE 1008 [20] and IEEE 1012 [21], the testing plan is
formulated. When using CNN and LSTM algorithms for human activity recognition,
the following testing techniques can be used to evaluate their performance:
Hold-Out Testing - In this technique, the dataset is divided into two parts: a
training set and a testing set. The model is trained on the training set and
then evaluated on the testing set to determine its accuracy.
Cross-Validation - As described earlier, this technique can also be used for
CNN and LSTM algorithms to evaluate their performance.
Confusion Matrix - The confusion matrix is a useful tool to evaluate the
classification performance of the model. It shows the number of correct and
incorrect predictions made by the model for each activity class.
F1 Score - The F1 score is a measure of the model's accuracy, calculated as
the harmonic mean of precision and recall. It is a useful metric for evaluating
the overall performance of the model.
33
CHAPTER 6
6.1 INTRODUCTION
Human Activity Recognition (HAR) using Convolutional Neural Networks (CNN) and
Long Short-Term Memory (LSTM) models has shown promising results in various
research studies. HAR aims to identify the activities performed by humans based
on sensor data collected from wearable devices or smartphones.
CNNs are well-known for their ability to extract spatial features from image data,
while LSTMs are known for their ability to capture temporal dependencies in
sequential data. When combined, CNN-LSTM models can effectively process both
spatial and temporal features from sensor data.
The outcomes of HAR using CNN-LSTM models can vary depending on various
factors such as the dataset used, the architecture of the model, and the
preprocessing techniques applied to the data. However, some common outcomes
of HAR using CNN-LSTM models include:
34
Overall, the outcomes of HAR using CNN-LSTM models demonstrate the potential
of these models in various domains, including healthcare, sports, and security.
IEEE 1855-2016 - This standard provides a framework for the design and
implementation of machine learning algorithms, including deep learning
algorithms such as CNN and LSTM models. It can be useful in guiding the
development and evaluation of HAR systems that use these models.
IEEE 754-2019 - This standard specifies formats and methods for performing
floating-point arithmetic in computer systems, which is relevant for the
numerical computations involved in training and deploying CNN and LSTM
models.
IEEE 29148-2018 - This standard provides guidelines for the software and
system requirements engineering process, which is important for ensuring
the quality and reliability of the HAR system.
IEEE 1063-2015 - This standard provides guidelines for the software life
cycle processes, which can be useful in guiding the development, testing,
and maintenance of the HAR system.
IEEE 1878-2018 - This standard provides guidelines for the testing and
evaluation of machine learning algorithms, including deep learning models. It
can be useful in ensuring the accuracy and reliability of the HAR system.
6.3 CONSTRAINTS
Human Activity Recognition (HAR) using CNN and LSTM models without sensors
can face several constraints that may affect the performance and accuracy of the
system. Some of these constraints include:
35
Computational Resource - CNN and LSTM models require significant
computational resources for training and inference. Without access to
powerful computing resources, training and inference can be slow or
infeasible, making it challenging to develop an accurate HAR system.
Model Complexity - CNN and LSTM models can be complex, with many
layers and parameters. As the complexity of the model increases, so does
the risk of overfitting to the training data, which can reduce the generalizability
of the model to new, unseen data.
Interpretability - CNN and LSTM models can be challenging to interpret,
which can limit the ability of researchers and practitioners to understand the
factors that contribute to human activities and improve the accuracy of the
system.
Ethics and Privacy - HAR systems can raise ethical and privacy concerns,
particularly if they are used to collect data from individuals without their
consent or knowledge. Ensuring that the HAR system is developed and
deployed in an ethical and privacy-respecting manner is important for
protecting the rights and well-being of individuals.
These constraints can pose significant challenges for developing accurate and
effective HAR systems using CNN and LSTM models without sensors. Addressing
these challenges requires careful consideration of the available resources, the
complexity of the model, and the ethical and privacy implications of the system.
Using the IEEE Std. 1540-2001 [24] Risk Management, the below are formulated
In the project of Human Activity Recognition (HAR) using LSTM and CNN, there are
several tradeoffs that need to be considered, including:
36
complexity. Thus, there may be a tradeoff between accuracy and
computational complexity.
37
CHAPTER 7
38
Fig. 7.2.2: Creation of ConvLSTM Model
Here we have our LRCN Model’s Loss and Accuracy Curves from our Training and
Testing Steps.
39
Fig. 7.2.4: Total Accuracy vs Total Validation Accuracy of ConvLSTM Model
We achieved an accuracy of 77% for the ConvLSTM Model after Training and
Testing.
40
Fig. 7.2.6: Creation of LRCN Model
41
Here we have our LRCN Model’s Loss and Accuracy Curves from our Training and
Testing Steps.
42
Fig. 7.2.9: Accuracy of LRCN Model
After training and testing the LRCN Model we got an accuracy of 87%.
The above picture shows the predicted action of human is Javelin Throw.
43
Fig. 7.2.11: Picture showing the predicted action as Diving
The above picture shows the predicted action of human is Playing Tabla.
44
CHAPTER 8
CONCLUSION
8.1 CONCLUSION
In conclusion, human activity recognition using CNN and LSTM algorithms has
shown promising results in recent years. CNNs are effective in extracting spatial
features from images, while LSTMs can capture the temporal dynamics of
sequential data. By combining these two architectures, we can achieve a more
accurate and robust model for human activity recognition.
The success of human activity recognition using CNN and LSTM algorithms has led
to numerous applications in healthcare, sports, and security, among others. It has
the potential to enhance the quality of life for individuals and contribute to the
development of more efficient and intelligent systems in various domains.
There are several potential areas of future work for human activity recognition using
CNN and LSTM algorithms. Some of these include:
45
the need for large labelled datasets and improve the generalization
performance of the model.
Real-time recognition - Many applications of human activity recognition
require real-time processing of data. Future work can focus on developing
models that can perform real-time recognition of human activities with low
latency and high accuracy.
Privacy and security - As human activity recognition systems become more
prevalent, there is a need to address privacy and security concerns. Future
work can explore methods to ensure that these systems protect user privacy
and prevent malicious attacks.
Overall, there are several exciting directions for future work in human activity
recognition using CNN and LSTM algorithms, which can further improve the
accuracy, efficiency, and applicability of these systems.
There are several research issues that need to be addressed in human activity
recognition using CNN and LSTM algorithms. Some of these include:
One of the major challenges in human activity recognition is the lack of large,
diverse, and annotated datasets. This can limit the accuracy and
generalization performance of the model. Future work can focus on
developing new datasets or leveraging transfer learning techniques to
address this issue.
While CNN and LSTM algorithms are effective in recognizing human
activities, they often lack interpretability. It can be challenging to understand
why a particular activity was recognized or to identify the key features that
contribute to the classification decision. Future work can focus on developing
methods to improve the interpretability of the model.
Human activity recognition systems must be designed to work in real-world
settings, where there may be variations in lighting, background noise, and
other factors. Future work can investigate methods to improve the robustness
of the system and ensure that it can operate effectively in real-world
As human activity recognition systems become more prevalent, there is a
need to address ethical considerations, such as privacy, bias, and fairness.
46
Future work can explore methods to mitigate these concerns and ensure that
the systems are developed and deployed responsibly.
Overall, addressing these research issues can improve the accuracy, efficiency, and
applicability of human activity recognition using CNN and LSTM algorithms, and
ensure that these systems can be deployed effectively and responsibly.
Overall, addressing these implementation issues can help to ensure that the human
activity recognition system using CNN and LSTM algorithms is developed and
deployed effectively and can provide accurate and reliable results.
47
REFERENCES
[1] Usharani, J.; Saktivel, U. Human Activity Recognition using Android Smartphone.
In Proceedings of the International Conference on Innovations in Computing &
Networking ICICN16, Bengaluru, Karnataka, 2016.
[2] Anguita, D.; Ghio, A.; Oneto, L.; Parra-Llanas, X.; Reyes-Ortiz, J. Energy Efficient
Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. J. Univers.
Comput. Sci. 2013, 19, 1295–1314.
[4] Uddin, Z.; Soylu, A. Human activity recognition using wearable sensors,
discriminant analysis, and long short-term memory-based neural structured
learning. Sci. Rep. 2021, 11, 16455.
[5] Vakili, M.; Rezaei, M. Incremental Learning Techniques for Online Human Activity
Recognition. arXiv 2021, arXiv:2109.09435.
[6] Muangprathub, J.; Sriwichian, A.; Wanichsombat, A.; Kajornkasirat, S.; Nillaor,
P.; Boonjing, V. A Novel Elderly Tracking System Using Machine Learning to Classify
Signals from Mobile and Wearable Sensors. Int. J. Environ. Res. Public Health 2021,
18, 12652.
[8] Zhou, B.; Yang, J.; Li, Q. Smartphone-Based Activity Recognition for Indoor
Localization Using a Convolutional Neural Network. Sensors 2019, 19, 621.
[9] Murad, A.; Pyun, J.-Y. Deep Recurrent Neural Networks for Human Activity
Recognition. Sensors 2017, 17, 2556.
48
[10] S. O. Eyobu and D. S. Han, “Feature Representation and Data Augmentation
for Human Activity Classification Based on Wearable IMU Sensor Data Using a
Deep LSTM Neural Network,” Sensors, 2018, 18, 2892.
[12] K. Kim, S. Choi, M. Chae, H. Park, J. Lee, and J. Park, “A Deep Learning Based
Approach to Recognizing Accompanying Status of Smartphone Users Using
Multimodal Data,” Journal of Intelligence and Information Systems, vol. 25, no. 1,
pp. 163–177, Mar. 2019.
[16] Wang, J., Zhang, X., Wu, Y., & Wang, Y. (2022). Unsupervised Learning of
Human Activities from Long-Term Videos. IEEE Transactions on Pattern Analysis
and Machine Intelligence. doi: 10.1109/TPAMI.2022.3182538.
[17] Yao, J., Zhang, L., Lu, J., & Xu, Y. (2021). Self-Supervised Learning of Human
Activities from Temporal Segments in Videos. IEEE Transactions on Pattern
Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2021.3111372.
[18] IEEE Std. 829-1998 IEEE Standard for Software Test Documentation
[19] IEEE Std. 830-1998 IEEE Recommended Practice for Software Requirements
[20] IEEE Std. 1008-1997 IEEE Standard for Software Unit Testing
[21] IEEE Std. 1012-1998 IEEE Standard for Software Verification and Validation
49
[22] IEEE Std. 1016-1998 IEEE Recommended Practice for Software Design
Descriptions
[23] IEEE Std. 1058-1998 IEEE Standard for Software Project Management Plans
[24] IEEE Std. 1540-2001 IEEE Standard for Software Life Cycle Processes – Risk
Management
50
APPENDIX
A. SOURCE CODE
import os
import cv2
import pafy
import math
import random
import numpy as np
import datetime as dt
import tensorflow as tf
from collections import deque
import matplotlib.pyplot as plt
from moviepy.editor import *
%matplotlib inline
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from keras.utils.vis_utils import plot_model
seed_constant = 27
np.random.seed(seed_constant)
random.seed(seed_constant)
tf.random.set_seed(seed_constant)
plt.figure(figsize = (25, 25))
all_classes_names = os.listdir('UCF50')
random_range = random.sample(range(len(all_classes_names)), 25)
for counter, random_index in enumerate(random_range, 1):
selected_class_Name = all_classes_names[random_index]
video_files_names_list = os.listdir(f'UCF50/{selected_class_Name}')
selected_video_file_name = random.choice(video_files_names_list)
51
video_reader =
cv2.VideoCapture(f'UCF50/{selected_class_Name}/{selected_video_file_name
}')
_, bgr_frame = video_reader.read()
video_reader.release()
rgb_frame = cv2.cvtColor(bgr_frame, cv2.COLOR_BGR2RGB)
plt.subplot(5, 5, counter);plt.imshow(rgb_frame);plt.axis('off')
IMAGE_HEIGHT , IMAGE_WIDTH = 64, 64
SEQUENCE_LENGTH = 20
DATASET_DIR = "UCF50"
CLASSES_LIST = ["WalkingWithDog", "TaiChi", "Swing", "HorseRace"]
def frames_extraction(video_path):
frames_list = []
video_reader = cv2.VideoCapture(video_path)
video_frames_count =
int(video_reader.get(cv2.CAP_PROP_FRAME_COUNT))
skip_frames_window =
max(int(video_frames_count/SEQUENCE_LENGTH), 1)
for frame_counter in range(SEQUENCE_LENGTH):
video_reader.set(cv2.CAP_PROP_POS_FRAMES, frame_counter *
skip_frames_window)
success, frame = video_reader.read()
if not success:
break
resized_frame = cv2.resize(frame, (IMAGE_HEIGHT, IMAGE_WIDTH))
normalized_frame = resized_frame / 255
frames_list.append(normalized_frame)
video_reader.release()
return frames_list
def create_dataset():
features = []
labels = []
video_files_paths = []
for class_index, class_name in enumerate(CLASSES_LIST):
52
print(f'Extracting Data of Class: {class_name}')
files_list = os.listdir(os.path.join(DATASET_DIR, class_name))
for file_name in files_list:
video_file_path = os.path.join(DATASET_DIR, class_name, file_name)
frames = frames_extraction(video_file_path)
if len(frames) == SEQUENCE_LENGTH:
features.append(frames)
labels.append(class_index)
video_files_paths.append(video_file_path)
features = np.asarray(features)
labels = np.array(labels)
return features, labels, video_files_paths
features, labels, video_files_paths = create_dataset()
one_hot_encoded_labels = to_categorical(labels)
features_train, features_test, labels_train, labels_test =
train_test_split(features, one_hot_encoded_labels, test_size = 0.25, shuffle =
True, random_state = seed_constant)
def create_convlstm_model():
model = Sequential()
model.add(ConvLSTM2D(filters = 4, kernel_size = (3, 3), activation =
'tanh',data_format = "channels_last",
recurrent_dropout=0.2, return_sequences=True, input_shape =
(SEQUENCE_LENGTH, IMAGE_HEIGHT, IMAGE_WIDTH, 3)))
model.add(MaxPooling3D(pool_size=(1, 2, 2), padding='same',
data_format='channels_last'))
model.add(TimeDistributed(Dropout(0.2)))
model.add(ConvLSTM2D(filters = 8, kernel_size = (3, 3), activation = 'tanh',
data_format = "channels_last",
recurrent_dropout=0.2, return_sequences=True))
model.add(MaxPooling3D(pool_size=(1, 2, 2), padding='same',
data_format='channels_last'))
model.add(TimeDistributed(Dropout(0.2)))
53
model.add(ConvLSTM2D(filters = 14, kernel_size = (3, 3), activation = 'tanh',
data_format = "channels_last",
recurrent_dropout=0.2, return_sequences=True))
model.add(Flatten())
model.summary()
return model
convlstm_model = create_convlstm_model()
54
convlstm_model_training_history = convlstm_model.fit(x = features_train, y =
labels_train, epochs = 50, batch_size = 4,shuffle = True, validation_split = 0.2,
callbacks = [early_stopping_callback])
model_evaluation_history = convlstm_model.evaluate(features_test,
labels_test)
model_evaluation_loss, model_evaluation_accuracy =
model_evaluation_history
date_time_format = '%Y_%m_%d__%H_%M_%S'
current_date_time_dt = dt.datetime.now()
current_date_time_string = dt.datetime.strftime(current_date_time_dt,
date_time_format)
model_file_name =
f'convlstm_model___Date_Time_{current_date_time_string}___Loss_{model_e
valuation_loss}___Accuracy_{model_evaluation_accuracy}.h5'
convlstm_model.save(model_file_name)
def plot_metric(model_training_history, metric_name_1, metric_name_2,
x_label, y_label):
metric_value_1 = model_training_history.history[metric_name_1]
metric_value_2 = model_training_history.history[metric_name_2]
epochs = range(len(metric_value_1))
plt.plot(epochs, metric_value_1, 'green', label = metric_name_1)
plt.plot(epochs, metric_value_2, 'orangered', label = metric_name_2)
plt.xlabel(str(x_label))
plt.ylabel(str(y_label))
plt.legend()
plot_metric(convlstm_model_training_history, 'loss', 'val_loss', 'No.of epochs',
'Losses')
plot_metric(convlstm_model_training_history, 'accuracy', 'val_accuracy', 'No.of
epochs', 'Accuracies')
def create_LRCN_model():
model = Sequential()
55
model.add(TimeDistributed(Conv2D(16, (3, 3), padding='same',activation =
'relu'),
input_shape = (SEQUENCE_LENGTH, IMAGE_HEIGHT,
IMAGE_WIDTH, 3)))
model.add(TimeDistributed(MaxPooling2D((4, 4))))
model.add(TimeDistributed(Dropout(0.25)))
56
LRCN_model_training_history = LRCN_model.fit(x = features_train, y =
labels_train, epochs = 70, batch_size = 4 , shuffle = True, validation_split = 0.2,
callbacks = [early_stopping_callback])
model_evaluation_history = LRCN_model.evaluate(features_test, labels_test)
model_evaluation_loss, model_evaluation_accuracy =
model_evaluation_history
date_time_format = '%Y_%m_%d__%H_%M_%S'
current_date_time_dt = dt.datetime.now()
current_date_time_string = dt.datetime.strftime(current_date_time_dt,
date_time_format)
model_file_name =
f'LRCN_model___Date_Time_{current_date_time_string}___Loss_{model_eval
uation_loss}___Accuracy_{model_evaluation_accuracy}.h5'
LRCN_model.save(model_file_name)
plot_metric(LRCN_model_training_history, 'loss', 'val_loss', 'Total Loss vs Total
Validation Loss','No.of epochs')
plot_metric(LRCN_model_training_history, 'accuracy', 'val_accuracy', 'Total
Accuracy vs Total Validation Accuracy','No.of epochs')
def download_youtube_videos(youtube_video_url, output_directory):
video = pafy.new(youtube_video_url)
title = video.title
video_best = video.getbest()
output_file_path = f'{output_directory}/{title}.mp4'
video_best.download(filepath = output_file_path, quiet = True)
return title
test_videos_directory = 'test_videos'
os.makedirs(test_videos_directory, exist_ok = True)
video_title =
download_youtube_videos('https://www.youtube.com/watch?v=8u0qjmHIOcE',
test_videos_directory)
input_video_file_path = f'{test_videos_directory}/{video_title}.mp4'
def predict_on_video(video_file_path, output_file_path,
SEQUENCE_LENGTH):
video_reader = cv2.VideoCapture(video_file_path)
57
original_video_width =
int(video_reader.get(cv2.CAP_PROP_FRAME_WIDTH))
original_video_height =
int(video_reader.get(cv2.CAP_PROP_FRAME_HEIGHT))
video_writer = cv2.VideoWriter(output_file_path, cv2.VideoWriter_fourcc('M',
'P', '4', 'V'),
video_reader.get(cv2.CAP_PROP_FPS),
(original_video_width, original_video_height))
frames_queue = deque(maxlen = SEQUENCE_LENGTH)
predicted_class_name = ''
while video_reader.isOpened():
ok, frame = video_reader.read()
if not ok:
break
resized_frame = cv2.resize(frame, (IMAGE_HEIGHT, IMAGE_WIDTH))
normalized_frame = resized_frame / 255
frames_queue.append(normalized_frame)
if len(frames_queue) == SEQUENCE_LENGTH:
predicted_labels_probabilities =
LRCN_model.predict(np.expand_dims(frames_queue, axis = 0))[0]
predicted_label = np.argmax(predicted_labels_probabilities)
predicted_class_name = CLASSES_LIST[predicted_label]
cv2.putText(frame, predicted_class_name, (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
video_writer.write(frame)
video_reader.release()
video_writer.release()
output_video_file_path = f'{test_videos_directory}/{video_title}-Output-
SeqLen{SEQUENCE_LENGTH}.mp4'
predict_on_video(input_video_file_path, output_video_file_path,
SEQUENCE_LENGTH)
VideoFileClip(output_video_file_path, audio=False,
target_resolution=(300,None)).ipython_display()
58
B. SCREENSHOTS
59
Fig. B.3: Extracting the Frames
60
Fig. B.5: Creating ConvLSTM Model
61
Fig. B.7: Plotting the Model
62
Fig. B.9: Plotting Loss and Accuracy Curves of ConvLSTM Model
63
Fig. B.11: LRCN Model Created Successfully
64
Fig. B.13: Evaluating the Model
65
Fig. B.15: Function to download video from YouTube
67
Fig. B.19: Testing on Videos
68
C. RESEARCH PAPER
Abstract— These days, it's not uncommon to see video wearables, cellphones, or other monitors that record
cameras installed to keep an eye on pedestrians and a user's movement, posture, or gestures. To improve
motorists alike in a variety of public spaces. The areas as diverse as monitoring devices, sports
proliferation of camera footage necessitates the creation evaluation metrics, and more, HAR seeks to create
of some means of deducing the nature of the activity
captured on film. Due to the widespread availability of
technologies that can accurately identify and
acquisition devices like handsets & camcorders, comprehend a wide variety of human motion and
HAR may be used in a wide variety of contexts. The activity in real time.
proliferation of electronic gadgets and software has been
matched by a revolution in data extraction made possible HAR is a method that may be used with either
by breakthroughs in AI. Difficulties such backdrop security cameras or regular cameras to identify a
congestion, partly blockage, variations in size, wide range of human activities. Researchers in the
perspective, illumination, & look, make it difficult to fields of healthcare, elderly services, or life welfare
recognise human actions in video frame or still services have been more interested in HAR in recent
photographs. Multimodal recognition is necessary in
several fields, such as cctv, machine-human contact, &
years because to its potential to enable the automated
robots for characterising human behaviour. Here, we monitoring and comprehension of patients' or
survey the most current and cutting-edge findings from residents' behaviours in "smart" settings like
the study of how to categorise human actions. In this hospitals and residences. A HAR setup in a smart
paper, we propose a taxonomy for studying human device, for illustration, may keep track of what its
activity and explore the benefits and drawbacks of this inhabitants get up to on a weekly, quarterly, & annual
diverse set of approaches. Disabled people's daily basis just by seeing them. The physicians may look
routines, including resting, moving, traveling down & up into the resident' routines and habits and provide
the stairs, speaking, & laying, have been frequently recommendations based on that information. A HAR
tracked using cellphones. Human motion detection often
makes use of well-established ML & DL techniques
system is able to detect falls and other abnormalities
include CNN, & LSTM Network. in older people's activity levels with relative ease.
Activity recognition's foundational approach is made
Keywords—: Machine Learning, Human-Machine, up of activity feature extraction, modelling, and
Convolution Neural Network, Long short-term memory, recognition methods. The complexity of video-based
Human Activity Recognition. HAR stems from the fact that, unlike arm movements
or sign communications, motion detection of the
I. INTRODUCTION complete body must be taken into account.
Therefore, it is crucial to have a whole body model
Accurately identifying & classifying human actions in order to accurately depict human motion. Despite
from wearable sensors is the goal of HAR, a topic of the fact that video-based HAR technologies have
research. This is often accomplished by analysing attracted the attention of many academics owing to
information gleaned from portable tech like
69
their potential use, reliable detection of human continuous monitoring of user activities and
actions remains a significant difficulty. instantaneous data collection.
Recognizing human actions is an intricate time series In the medical profession, for example, HAR may be
linear classifier called HAR. To accurately generate used to track patients' activity levels and identify
attributes from the original data to suit a ml model, when they need assistance getting about. It has
prior approaches have relied on in-depth domain applications in both games & sporting, where it could
understanding and methodologies from noise be employed to monitor and enhance player
removal. By applying machine learning efficiency, and in the latter it can be harnessed to
characteristics from the raw data, dl techniques like allow players to direct virtual avatars with their own
CNN & RNN have recently shown competent and bodies. In the field of security, it may be used to
even produce state-of-the-art outcomes. This article monitor for and react to suspicious activity.
will introduce you to the person activity
identification issue and the state-of-the-art As a whole, HAR is making great strides forward,
dl approaches that have been developed to solve it. and the pace of technological progress suggests that
Vision-based approaches towards person even more advanced and precise algorithms will be
identification involve using image or video data to produced in the not-too-distant ahead.
identify individuals. These methods typically involve
using computer vision techniques to extract features
from images or video frames that can be used to II. LITERATURE REVIEW
distinguish between individuals. These techniques
have a wide range of applications in security, access Over the last several years, researchers have
control, and forensic investigations. examined a variety of approaches to the problem of
identifying human actions. The literature on emotion
Some of the characteristics of human behavior in this recognition is reviewed in this area.
classification are:
Using the Android operating system, UsharaniJ[1]
Motion patterns: Motion patterns refer to the way proposed a HAR system. By analysing data from
humans move their bodies when performing different accelerometers, they were able to develop a
activities. Different activities have unique motion programme that allowed for learning &
patterns that can be captured using sensors. For categorization to take place in real time through the
example, walking has a distinct motion pattern cloud. To improve the k-NN classifier's efficiency,
compared to running or cycling. Posture and reliability, overall processing time on the Android
orientation: Posture and orientation refer to the version while using less resources, they adopted the
position and orientation of different body parts clustering k-NN strategy. They also determined that
during an activity. For example, standing has a different device types and abilities resulted in
different posture and orientation compared to sitting different categorization times.
or lying down.
In their [2] article, DavideAnguita. developed the
In essence, the task of categorising activities is a time adapted SVM technique, which uses fixed point
series issue. One application of supervised ml is computing to generate an energy-efficient
time-series categorization. It may be used for framework for the categorization of human actions
predicting & unloading sensor data, as well as using a phone. The displayed novel technique was
predicting future values based on historical data developed with the intention of being used in a wide
using statistical methods. Neural networks are the range of intellect applications & advanced
current gold standard for wearable sensors. The most surroundings to expedite computation while
popular methods for this job are a subset of neural minimising the use of system resources, thereby
network models called CNN Models as well as a reducing energy, while still achieving results
subset of RNN Models called RNN Models. competitive with those of other commonly employed
classification methods.
A subset of ml & computer vision, HAR seeks to
autonomously identify and categorise human MdZiaUddin[3] suggested a body-sensor-based
activities. With several potential uses in areas behavior detection method using deep NSL based on
including health, sports, entertainment, & defense, it LSTM to comprehend people's behaviour in various
is quickly rising in importance as a r&d focus. settings. KDA was used to improve feature
Inertial sensors, magnetometers, & gyros are just aggregation across all tasks; this method increases
some of the sensors that are often used in HAR inter-class scattered while decreasing intra-class
system to track and record user motion and scattering in the data. The suggested model
behaviour. Many handsets, watches, and fitness outperformed other dl models like the RNN, CNN, &
bands all include built-in sensors that allow for
70
Bayesian Belief Networks in terms of recall, with an concerning the gathering and handling of personal
impressive 98%. information for HAR, which raises issues of security
and confidentiality of data. Protecting sensitive
For continuous estimation of human biological information requires stringent security and safety
motions using cellphone motion detectors, Meyasam, safeguards. Due to their high price tag, many people
Vakiili[4] presented a true HAR framework. Six are priced out of using current HAR systems. In
gradual learning methods were employed to evaluate addition, some consumers may find it prohibitive to
the system's efficacy over a set of twenty tasks; these invest in such systems due to the high upfront cost of
results were then compared to those obtained using wearables, monitors, as well as other gear. The
state-of-the-art HAR techniques like DTs, AdaBoost, inability to easily compare and integrate data from
and etc. The highest achieved accuracy was 95%, various sources due to a general lack of standards and
achieved with Sequential k-NN & Additive Naïve compatibility across HAR systems.
bayes Probabilistic.
III. PROPOSED SYSTEM
A innovative older person surveillance technology
based on a ml algorithm was presented in a After examining the shortcomings of prior efforts, we
publication by Jirapond Muangprathub[5]. The developed a novel method. The suggested system can
authors found that the k-NN algorithm with a k value identify human behavior with no extra sensors. Only
of five provided the highest accuracy (96.45%) while the live footage from the cameras matters. For the
attempting to identify the real-time activities of the purpose of detecting human activity, the suggested
aged. Moreover, they developed a system that system collects data over a predetermined time
provides geospatial data presentation for the elderly, period. Output reveals what a person is doing.
and in emergency situations, the person may utilise a
messaging instrument to ask for assistance. There is no preliminary configuration needed prior to
rollout. Sensors are employed to monitor human
For the purpose of identifying human activities activities in the current system. In this case, sensor
within buildings, BaodingZhou[6] presented a equipment is unnecessary. Because of this, there is
convolutional neural network. Based on data from no longer a need to spend money on sensors. In other
cellphones' motion sensors, gyros, accelerometers, & words, we can get by without using sensors. Adding
leading indicators, nine distinct activities were new features and capabilities to the existing model is
identified. Reliability of 98% was achieved using the simple and requires no more hardware.
suggested strategy.
CNN –
Six separate behaviors may be recognised using data
from smartphones, and AbdulmajidMurad[7] CNN (Convolutional Neural Network) models are
suggested using a deep Lstm to do so. When tested often used in human activity recognition, including
on the UCI datasets, the system performed at an in LRCN (Long-term Recurrent Convolutional
impressive 96.70% correctness. Networks) models, for several reasons:
For activity recognition, it is necessary to provide 1) Posture and orientation: Posture and orientation
people with sensors. Imprecise and unreliable refer to the position and orientation of different
biological sensing technologies is now available. The body parts during an activity. For example,
sensors must be installed in advance of HAR. Due to standing has a different posture and orientation
the high price of sensors and the even higher price at compared to sitting or lying down.
which we find ourselves in the event of a damaged 2) Robustness to variations: CNN models are
sensor, using sensors is inefficient. designed to be robust to variations in input data,
such as changes in lighting or orientation. In
There are a few disadvantages too. Accuracy: human activity recognition, variations in video
Current systems have low accuracy and may generate data due to changes in lighting or position can
false-positives /false-negatives despite technological affect the accuracy of the model. By using a
and ml algorithmic advancements. This is especially CNN model to extract spatial features, the
true of elaborate actions that defy easy description. LRCN model can be more robust to these
The accuracy of HAR systems is very dependent on variations.
the quality of the sensor used. For this reason, not all 3) End-to-end training: LRCN models are trained
gadgets or wearables will include high-quality end-to-end, meaning that both the CNN and
sensors. However, the battery life of wearables and LSTM (Long Short-Term Memory) layers are
cellphones used for HAR is often short, which might trained simultaneously. This allows the model to
restrict their use and efficacy. Since this might learn features that are optimized for the task of
compromise the accuracy of the data, it's a big issue human activity recognition. By using a CNN
for prolonged usage. Concerns have been raised model as part of the LRCN model, the model can
71
learn spatial features that are optimized for the In Human Activity Recognition (HAR) using LRCN,
task, leading to better performance. feature extraction is a critical process to capture
The use of CNN model in human activity recognition relevant information from raw input data, such as
is important because it allows for the extraction of videos readings. The LRCN model combines
spatial features that are important for identifying Convolutional Neural Network (CNN) and Long-
different activities. The CNN model also helps to Short Term Memory (LSTM) layers to extract spatial
make the LRCN model more robust to variations in and temporal features from the input data,
input data and allows for end-to-end training, leading respectively.
to better performance.
The feature extraction process can be divided into
LSTM – two parts: spatial feature extraction using CNN and
temporal sequence modeling using LSTM. In the first
LSTM (Long Short-Term Memory) models are often part, the input video frames are passed through
used in LRCN (Long-term Recurrent Convolutional multiple convolutional layers that extract relevant
Networks) models for human activity recognition, spatial features from the frames. The output of the
for several reasons: convolutional layers is a 3D tensor that contains
information about the spatial features of the frames.
1) Temporal sequence modeling: LSTM models are
designed to model sequences of data, which In the second part, the 3D tensor output from the
makes them well-suited for human activity CNN layers is fed to the LSTM layers for temporal
recognition tasks where the temporal aspect of sequence modeling. The LSTM layers can capture
the data is important. By using an LSTM layer the temporal dependencies between the frames in the
in the LRCN model, the model can learn to video and learn to represent the sequential patterns in
recognize patterns in the sequences of spatial the data. The LSTM layers are designed to process
features extracted by the CNN layer. sequential data, such as videos, and are able to
remember long-term dependencies in the input
2) Memory retention: LSTMs are designed to sequence.
remember past inputs, which is important for
human activity recognition where the context of The output of the LSTM layers is a fixed-length
the activity is important. By retaining memory of vector that represents the temporal features of the
previous inputs, the LSTM layer can better input video sequence. This vector can be used as
understand the context of the current input and input to a classifier that predicts the activity label.
improve the accuracy of activity recognition. The classifier can be trained using supervised
learning with labeled training data, where the input
LSTM allows for the modeling of temporal features are the fixed-length vector obtained from the
sequences of spatial features, which are important for LSTM layers, and the output is the activity label.
identifying different activities. LSTMs are also
robust to variations in input data and can retain Overall, the feature extraction process in HAR using
memory of past inputs, leading to better performance LRCN involves using CNN layers to extract spatial
in human activity recognition tasks. features from the input video frames and LSTM
layers to capture the temporal dependencies between
Data Pre-processing – the frames and extract temporal features from the
video sequence. The combination of these two types
We'll proceed to preprocess the dataset. To simplify of features allows the LRCN model to effectively
computations, we will first read the video files from recognize human activities from video data.
the dataset and resize the video frames to a fixed
width and height. We will next normalise the data to Implementation -
the range
[0-1] by dividing the pixel values with 255, which To analyse and forecast on pictures, a CNN or
speeds up convergence while training the network. ConvNet is a form of deep NN developed
We'll implement a function that takes a video's path particularly for this task.
as an argument and outputs a list of the video's
resized and normalised frames. Although not every It uses kernels (called filtration) to analyse a picture
frame will be added to the list because we only & produce image features (which depict if a
require a sequence length of evenly distributed particular feature is visible at a spatial position or
frames, the function will interpret the video frame by not), with number of features and map sizes
frame. increasing and decreasing, respectively, as we
progress greater depth into the system via lumping
Feature Extraction – processes.
72
An LSTM will take into account all of the preceding
inputs before producing an output, making it ideal for
use with a data series. RNN, of which LSTMs are a
subtype, are notoriously ineffective at handling long-
term connections in an original signal due to a
phenomenon known as the disappearing gradients
issue.
73
other dl strategies that use raw sensor readings as
input revealed that the latter is superior on both
fronts. We tested our model on UCF Dataset, and we
found that it performed well. We used LRCN model
since it is robust and provides us an high accuracy. It
achieved an accuracy of 85.25 %. The duration of
time it required to execute the various systems and
factor is determined was another parameter that was
not assessed in this article but was evident in the
trials. When contrasted to our method, the other
algorithms took much longer to complete.
REFERENCES
74
[7]Joshila Grace L.K,Vigneshwari S, SathyaBama Pattern Analysis and Machine Intelligence. doi:
Krishna R, Ankayarkanni B, Mary Posonia A,"A 10.1109/TPAMI.2021.3111372.
Joint Optimization Approach for Security and
Insurance Management on the Cloud", Lecture Notes
in Networks and Systems,Vol.430, pp. 405–413.
[17] Yao, J., Zhang, L., Lu, J., & Xu, Y. (2021). Self-
Supervised Learning of Human Activities from
Temporal Segments in Videos. IEEE Transactions on
75