Predicting Pedestrian Intention to Cross the Road
Predicting Pedestrian Intention to Cross the Road
Predicting Pedestrian Intention to Cross the Road
ABSTRACT The goal of this research is the development of a driver assistant feature, which can warn
the driver in case a pedestrian is in a potential risk due to sudden intention to cross the road. The process
of crossing pedestrian is defined as the changing of pedestrian orientation on the curb toward the road.
We built a Convolutional Neural Network (CNN) model combined with depth sensing camera to estimate
the pedestrian orientation and distance from the vehicle. The model detects the higher human body keypoints
in 2D space while the depth info make it possible to translate the points into a 3D space. These info are tracked
per pedestrian and any change in the pedestrian moving pattern toward the road is translated to a warning
for the driver. The CNN model is end-end trained using different datasets presenting pedestrian in different
configurations and scenes.
I. INTRODUCTION have an enough response time to issue the required alerts for
One of the main tasks for assistive and autonomous driving the driver or trigger safety breaking action.
systems is to assure traffic safety for drivers and pedestri- The presence of communication channels in connected
ans by reducing the human error leading to crashes with vehicles technology either between the vehicle and it’s sur-
other vehicles, road infrastructure and pedestrians. Pedes- rounding infrastructure or other nearby vehicles can enhance
trian injuries in traffic accidents have high lethality due to the process of sensing the pedestrians. This can be achieved
the vulnerability of the pedestrians. According to Governors by providing the sensing information as a service to other
Highway Safety Association (GHSA) preliminary report for vehicles [2]. The pedestrian data can be detected and shared
2019 [1], 6590 pedestrian were killed in motor vehicle acci- from a leading vehicle to the vehicle in the back which will
dents, with an increase of almost 300 deaths from the reported reduce the processing time in these vehicles and therefor more
number in 2018. time for reaction.
Governments spared no effort to make roads a safer place Vision based pedestrians detection field has been very
to use. By crafting better road regulations and construct active and is rich with methods and algorithms [3]. During the
roads infrastructures. On the other hand, tech companies and last decade, deep learning techniques witnessed breakthrough
researchers are trying very hard to make vehicles safer for in the applications and performance. Graphics Processing
both pedestrians and drivers by using advanced technologies Units (GPUs) played a significant role in this breakthrough by
helping the driver to avoid crashes and even if happened to enabling fast processing of big data and training large CNN
reduce the impact of it. models. Deep learning based pedestrian detectors provides
Autonomous cars on various levels, including the fully accurate pedestrians detection even with the large amount of
automated or the ones that equipped with advanced driver variation in human look caused by clothes and body shapes.
assist systems (ADAS) should have robust and efficient Even with such advances in pedestrian detectors, avoiding
algorithms to avoid vehicle-pedestrian crashes as much as vehicle to pedestrian crashes still a challenging task such in
possible either by initiating the required driving actions or cases where a pedestrian decides to cross the road suddenly.
by giving the drivers extra information to be aware of their In such cases the human driver and the autonomous driver
surroundings. Both autonomous and connected vehicle tech- have shorter time to initiate the required response.
nologies should have the abilities to determine if the pedes- Pedestrian detection is a critical step in any pedestrian-
trian is crossing the road in the path of the vehicle, in order to safety algorithm but it’s only the first step for a safer
pedestrian-vehicle interaction. The vehicles should have the
The associate editor coordinating the review of this manuscript and ability to analyze and track the activities of pedestrian along
approving it for publication was Moayad Aloqaily . video frames in order to determine the required actions
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
72558 VOLUME 8, 2020
K. M. Abughalieh, S. G. Alawneh: Predicting Pedestrian Intention to Cross the Road
to reduce the risk of crashing. Providing the driver or the pedestrian focus. Body orientation is taken in a certain refer-
auto-driver with information related to pedestrians behav- ence; mostly the camera. In addition to autonomous vehicles
ior on the road can significantly increase the pedestrian pedestrian safety functionality, social robots are one of the
safety. Activities like intention to cross or detecting pedes- main applications requiring this kind of information in order
trian awareness can be a part of the decision making inputs to build advanced path planing algorithms, other fields that
to perform smooth maneuver to prevent accident or reduce can make use of such information is surveillance for behavior
the impact of it. Prediction of pedestrian crossing the road and interaction analysis.
one second prior to the actual action can provide extra dis- Many techniques have been widely used for understanding
tance for a vehicle automatic response or a driver response to pedestrians behaviors in the road, either by understanding the
take action. A couple of seconds of prediction for a pedestrian pedestrian motion or analyzing the pedestrian behaviors and
intention could be critical to avoid crashes or reducing the intentions. Using on body sensors is one of the methods used
chance of injury requiring hospitalization. to capture the pedestrian orientation, Peng and Qian [5] used
Interpretation of pedestrian actions and movements on the motion capture devices to estimate the human body orienta-
curb could detect the pedestrian intention to cross the street or tion. The work in [6] also used external magnetic sensors to
not. Actions like bending the higher part of the body, heading estimate the orientation. Such methods work on controlled
toward the street or making eye contact with the driver could environments but not suitable for on road pedestrians.
give a higher indication for the intention of the pedestrian to The head pose provides good clues about the pedestrian
cross the road. All these signs can be essential parts in the focus and can be used in overall body orientation estima-
process of designing the assistive and autonomous driving tion, Chen et al. [7] proposed an approach that jointly esti-
systems to become more suitable for urban environments. mates body pose and head pose from surveillance video,
A Proper estimation for the pedestrian path depending on the taking advantage of the soft couplings between body position
pedestrian pose and speed provides the vehicle with accu- (movement direction), body pose, and head pose. The authors
rate estimation of probability of crash with the pedestrian. in [8] focused on estimating the human head orientation from
Another source of info that is significant for the process extremely low-resolution RGB images using a non-linear
of prediction is the environment around the vehicle like the regression and used Support Vector Regression (SVR).
distance between the pedestrian and vehicle, crossing sign Utilizing deep learning methods to estimate the body ori-
and pedestrian crossing walk existence. entation, Choi [9], used a convolutional neural networks
The proposed approach in this work builds on the ideas for estimating human body orientation. The model classi-
from a previous work [4] presenting an enhanced CNN model fies the input image into one of eight classes covering the
for body landmarks detection in addition to a pedestrian 360 degrees. In a previous work [10] we used a combination
intention to cross the street detector, the detector is based on of OpenPose implementation [11] and lifting from the Deep
detecting sudden changes of pedestrian orientation toward the Learning implementation [12] to estimate a human body
street. The novel contributions of this research work can be orientation. OpenPose was used to detect the human body
summarized in the following: landmarks defines in the COCO dataset [13] which were
• Developing a CNN model for detecting human body 17 points, then these points were passed to the other algorithm
landmarks with a higher accuracy than our previous to produce the 3D space translation. We were able to estimate
work. the body orientation using these points by building vectors
• Increasing the dataset size for labeled pedestrians for the using the shoulders points, and another vector using hips
landmarks of shoulder, neck and nose proposed in our points. In another work [4], a CNN model was developed to
previous work. detect the body landmarks of the shoulders, neck and face
• Developing a street crossing intention based on detect- only, this time the points were translated to a 3D space using
ing sudden pedestrian orientation change toward the a depth camera. Then the same concept of using vectors to
road. The orientation detection is based on our previ- compute the orientation is applied successfully.
ous work of depth module that translates the detected In the area of understanding the pedestrian behaviors and
landmarks into a 3D space where the body orientation is intention. Pedestrian’s intention can be analyzed by tracking
estimated. their current status and previous one, the status might include
The rest of the paper is organized as follows. Section II walking directions, motion speed, position, head orientation
presents the related work, then the system overview is and awareness, awareness is highly related to head orienta-
described in details in Section III. Section IV describes the tion, eyes direction and being busy using mobile phone for
obtained result with some discussion and analysis. Finally, example. The head orientation is a very important indica-
Sections V and VI concludes with the ongoing research and tion for the pedestrian behavior, [14] utilizes human body
future plans. language to predict behaviors based on head orientation.
A stereo camera vision was used for human detection and
II. RELATED WORK head pose estimation using a Latent-Dynamic Conditional
Pedestrian behavior analysis includes detecting one or more Random Field mode. More research examples for pedestrian
sign like pedestrian body orientation, head orientation, intention based on head orientation estimation can be found
FIGURE 1. System pipeline overview, showing the different modules in the system in addition to the flow of the process and tracker updating phases.
in [15], [16], these approaches present methods based on the body orientation. In order to achieve that, this method
monocular and stereo cameras. focuses on important landmarks on the pedestrian. These
In [17], the authors also used the body language in 3D landmarks are the shoulders, the neck and the face. These
to perform pedestrian activity and path prediction based on body landmarks are chosen because they are highly related to
pose estimation. The system uses a LIDAR and stereo vision the body orientation. The relation between these landmarks
camera equipped on a moving vehicle. Kataka et al. [18] and the orientation is more obvious when inserting how far
used body pose and gait analysis to recognize pedestrians each point is from the observer plane; the camera in this case.
activities. The authors localized pedestrian using extended Imagining a line connecting the two shoulders points that
CoHOG + AdaBoost while dense trajectories are used for is centered on neck point can give a better picture for the
activity analysis. The classification has four classes: crossing, concept. Finding the normal vector for this line gives the body
walking, standing and riding a bicycle. In [19], Kooij et al. orientation.
used a stereo vision system to extract the context information The described methodology till now estimates the pedes-
as the head orientation, the vehicle-pedestrian distance and trian orientation, detecting crossing the street intention
spatial layout by the distance of the pedestrian to the curbside requires a tracker that keeps tracking each pedestrian detected
on top of a Switching Linear Dynamical System to predict orientation. Having this tracker makes it possible to detect
more accurate path and action for a horizon of one second. changes of the orientation toward the road for each pedestrian.
In [20], pedestrian intent prediction is used for risk estima- This change can be understood as an intention to cross the
tion using clues of pedestrian dynamics, and map information road, based on this intention initial actions like slowing down
based on GPS location. The system is monocular based and can be taken by the driver or the auto-driver. Slowing down
use the vision info for trajectory tracking and predictions in will provide a longer reaction time in case the pedestrian
the near future to issue risk alert. The pedestrian annotation is continued crossing the road.
done manually as the process of detection is out of the scope The designed system that translates this methodology con-
of this work. sists of different modules performing different tasks in order
A decent amount of effort has been placed on the task of to detect the pedestrian intention to cross the road, Fig. 1
pedestrians detection and their behavior estimation. Detec- shows the system pipeline. The modules of the system per-
tion models varies between hand-crafted features and deep form the tasks of pedestrian detection, body landmarks detec-
learned ones using different datasets for the different tasks. tion, depth sensing and orientation estimation, the system
Building a system that estimates the risk on pedestrians based also has a pedestrian tracker that keeps track of all gathered
on their behavior in the roads requires combining the tasks of information for each pedestrian from the other modules. The
human (pedestrian) detection and walking orientation estima- following subsections explain each module in more details.
tion while keep performing in real-time.
A. BODY LANDMARKS ESTIMATION
III. METHODOLOGY AND SYSTEM OVERVIEW This module consists of two sub-modules, the first module is
The main concept in this approach is to construct a 3D a pedestrian detector, while the second module is our trained
visualization of the human body that gives a clear clue for CNN body landmarks estimator module. The two modules
1) PEDESTRIANS DETECTOR
Pedestrian detection is the first essential task in the system
pipeline. Given a frame input from the camera this module
task is to detect and localize every pedestrian in the scene.
So each pedestrian is bounded by a boundary box that will
be registered or updated in the pedestrians tracking module;
explained in the following section. Any pedestrian detector
could be used here, but considering detection accuracy and
robustness, YOLO [21] detectors are used. YOLOv3 [22] and
TinyYolov3 were tested for resources and processing speed
testing. TinyYolov3 is a tiny version of YOLOv3, and is much
faster but less accurate. FIGURE 2. Example of the CNN model output, the body landmarks are
detected if visible.
YOLO (You Only Look Once), is a single-stage
neural network for object detection, a boundary box and
class prediction are generated as the output of processing
the input image. Previous methods for object detection, like
R-CNN [23] and its variations perform object detection in
multiple steps; extract 2000 regions from the image in a
process called region proposals then classify these regions.
This can be slow to run and also hard to optimize, because
each individual component must be trained separately. On the
FIGURE 3. The relation between the selected body landmarks and the
other side YOLO, performs the detection with a single neural body orientations, an observer can easily conclude the body orientation
network. Single stage methods like YOLO [21], [22], [24], given a top view of the detected landmarks.
and SSD [25] achieve high performance speed but YOLO
outperforms the others as shown in Table 1.
a supervised learning. The CNN model outputs a vector with
TABLE 1. Performance comparison for neural network algorithms done containing eight values representing the x and y coordinates
by [24]. for the four landmarks. The resulted model output is validated
on another test dataset to evaluate the training process, this set
is called validation or testing data.
The CNN architecture consists of a sequential structure
of different types of layers. Neural network layers learn to
extract features by activating certain nodes if a desired feature
is found in the layer input, this is achieved by adjusting the
layer parameters in the training process using the labeled
examples. The model has input layers, hidden layers and
2) BODY LANDMARKS ESTIMATION MODEL output layer, the input layer is directly connected to the input
The proposed neural network uses a CNN model that per- image, while the hidden layers input comes from the input
forms human landmarks estimation for the upper body part, layer or another hidden layer output until the output layer is
this module works only inside the regions of pedestrians reached. As the name implies, the main layers were used in
detected by the pedestrian detection module. As mentioned the building blocks of the model are of convolutional type
before, the points of interest in this work are the two shoul- for feature extraction followed by max pooling for down
ders, the neck and the face. In our context the face keypoints sampling the feature maps size for faster performance and
is the same as presented by the nose point in COCO [13] to keep dominant features only by filtering out the weak
and MPII [26] parts mapping. The human body orienta- features.
tion is highly related to the position of these points, check Another layers of dropout were also implemented for
Figs. 2 and 3 below. removing redundant nodes. The final output is flattened and
All the images of the dataset are resized to match the CNN a fully connected layers are used to extract the final eight
model size input which is 75 × 75 pixels, then provided values. The activation functions used in the layers is recti-
as labeled examples with their respective keypoints for the fied linear unit (ReLU), which is faster than segmoid in the
model to perform the training. Such type of training is called training process. The model consists of 6 building blocks, five
convolutional layers followed by max pooling layers, then a or facing the other direction, here the pedestrian left side and
fully connected layer followed by drop out and finally the right side should stay the same from the camera view, while
output layer. The convolution layers use the same filter size pedestrians showing one side of their bodies as in walking
of (3 × 3) but with different counts; 32, 32, 64, 128 and 256. from side to side in front of the camera will have inverted
While the fully connected layer used is 512 nodes. The total left and right sides. So to over come these issue a scripts
number of trainable parameters in the network is 925992. that perform image resizing, vertical and horizontal shifting
Fig. 4 shows the full architecture. is implemented to create more image variations, in addition
to manual labeling for horizontally flipped images.
In this work also rotational and horizontal flip augmenta-
tion methods were not needed as they do not cover real exam-
ples. The total number of variation made for each image is
almost 100 generating 600000 total samples that divided into
training (80%) and validation (20%) sets. Fig. 5 illustrates
some of the applied variations.
The KCF is a variant of correlation filters. Correlation computation. The vectors creation depends on points visibil-
based filters consider a samples match if the samples have a ity, we are assuming that the neck is always visible while
high correlation value. KCF uses this idea for object track- one of the other points might not. The vector are constructed
ing. KCF finds the correlation between the tracked object always from the left shoulder to the right shoulder through
in the current frame and other patches in the next frame. the neck point. The normal of the vector in the direction of
The highest value correlation indicates in which direction the face point is assumed to be the human body orientation.
tracked object has moved. KCF tracker is not robust enough Stereo vision setup can estimate the distance of a certain
to significant change in object appearance. OpenCV [33] point using the two images taken by the two cameras. The
implementation is used for KCF tracker. cameras are separated by a known distance (b) known as the
To keep a correct labeling and pedestrian information baseline. The difference in the viewpoints for the same scene
assignments the centroid tracker is implemented. The cen- from the two cameras provides extra information enabling the
troid tracker inputs are the tracked pedestrian objects from process of generating a depth map. The depth map is usually
the KCF tracker, where the boundary boxes of each detected in a gray scale format and shows the distance between the
pedestrian is updated in every frame. As mentioned before camera and the objects in the scene. The extra information is
each detected pedestrian is registered as a track-able object the what called disparity, disparity is the horizontal shift that
and given a unique ID with all other information and main- can be observed between the left camera image and the right
tained by the centroid tracker. The centroid tracking approach camera image and it can be found on the pixel level, check
as shown in Fig. 6 uses the Euclidean distance between the Fig. 7. For the vertical shift to be valid a perfect alignment
already registered tracked objects centroids and new objects for cameras is assumed to be present in order match each
centroids in a subsequent frame in the video. pixel row in both images, this alignment is guaranteed by
the mounting of separate cameras or the packaging of stereo
camera manufacturer, otherwise an alignment pre-processing
is required.
FIGURE 6. Blue point represent the centroid object in the previous frame
while the red points represent the centroids of the detected objects in the
current frame. The euclidean distance is measured for each centroid and
the closest centroids in the new frame is given the same ID for the object
in the previous frame.
IV. RESULTS
The following section will discuss the output result for each
module then analyze the overall system performance.
FIGURE 14. Examples of tracking a pedestrian walking parallel to the car path then crossing the road in front of a vehicle, green boundary box indicates a
safe case while red indicates driver attention or action is required.
TABLE 3. Orientation estimation classes precision. detector might fail to detect pedestrians in some situations.
Pedestrians have variety of physical shapes, heights, widths
and clothes, they appear in different environments, back-
grounds and weather conditions. This makes the task of pre-
TABLE 4. Pedestrian actions confusion matrix. dicting human behavior such as the intention to cross the road
a more complex and challenging task but at the same time it’s
a very promising technique in avoiding crashes and reducing
pedestrian fatalities.
Pedestrians who intends to cross the road and getting in
the path of the vehicle are more critical to the driver than
those who walk on the curb and not intending to cross the
road. One second prediction for a pedestrian crossing the road
ahead of a car driving at typical urban speed of 50 Km/h
CNN model with single pedestrian images input and 70.6% can provide a distance of 13.8 meters for a vehicle automatic
compared to [38], Table 3 shows the precision for each class response or a driver response, this time could be even longer
with an average of 81.76%. Fig. 13 shows examples of the if slowing down action is considered before the pedestrian
module output. start crossing the road. A couple of seconds of prediction
for a pedestrian intention could be critical to avoid crashes
D. OVERALL PERFORMANCE or reducing the chance of injury requiring hospitalization.
The purpose of this system is to detect pedestrian intention Pedestrians build their decision of crossing the road based
to cross the road in front of a vehicle, Fig. 14 illustrates on how fast and how far is the coming vehicles, but these
an example. The system evaluation is done on 20 video decisions might be wrong due to wrong estimation and here
sequences. The videos are filmed by the ZED camera out- comes the driver and auto driver roles.
doors in sunny and cloudy weathers and indoors with proper Recognizing pedestrian behavior from a driver view can be
lighting. The evaluated events are manually extracted for perceived by different actions and signs from the pedestrian.
testing to sequences of 5-10 seconds include pedestrian walk- Such actions and signs can be related to head movement when
ing on the road side (40), pedestrian crossing the road (25) looking at road sides as a sign of waiting for the right moment
and pedestrians walking then crossing the road (20). Table 4 of crossing, other signs are related to legs movements and
shows the confusion matrix for the classification with the body bending toward the street which are a clear indication
accuracy average of 87% and precision average of 86.74% for for starting a walking action. Other signs like low traffic
the classes. Different approaches in the literature adopt differ- density on the opposite lane can encourage pedestrian to cross
ent classification and result verification methods, the work the road. Looking at these signs not all of them are eas-
in [39] classify crossing vs not crossing pedestrian actions ily implemented into a computerized algorithms. This work
reached 70% accuracy using CNN extracted features, but this focuses on implementing the behavior of bending toward the
accuracy was increased to 88% using OpenPose extracted road as computer vision technique.
features and a SVM/Random Forest classifier. Looking at This paper has described a vision based approach for
the [40], the classified actions describing the pedestrian detecting pedestrians crossing intention to cross the road. The
actions are: standing, starting, stopping and walking. The approach uses a combination of deep learning techniques and
work achieved overall accuracy of 85%. depth sensing to build a 3D understanding of the pedestrian
orientation in reference to the camera view. A very impor-
V. CONCLUSION tant assumption that held here is the walking orientation is
Despite the huge efforts done by government and vehicles assumed to be the same as the body orientation.
manufactures to increase vehicles safety U.S. pedestrian Deep learning models were used to extract important body
fatalities have increased in the last few years. Vehicles are landmarks that highly related to the body orientation. The
equipped with more advanced safety modules, crash avoid- method is based on two deep learning components; pub-
ance technologies, pedestrian detection systems and even licly available CNN pedestrian detector model as YOLO and
more equipped with systems to minimize the effect of crashes another CNN model developed and trained by the authors.
and reduce injuries such as active hoods and windshield The training process included dataset collecting, dataset
airbags. Many experts are optimistic about the advances in labeling and image augmentation to increase the number of
the world of autonomous vehicles; they count on it to reduce labeled training examples.
pedestrian fatalities through eliminating the human drivers The CNN model achieved high validation accuracy of 94%
errors. in estimating the body landmarks for the detected pedes-
Even with the advances in computer vision algorithms trian by YOLO. Moving to the orientation performance
especially with the great performance of deep learning break- the model achieves high accuracy for the main orientations
through and the incredible result of pedestrian detectors. (0,90,180 and 270) degrees but lower accuracy for other
Pedestrian detection still a challenging task and advanced orientations. The lower accuracy and miss predictions for
the orientation might come from different sources like the [6] A. M. Sabatini, ‘‘Estimating three-dimensional orientation of human
model itself or the ZED camera. The model accuracy can body parts by Inertial/Magnetic sensing,’’ Sensors, vol. 11, no. 2,
pp. 1489–1525, 2011.
be enhanced using richer dataset of pedestrian to achieve [7] C. Chen, A. Heili, and J.-M. Odobez, ‘‘A joint estimation of head and body
more generalized model while the depth sensing might be orientation cues in surveillance video,’’ in Proc. IEEE Int. Conf. Comput.
enhanced using a higher resolution video format, more pro- Vis. Workshops (ICCV Workshops), Nov. 2011, pp. 860–867.
[8] J. Chen, J. Wu, K. Richter, J. Konrad, and P. Ishwar, ‘‘Estimating head pose
cessing power or even replace the sensor with better depth orientation using extremely low resolution images,’’ in Proc. IEEE South-
sensor. Other solution might include LIDARs and sensor west Symp. Image Anal. Interpretation (SSIAI), Mar. 2016, pp. 65–68.
fusion techniques. [9] J. Choi, B.-J. Lee, and B.-T. Zhang, ‘‘Human body orientation estimation
using convolutional neural network,’’ 2016, arXiv:1609.01984. [Online].
Checking the final system output; classifying pedestrians Available: http://arxiv.org/abs/1609.01984
into walking, crossing and intention to cross. The system [10] K. Abughalieh and S. Alawneh, ‘‘Real time 2D pose estimation for pedes-
achieves high accuracy for detecting the crossing pedestrian trian path estimation using GPU computing,’’ in Proc. SAE Tech. Paper
Ser., Apr. 2019, pp. 1–5.
since it directly depends on the pedestrian detector and the [11] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, ‘‘OpenPose:
predefined region of interest. Walking pedestrian class also Realtime multi-person 2D pose estimation using part affinity fields,’’ 2018,
achieves high accuracy but the errors are coming from the arXiv:1812.08008. [Online]. Available: http://arxiv.org/abs/1812.08008
[12] D. Tome, C. Russell, and L. Agapito, ‘‘Lifting from the deep: Convo-
orientation estimation module might produce false positive lutional 3D pose estimation from a single image,’’ in Proc. IEEE Conf.
classification into the third class of crossing intention. The Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2500–2509.
overall system performance might be affected by the accuracy [13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’
of the depth information provided by the ZED camera. in Proc. Eur. Conf. Comput. Vis. (ECCV). Zürich, Switzerland: Springer,
As a summary this system addresses traffic safety, 2014, pp. 740–755.
the pedestrians safety in particular in the following manner: [14] A. T. Schulz and R. Stiefelhagen, ‘‘Pedestrian intention recognition using
latent-dynamic conditional random fields,’’ in Proc. IEEE Intell. Vehicles
• Providing the vehicle driver with extra awareness of the Symp. (IV), Jun. 2015, pp. 622–627.
pedestrian behavior in front of the vehicle. [15] F. Flohr, M. Dumitru-Guzu, J. F. P. Kooij, and D. M. Gavrila, ‘‘A probabilis-
• Pedestrians who suddenly appear in front of the vehicle tic framework for joint pedestrian head and body orientation estimation,’’
IEEE Trans. Intell. Transp. Syst., vol. 16, no. 4, pp. 1872–1882, Aug. 2015.
are harder to avoid due to the short reaction time win- [16] E. Rehder, H. Kloeden, and C. Stiller, ‘‘Head detection and orientation
dow, this system can help the driver and the auto-driver estimation for pedestrian safety,’’ in Proc. 17th Int. IEEE Conf. Intell.
to react faster by slowing down once crossing intention Transp. Syst. (ITSC), Oct. 2014, pp. 2292–2297.
[17] R. Quintero, I. Parra, D. F. Llorca, and M. A. Sotelo, ‘‘Pedestrian path
is detected. prediction based on body language and action classification,’’ in Proc. 17th
• The system also does the task of pedestrians detection Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Oct. 2014, pp. 679–684.
in the vehicle path allowing for higher chance of crash [18] H. Kataoka, Y. Aoki, Y. Satoh, S. Oikawa, and Y. Matsui, ‘‘Fine-grained
walking activity recognition via driving recorder dataset,’’ in Proc. IEEE
avoidance. 18th Int. Conf. Intell. Transp. Syst., Sep. 2015, pp. 620–625.
[19] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, ‘‘Context-based
pedestrian path prediction,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV).
VI. FUTURE WORK Zürich, Switzerland: Springer, 2014, pp. 618–633.
This research work opens the door for a lot of ideas and [20] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, ‘‘Trajectory analy-
enhancements such as enhancing the system by including sis and prediction for improved pedestrian safety: Integrated framework
and evaluations,’’ in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2015,
advanced depth sensor. Using advanced depth sensor with pp. 330–335.
higher resolution cameras will require more processing power [21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
which can be resolved by adding multiple GPUs to distribute Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
the heavy computations required by the CNN models and the [22] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve-
stereo vision algorithm. Another idea worth investigated here ment,’’ 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/
is using the same pipeline with LIDAR depth map instead of a 1804.02767
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
stereo vision camera to avoid the stereo vision computations. for accurate object detection and semantic segmentation,’’ in Proc. IEEE
The issue here is the resolution provided by the LIDAR which Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
is lower than the ZED camera. [24] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 7263–7271.
REFERENCES [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf.
[1] R. Retting, Pedestrian Traffic Fatalities by State: 2019 Preliminary Data. Comput. Vis. (ECCV). Amsterdam, The Netherlands: Springer, 2016,
Washington, DC, USA: Governors Highway Safety Association, 2020. pp. 21–37.
[2] A. A. Alkheir, M. Aloqaily, and H. T. Mouftah, ‘‘Connected and [26] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, ‘‘A dataset for
autonomous electric vehicles (CAEVs),’’ IT Prof., vol. 20, no. 6, pp. 54–61, movie description,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Nov. 2018. (CVPR), Jun. 2015, pp. 3202–3212.
[3] D. Gerónimo and A. M. López, Vision-Based Pedestrian Protection Sys- [27] Y. Deng, P. Luo, C. C. Loy, and X. Tang, ‘‘Pedestrian attribute recognition
tems for Intelligent Vehicles. Springer, 2014. at far distance,’’ in Proc. ACM Int. Conf. Multimedia, 2014, pp. 789–792.
[4] K. Abughalieh and S. Alawneh, ‘‘Pedestrian orientation estimation using [28] A. Dominguez-Sanchez, M. Cazorla, and S. Orts-Escolano, ‘‘Pedes-
CNN and depth camera,’’ in Proc. SAE Tech. Paper Ser., Apr. 2020, trian movement direction recognition using convolutional neural net-
pp. 1–9. works,’’ IEEE Trans. Intell. Transp. Syst., vol. 18, no. 12, pp. 3540–3548,
[5] B. Peng and G. Qian, ‘‘Binocular dance pose recognition and body ori- Dec. 2017.
entation estimation via multilinear analysis,’’ in Proc. IEEE Comput. Soc. [29] Stereolabs Inc. (Mar. 2020). Stereolabs Zed Camera. [Online]. Available:
Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2008, pp. 1–8. https://www.stereolabs.com/zed/
[30] (Mar. 2020). Labelbox. [Online]. Available: https://labelbox.com SHADI G. ALAWNEH (Senior Member, IEEE)
[31] F. Chollet. (Mar. 2020). Keras. [Online]. Available: https://keras.io received the B.Eng. degree in computer engi-
[32] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, ‘‘High-speed tracking neering from the Jordan University of Science
with kernelized correlation filters,’’ IEEE Trans. Pattern Anal. Mach. and Technology, Irbid, Jordan, in 2008, and the
Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015. M.Eng. and Ph.D. degrees in computer engineer-
[33] G. Bradski. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software ing from the Memorial University of Newfound-
Tools. [Online]. Available: https://www.drdobbs.com/open-source/the- land, St. John’s, NL, Canada, in 2010 and 2014,
opencv-library/184404319
respectively. Then, he was a Staff Software Devel-
[34] L. Matthies, ‘‘Dynamic stereo vision,’’ Ph.D. dissertation, Dept. Comput.
oper with the Hardware Acceleration Lab, IBM,
Sci., Carnegie Mellon Univ. Pittsburgh, PA, USA, 1989.
[35] C. Nvidia, ‘‘Nvidia cuda c programming guide,’’ Nvidia Corp., vol. 120, Canada, from May 2014 to August 2014. After
no. 18, p. 8, 2011. that, he was a Research Engineer with C-CORE, from 2014 to 2016,
[36] L. E. Ortiz, V. E. Cabrera, and L. M. G. Goncalves, ‘‘Depth data error and became an Adjunct Professor with the Department of Electrical and
modeling of the ZED 3D vision sensor from stereolabs,’’ ELCVIA Electron. Computer Engineering, Memorial University of Newfoundland, in 2016.
Lett. Comput. Vis. Image Anal., vol. 17, no. 1, p. 1, 2018. He is currently an Assistant Professor with the Department of Electrical and
[37] K. Kumamoto and K. Yamada, ‘‘CNN-based pedestrian orientation estima- Computer Engineering, Oakland University. He has authored or coauthored
tion from a single image,’’ in Proc. 4th IAPR Asian Conf. Pattern Recognit. scientific publications, including international peer-reviewed journals and
(ACPR), Nov. 2017, pp. 13–18. conference papers. His research interests include parallel and distributed
[38] K. Hara, R. Vemulapalli, and R. Chellappa, ‘‘Designing deep convolu- computing, general purpose GPU computing, parallel processing architec-
tional neural networks for continuous object orientation estimation,’’ 2017, ture and its applications, autonomous driving, numerical simulation and
arXiv:1702.01499. [Online]. Available: http://arxiv.org/abs/1702.01499 modeling, and software design and optimization. He is a Senior Member of
[39] Z. Fang and A. M. Lopez, ‘‘Is the pedestrian going to cross? Answering by the IEEE Computer Society.
2D pose estimation,’’ in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2018,
pp. 1271–1276.
[40] R. Quintero, I. Parra, D. F. Llorca, and M. A. Sotelo, ‘‘Pedestrian intention
and pose prediction through dynamical models and behaviour classifi-
cation,’’ in Proc. IEEE 18th Int. Conf. Intell. Transp. Syst., Sep. 2015,
pp. 83–88.