Predicting Pedestrian Intention to Cross the Road

Received March 29, 2020, accepted April 10, 2020, date of publication April 13, 2020, date of current
version April 30, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.2987777
Predicting Pedestrian Intention to Cross the Road

KARAM M. ABUGHALIEH AND SHADI G. ALAWNEH , (Senior Member, IEEE)
Department of Electrical and Computer Engineering, Oakland University, Rochester, MI 48309, USA
Corresponding author: Shadi G. Alawneh ([email protected])
ABSTRACT The goal of this research is the development of a driver assistant feature, which can warn
the driver in case a pedestrian is in a potential risk due to sudden intention to cross the road. The process
of crossing pedestrian is defined as the changing of pedestrian orientation on the curb toward the road.
We built a Convolutional Neural Network (CNN) model combined with depth sensing camera to estimate
the pedestrian orientation and distance from the vehicle. The model detects the higher human body keypoints
in 2D space while the depth info make it possible to translate the points into a 3D space. These info are tracked
per pedestrian and any change in the pedestrian moving pattern toward the road is translated to a warning
for the driver. The CNN model is end-end trained using different datasets presenting pedestrian in different
configurations and scenes.
INDEX TERMS ADAS, GPU, CNN.
I. INTRODUCTION have an enough response time to issue the required alerts for
One of the main tasks for assistive and autonomous driving the driver or trigger safety breaking action.
systems is to assure traffic safety for drivers and pedestri- The presence of communication channels in connected
ans by reducing the human error leading to crashes with vehicles technology either between the vehicle and it’s sur-
other vehicles, road infrastructure and pedestrians. Pedes- rounding infrastructure or other nearby vehicles can enhance
trian injuries in traffic accidents have high lethality due to the process of sensing the pedestrians. This can be achieved
the vulnerability of the pedestrians. According to Governors by providing the sensing information as a service to other
Highway Safety Association (GHSA) preliminary report for vehicles [2]. The pedestrian data can be detected and shared
2019 [1], 6590 pedestrian were killed in motor vehicle acci- from a leading vehicle to the vehicle in the back which will
dents, with an increase of almost 300 deaths from the reported reduce the processing time in these vehicles and therefor more
number in 2018. time for reaction.
Governments spared no effort to make roads a safer place Vision based pedestrians detection field has been very
to use. By crafting better road regulations and construct active and is rich with methods and algorithms [3]. During the
roads infrastructures. On the other hand, tech companies and last decade, deep learning techniques witnessed breakthrough
researchers are trying very hard to make vehicles safer for in the applications and performance. Graphics Processing
both pedestrians and drivers by using advanced technologies Units (GPUs) played a significant role in this breakthrough by
helping the driver to avoid crashes and even if happened to enabling fast processing of big data and training large CNN
reduce the impact of it. models. Deep learning based pedestrian detectors provides
Autonomous cars on various levels, including the fully accurate pedestrians detection even with the large amount of
automated or the ones that equipped with advanced driver variation in human look caused by clothes and body shapes.
assist systems (ADAS) should have robust and efficient Even with such advances in pedestrian detectors, avoiding
algorithms to avoid vehicle-pedestrian crashes as much as vehicle to pedestrian crashes still a challenging task such in
possible either by initiating the required driving actions or cases where a pedestrian decides to cross the road suddenly.
by giving the drivers extra information to be aware of their In such cases the human driver and the autonomous driver
surroundings. Both autonomous and connected vehicle tech- have shorter time to initiate the required response.
nologies should have the abilities to determine if the pedes- Pedestrian detection is a critical step in any pedestrian-
trian is crossing the road in the path of the vehicle, in order to safety algorithm but it’s only the first step for a safer
pedestrian-vehicle interaction. The vehicles should have the
The associate editor coordinating the review of this manuscript and ability to analyze and track the activities of pedestrian along
approving it for publication was Moayad Aloqaily . video frames in order to determine the required actions
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
72558 VOLUME 8, 2020
K. M. Abughalieh, S. G. Alawneh: Predicting Pedestrian Intention to Cross the Road
to reduce the risk of crashing. Providing the driver or the pedestrian focus. Body orientation is taken in a certain refer-
auto-driver with information related to pedestrians behav- ence; mostly the camera. In addition to autonomous vehicles
ior on the road can significantly increase the pedestrian pedestrian safety functionality, social robots are one of the
safety. Activities like intention to cross or detecting pedes- main applications requiring this kind of information in order
trian awareness can be a part of the decision making inputs to build advanced path planing algorithms, other fields that
to perform smooth maneuver to prevent accident or reduce can make use of such information is surveillance for behavior
the impact of it. Prediction of pedestrian crossing the road and interaction analysis.
one second prior to the actual action can provide extra dis- Many techniques have been widely used for understanding
tance for a vehicle automatic response or a driver response to pedestrians behaviors in the road, either by understanding the
take action. A couple of seconds of prediction for a pedestrian pedestrian motion or analyzing the pedestrian behaviors and
intention could be critical to avoid crashes or reducing the intentions. Using on body sensors is one of the methods used
chance of injury requiring hospitalization. to capture the pedestrian orientation, Peng and Qian [5] used
Interpretation of pedestrian actions and movements on the motion capture devices to estimate the human body orienta-
curb could detect the pedestrian intention to cross the street or tion. The work in [6] also used external magnetic sensors to
not. Actions like bending the higher part of the body, heading estimate the orientation. Such methods work on controlled
toward the street or making eye contact with the driver could environments but not suitable for on road pedestrians.
give a higher indication for the intention of the pedestrian to The head pose provides good clues about the pedestrian
cross the road. All these signs can be essential parts in the focus and can be used in overall body orientation estima-
process of designing the assistive and autonomous driving tion, Chen et al. [7] proposed an approach that jointly esti-
systems to become more suitable for urban environments. mates body pose and head pose from surveillance video,
A Proper estimation for the pedestrian path depending on the taking advantage of the soft couplings between body position
pedestrian pose and speed provides the vehicle with accu- (movement direction), body pose, and head pose. The authors
rate estimation of probability of crash with the pedestrian. in [8] focused on estimating the human head orientation from
Another source of info that is significant for the process extremely low-resolution RGB images using a non-linear
of prediction is the environment around the vehicle like the regression and used Support Vector Regression (SVR).
distance between the pedestrian and vehicle, crossing sign Utilizing deep learning methods to estimate the body ori-
and pedestrian crossing walk existence. entation, Choi [9], used a convolutional neural networks
The proposed approach in this work builds on the ideas for estimating human body orientation. The model classi-
from a previous work [4] presenting an enhanced CNN model fies the input image into one of eight classes covering the
for body landmarks detection in addition to a pedestrian 360 degrees. In a previous work [10] we used a combination
intention to cross the street detector, the detector is based on of OpenPose implementation [11] and lifting from the Deep
detecting sudden changes of pedestrian orientation toward the Learning implementation [12] to estimate a human body
street. The novel contributions of this research work can be orientation. OpenPose was used to detect the human body
summarized in the following: landmarks defines in the COCO dataset [13] which were
• Developing a CNN model for detecting human body 17 points, then these points were passed to the other algorithm
landmarks with a higher accuracy than our previous to produce the 3D space translation. We were able to estimate
work. the body orientation using these points by building vectors
• Increasing the dataset size for labeled pedestrians for the using the shoulders points, and another vector using hips
landmarks of shoulder, neck and nose proposed in our points. In another work [4], a CNN model was developed to
previous work. detect the body landmarks of the shoulders, neck and face
• Developing a street crossing intention based on detect- only, this time the points were translated to a 3D space using
ing sudden pedestrian orientation change toward the a depth camera. Then the same concept of using vectors to
road. The orientation detection is based on our previ- compute the orientation is applied successfully.
ous work of depth module that translates the detected In the area of understanding the pedestrian behaviors and
landmarks into a 3D space where the body orientation is intention. Pedestrian’s intention can be analyzed by tracking
estimated. their current status and previous one, the status might include
The rest of the paper is organized as follows. Section II walking directions, motion speed, position, head orientation
presents the related work, then the system overview is and awareness, awareness is highly related to head orienta-
described in details in Section III. Section IV describes the tion, eyes direction and being busy using mobile phone for
obtained result with some discussion and analysis. Finally, example. The head orientation is a very important indica-
Sections V and VI concludes with the ongoing research and tion for the pedestrian behavior, [14] utilizes human body
future plans. language to predict behaviors based on head orientation.
A stereo camera vision was used for human detection and
II. RELATED WORK head pose estimation using a Latent-Dynamic Conditional
Pedestrian behavior analysis includes detecting one or more Random Field mode. More research examples for pedestrian
sign like pedestrian body orientation, head orientation, intention based on head orientation estimation can be found
VOLUME 8, 2020 72559

FIGURE 1. System pipeline overview, showing the different modules in the system in addition to the flow of the process and tracker updating phases.
in [15], [16], these approaches present methods based on the body orientation. In order to achieve that, this method
monocular and stereo cameras. focuses on important landmarks on the pedestrian. These
In [17], the authors also used the body language in 3D landmarks are the shoulders, the neck and the face. These
to perform pedestrian activity and path prediction based on body landmarks are chosen because they are highly related to
pose estimation. The system uses a LIDAR and stereo vision the body orientation. The relation between these landmarks
camera equipped on a moving vehicle. Kataka et al. [18] and the orientation is more obvious when inserting how far
used body pose and gait analysis to recognize pedestrians each point is from the observer plane; the camera in this case.
activities. The authors localized pedestrian using extended Imagining a line connecting the two shoulders points that
CoHOG + AdaBoost while dense trajectories are used for is centered on neck point can give a better picture for the
activity analysis. The classification has four classes: crossing, concept. Finding the normal vector for this line gives the body
walking, standing and riding a bicycle. In [19], Kooij et al. orientation.
used a stereo vision system to extract the context information The described methodology till now estimates the pedes-
as the head orientation, the vehicle-pedestrian distance and trian orientation, detecting crossing the street intention
spatial layout by the distance of the pedestrian to the curbside requires a tracker that keeps tracking each pedestrian detected
on top of a Switching Linear Dynamical System to predict orientation. Having this tracker makes it possible to detect
more accurate path and action for a horizon of one second. changes of the orientation toward the road for each pedestrian.
In [20], pedestrian intent prediction is used for risk estima- This change can be understood as an intention to cross the
tion using clues of pedestrian dynamics, and map information road, based on this intention initial actions like slowing down
based on GPS location. The system is monocular based and can be taken by the driver or the auto-driver. Slowing down
use the vision info for trajectory tracking and predictions in will provide a longer reaction time in case the pedestrian
the near future to issue risk alert. The pedestrian annotation is continued crossing the road.
done manually as the process of detection is out of the scope The designed system that translates this methodology con-
of this work. sists of different modules performing different tasks in order
A decent amount of effort has been placed on the task of to detect the pedestrian intention to cross the road, Fig. 1
pedestrians detection and their behavior estimation. Detec- shows the system pipeline. The modules of the system per-
tion models varies between hand-crafted features and deep form the tasks of pedestrian detection, body landmarks detec-
learned ones using different datasets for the different tasks. tion, depth sensing and orientation estimation, the system
Building a system that estimates the risk on pedestrians based also has a pedestrian tracker that keeps track of all gathered
on their behavior in the roads requires combining the tasks of information for each pedestrian from the other modules. The
human (pedestrian) detection and walking orientation estima- following subsections explain each module in more details.
tion while keep performing in real-time.
A. BODY LANDMARKS ESTIMATION
III. METHODOLOGY AND SYSTEM OVERVIEW This module consists of two sub-modules, the first module is
The main concept in this approach is to construct a 3D a pedestrian detector, while the second module is our trained
visualization of the human body that gives a clear clue for CNN body landmarks estimator module. The two modules
72560 VOLUME 8, 2020

work together in sequence to extract the pedestrians body

landmarks; the extracted pedestrians from the input image by
the first module are passed to the landmarks estimator module
where it only works inside pedestrian detection regions.
1) PEDESTRIANS DETECTOR
Pedestrian detection is the first essential task in the system
pipeline. Given a frame input from the camera this module
task is to detect and localize every pedestrian in the scene.
So each pedestrian is bounded by a boundary box that will
be registered or updated in the pedestrians tracking module;
explained in the following section. Any pedestrian detector
could be used here, but considering detection accuracy and
robustness, YOLO [21] detectors are used. YOLOv3 [22] and
TinyYolov3 were tested for resources and processing speed
testing. TinyYolov3 is a tiny version of YOLOv3, and is much
faster but less accurate. FIGURE 2. Example of the CNN model output, the body landmarks are
detected if visible.
YOLO (You Only Look Once), is a single-stage
neural network for object detection, a boundary box and
class prediction are generated as the output of processing
the input image. Previous methods for object detection, like
R-CNN [23] and its variations perform object detection in
multiple steps; extract 2000 regions from the image in a
process called region proposals then classify these regions.
This can be slow to run and also hard to optimize, because
each individual component must be trained separately. On the
FIGURE 3. The relation between the selected body landmarks and the
other side YOLO, performs the detection with a single neural body orientations, an observer can easily conclude the body orientation
network. Single stage methods like YOLO [21], [22], [24], given a top view of the detected landmarks.
and SSD [25] achieve high performance speed but YOLO
outperforms the others as shown in Table 1.
a supervised learning. The CNN model outputs a vector with
TABLE 1. Performance comparison for neural network algorithms done containing eight values representing the x and y coordinates
by [24]. for the four landmarks. The resulted model output is validated
on another test dataset to evaluate the training process, this set
is called validation or testing data.
The CNN architecture consists of a sequential structure
of different types of layers. Neural network layers learn to
extract features by activating certain nodes if a desired feature
is found in the layer input, this is achieved by adjusting the
layer parameters in the training process using the labeled
examples. The model has input layers, hidden layers and
2) BODY LANDMARKS ESTIMATION MODEL output layer, the input layer is directly connected to the input
The proposed neural network uses a CNN model that per- image, while the hidden layers input comes from the input
forms human landmarks estimation for the upper body part, layer or another hidden layer output until the output layer is
this module works only inside the regions of pedestrians reached. As the name implies, the main layers were used in
detected by the pedestrian detection module. As mentioned the building blocks of the model are of convolutional type
before, the points of interest in this work are the two shoul- for feature extraction followed by max pooling for down
ders, the neck and the face. In our context the face keypoints sampling the feature maps size for faster performance and
is the same as presented by the nose point in COCO [13] to keep dominant features only by filtering out the weak
and MPII [26] parts mapping. The human body orienta- features.
tion is highly related to the position of these points, check Another layers of dropout were also implemented for
Figs. 2 and 3 below. removing redundant nodes. The final output is flattened and
All the images of the dataset are resized to match the CNN a fully connected layers are used to extract the final eight
model size input which is 75 × 75 pixels, then provided values. The activation functions used in the layers is recti-
as labeled examples with their respective keypoints for the fied linear unit (ReLU), which is faster than segmoid in the
model to perform the training. Such type of training is called training process. The model consists of 6 building blocks, five
VOLUME 8, 2020 72561

convolutional layers followed by max pooling layers, then a or facing the other direction, here the pedestrian left side and
fully connected layer followed by drop out and finally the right side should stay the same from the camera view, while
output layer. The convolution layers use the same filter size pedestrians showing one side of their bodies as in walking
of (3 × 3) but with different counts; 32, 32, 64, 128 and 256. from side to side in front of the camera will have inverted
While the fully connected layer used is 512 nodes. The total left and right sides. So to over come these issue a scripts
number of trainable parameters in the network is 925992. that perform image resizing, vertical and horizontal shifting
Fig. 4 shows the full architecture. is implemented to create more image variations, in addition
to manual labeling for horizontally flipped images.
In this work also rotational and horizontal flip augmenta-
tion methods were not needed as they do not cover real exam-
ples. The total number of variation made for each image is
almost 100 generating 600000 total samples that divided into
training (80%) and validation (20%) sets. Fig. 5 illustrates
some of the applied variations.
FIGURE 4. The architecture of the CNN model.
An important activity of building the body landmarks esti-

mation model is preparing the training and the validation
dataset. The collected dataset is a collection from differ-
ent datasets available publicly online for different tasks like
pedestrian detection images [27] and walking direction detec-
tion [28] in addition to self-collected pedestrian images using
the ZED camera [29]. The images contains the pedestrian
only with no other objects and minimal background as possi-
ble. For the self-collected images by ZED camera, a python
script is written to automate the process of extracting pedes-
trians images and save them as separate files. The process FIGURE 5. Sample of image variations applied on the base image on the
includes pedestrian detection using YOLOv3, cropping the top left corner.
detection and save it as a new image file.

In order to get dataset ready for the training process the
labeling information is required. The dataset was labeled B. PEDESTRIANS TRACKING MODULE
manually using labelbox online tool [30]. The labeling object The tracker module works along all other modules. Each
are defined as the four desired points, Right Shoulder(RS), pedestrian is registered as a track-able object, where the
Left Shoulder (LS), Neck (NK) and Nose (NS). The tool tracker keeps all pedestrian obtained information saved as
records the x and y coordinates for each selected point. The long as the pedestrian appears in the scene for a certain num-
total number of collected sampled images is 20000 while ber of frames. The tracker does not perform the task of object
the total number of labeled samples in 6000 images. In deep detection but assumes that the tracked object is provided
learning a larger dataset is always better to get higher accu- manually or as in this work automatically by a TinyYOLOv3
racy. In order to achieve that with the small labeled dataset object detector. The tracker starts working when receiving
a well known practice called image augmentation was used. the detected pedestrian from the pedestrian detector module.
Image augmentation increases the number of training and After that the YOLO detector will not be activated to update
validation samples by generating new samples. The process the detections for 25 frames to avoid unnecessary computa-
is done by applying different operations like horizontal and tions and achieve faster performance. At this stage the tracker
vertical flips, rotations, brightness change, adding noise, hor- can keep tracking the pedestrians in the upcoming frames
izontal and vertical shifts. Tools like Keras [31] can do image and update the coordinates boundary boxes that was initially
augmentation automatically. provided by YOLO for each pedestrian.
In this work the automated tools are not suitable for image In order to build a tracker that assigns the right label
augmentation, since an extra caution is required to fix the and information to the same pedestrian in every frame the
labels when a vertical flip is done, considering flipping the designed tracker consists of two components. a centroid
images vertically; the images of pedestrian facing the camera tracker and Kernelized Correlation Filters (KCF) tracker [32].
72562 VOLUME 8, 2020

The KCF is a variant of correlation filters. Correlation computation. The vectors creation depends on points visibil-
based filters consider a samples match if the samples have a ity, we are assuming that the neck is always visible while
high correlation value. KCF uses this idea for object track- one of the other points might not. The vector are constructed
ing. KCF finds the correlation between the tracked object always from the left shoulder to the right shoulder through
in the current frame and other patches in the next frame. the neck point. The normal of the vector in the direction of
The highest value correlation indicates in which direction the face point is assumed to be the human body orientation.
tracked object has moved. KCF tracker is not robust enough Stereo vision setup can estimate the distance of a certain
to significant change in object appearance. OpenCV [33] point using the two images taken by the two cameras. The
implementation is used for KCF tracker. cameras are separated by a known distance (b) known as the
To keep a correct labeling and pedestrian information baseline. The difference in the viewpoints for the same scene
assignments the centroid tracker is implemented. The cen- from the two cameras provides extra information enabling the
troid tracker inputs are the tracked pedestrian objects from process of generating a depth map. The depth map is usually
the KCF tracker, where the boundary boxes of each detected in a gray scale format and shows the distance between the
pedestrian is updated in every frame. As mentioned before camera and the objects in the scene. The extra information is
each detected pedestrian is registered as a track-able object the what called disparity, disparity is the horizontal shift that
and given a unique ID with all other information and main- can be observed between the left camera image and the right
tained by the centroid tracker. The centroid tracking approach camera image and it can be found on the pixel level, check
as shown in Fig. 6 uses the Euclidean distance between the Fig. 7. For the vertical shift to be valid a perfect alignment
already registered tracked objects centroids and new objects for cameras is assumed to be present in order match each
centroids in a subsequent frame in the video. pixel row in both images, this alignment is guaranteed by
the mounting of separate cameras or the packaging of stereo
camera manufacturer, otherwise an alignment pre-processing
is required.
FIGURE 6. Blue point represent the centroid object in the previous frame
while the red points represent the centroids of the detected objects in the
current frame. The euclidean distance is measured for each centroid and
the closest centroids in the new frame is given the same ID for the object
in the previous frame.
Euclidean distance is computed in every frame between

previously registered object and the newly updated cen-
troids location. Based on the Euclidean distance analysis,
the objects IDs will be updated by either assigning the same FIGURE 7. Disparity in stereo images.
ID to the nearest centroid or giving a new ID if the a new
object appeared and not registered previously or dropping the The following set of equations describe the math concepts
tracked object ID if the object is absent from the scene for behind the stereo vision model in Fig. 8, given the dispar-
certain number of frames. ity (d), the focal length of the two identical cameras (f) and
the baseline distance separating the cameras (b), then the
C. DEPTH SENSING MODULE depth (z) can be defined as following based on the simple
At this point of the system pipeline, the pedestrian are model of the pinhole camera. note that (z) is a plane to plane
detected with their body landmarks. one more thing to obtain distance.
in this module is how far these body landmark are from
x1 − x2 f
the camera plane, this distance is referred to as the depth. = (1)
To obtain the depth measurements, a stereo vision [34] sys- b Z
bf
tem is utilized. The depth information is measured for each Z = (2)
pedestrian detected body landmarks to construct a 3D space x1 − x2
presentation for the points. These points in the 3D world will Before obtaining the disparity a stereo matching should
be used to construct the vectors required for the orientation be achieved first. Assuming aligned images where each row
VOLUME 8, 2020 72563

the orientation is by estimating the orientation in the 3D space

using 3D vectors, another smarter and much simpler approach
is to convert the 3D space into a 2D space by eliminating the
height component of the pedestrian, in another words the 3D
points are projected on the floor plane. The resulted 2D plane
now contains the depth info on the y-axis and the location of
the point at the x-axis, the same concept illustrated in Fig. 3
before by observing the 3D space from a top view, the concept
is illustrated with two pedestrians example in Fig. 10. The
new top view provides the algorithm with a clear conclusion
about the pedestrian orientation.
FIGURE 8. Stereo vision model.
in the left image is aligned to the one in right image the

stereo matching is the process of finding corresponding pixels
in stereo pair of images as shown in Fig. 9. After that the FIGURE 10. Illustration of the idea of converting the 3D space info into
displacement of the pixels in reference to the left image for 2D space, the top view result is enough to get the pedestrian orientation.
example is found and the disparity map is obtained. Those
values can then be used to compute the depth as in shown The orientation is computed based on the 2D vectors
in Fig. 8. The matching process is done row by row, as a constructed by connecting the available detected points
starting point reference the same pixel column can be the as follows (LS-RS, LS-NK-RS, LS-NK, RS-NK, LS-NS,
starting point of search. The matching process is based simi- RS-NS), as noted the direction is always from the left side to
larity measure that defines the closest candidate to the target the right side. Then using tan inverse function the orientation
pixel. angle is known, the final angle is adjusted to be in reference to
the camera view, so a pedestrian facing the camera is having
a 0 degree orientation while pedestrian orientation toward
right has 90 degrees orientation, to over come the limitation of
tan inverse function in dealing with the whole range of angles
atan2 is used in this work which is defined as follows:
y
θ = arcTan2
 x

 arctan (y, ,
x) x>0
FIGURE 9. Pixel matching in stereo images view.
π

 x
, y>0


 arctan
2 y

For this work the ZED camera by Stereolabs is used. The 
ZED camera is a stereo vision system that can be used to arcTan2 (y, x) = − π − arctan x , y < 0 (3)
2 y

provide a 3D perception of the world. Providing a long range 

arctan xy ± π, x<0


depth perception up to 20m [29]. It is suitable for many appli- 



cations as in robot navigation, virtual reality, tracking, motion 
undefined, x, y = 0
tracking and so on. The depth computation through the stereo
vision in ZED camera is accelerated using CUDA [35] GPU E. CROSSING INTENTION DETECTION MODULE
computations. This module builds on all the information gathered from the
previous modules for each detected pedestrian. The module
D. ORIENTATION ESTIMATION MODULE utilize the collected data to detect pedestrian intention of
At this stage, the depth information collected by the previous crossing the road in order to improve the situation awareness
depth sensing module are used to estimate the orientation for for the driver or the auto driver in urban environments. The
each pedestrian and update the tracker. One way to compute driver view is categorized into two regions, the car path
72564 VOLUME 8, 2020

region and the curb region or everywhere else. Based on

the regions the detected pedestrians cases are categorized
into three pedestrians defining the required level of driver
awareness:
• Safe: A pedestrian walking on the curb, with an ori-
entation parallel to the car path, no signs of crossing
intention.
• Watch: A pedestrian walking on the curb but changed
their walking orientation toward the car path, a sign of
crossing intention.
• Risk: A pedestrian detected in the car path region,
FIGURE 12. Examples of model output.
the pedestrian already crossing.
This task is actually a part of the tracker module which
have an overview and a short history of information for each
active track-able pedestrian object. Since the tracker keeps a
record for each pedestrian orientation, it is possible to detect
any changes in the orientation pattern for those who are
walking on the curb. Once a change in orientation is detected,
the tracker decides whether this change is in the direction
of the car path or not, the tracker decides that based on the
pedestrian detection location relative to the car path. This
sudden change of orientation labels the pedestrian for extra
attention from the driver.
IV. RESULTS
The following section will discuss the output result for each
module then analyze the overall system performance.
A. BODY LANDMARKS ESTIMATION PERFORMANCE

The CNN model is trained using 80% of the examples in the
labeled dataset previously prepared for this task, the other
%20 is left for validation. In this work Keras library was used
to implement the model and perform the training process.
The training is conducted on Intel(R) Core(TM) i7-7700HQ
CPU at 2.80GHz with a 16 GB of RAM and equipped
with NVIDIA GeForce GTX 1060. The used loss function
is Mean Square Error (MSE) with ADAM optimizer and
0.0001 learning rate. The training is performed for 25 epoch
with 128 batch size. The model reached an accuracy of
%94 on validation. Fig. 11 shows the training accuracy.
FIGURE 13. Examples of orientation module output, the green box is
YOLOv3 pedestrian detection output, the points represents the body
landmarks while the red arrow points to the estimated orientation.
An important factor to get high accuracy body landmarks

points detection is to provide the model with cropped pedes-
trian detection that is close to the pattern provided in the
training, no big areas for the background. This issue has been
taken care of in the detector module, but still exists if the
pedestrian spreading arms which is not common but worth
mentioning, in this case the boundary box becomes larger
including a lot of background and result in wrong landmarks
points estimation.
Example of model output, are shown in Fig. 12, more train-
ing examples are required to get more generalized prediction
FIGURE 11. Model validation accuracy. in order to avoid miss-prediction as shown in Fig. 12b.
VOLUME 8, 2020 72565

FIGURE 14. Examples of tracking a pedestrian walking parallel to the car path then crossing the road in front of a vehicle, green boundary box indicates a
safe case while red indicates driver attention or action is required.
B. DEPTH SENSING PERFORMANCE TABLE 2. Orientation estimation confusion matrix.
As previously mentioned, the ZED camera computes the

depth information using re-projection from the model as
shown in Fig. 8. The ZED camera SDK provides a depth
map and a point cloud for four resolutions setup VGA,
HD720, HD1080 and HD2K. Point cloud is computationally
more expensive than the depth map and thats why this work
depends only on the depth map, in average the processing
depth map is 20%-30% faster than computing the point cloud,
where for example the time required to compute the point
cloud for a frame on HD720 is 1.9ms while the 1.7ms is
required to compute the depth map. TK1 Nvidia development board they mainly referred the error
Testing the depth measuring accuracy shows that higher to the hardware and the algorithm.
resolution provides a higher depth accuracy but more compu-
tations are required. The best accuracy for HD2K resolution C. ORIENTATION ESTIMATION PERFORMANCE
with error in measured distance 20cm while for VGA resolu- To evaluate the orientation module, the problem is converted
tion there might reach 75cm. As a trade of between accuracy to classification problem by categorizing the angle into 8 cat-
and computation time the resolution of HD720 is used, which egories covering the angles 0-360. The evaluation is made by
can have errors up to 35cm.Regarding the proposed method testing 50 test examples for each category and monitoring the
for orientation estimation, the error in measurement will not model output. Angles are categories in intervals of 45 degrees
critically affect the orientation estimation unless the mea- to reduce the size of the confusion matrix and the evaluation
sured region have different error values. The authors in [36] process. Table 2 shows the obtained matrix with accuracy
made more detailed study on modeling ZED camera error on average of 81.75% compared to 82.5% accuracy in [37] using
72566 VOLUME 8, 2020

TABLE 3. Orientation estimation classes precision. detector might fail to detect pedestrians in some situations.
Pedestrians have variety of physical shapes, heights, widths
and clothes, they appear in different environments, back-
grounds and weather conditions. This makes the task of pre-
TABLE 4. Pedestrian actions confusion matrix. dicting human behavior such as the intention to cross the road
a more complex and challenging task but at the same time it’s
a very promising technique in avoiding crashes and reducing
pedestrian fatalities.
Pedestrians who intends to cross the road and getting in
the path of the vehicle are more critical to the driver than
those who walk on the curb and not intending to cross the
road. One second prediction for a pedestrian crossing the road
ahead of a car driving at typical urban speed of 50 Km/h
CNN model with single pedestrian images input and 70.6% can provide a distance of 13.8 meters for a vehicle automatic
compared to [38], Table 3 shows the precision for each class response or a driver response, this time could be even longer
with an average of 81.76%. Fig. 13 shows examples of the if slowing down action is considered before the pedestrian
module output. start crossing the road. A couple of seconds of prediction
for a pedestrian intention could be critical to avoid crashes
D. OVERALL PERFORMANCE or reducing the chance of injury requiring hospitalization.
The purpose of this system is to detect pedestrian intention Pedestrians build their decision of crossing the road based
to cross the road in front of a vehicle, Fig. 14 illustrates on how fast and how far is the coming vehicles, but these
an example. The system evaluation is done on 20 video decisions might be wrong due to wrong estimation and here
sequences. The videos are filmed by the ZED camera out- comes the driver and auto driver roles.
doors in sunny and cloudy weathers and indoors with proper Recognizing pedestrian behavior from a driver view can be
lighting. The evaluated events are manually extracted for perceived by different actions and signs from the pedestrian.
testing to sequences of 5-10 seconds include pedestrian walk- Such actions and signs can be related to head movement when
ing on the road side (40), pedestrian crossing the road (25) looking at road sides as a sign of waiting for the right moment
and pedestrians walking then crossing the road (20). Table 4 of crossing, other signs are related to legs movements and
shows the confusion matrix for the classification with the body bending toward the street which are a clear indication
accuracy average of 87% and precision average of 86.74% for for starting a walking action. Other signs like low traffic
the classes. Different approaches in the literature adopt differ- density on the opposite lane can encourage pedestrian to cross
ent classification and result verification methods, the work the road. Looking at these signs not all of them are eas-
in [39] classify crossing vs not crossing pedestrian actions ily implemented into a computerized algorithms. This work
reached 70% accuracy using CNN extracted features, but this focuses on implementing the behavior of bending toward the
accuracy was increased to 88% using OpenPose extracted road as computer vision technique.
features and a SVM/Random Forest classifier. Looking at This paper has described a vision based approach for
the [40], the classified actions describing the pedestrian detecting pedestrians crossing intention to cross the road. The
actions are: standing, starting, stopping and walking. The approach uses a combination of deep learning techniques and
work achieved overall accuracy of 85%. depth sensing to build a 3D understanding of the pedestrian
orientation in reference to the camera view. A very impor-
V. CONCLUSION tant assumption that held here is the walking orientation is
Despite the huge efforts done by government and vehicles assumed to be the same as the body orientation.
manufactures to increase vehicles safety U.S. pedestrian Deep learning models were used to extract important body
fatalities have increased in the last few years. Vehicles are landmarks that highly related to the body orientation. The
equipped with more advanced safety modules, crash avoid- method is based on two deep learning components; pub-
ance technologies, pedestrian detection systems and even licly available CNN pedestrian detector model as YOLO and
more equipped with systems to minimize the effect of crashes another CNN model developed and trained by the authors.
and reduce injuries such as active hoods and windshield The training process included dataset collecting, dataset
airbags. Many experts are optimistic about the advances in labeling and image augmentation to increase the number of
the world of autonomous vehicles; they count on it to reduce labeled training examples.
pedestrian fatalities through eliminating the human drivers The CNN model achieved high validation accuracy of 94%
errors. in estimating the body landmarks for the detected pedes-
Even with the advances in computer vision algorithms trian by YOLO. Moving to the orientation performance
especially with the great performance of deep learning break- the model achieves high accuracy for the main orientations
through and the incredible result of pedestrian detectors. (0,90,180 and 270) degrees but lower accuracy for other
Pedestrian detection still a challenging task and advanced orientations. The lower accuracy and miss predictions for
VOLUME 8, 2020 72567

the orientation might come from different sources like the [6] A. M. Sabatini, ‘‘Estimating three-dimensional orientation of human
model itself or the ZED camera. The model accuracy can body parts by Inertial/Magnetic sensing,’’ Sensors, vol. 11, no. 2,
pp. 1489–1525, 2011.
be enhanced using richer dataset of pedestrian to achieve [7] C. Chen, A. Heili, and J.-M. Odobez, ‘‘A joint estimation of head and body
more generalized model while the depth sensing might be orientation cues in surveillance video,’’ in Proc. IEEE Int. Conf. Comput.
enhanced using a higher resolution video format, more pro- Vis. Workshops (ICCV Workshops), Nov. 2011, pp. 860–867.
[8] J. Chen, J. Wu, K. Richter, J. Konrad, and P. Ishwar, ‘‘Estimating head pose
cessing power or even replace the sensor with better depth orientation using extremely low resolution images,’’ in Proc. IEEE South-
sensor. Other solution might include LIDARs and sensor west Symp. Image Anal. Interpretation (SSIAI), Mar. 2016, pp. 65–68.
fusion techniques. [9] J. Choi, B.-J. Lee, and B.-T. Zhang, ‘‘Human body orientation estimation
using convolutional neural network,’’ 2016, arXiv:1609.01984. [Online].
Checking the final system output; classifying pedestrians Available: http://arxiv.org/abs/1609.01984
into walking, crossing and intention to cross. The system [10] K. Abughalieh and S. Alawneh, ‘‘Real time 2D pose estimation for pedes-
achieves high accuracy for detecting the crossing pedestrian trian path estimation using GPU computing,’’ in Proc. SAE Tech. Paper
Ser., Apr. 2019, pp. 1–5.
since it directly depends on the pedestrian detector and the [11] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, ‘‘OpenPose:
predefined region of interest. Walking pedestrian class also Realtime multi-person 2D pose estimation using part affinity fields,’’ 2018,
achieves high accuracy but the errors are coming from the arXiv:1812.08008. [Online]. Available: http://arxiv.org/abs/1812.08008
[12] D. Tome, C. Russell, and L. Agapito, ‘‘Lifting from the deep: Convo-
orientation estimation module might produce false positive lutional 3D pose estimation from a single image,’’ in Proc. IEEE Conf.
classification into the third class of crossing intention. The Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2500–2509.
overall system performance might be affected by the accuracy [13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’
of the depth information provided by the ZED camera. in Proc. Eur. Conf. Comput. Vis. (ECCV). Zürich, Switzerland: Springer,
As a summary this system addresses traffic safety, 2014, pp. 740–755.
the pedestrians safety in particular in the following manner: [14] A. T. Schulz and R. Stiefelhagen, ‘‘Pedestrian intention recognition using
latent-dynamic conditional random fields,’’ in Proc. IEEE Intell. Vehicles
• Providing the vehicle driver with extra awareness of the Symp. (IV), Jun. 2015, pp. 622–627.
pedestrian behavior in front of the vehicle. [15] F. Flohr, M. Dumitru-Guzu, J. F. P. Kooij, and D. M. Gavrila, ‘‘A probabilis-
• Pedestrians who suddenly appear in front of the vehicle tic framework for joint pedestrian head and body orientation estimation,’’
IEEE Trans. Intell. Transp. Syst., vol. 16, no. 4, pp. 1872–1882, Aug. 2015.
are harder to avoid due to the short reaction time win- [16] E. Rehder, H. Kloeden, and C. Stiller, ‘‘Head detection and orientation
dow, this system can help the driver and the auto-driver estimation for pedestrian safety,’’ in Proc. 17th Int. IEEE Conf. Intell.
to react faster by slowing down once crossing intention Transp. Syst. (ITSC), Oct. 2014, pp. 2292–2297.
[17] R. Quintero, I. Parra, D. F. Llorca, and M. A. Sotelo, ‘‘Pedestrian path
is detected. prediction based on body language and action classification,’’ in Proc. 17th
• The system also does the task of pedestrians detection Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Oct. 2014, pp. 679–684.
in the vehicle path allowing for higher chance of crash [18] H. Kataoka, Y. Aoki, Y. Satoh, S. Oikawa, and Y. Matsui, ‘‘Fine-grained
walking activity recognition via driving recorder dataset,’’ in Proc. IEEE
avoidance. 18th Int. Conf. Intell. Transp. Syst., Sep. 2015, pp. 620–625.
[19] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, ‘‘Context-based
pedestrian path prediction,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV).
VI. FUTURE WORK Zürich, Switzerland: Springer, 2014, pp. 618–633.
This research work opens the door for a lot of ideas and [20] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, ‘‘Trajectory analy-
enhancements such as enhancing the system by including sis and prediction for improved pedestrian safety: Integrated framework
and evaluations,’’ in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2015,
advanced depth sensor. Using advanced depth sensor with pp. 330–335.
higher resolution cameras will require more processing power [21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
which can be resolved by adding multiple GPUs to distribute Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
the heavy computations required by the CNN models and the [22] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve-
stereo vision algorithm. Another idea worth investigated here ment,’’ 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/
is using the same pipeline with LIDAR depth map instead of a 1804.02767
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
stereo vision camera to avoid the stereo vision computations. for accurate object detection and semantic segmentation,’’ in Proc. IEEE
The issue here is the resolution provided by the LIDAR which Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
is lower than the ZED camera. [24] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 7263–7271.
REFERENCES [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf.
[1] R. Retting, Pedestrian Traffic Fatalities by State: 2019 Preliminary Data. Comput. Vis. (ECCV). Amsterdam, The Netherlands: Springer, 2016,
Washington, DC, USA: Governors Highway Safety Association, 2020. pp. 21–37.
[2] A. A. Alkheir, M. Aloqaily, and H. T. Mouftah, ‘‘Connected and [26] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, ‘‘A dataset for
autonomous electric vehicles (CAEVs),’’ IT Prof., vol. 20, no. 6, pp. 54–61, movie description,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Nov. 2018. (CVPR), Jun. 2015, pp. 3202–3212.
[3] D. Gerónimo and A. M. López, Vision-Based Pedestrian Protection Sys- [27] Y. Deng, P. Luo, C. C. Loy, and X. Tang, ‘‘Pedestrian attribute recognition
tems for Intelligent Vehicles. Springer, 2014. at far distance,’’ in Proc. ACM Int. Conf. Multimedia, 2014, pp. 789–792.
[4] K. Abughalieh and S. Alawneh, ‘‘Pedestrian orientation estimation using [28] A. Dominguez-Sanchez, M. Cazorla, and S. Orts-Escolano, ‘‘Pedes-
CNN and depth camera,’’ in Proc. SAE Tech. Paper Ser., Apr. 2020, trian movement direction recognition using convolutional neural net-
pp. 1–9. works,’’ IEEE Trans. Intell. Transp. Syst., vol. 18, no. 12, pp. 3540–3548,
[5] B. Peng and G. Qian, ‘‘Binocular dance pose recognition and body ori- Dec. 2017.
entation estimation via multilinear analysis,’’ in Proc. IEEE Comput. Soc. [29] Stereolabs Inc. (Mar. 2020). Stereolabs Zed Camera. [Online]. Available:
Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2008, pp. 1–8. https://www.stereolabs.com/zed/
72568 VOLUME 8, 2020

[30] (Mar. 2020). Labelbox. [Online]. Available: https://labelbox.com SHADI G. ALAWNEH (Senior Member, IEEE)
[31] F. Chollet. (Mar. 2020). Keras. [Online]. Available: https://keras.io received the B.Eng. degree in computer engi-
[32] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, ‘‘High-speed tracking neering from the Jordan University of Science
with kernelized correlation filters,’’ IEEE Trans. Pattern Anal. Mach. and Technology, Irbid, Jordan, in 2008, and the
Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015. M.Eng. and Ph.D. degrees in computer engineer-
[33] G. Bradski. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software ing from the Memorial University of Newfound-
Tools. [Online]. Available: https://www.drdobbs.com/open-source/the- land, St. John’s, NL, Canada, in 2010 and 2014,
opencv-library/184404319
respectively. Then, he was a Staff Software Devel-
[34] L. Matthies, ‘‘Dynamic stereo vision,’’ Ph.D. dissertation, Dept. Comput.
oper with the Hardware Acceleration Lab, IBM,
Sci., Carnegie Mellon Univ. Pittsburgh, PA, USA, 1989.
[35] C. Nvidia, ‘‘Nvidia cuda c programming guide,’’ Nvidia Corp., vol. 120, Canada, from May 2014 to August 2014. After
no. 18, p. 8, 2011. that, he was a Research Engineer with C-CORE, from 2014 to 2016,
[36] L. E. Ortiz, V. E. Cabrera, and L. M. G. Goncalves, ‘‘Depth data error and became an Adjunct Professor with the Department of Electrical and
modeling of the ZED 3D vision sensor from stereolabs,’’ ELCVIA Electron. Computer Engineering, Memorial University of Newfoundland, in 2016.
Lett. Comput. Vis. Image Anal., vol. 17, no. 1, p. 1, 2018. He is currently an Assistant Professor with the Department of Electrical and
[37] K. Kumamoto and K. Yamada, ‘‘CNN-based pedestrian orientation estima- Computer Engineering, Oakland University. He has authored or coauthored
tion from a single image,’’ in Proc. 4th IAPR Asian Conf. Pattern Recognit. scientific publications, including international peer-reviewed journals and
(ACPR), Nov. 2017, pp. 13–18. conference papers. His research interests include parallel and distributed
[38] K. Hara, R. Vemulapalli, and R. Chellappa, ‘‘Designing deep convolu- computing, general purpose GPU computing, parallel processing architec-
tional neural networks for continuous object orientation estimation,’’ 2017, ture and its applications, autonomous driving, numerical simulation and
arXiv:1702.01499. [Online]. Available: http://arxiv.org/abs/1702.01499 modeling, and software design and optimization. He is a Senior Member of
[39] Z. Fang and A. M. Lopez, ‘‘Is the pedestrian going to cross? Answering by the IEEE Computer Society.
2D pose estimation,’’ in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2018,
pp. 1271–1276.
[40] R. Quintero, I. Parra, D. F. Llorca, and M. A. Sotelo, ‘‘Pedestrian intention
and pose prediction through dynamical models and behaviour classifi-
cation,’’ in Proc. IEEE 18th Int. Conf. Intell. Transp. Syst., Sep. 2015,
pp. 83–88.
KARAM M. ABUGHALIEH received the M.Sc.

degree in electrical engineering from Princess
Sumaya University for Technology (PSUT),
Amman, Jordan, in February 2011. He is cur-
rently pursuing the Ph.D. degree in electrical and
computer engineering with Oakland University.
He also worked on object detection and tracking
in the master’s degree thesis for UAV applications.
He also works as a Teaching and Research Assis-
tant with Oakland University. He has obtained a
very good experience in embedded systems design.
VOLUME 8, 2020 72569

Predicting Pedestrian Intention to Cross the Road

Uploaded by

Copyright:

Available Formats

Predicting Pedestrian Intention to Cross the Road

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Pedestrian Intention to Cross the Road

Uploaded by

Copyright:

Available Formats

Received March 29, 2020, accepted April 10, 2020, date of publication April 13, 2020, date of current

version April 30, 2020.

Predicting Pedestrian Intention to Cross the Road

INDEX TERMS ADAS, GPU, CNN.

VOLUME 8, 2020 72559

72560 VOLUME 8, 2020

work together in sequence to extract the pedestrians body

VOLUME 8, 2020 72561

FIGURE 4. The architecture of the CNN model.

An important activity of building the body landmarks esti-

detection and save it as a new image file.

72562 VOLUME 8, 2020

Euclidean distance is computed in every frame between

VOLUME 8, 2020 72563

the orientation is by estimating the orientation in the 3D space

FIGURE 8. Stereo vision model.

in the left image is aligned to the one in right image the

72564 VOLUME 8, 2020

region and the curb region or everywhere else. Based on

A. BODY LANDMARKS ESTIMATION PERFORMANCE

An important factor to get high accuracy body landmarks

VOLUME 8, 2020 72565

B. DEPTH SENSING PERFORMANCE TABLE 2. Orientation estimation confusion matrix.

As previously mentioned, the ZED camera computes the

72566 VOLUME 8, 2020

VOLUME 8, 2020 72567

72568 VOLUME 8, 2020

KARAM M. ABUGHALIEH received the M.Sc.

VOLUME 8, 2020 72569

You might also like