Two-Stream Multi-Channel Convolutional Neural Network (TM-CNN) For Multi-Lane Traffic Speed Prediction Considering Traffic Volume Impact
Two-Stream Multi-Channel Convolutional Neural Network (TM-CNN) For Multi-Lane Traffic Speed Prediction Considering Traffic Volume Impact
Two-Stream Multi-Channel Convolutional Neural Network (TM-CNN) For Multi-Lane Traffic Speed Prediction Considering Traffic Volume Impact
Abstract: Traffic speed prediction is a critically important component of intelligent transportation systems (ITS). Recently,
with the rapid development of deep learning and transportation data science, a growing body of new traffic speed
prediction models have been designed, which achieved high accuracy and large-scale prediction. However, existing studies
have two major limitations. First, they predict aggregated traffic speed rather than lane-level traffic speed; second, most
studies ignore the impact of other traffic flow parameters in speed prediction. To address these issues, we propose a two-
stream multi-channel convolutional neural network (TM-CNN) model for multi-lane traffic speed prediction considering
traffic volume impact. In this model, we first introduce a new data conversion method that converts raw traffic speed data
and volume data into spatial-temporal multi-channel matrices. Then we carefully design a two-stream deep neural network
to effectively learn the features and correlations between individual lanes, in the spatial-temporal dimensions, and
between speed and volume. Accordingly, a new loss function that considers the volume impact in speed prediction is
developed. A case study using one-year data validates the TM-CNN model and demonstrates its superiority. This paper
contributes to two research areas: (1) traffic speed prediction, and (2) multi-lane traffic flow study.
1
difference between different lanes. In some studies, this is matrices. The converted data matrices are
due to the unavailability of lane-level traffic data; in others organized as the inputs to the deep neural network.
where the lane-level data are available, the speeds are still (2) We design a two-stream CNN architecture for
often aggregated for simplifying the model complexity. multi-lane traffic speed prediction. The
However, since a long time ago, research has revealed that convolutional layers extract the correlations
traffic flows on different lanes show different yet correlated between lanes and spatial-temporal features in the
patterns [31–39]. For instance, Daganzo et al. studied a multi-channel data matrices. It also concatenates
“reverse lambda” pattern in their work [37]. This pattern the outputs of the two convolutional-layer streams
shows as consistently high flows on freeway median lanes, and learns a speed-volume feature vector.
but it has not been reported for the shoulder lanes. It is also (3) We propose a new loss function for the deep
observed that for either two-lane or three-lane freeway learning model. It is the sum of a speed term and a
segments, there are certain volume-density distributions for weighted volume term. By appropriately setting the
individual lanes [34]. As the increasing need for lane-based weight, the volume term improves the learning
traffic operations such as carpool lane tolling and reversable ability of the model and helps prevent overfitting.
lane control in modern transportation systems, this issue can (4) Traditional studies on multi-lane traffic flow
no longer be ignored. mostly focus on the mathematically modeling and
The second limitation is that most existing studies behavior description of multi-lane traffic. This
ignore other traffic flow parameters in speed prediction study is among the first efforts to apply deep
tasks. In traffic flow theory, there are correlations among learning methods for multi-lane traffic pattern
traffic flow speed, volume, and occupancy [40]. Without the mining and prediction.
integration of volume or occupancy into speed prediction,
the hidden traffic flow patterns may not be fully captured 2. Methodology
and learned, which can lead to reduced prediction accuracy 2.1 Modeling multi-lane traffic as multi-channel matrices
[41]. An intuitive example is that: In free-flow conditions, a The first step of our methodology is modeling the
larger-volume traffic stream tends to be more sensitive to multi-lane traffic flow as multi-channel matrices. We
perturbances than smaller-volume traffic stream. Therefore, propose a data conversion method to convert the raw data
the speed of the larger-volume traffic stream is more likely into spatial-temporal multi-channel matrices, in which
to decrease in a future time. However, without the volume traffic on every individual lane is added to the matrices as a
or occupancy data, it is hard to model the hidden traffic flow separate channel. This modeling idea comes from CNN’s
patterns. superiority to capture features in multi-channel RGB images.
To address these challenges, we propose a two- In RGB images, each color channel has correlations yet
stream multi-channel convolutional neural network (TM- differences with the other two. This is similar to traffic
CNN) for multi-lane traffic speed prediction with the flows on different lanes where correlations and differences
consideration of traffic volume impact. In the proposed both exist [32, 37]. Thus, averaging traffic flow parameters
model, we develop a data conversion method to convert at a certain milepost and timestamp is like doing a weighted
both the multi-lane speed data and multi-lane volume data average of the RGB values to get the grayscale value. In this
into multi-channel spatial-temporal matrices. We design a sense, previous methods for traffic speed prediction are
CNN architecture with two streams, where one takes the designed for “grayscale images” (spatial-temporal prediction
multi-channel speed matrix as input and another takes the for averaged speed) or even just a single image column
multi-channel volume matrix as input. A fusion method is (speed prediction for an individual location). In this study,
further implemented for the two streams. Specifically, the proposed model manages to handle lane-level traffic
convolutional layers learn the two matrices to capture traffic information by formulating the data inputs as “RGB images.”
flow features in three dimensions: the spatial dimension, the In this paper, loop detector data is used due to the
temporal dimension, and the lane dimension. Then, the fact it collects different types of traffic flow data on
output tensors of the two streams will be flattened and individual lanes. That is being said, though loop detector is a
concatenated into one speed-volume vector, and this vector relatively traditional traffic detector, it provides lane-level
will be learned by the fully connected (FC) layers. traffic speed, volume, and occupancy data which many other
Accordingly, a new loss function is devised considering the detectors do not [42–44]. For example, probe vehicle data
volume impact in the speed prediction task. are widely used nowadays, but besides a small sample of
The proposed TM-CNN model is validated using traffic speeds and trajectories, most of them are unable to
one-year loop detector data on a major freeway in the collect lane-level data or volume data.
Seattle area. The comprehensive comparisons and analyses This data conversion method diagram is shown in
demonstrate the strength and effectiveness of our model. Figure 1. There are loop detectors installed at k different
This paper contributes to two transportation research areas. mileposts along this segment, and the past n time steps are
First, it contributes to the traffic speed prediction area by considered in the prediction task. We denote the number of
adding a new deep neural network model to the existing lanes as c. Without loss of generality, it is assumed that the
literature. Second, it pushes off the boundary of knowledge number of lanes is three in Figure 1 for the sake of
in the multi-lane traffic flow study area by developing a illustration. Single-lane traffic would be represented by two
method for the learning and speed prediction of multi-lane 𝑘 × 𝑛 spatial-temporal 2D matrices, where one is for speed
traffic. In summary, the contribution of this paper is fourfold: and another for volume. We denote them as 𝐼𝑢 for speed and
(1) We introduce a new data conversion method to 𝐼𝑞 for volume. We define the speed value and volume value
convert the multi-lane traffic speed data and
to be 𝑢𝑖𝑙𝑡 and 𝑞𝑖𝑙𝑡 respectively for a detector at milepost i (i
volume data into spatial-temporal multi-channel
2
= 1,2,…,k) and lane l (l = 1,2,…,c) at time t (t = 1,2,…,n). given milepost 𝑖 and time 𝑡. In the three-lane example in
Note that each 𝑢𝑖𝑙𝑡 or 𝑞𝑖𝑙𝑡 is normalized to between 0 and 1 Figure 1, the spatial-temporal matrices have three channels.
using min-max normalization since speed and volume have Mathematically, the spatial-temporal multi-channel matrices
different value ranges. Hence, in the speed and volume for traffic speed (𝑋𝑢 ) and volume (𝑋𝑞 ) can be denoted as
matrices with the size 𝑘 × 𝑛 × 𝑐, we construct the matrices
using Eq. (1) and Eq. (2), 𝐼𝑢 (1,1) 𝐼𝑢 (1,2) … 𝐼𝑢 (1, 𝑛)
𝐼 (2,1) 𝐼𝑢 (2,2) … 𝐼𝑢 (2, 𝑛)
𝑋𝑢 = [ 𝑢 ] (3)
𝐼𝑢 (𝑖, 𝑡) = (𝑢𝑖1𝑡 , 𝑢𝑖2𝑡 , … , 𝑢𝑖𝑐𝑡 ) (1) ⋮ ⋮ ⋮
𝐼𝑞 (𝑖, 𝑡) = (𝑞𝑖1𝑡 , 𝑞𝑖2𝑡 , … , 𝑞𝑖𝑐𝑡 ) (2) 𝐼𝑢 (𝑘, 1) 𝐼𝑢 (𝑘, 2) … 𝐼𝑢 (𝑘, 𝑛)
Fig. 1 The data input modeling process of converting the multi-lane traffic flow raw data to the multi-channel spatial-temporal
matrix
2.2 Convolution for feature extraction the left-most column, channel #1 displays the traffic pattern
The CNN has demonstrated a promising performance of lane #1; and on the bottom, the pattern of lane #c is
in image classification and many other applications due to presented. The symbol “*” denotes the convolution
its locally-connected layers and the better ability than other operation in Figure 2. Since our input is a multi-channel
neural networks to capture local features. In transportation, image, the convolution filters are also multi-channel. In the
traffic stream, as well as disturbance to traffic stream, moves figure, a 3 × 3 × c filter is drawn, while the size of the filter
along the spatial axis and the temporal axis. Thus, applying can be changed in practice. The values inside the cells of a
CNN to the spatial-temporal traffic image manages to filter are weights of the CNN, which are automatically
capture local features in both spatial and temporal modified during the training process. The final weights are
dimensions. The fundamental operation in the feature able to extract the most salient features in the multi-channel
extraction process of CNN is convolution. With the re- image. The convolution operation outputs a feature map for
organized input as a multi-channel matrix 𝑋 (𝑋 could be 𝑋𝑢 each channel, and they are summed up to be the extracted
or 𝑋𝑞 ), the basic unit of a convolution operation is shown in feature map of this convolution filter in the current
Figure 2. On the left most of the figure, it is the input convolutional layer. With multiple filters operated on the
spatial-temporal matrix or image 𝑋. Every channel of the same input image, a multi-channel feature map will be
input matrix is a 2D spatial-temporal matrix representing the constructed, and serves as the input to the next layer.
traffic flow pattern on the corresponding lane. On the top of
3
Fig. 2 The convolution operation to extract features from the multi-channel spatial-temporal traffic flow matrices
2.3 The TM-CNN for speed prediction (5) Different from most CNN's, our CNN does not have a
In order to learn the multi-lane traffic flow patterns pooling layer. The main reason for not inserting pooling
and predict traffic speeds, a CNN structure is designed (see layers in between convolutional layers is that our input
Figure 3). Compared to a standard CNN, the proposed CNN images are much smaller than regular images for image
architecture is modified in the following aspects: (1) The classification or object detection [45, 46]. Regular input
network inputs are different, that is, the input image is a images to a CNN usually have hundreds of columns and
spatial-temporal image built by traffic sensor data, and it has rows while the spatial-temporal images for roadway traffic
multiple channels which represent the lanes of a corridor. are not that large. In this research and many existing traffic
Moreover, the pixels values’ range is different from a prediction studies, the time resolution of the data is five
normal image. For a normal image, it is 0 to 255; however, minutes, which means even using two-hour data for
here it ranges from 0 to either the highest speed (often the prediction there are only 24 time steps. Thus, we do not risk
speed limit) or the highest volume (often the capacity). (2) losing information by pooling. (6) The loss function is
The neural network has two streams of convolutional layers, devised to contain both speed and volume information. For
which are for processing the speeds and volumes. But most traditional image classification CNN’s, the loss function is
CNN has only one stream of convolutional layers. The the cross-entropy loss. And for traffic speed prediction tasks,
purpose of having two streams of convolutional layers is to the loss function is commonly the Mean Squared Error
integrate both speed information and volume information (MSE) function with only speed values. However, in this
into the model so that the network can learn the traffic research, we add a new term in the loss function to
patterns better than only learning speed. To combine the two incorporate the volume information. We denote the ground
streams, a fusion operation that flattens and concatenates the truth speed vector and volume vector as 𝑌𝑢 and 𝑌𝑞 , and the
outputs of the two streams are implemented between the predicted speed vector and volume vector as 𝑌̂𝑢 and 𝑌̂𝑞 . Note
convolutional layers and the FC layers. The fusion operation that 𝑌𝑢 , 𝑌𝑞 , 𝑌̂𝑢 , and 𝑌̂𝑞 are all normalized between 0 and 1.
is chosen to be concatenation instead of addition or
The loss function 𝐿 is defined in Eq. (5) by summing up the
multiplying because the concatenation operation is more
MSEs of speed and volume. The volume term λ||𝑌̂𝑞 − 𝑌𝑞 ||22
flexible for us to modify each stream’s structures. In other
words, the concatenation fusion method allows the two is added to the loss function for reducing the probability of
streams of convolutional layers to have different structures. overfitting by helping the model better understand the
(3) The extracted features have unique meanings and are essential traffic patterns. This design improves the speed
different from image classifications or most other tasks. The prediction accuracy on test dataset with proper settings of λ.
extracted features here are relations among road segments, Our suggested value of λ is between 0 and 1 considering that
time series, adjacent lanes, and between traffic flow speeds the volume term that deals with overfitting should still have
and volumes. (4) The output is different, i.e., our output is a a lower impact than the speed term on speed prediction
vector of traffic speeds of multiple locations at a future time problems.
rather than a single category label or some bounding boxes’
coordinates. The output itself is part of the input for another 𝐿 = ||𝑌̂𝑢 − 𝑌𝑢 ||22 + λ||𝑌̂𝑞 − 𝑌𝑞 ||22 (5)
prediction, while this is not the case for most other CNN’s.
4
Fig. 3 The proposed two-stream multi-channel convolutional neural network (TM-CNN) architecture
Fig. 7 The predicted speeds and ground truths at milepost 166.4 for all lanes in 24 hours