RL Paper Latex v01d01
RL Paper Latex v01d01
RL Paper Latex v01d01
Abstract—In this paper, we propose a reinforcement learning conditions. Hence, there are multiple variables to consider
based traffic signal controller. We use the n-step SARSA algo- before deciding which traffic signal to activate. The Reinforce-
rithm to design a predictive approach to traffic signal control. ment Learning method looks for an optimal solution in a self-
An optimal policy that effectively minimizes the probability of
congestion in the network is arrived at in this paper. Our results explorative manner. This relieves us from the burden of writing
show that our algorithm outperforms conventional approaches elaborate hard-coded algorithms, that in their very nature
such as the longest queue first technique. The overall approach are not universal to all systems. Reinforcement Learning
is so designed as to make it implementable in low-cost real-time techniques, on the other hand, can adapt to any system.
systems. The Reinforcement Learning research on Traffic signal
Index Terms—Reinforcement Learning, Simulation of Urban control is replete with Neural Network (NN) based algorithms.
Mobility, SUMO, Artificial Intelligence, Smart city. While being efficient, they are also highly complex and
compute-intensive. Implementing them on low-cost real-time
I. I NTRODUCTION embedded systems is difficult. This makes them an unviable
In the modern world, urbanization has led to high traffic option for developing countries. We instead use the function
density in cities. This is exacerbated by the poor road infras- approximation technique, which is less compute-intensive.
tructure in developing countries. Fixed signal timings are the This aligns with our aim to design a simplistic low-cost traffic
norm in most of the traffic junctions around the world. This controller.
technique, based on the round-robin method, cycles the sign In this paper, we propose for the first time, an adaptive
configurations in a periodic manner. This strategy, though easy Reinforcement Learning based traffic signal controller that
to implement, may lead to increased congestion as it does not uses the versatile n-step SARSA algorithm with Linear func-
take the real-time traffic conditions into account. For example, tion approximation to estimate Q-values. The pertinent issue
this system might give a green signal to an intersection even with using Reinforcement Learning methods for large-scale
when no vehicles are waiting. We define the traffic signal traffic networks is that the size of the state-action spaces
control problem as follows: Assuming that we have data about increases exponentially with the number of signal intersections
the state of traffic at all the intersections in the network, what in the network. To counteract this issue, we employ a novel
is the optimal phasing sequence of the signals that achieves centralized agent that actuates traffic signals such that the size
the lowest rate and intensity of vehicular congestion? of the state-action space increases linearly with the number of
Typical control system based traffic controllers try to predict intersections, rather than exponentially like in the conventional
the future traffic pattern and set a traffic signal plan. However, methods. This makes our algorithm well-suited for even large-
the caveat to this approach is that the traffic network and the scale networks. The centralized traffic control agent used in
flow have to be modeled. As pointed out in [1] & [2], this our algorithm is designed to reduce the queue lengths at the
is not such an easy task to accomplish for a dynamic traffic intersections and the vehicle waiting times. Our algorithm
flow. is simulated using the multi-modal traffic simulator SUMO
The need of the hour is self-learning and adaptive traffic (Simulation of Urban MObility) on a real-world road network
management systems. Simple techniques such as induction from Texas, USA. We show that our algorithm performs better
loop sensors have been in existence since the 1960s [4]. With when compared with contemporary techniques such as Static
the modern advancements in Artificial Intelligence, the imple- Signaling (SS) and Longest Queue First (LQF).
mentation of highly efficient traffic controllers is possible. The succeeding sections are organized as follows: Section II
Reinforcement Learning (RL), being a hybrid of supervised describes the research conducted in the field of Traffic signal
and unsupervised learning, is best suited to find optimal control and reinforcement learning, Section III explains the
solutions for high-dimensional control problems. The traffic road network and the simulation setup used in our experiment,
flow in a network is dependent on multiple parameters such and also defines the state space, action space, and the reward
as vehicular density, traffic flow rate, and existing congestion structure. Section IV details the algorithmic approach used in
our work. Section V discusses the results and performance of network. This may prove beneficial in the case where multiple
the agent, and Section VI summarizes the research conducted. intersections are already congested and one leads to the other.
In such a situation, opening the wrong intersection may lead to
II. R ELATED W ORKS a snowballing effect leading to more congestion. Multi-agent
In this section, we delve into the major research works Reinforcement Learning (MARL) systems have independent
carried out in the field of RL for improving the efficiency agents working at all intersections in the network, and they
of traffic systems. act in a coordinated fashion to achieve better overall system
Representation of the state information plays a major role in efficiency. References [5], [10], [11], and [12] describe the use
the outcome of any RL algorithm. Several representations are of such MARL based traffic control. A hybrid of centralized
found in the literature. For example, a feature vector consisting and independent MARL system is proposed in [13].
of thresholded vehicle queue lengths, elapsed waiting time, The reward structure determines the objective of your RL
and the current signal phase is used in [5]. We found during based solution. The reward should be formulated in such a way
our work that queue length is a bad measure of sensing that when the agent takes an action that optimizes an objective
congestion at intersections. Our conclusion was that the high metric such as queue lengths or waiting times, it should be
bumper-to-bumper separation distance for large vehicles and given a higher reward. One or more traffic parameters can
also varying lengths of the vehicles were the reason for this. be used to calculate the reward, where uneven weights can be
Instead, vehicle density proved to be a better indicator of assigned to the parameters to emphasize the effect of one over
crowding at intersections. References [6] and [7] use a state the others. During the training phase, the agent learns which
representation called Discrete Traffic State Encoding (DTSE). actions are appropriate in a given situation. Hence, when it
In this approach, the roads are sectioned into grids, and a encounters similar situations in the live run, it knows which
particular cell of the grid is marked as ‘1’ if a vehicle is actions are best. One obvious way to improve a particular
present in that cell and ‘0’ otherwise. A speed matrix of traffic parameter (or a weighted sum of them) is to use the
the same dimension as the grid is updated regularly with difference between consecutive time steps as the reward. This
the speed of the vehicle in that cell (0 if no vehicle is results in a positive reward when there is an improvement in
present). Better performance is achieved by decreasing the the parameter, negative otherwise. Using cumulative waiting
size of the cells, albeit at the cost of higher computational time is common in the literature [6] [7]. Also frequently used
power requirements. The downside of this approach is that is a combination of two parameters such as waiting time and
more complex data acquisition techniques are required as the queue length [5]. Works such as [8] also include the number
speeds of the vehicles are to be regularly updated in addition of passing vehicles and their travel times in the calculation.
to the binary matrix denoting whether the vehicle is present It should be kept in mind that the reward structure works
or absent. Reference [8] proposes extracting vehicle position hand-in-hand with the algorithm, and hence, the choice of one
information from the image representation by feeding it into should be tailored to the overall approach.
a Convolutional Neural Network (CNN). Along with this, The RL algorithm has primarily two components: the
queue length, updated waiting time, phase, and the number method used to estimate the Q-values, and second, the actual
of vehicles are inputted into an another NN to estimate the algorithm used to achieve the optimal policy. The most popular
Q-values. Keeping in mind our objective to design a simple method to estimate the Q-values is to use a Neural Network.
low-cost system, we shy away from the above techniques. It can model non-linear effects by using appropriate activation
There are different things that can be controlled using traffic functions. Extensive research has been conducted in this
signals to influence the traffic flow in the network. Some domain, with many applied to the traffic control problem [7]
examples include the phase (side) to open for traffic, the [8] [9]. Works such as [5] use linear function approximation
duration of the phase (timing), and the sequence of the phases. in combination with Q-learning to attain the optimal policy.
Works such as [9] propose controlling both the phase and Works such as [14] use the vanilla table-based Q-learning
duration. The phase duration is adjustable to a resolution of 1s. method. However, a finite state-action space is a prerequisite
The authors of this paper are of the opinion that though such to apply this method. Most previous works available in the
systems may show promising results on paper, they fail to take literature use a variant of the Q-learning algorithm [6] [9].
into account the behavioral aspects of drivers. For example, Reference [15] presents a unique Dynamic Programming (DP)
although the phase durations are assigned proportional to based method to find the optimal control plan. This DP-based
traffic demand from that side, the drivers may have a feeling of algorithm uses the values of the neighboring intersections to
unfairness when a very long period is assigned to sides other update the value of a given one. This inherent property of DP
than theirs. This is especially true when designing systems ensures cooperation with the neighbors.
for countries with a considerable number of unsupervised
intersections. To counteract this feeling of unfairness, works III. E XPERIMENTAL S ETTING
such as [5] and [8] propose adding vehicle waiting times to Our experiments are conducted on a simulated 4-junction
the feature vector. road network in Texas. We obtain the road network using
One main advantage of centralized agents is that they OpenStreetMap. We perform our simulations on this network
inherently take into account the overall traffic condition in a using the open-source traffic simulator Simulation of Urban
Mobility (SUMO). Controlled traffic patterns are generated dependency. This reduces the size of the feature vector, which
using SUMO’s inbuilt traffic generator to test our algorithm. in turn reduces the memory requirements of the algorithm. It
The full road network used in our method is as shown also reduces the training time required since fewer weights
in Fig. 2. The blue-colored intersections are of concern to are involved in the Q-value calculation. Furthermore, it makes
us. The traffic signals in these intersections are shown in our algorithm highly scalable, making it suitable for large scale
Fig. 3. The 15 signals numbered N 1 to N 15 are the 15 networks [3].
signals present in all of these intersections. Lane area detectors The reward structure is primarily designed to reduce the
(called E2 detectors in SUMO) are used to sense the vehicle queue length of the vehicles waiting at the intersections.
occupancy at the intersections. The percentage occupancy The vehicle occupancy percentages provided by the lane area
outputted by these detectors comprise the state feature vector detectors are used to calculate the reward. For this, we use
of the algorithm. These detectors sense vehicles up to 70m a simple scheme where the reward is set as the negative of
from the intersections. the cumulative queue lengths (in meters) of the signals of the
The network consists of roads of one to three lanes, where particular intersection being currently actuated. Thus, when
each lane is a standard 12ft wide. The main arterial roads are any of the queue lengths in that intersection increases, the
1.5km to 2km in length. The traffic consists of 4 types of reward becomes more negative. The agent then corrects itself
vehicles. The traffic distribution is modeled to represent urban by taking alternate actions to achieve reduced overall queue
traffic - Bikes (41%), Cars (37.5%), Trucks (12.5%), Buses lengths.
(8.3%). These vehicles have different speed and acceleration The RL paradigm is based on the Markov Decision Pro-
parameters as shown in Table I [3]. cess (MDP) framework. Here, we formally present the traffic
Fig. 3, shows a four-way intersection used in our setup. It control problem as an MDP. An MDP consists of an agent,
shows the traffic signal phasing in green and red colors. In our a state space, an action space, and a reward structure. At any
phasing scheme, when a particular phase is activated, vehicles time step, the agent in state s chooses an action a that is
on that side can move to any of the other three sides. The available in that state. This action takes the agent to a new
green lines in Fig. 1, depicts the same. The duration of each random state s0 , while rewarding the agent with a reward R.
phase is fixed at 64s. In our problem, the state S of the environment consists of
The traffic pattern is generated by SUMO’s inbuilt Python- the occupancy levels of signals N 1 to N 15. The occupancy
based script. This script allows the traffic flow parameters to levels can take any value from 0% to 100%. Thus, there are
be configured according to the requirement. The generated infinite states in our system. This renders table-based Q-value
traffic pattern is set to be similar to a real-world one, in both estimators unusable [3].
the composition and other attributes. For this, we choose a
Binomial distribution with n = 5 & p = 0.15, with larger
traffic flowing in wider roads. S = {n1, n2, n3, · · · , n14, n15} (1)
A centralized agent like ours has to actuate all the intersec-
tions in the network by itself. As such, when the intersections where ni is the occupancy level of signal i, and 0% ≤ ni ≤
are operated simultaneously, the action space of the model 100%.
consists of all possible sign configurations of all the inter-
sections in the network. For example, for our 4-intersection
network, the number of possible actions is 4*4*4*3 = 192,
i.e., the number of actions grows as 4n (assuming all are 4-
way intersections). It is easy to imagine how things could get
out of hand for a city-level traffic network. Instead of this naı̈ve
approach, we use a cyclic scheme, wherein the intersections
are operated cyclically (INT1 → INT2 → INT3 → INT4
→ INT1) with a 16s time gap. Hence, at any one point in
time, we are actuating only a single intersection. This reduces
the action space to just 4+4+4+3 = 15, which is a linear 4n
TABLE I
V EHICLE C HARACTERISTICS
Max Max
Vehicle Length Width
Speed Acceleration
Type (m) (m)
(m/s) (m/s2 )
Bike 2.2 0.9 25 6.0
Car 4.3 1.8 27.78 2.9
Truck 7.1 2.4 22.22 1.3
Bus 12.0 2.5 19.44 1.2 Fig. 1. The road network.
where R̄t is an estimate of the average reward.
IV. D ESCRIPTION OF THE RL A PPROACH U SED
We wish to find the optimal signal control scheme to
achieve better traffic flow efficiency. This control problem of
reinforcement learning, in our case, reduces to maximizing the
differential n-step return defined in (4). We do this by taking
optimal actions. The optimal action at a state s is defined as
the one maximizing the Q-value, Q(s, a).