RL Paper Latex v01d01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Reinforcement Learning based Intelligent Traffic

Signal Control using n-step SARSA


Akshay Kekuda R. Anirudh Mithun Krishnan
Dept. of Computer Science and Engineering Packet Hardware Group Packet Hardware Group
Ohio State University Centre for Development of Telematics Centre for Development of Telematics
Columbus, Ohio Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]

Abstract—In this paper, we propose a reinforcement learning conditions. Hence, there are multiple variables to consider
based traffic signal controller. We use the n-step SARSA algo- before deciding which traffic signal to activate. The Reinforce-
rithm to design a predictive approach to traffic signal control. ment Learning method looks for an optimal solution in a self-
An optimal policy that effectively minimizes the probability of
congestion in the network is arrived at in this paper. Our results explorative manner. This relieves us from the burden of writing
show that our algorithm outperforms conventional approaches elaborate hard-coded algorithms, that in their very nature
such as the longest queue first technique. The overall approach are not universal to all systems. Reinforcement Learning
is so designed as to make it implementable in low-cost real-time techniques, on the other hand, can adapt to any system.
systems. The Reinforcement Learning research on Traffic signal
Index Terms—Reinforcement Learning, Simulation of Urban control is replete with Neural Network (NN) based algorithms.
Mobility, SUMO, Artificial Intelligence, Smart city. While being efficient, they are also highly complex and
compute-intensive. Implementing them on low-cost real-time
I. I NTRODUCTION embedded systems is difficult. This makes them an unviable
In the modern world, urbanization has led to high traffic option for developing countries. We instead use the function
density in cities. This is exacerbated by the poor road infras- approximation technique, which is less compute-intensive.
tructure in developing countries. Fixed signal timings are the This aligns with our aim to design a simplistic low-cost traffic
norm in most of the traffic junctions around the world. This controller.
technique, based on the round-robin method, cycles the sign In this paper, we propose for the first time, an adaptive
configurations in a periodic manner. This strategy, though easy Reinforcement Learning based traffic signal controller that
to implement, may lead to increased congestion as it does not uses the versatile n-step SARSA algorithm with Linear func-
take the real-time traffic conditions into account. For example, tion approximation to estimate Q-values. The pertinent issue
this system might give a green signal to an intersection even with using Reinforcement Learning methods for large-scale
when no vehicles are waiting. We define the traffic signal traffic networks is that the size of the state-action spaces
control problem as follows: Assuming that we have data about increases exponentially with the number of signal intersections
the state of traffic at all the intersections in the network, what in the network. To counteract this issue, we employ a novel
is the optimal phasing sequence of the signals that achieves centralized agent that actuates traffic signals such that the size
the lowest rate and intensity of vehicular congestion? of the state-action space increases linearly with the number of
Typical control system based traffic controllers try to predict intersections, rather than exponentially like in the conventional
the future traffic pattern and set a traffic signal plan. However, methods. This makes our algorithm well-suited for even large-
the caveat to this approach is that the traffic network and the scale networks. The centralized traffic control agent used in
flow have to be modeled. As pointed out in [1] & [2], this our algorithm is designed to reduce the queue lengths at the
is not such an easy task to accomplish for a dynamic traffic intersections and the vehicle waiting times. Our algorithm
flow. is simulated using the multi-modal traffic simulator SUMO
The need of the hour is self-learning and adaptive traffic (Simulation of Urban MObility) on a real-world road network
management systems. Simple techniques such as induction from Texas, USA. We show that our algorithm performs better
loop sensors have been in existence since the 1960s [4]. With when compared with contemporary techniques such as Static
the modern advancements in Artificial Intelligence, the imple- Signaling (SS) and Longest Queue First (LQF).
mentation of highly efficient traffic controllers is possible. The succeeding sections are organized as follows: Section II
Reinforcement Learning (RL), being a hybrid of supervised describes the research conducted in the field of Traffic signal
and unsupervised learning, is best suited to find optimal control and reinforcement learning, Section III explains the
solutions for high-dimensional control problems. The traffic road network and the simulation setup used in our experiment,
flow in a network is dependent on multiple parameters such and also defines the state space, action space, and the reward
as vehicular density, traffic flow rate, and existing congestion structure. Section IV details the algorithmic approach used in
our work. Section V discusses the results and performance of network. This may prove beneficial in the case where multiple
the agent, and Section VI summarizes the research conducted. intersections are already congested and one leads to the other.
In such a situation, opening the wrong intersection may lead to
II. R ELATED W ORKS a snowballing effect leading to more congestion. Multi-agent
In this section, we delve into the major research works Reinforcement Learning (MARL) systems have independent
carried out in the field of RL for improving the efficiency agents working at all intersections in the network, and they
of traffic systems. act in a coordinated fashion to achieve better overall system
Representation of the state information plays a major role in efficiency. References [5], [10], [11], and [12] describe the use
the outcome of any RL algorithm. Several representations are of such MARL based traffic control. A hybrid of centralized
found in the literature. For example, a feature vector consisting and independent MARL system is proposed in [13].
of thresholded vehicle queue lengths, elapsed waiting time, The reward structure determines the objective of your RL
and the current signal phase is used in [5]. We found during based solution. The reward should be formulated in such a way
our work that queue length is a bad measure of sensing that when the agent takes an action that optimizes an objective
congestion at intersections. Our conclusion was that the high metric such as queue lengths or waiting times, it should be
bumper-to-bumper separation distance for large vehicles and given a higher reward. One or more traffic parameters can
also varying lengths of the vehicles were the reason for this. be used to calculate the reward, where uneven weights can be
Instead, vehicle density proved to be a better indicator of assigned to the parameters to emphasize the effect of one over
crowding at intersections. References [6] and [7] use a state the others. During the training phase, the agent learns which
representation called Discrete Traffic State Encoding (DTSE). actions are appropriate in a given situation. Hence, when it
In this approach, the roads are sectioned into grids, and a encounters similar situations in the live run, it knows which
particular cell of the grid is marked as ‘1’ if a vehicle is actions are best. One obvious way to improve a particular
present in that cell and ‘0’ otherwise. A speed matrix of traffic parameter (or a weighted sum of them) is to use the
the same dimension as the grid is updated regularly with difference between consecutive time steps as the reward. This
the speed of the vehicle in that cell (0 if no vehicle is results in a positive reward when there is an improvement in
present). Better performance is achieved by decreasing the the parameter, negative otherwise. Using cumulative waiting
size of the cells, albeit at the cost of higher computational time is common in the literature [6] [7]. Also frequently used
power requirements. The downside of this approach is that is a combination of two parameters such as waiting time and
more complex data acquisition techniques are required as the queue length [5]. Works such as [8] also include the number
speeds of the vehicles are to be regularly updated in addition of passing vehicles and their travel times in the calculation.
to the binary matrix denoting whether the vehicle is present It should be kept in mind that the reward structure works
or absent. Reference [8] proposes extracting vehicle position hand-in-hand with the algorithm, and hence, the choice of one
information from the image representation by feeding it into should be tailored to the overall approach.
a Convolutional Neural Network (CNN). Along with this, The RL algorithm has primarily two components: the
queue length, updated waiting time, phase, and the number method used to estimate the Q-values, and second, the actual
of vehicles are inputted into an another NN to estimate the algorithm used to achieve the optimal policy. The most popular
Q-values. Keeping in mind our objective to design a simple method to estimate the Q-values is to use a Neural Network.
low-cost system, we shy away from the above techniques. It can model non-linear effects by using appropriate activation
There are different things that can be controlled using traffic functions. Extensive research has been conducted in this
signals to influence the traffic flow in the network. Some domain, with many applied to the traffic control problem [7]
examples include the phase (side) to open for traffic, the [8] [9]. Works such as [5] use linear function approximation
duration of the phase (timing), and the sequence of the phases. in combination with Q-learning to attain the optimal policy.
Works such as [9] propose controlling both the phase and Works such as [14] use the vanilla table-based Q-learning
duration. The phase duration is adjustable to a resolution of 1s. method. However, a finite state-action space is a prerequisite
The authors of this paper are of the opinion that though such to apply this method. Most previous works available in the
systems may show promising results on paper, they fail to take literature use a variant of the Q-learning algorithm [6] [9].
into account the behavioral aspects of drivers. For example, Reference [15] presents a unique Dynamic Programming (DP)
although the phase durations are assigned proportional to based method to find the optimal control plan. This DP-based
traffic demand from that side, the drivers may have a feeling of algorithm uses the values of the neighboring intersections to
unfairness when a very long period is assigned to sides other update the value of a given one. This inherent property of DP
than theirs. This is especially true when designing systems ensures cooperation with the neighbors.
for countries with a considerable number of unsupervised
intersections. To counteract this feeling of unfairness, works III. E XPERIMENTAL S ETTING
such as [5] and [8] propose adding vehicle waiting times to Our experiments are conducted on a simulated 4-junction
the feature vector. road network in Texas. We obtain the road network using
One main advantage of centralized agents is that they OpenStreetMap. We perform our simulations on this network
inherently take into account the overall traffic condition in a using the open-source traffic simulator Simulation of Urban
Mobility (SUMO). Controlled traffic patterns are generated dependency. This reduces the size of the feature vector, which
using SUMO’s inbuilt traffic generator to test our algorithm. in turn reduces the memory requirements of the algorithm. It
The full road network used in our method is as shown also reduces the training time required since fewer weights
in Fig. 2. The blue-colored intersections are of concern to are involved in the Q-value calculation. Furthermore, it makes
us. The traffic signals in these intersections are shown in our algorithm highly scalable, making it suitable for large scale
Fig. 3. The 15 signals numbered N 1 to N 15 are the 15 networks [3].
signals present in all of these intersections. Lane area detectors The reward structure is primarily designed to reduce the
(called E2 detectors in SUMO) are used to sense the vehicle queue length of the vehicles waiting at the intersections.
occupancy at the intersections. The percentage occupancy The vehicle occupancy percentages provided by the lane area
outputted by these detectors comprise the state feature vector detectors are used to calculate the reward. For this, we use
of the algorithm. These detectors sense vehicles up to 70m a simple scheme where the reward is set as the negative of
from the intersections. the cumulative queue lengths (in meters) of the signals of the
The network consists of roads of one to three lanes, where particular intersection being currently actuated. Thus, when
each lane is a standard 12ft wide. The main arterial roads are any of the queue lengths in that intersection increases, the
1.5km to 2km in length. The traffic consists of 4 types of reward becomes more negative. The agent then corrects itself
vehicles. The traffic distribution is modeled to represent urban by taking alternate actions to achieve reduced overall queue
traffic - Bikes (41%), Cars (37.5%), Trucks (12.5%), Buses lengths.
(8.3%). These vehicles have different speed and acceleration The RL paradigm is based on the Markov Decision Pro-
parameters as shown in Table I [3]. cess (MDP) framework. Here, we formally present the traffic
Fig. 3, shows a four-way intersection used in our setup. It control problem as an MDP. An MDP consists of an agent,
shows the traffic signal phasing in green and red colors. In our a state space, an action space, and a reward structure. At any
phasing scheme, when a particular phase is activated, vehicles time step, the agent in state s chooses an action a that is
on that side can move to any of the other three sides. The available in that state. This action takes the agent to a new
green lines in Fig. 1, depicts the same. The duration of each random state s0 , while rewarding the agent with a reward R.
phase is fixed at 64s. In our problem, the state S of the environment consists of
The traffic pattern is generated by SUMO’s inbuilt Python- the occupancy levels of signals N 1 to N 15. The occupancy
based script. This script allows the traffic flow parameters to levels can take any value from 0% to 100%. Thus, there are
be configured according to the requirement. The generated infinite states in our system. This renders table-based Q-value
traffic pattern is set to be similar to a real-world one, in both estimators unusable [3].
the composition and other attributes. For this, we choose a
Binomial distribution with n = 5 & p = 0.15, with larger
traffic flowing in wider roads. S = {n1, n2, n3, · · · , n14, n15} (1)
A centralized agent like ours has to actuate all the intersec-
tions in the network by itself. As such, when the intersections where ni is the occupancy level of signal i, and 0% ≤ ni ≤
are operated simultaneously, the action space of the model 100%.
consists of all possible sign configurations of all the inter-
sections in the network. For example, for our 4-intersection
network, the number of possible actions is 4*4*4*3 = 192,
i.e., the number of actions grows as 4n (assuming all are 4-
way intersections). It is easy to imagine how things could get
out of hand for a city-level traffic network. Instead of this naı̈ve
approach, we use a cyclic scheme, wherein the intersections
are operated cyclically (INT1 → INT2 → INT3 → INT4
→ INT1) with a 16s time gap. Hence, at any one point in
time, we are actuating only a single intersection. This reduces
the action space to just 4+4+4+3 = 15, which is a linear 4n

TABLE I
V EHICLE C HARACTERISTICS

Max Max
Vehicle Length Width
Speed Acceleration
Type (m) (m)
(m/s) (m/s2 )
Bike 2.2 0.9 25 6.0
Car 4.3 1.8 27.78 2.9
Truck 7.1 2.4 22.22 1.3
Bus 12.0 2.5 19.44 1.2 Fig. 1. The road network.
where R̄t is an estimate of the average reward.
IV. D ESCRIPTION OF THE RL A PPROACH U SED
We wish to find the optimal signal control scheme to
achieve better traffic flow efficiency. This control problem of
reinforcement learning, in our case, reduces to maximizing the
differential n-step return defined in (4). We do this by taking
optimal actions. The optimal action at a state s is defined as
the one maximizing the Q-value, Q(s, a).

a∗ = argmaxa Q(s, a) (5)


We use linear function approximation to estimate the Q-
values. This technique takes as input the state-action feature
vector, ψ(s, a), and outputs the scalar Q-value by multiplying
it with the weight vector, w. In our approach, we use the
Fig. 2. Traffic signals in our network. vehicle occupancy percentages at the 15 signals, ci , as part of
the state feature vector φ(s):

φ(s) = [c1 , c2 , · · · , c15 ]T (6)


The state-action feature vector, ψ(s, a), is then constructed
by inserting the 15-element state feature vector into a column
vector of length 15*15+1=226. The position where it is
inserted depends on the action a. For example, if the action
is a1, the first 15 elements of the 226-element zero vector are
set equal to φ(s). The last element of the state-action feature
vector is set to 1.

Fig. 3. Signal phasing in our setup. if 15 ∗ (i − 1) + 1 ≤ index
cj ,


 ≤ 15 ∗ (i − 1) + 15, cj  φ(s)
ψ(s, ai ) = (7)
The action set A consists of 15 actions, each denoting the 
1, if index = 226

opening of signals N 1 to N 15 respectively. As mentioned 0, otherwise

earlier, the intersections are operated cyclically, and hence, the The weight vector is also a 226-element vector. The last
available actions at any time t depend upon the intersection element, w226 , is the bias weight. It multiplies with the
that is being operated [3]. corresponding unity element in the state-action feature vector
to add a bias to the Q-value. This bias accounts for any state
A = {a1, a2, a3, · · · , a14, a15} (2) effects not accounted for by the occupancy percentages.
where ai denotes giving green to signal i.
The reward R is assigned as the negative of the cumulative w = [w1 , w2 , · · · , w226 ]T (8)
queue lengths (in meters) of the signals of the currently
Finally, the Q-value estimate, q̂(s, a, w), is obtained by the
controlled intersection.
linear operation:
X
R = − qi (3) q̂(s, a, w) = wT ψ(s, a) (9)
i  intersection
currently controlled The estimate (9) is used in (4) as well as the rest of the
Differential Semi-gradient n-step SARSA algorithm [16], the
Each time step t is equal to 16s. The problem is formulated pseudocode of which is presented in Algorithm 1. The random
as a continuing task. We use Linear function approximation initial state S0 is achieved by letting the traffic flow in the
to determine Q-values. However, function approximation tech- network for some random amount of time.
niques are not known to work well with the discounted reward The actions are taken in an -greedy manner, where  is
setting [16]. Hence, we use the average reward setting instead. decayed exponentially every training run starting from an
We define the differential n-step return as: initial value of 1. α and β are step size parameters, and are
decayed every training run, run.
Gt:t+n = Rt+1 − R̄t+n−1 + · · · + Rt+n − R̄t+n−1 +
(4)
q̂(St+n , At+n , wt+n−1 ) α = 0.1/run (10)
Algorithm 1 Pseudo Code
1: Initialize W eights w = 0
2: Initialize Average Reward R̄ = 0
3: for each training run : do
4: Start with a random S0 and A0 .
5: for t = 0, 1, 2.... do
6: T ake action At .
7: Choose action At+1 -greedy w.r.t q̂(St+1 , .., w).
8: τ =t−n+1
9: if τ ≥ 0 then
Pτ +n
10: δ = i=τ +1 Ri − R̄ + q̂(Sτ +n , Aτ +n , w) −
aaaaaaaaaaaaaaaaaaaaaaaaaaaa q̂(Sτ , Aτ , w) Fig. 4. Queue Length comparison of the traffic control algorithms.
11: R̄ = R̄ + βδ
12: w = w + αδψ(Sτ , Aτ )
13: end if state is achieved, n-step SARSA starts outperforming both SS
14: end for and LQF. It shows a 5.5% improvement as compared to LQF
15: end for and 52.94% improvement as compared to SS in the steady
state. During the dispersion phase (200mins onwards), n-step
SARSA is able to disperse the traffic 31.5% faster than LQF
and 275% faster than SS.
β = 2α (11) With respect to Waiting Time (Fig. 5), the network takes
about 100mins to reach steady state traffic flow. Once steady
The n-step SARSA algorithm attempts to look n time
state is achieved, n-step starts outperforming both LQF and
steps into the future. Increasing n allows you to foresee
SS. It shows a 14.2% improvement as compared to LQF
situations farther into the future, however, the convergence of
133% improvement as compared to SS in the steady state and
the algorithm becomes harder and may result in an unstable
dispersion phases. This shows that the vehicles reach their
suboptimal policy.
destinations faster in the case of n-step SARSA.
We choose the SARSA n value as 3, which we found yields
stable and consistent results. We obtain good results with just With respect to Time Loss (Fig. 6), the network takes about
20 runs of training. Each training run is equivalent to 4.44hrs 100mins to reach steady state traffic flow. Once steady state
of real-world traffic flow, amounting to a total training time is achieved, n-step SARSA starts outperforming both SS and
equivalent to 3.7 days. LQF. It shows a 10.2% improvement as compared to LQF
81.8% improvement as compared to SS in the steady state
V. S IMULATION R ESULTS and dispersion phases.
We compare the performance of our n-step SARSA algo- The Dispersion Rate of traffic during the dispersion phase
rithm against Static Signaling (SS) and Longest Queue First is very high for n-step SARSA, as can be seen from the steep
(LQF) using the following metrics [3]: descents of the n-step SARSA graphs of all the metrics.
Finally, with respect to Dispersion time (Fig. 7), we observe
1) Percentage Queue occupancy: This metric is the average
that LQF has multiple spikes in its dispersion times and SS
percentage lane occupancy of the 15 signals in our 4-
has a dispersion time of around 200mins. These are called
junction network.
deadlock situations, wherein, the traffic comes to a complete
2) Waiting Time: This metric denotes the fraction of the halt due to congestion. As can be seen from the plots, n-step
journey time that the vehicle spends waiting at signals. SARSA never enters (0%) a deadlock state due to its ability
3) Time Loss: This metric denotes the time lost by a vehicle
due to it driving below the ideal speed.
4) Dispersion Time: This metric denotes the time needed
to disperse the traffic present in the road network.
To measure this metric, traffic is allowed to flow for
about 3.3hrs, after which, the time needed to empty the
network is recorded.
The performance plots of n-step SARSA against SS, LQF
are shown in Fig. 4 to 7. We can see that the SS approach,
which is the most prevalent method in use today, performs
very badly in comparison to the other methods.
With respect to Queue length (Fig. 4), the network takes
about 150mins to reach steady state traffic flow. Once a steady Fig. 5. Waiting Time comparison of the traffic control algorithms.
[7] Andrea Vidali, Luca Crociani et al. A Deep Reinforcement Learning
Approach to Adaptive Traffic Lights Management. Workshop ”From
Objects to Agents”, 2019.
[8] Hua Wei, Guanjie Zheng et al. IntelliLight: A Reinforcement Learning
Approach for Intelligent Traffic Light Control. The 24th ACM SIGKDD
International Conference, 2018.
[9] L. Li, Y. Lv, and F. Wang. Traffic signal timing via deep reinforcement
learning. The IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,
2016.
[10] Prabuchandran K.J., Hemanth Kumar A.N, and S. Bhatnagar. Multi-
agent reinforcement learning for traffic signal control. 17th International
IEEE Conference on Intelligent Transportation Systems (ITSC), Qing-
dao, 2014.
[11] Bakker B., Whiteson S., Kester L., Groen F.C.A. Traffic Light Control by
Multiagent Reinforcement Learning Systems. Interactive Collaborative
Fig. 6. Time Loss comparison of the traffic control algorithms.
Information Systems, Springer Berlin Heidelberg, 2010.
[12] Marco Wiering. Multi-Agent Reinforcement Learning for Traffic Light
Control. Proceedings of the Seventeenth International Conference on
Machine Learning (ICML ’00), 2000. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 1151–1158.
[13] Zhi Zhang, Jiachen Yan, and Hongyuan Zha. Integrating independent and
centralized multi-agent reinforcement learning for traffic signal network
optimization. arXiv:1909.10651, 2019.
[14] H. Joo and Y. Lim. Reinforcement Learning for Traffic Signal
Timing Optimization, 2020 International Conference on Information
Networking (ICOIN), Barcelona, Spain, 2020, pp. 738-742, doi:
10.1109/ICOIN48656.2020.9016568.
[15] T. Li, D. Zhao, and J. Yi. Adaptive Dynamic Programming for Multi-
intersections Traffic Signal Intelligent Control, 2008 11th International
IEEE Conference on Intelligent Transportation Systems, Beijing, 2008,
pp. 286-291, doi: 10.1109/ITSC.2008.4732718.
[16] R. S. Sutton, A. G. Barto et al. Reinforcement learning: An introduction,
Fig. 7. Dispersion Time comparison of the traffic control algorithms. second edition. MIT press Cambridge, 2018.

to predict and avoid such situations. Thus, n-step SARSA has


the lowest dispersion time on all trials. Akshay Kekuda received the B.Tech. degree in
Electronics and Communication Engineering from
VI. C ONCLUSION National Institute of Technology, Surathkal in 2017.
He is currently pursuing a Masters in Computer Sci-
In this paper, we use the Differential Semi-gradient n-step ence from Ohio State University. His research inter-
SARSA RL algorithm to improve the efficiency of a road ests include Deep Reinforcement Learning, Applied
network. Simulations were performed and the results show that Machine Learning, and Biomedical Informatics.
our approach consistently outperforms the traffic control tech-
niques prevalent today. Such Reinforcement learning based
approaches have significant potential to reduce congestion in
modern traffic networks. R. Anirudh received the B.Tech. degree in Elec-
tronics and Communication Engineering from Na-
Our future work would involve a risk-sensitive approach tional Institute of Technology, Tiruchirappalli in
to traffic control, where the agent effectively predicts the 2017. He is currently with the Centre for Devel-
risk involved in a particular action at a given state. We aim opment of Telematics, Government of India. His re-
search interests include Deep Reinforcement Learn-
to use this ability to enable conscious decision making to ing, Intelligent Traffic Systems, and Edge AI.
preemptively avoid deadlock situations.
R EFERENCES
[1] Li L, Wen D, Yao D Y. A survey of traffic control with vehicular com-
munications. IEEE Transactions on Intelligent Transportation Systems, Mithun Krishnan received the B.Tech. degree in
2014, 15(1): 425-432. Electronics and Communication Engineering from
[2] Bellemans T, De Schutter B, De Moor B. Model predictive control for National Institute of Technology, Calicut in 2017.
ramp metering of motorway traffic: a case study. Control Engineering He is currently with the Centre for Development
Practice, 2006, 14(7): 757-767. of Telematics, Government of India. His research
[3] R. Anirudh, Mithun Krishnan, Akshay Kekuda. Intelligent Traffic Con- interests include Reinforcement Learning, Dynamic
trol System using Deep Reinforcement Learning. Unpublished. Traffic Modelling, and High-Performance Comput-
[4] Federal Highway Administration 2006, Chapter 2, Traffic detector ing.
handbook, 3rd edn, vol. 1, U.S. Department of Transportation, 2016.
[5] Prashanth L. A. and Shalabh Bhatnagar. Reinforcement Learning With
Function Approximation for Traffic Signal Control. IEEE Transactions
On Intelligent Transportation Systems, vol. 12, no. 2, 2011.
[6] W. Genders and S. Razavi. Using a deep reinforcement learning agent
for traffic signal control. arXiv:1611.01142, 2016.

You might also like