Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma

Transportation Research Part C 164 (2024) 104663
Contents lists available at ScienceDirect
Transportation Research Part C

journal homepage: www.elsevier.com/locate/trc
Multi-agent Deep Reinforcement Learning collaborative Traffic

Signal Control method considering intersection heterogeneity✩
Yiming Bie a , Yuting Ji a , Dongfang Ma b,c ,∗
a School of Transportation, Jilin University, Changchun 130022, China
b Institute of Marine Sensing and Networking, Zhejiang University, 310058, Hangzhou, China
c
Research Center of Artificial Intelligence, Pengcheng Lab., 518055, Shenzhen, China
ARTICLE INFO ABSTRACT
Keywords: Traffic Signal Control (TSC) plays a crucial role in mitigating congestion. The extensive
Traffic Signal Control integration of Deep Reinforcement Learning (DRL) into TSC, fueled by the advancements in
Multiagent reinforcement learning artificial intelligence, has yielded remarkable efficacy. Nevertheless, prevailing DRL models are
Graph attention networks
typically developed within homogeneous road networks, characterized by identical intersection
Intersection heterogeneity
types and relatively balanced traffic flow patterns. This predominant approach often results in
Spatiotemporal modeling
an oversight of the impact of intersection heterogeneity, encompassing variations in intersection
geometric structure, traffic demands, signal timing design, and other pertinent factors, on action
selections by researchers. This paper addresses the aforementioned gap by introducing a value-
decomposition-based spatiotemporal graph attention multi-agent DRL model (MARL_SGAT) that
expressly accommodates intersection heterogeneity. Specifically, a heterogeneous correlation
index is designed using road network structural parameters and serves to formulate a het-
erogeneous reward function, enabling the quantification of action execution improvements in
diverse road network environments. To address the intricate spatiotemporal features of traffic
flows within the heterogeneous road network, the model strategically selects and incorporates
these features during the formulation by leveraging a spatiotemporal graph attention network.
On these bases, double dual networks are introduced to convert the individual-global-max
constraint of action selection into a value range constraint of the action advantage function,
thereby facilitating model learning. Simulation results showcase the superior performance of the
MARL_SGAT algorithm when compared to seven baseline algorithms. Specifically, the algorithm
manifests in reduced average vehicle delays, decreased number of stops, and elevated travel
speeds within the controlled road network. These advantages are particularly conspicuous in
heterogeneous road network environments.
1. Introduction
Traffic signal control (TSC) is pivotal for enhancing intersection traffic efficiency, mitigating safety risks, and reducing vehicle
travel costs. As a frequently employed method, collaborative control views adjacent intersections holistically, formulating an optimal
signal control plan through systematic adjustments to signal cycle length, green time, and other parameters. Traditional signal control
approaches fall into three categories based on control principles: fixed-time control, actuated control, and adaptive control. Among
✩ This article belongs to the Virtual Special Issue on "Multi-agent Deep Reinforcement Learning collaborative Traffic Signal Control method considering
intersection heterogeneity".
∗ Corresponding author at: Institute of Marine Sensing and Networking, Zhejiang University, 310058, Hangzhou, China.
E-mail addresses: [email protected] (Y. Bie), [email protected] (Y. Ji), [email protected] (D. Ma).
https://doi.org/10.1016/j.trc.2024.104663
Received 16 January 2024; Received in revised form 24 April 2024; Accepted 12 May 2024
Available online 23 May 2024
0968-090X/© 2024 Published by Elsevier Ltd.
Y. Bie et al. Transportation Research Part C 164 (2024) 104663
these, adaptive control stands out as it can achieve and maintain a global performance level for the control system by accommodating
fluctuations in traffic parameters over time. It serves as a prevalent approach to address signal control challenges.
In recent years, the integration of machine learning methods into TSC has become increasingly prevalent with the advent of
artificial intelligence (Qin et al., 2022; Qu et al., 2023). Reinforcement learning (RL), a prominent machine learning algorithm,
finds frequent application in adaptive traffic signal control due to its merits in organizational optimization and real-time decision-
making (Joo et al., 2020; Acar and Sterling, 2023). In RL, the agent continually updates the mapping relationship, i.e., the signal
control policy, between actions and environmental states through interactions with the environment. It then adaptively adjusts
the signal control scheme based on the long-term cumulative reward value (Q value) derived from the adopted policy. However,
challenges arise in processing high-dimensional continuous state information, rendering the algorithm less effective in complex road
network environments. The emergence of deep reinforcement learning (DRL) offers a solution to these issues. Leveraging the potent
perception processing capabilities of deep learning (DL) (Nigam and Srivastava, 2023), the agent autonomously observes and learns
the state representation of the road network environment. It undergoes decision-making training based on RL, optimizing problem-
solving strategies and methods, ultimately achieving end-to-end (from input to output) control and perception. Consequently, traffic
signal control methods grounded in DRL algorithms have garnered increasing attention among scholars (Liang et al., 2020; He et al.,
2024).
In the early stages, DRL applied to TSC mainly focused on isolated intersections. Subsequently, methodologies such as Q
learning (Genders and Razavi, 2019), SARSA (Kekuda et al., 2021), or a hybrid approach combining fuzzy control and RL (Kumar
et al., 2020), were progressively expanded to encompass the signal control of multiple signalized intersections. During this phase,
agent actions in the algorithm were primarily influenced by their local environmental states, with limited information exchange
between agents. Consequently, achieving the collaborative control objective proved to be challenging. Addressing the challenge
of collaborative control for multiple signalized intersections necessitates understanding the interactive relationships between
intersections within the road network, which is crucial for formulating optimal control policy (Li et al., 2021; Tang and Zeng,
2022). To tackle this, certain scholars have employed multi-agent reinforcement learning (MARL) algorithms (Liu et al., 2023; Su
et al., 2023; Fei et al., 2024). By facilitating agent communication (Song et al., 2024) or employing value decomposition (Wang
et al., 2021a) to share the environment, these approaches adjust the signal control policy of each intersection from a network-
wide perspective. Wang et al. (2021b) leveraged the cooperative vehicle infrastructure system to construct a MARL framework
predicated on collaborative groups. This framework enhanced the communication among signal light agents through a k-nearest
neighbor state representation and a spatially influenced discounted joint reward model, thereby fostering cooperation. In 2023, Yu
et al. developed an RL-based decentralized controller that integrated bus priority with TSC to optimize the efficiency of multimodal
transportation networks. This method achieved communication and collaboration among multiple agents by sharing the action
information of the agents. Yazdani et al. (2023) proposed an adaptive traffic signal model based on DRL focusing on optimizing
the management of pedestrian and vehicle flows. By designing an extended reward function, the model accurately captures various
interactive delays between vehicles and pedestrians, significantly improving the efficiency and effectiveness of traffic control. In the
same year, Su et al. developed a multi-agent reinforcement learning (MARL) framework that incorporated sharing strategies and
spatial discount factors, effectively decoupling the complex relationship between emergency vehicle navigation and traffic signal
control. Chu et al. (2019) added the environment state of adjacent agents to the state expression of the local agent, so that the
local agent had more information about regional traffic distribution and cooperation strategies, and improved the fitting ability of
the model. Chen et al. (2021) proposed a decentralized MARL method based on participant evaluation to control traffic signals.
They employed a differential reward method to assess each agent’s contribution within the cooperative game, which expedited the
convergence speed of the algorithm. Considering the importance of information sharing in large-scale road networks, Jiang et al.
(2022) designed a general communication form UniComm, which embedded a large number of observations collected on an agent
into crucial predictions of the impact on its adjacent agents, improving network communication efficiency.
Moreover, the exploration and incorporation of spatiotemporal correlations among traffic flows have become crucial in the
development of machine learning-based signal control models. The traffic flow system is a highly complex and dynamically evolving
entity. Variations in traffic flow patterns within individual road sections are not only influenced by their unique historical trends but
are also intricately connected to the traffic dynamics in other road sections. This implies the presence of factors that impact traffic
flow changes across both temporal and spatial dimensions. Consequently, an effective traffic control model must possess an input
structure that accommodates spatiotemporal characteristics simultaneously (Dong et al., 2021; Wang et al., 2022). Notably, Huang
et al. (2021) introduced a MARL model, leveraging a spatial–temporal attentive neural network, to optimize traffic signal timing
in a comprehensive road network. Zhao et al. (2019) used the graph convolutional networks (GCN) to learn complex topological
structures for capturing spatial dependencies. The gated recurrent unit was employed to discern dynamic changes in traffic data,
thereby capturing temporal dependencies. Wang et al. (2020b) realized the coordinated control of multi-intersection in a distributed
manner through deep q-network (DQN) (Mnih et al., 2013). In this model, the traffic light adjacency graph was constructed based
on the spatial structure among traffic lights, and the historical traffic records would be integrated with the current traffic state via a
recurrent neural network structure. Wei et al. (2019) proposed a model, CoLight, to enable cooperation of traffic signals, which used
graph attention networks (GAT) (Veličković et al., 2017) to facilitate communication. For a target intersection in a network, CoLight
could not only incorporate the temporal and spatial influences of neighboring intersections to the target intersection but also build
up an index-free model of neighboring intersections. Wang et al. (2024) used GAT to capture geometric relationships between lanes
and utilized an enhanced GraphSAGE technique to explore the spatial characteristics of the traffic network. This method improved
the model’s adaptability to various traffic conditions and road networks, allowing for flexible traffic signal control adapted to the
2
geometric shapes of different intersections.Wang et al. (2022) designed a graph neural framework with GAT and long short-term
memory (LSTM) networks to obtain spatial and temporal information for coordinated control of traffic signals.
In conclusion, existing research approaches the problem from multiple perspectives such as feature extraction, reward reconstruc-
tion, and model structure optimization. These approaches employ a range of DL methods to uncover the spatiotemporal correlation
between traffic flows and implement collaborative control of multiple signalized intersections based on RL methods. However,
practical scenarios often involve a common urban road network structure that includes sub-arterial or branch roads connecting
two arterial roads. This leads to significant variations in the geometric structure, traffic demand, and signal timing design among
adjacent intersections, collectively termed ‘‘intersection heterogeneity’’ for this discussion. In a heterogeneous road network, traffic
conditions at each intersection markedly differ from those in a homogeneous environment, characterized by similar intersection
types and relatively balanced traffic flow. Two notable scenarios arise: (i) the traffic stream from a larger upstream intersection to a
smaller downstream intersection is divided by red lights, causing vehicles at the end of the traffic stream to wait at the intersection
stop line until the next green phase, significantly increasing travel time; (ii) the traffic stream from a smaller intersection to a
larger one intermittently experiences a ‘‘stop-and-go’’ condition during the green phase. Frequent stops and starts reduce average
vehicle speed, leading to delays and increased energy consumption (Ji et al., 2024; Ma et al., 2023). The existence of heterogeneous
intersections amplifies the uncertainty and nonlinearity of traffic flow in the road network, complicating spatiotemporal correlations
among traffic flows. Despite the inherent complexity, most models still construct the reward function and verify the model based
on homogeneous road networks, overlooking the influence of intersection heterogeneity on action selection. Consequently, these
models are unsuitable for heterogeneous road network environments with significant differences in intersection types.
To address the aforementioned issues, this study uniquely accounts for intersection heterogeneity in the collaborative control
problem, by formulating a value decomposition-based spatiotemporal graph attention multi-agent deep reinforcement learning
model, termed MARL_SGAT. The objective of this model is to maximize the long-term cumulative reward across the entire road
network. This model also involves treating the adjustment of green time at individual intersections as actions. The primary
contributions of this work can be summarized in three key aspects:
• We introduce a MARL framework for collaborative traffic signal control where each intersection is managed by an individual
agent. This framework accommodates intersection heterogeneity depicted by relevant spatiotemporal features, leveraging
the spatiotemporal GAT for extraction. Additionally, the framework employs the joint action-value function decomposition,
strategically reducing state–action space and encouraging collaborative interactions among multiple agents for mutual benefit.
Consequently, our framework exhibits the capability to adapt the signal control policy of individual intersections, offering a
holistic network viewpoint.
• Considering the prevalent distinctions among adjacent intersections in urban road networks, we formulate a heterogeneous
correlation index to quantify variations in intersection topology, providing a robust foundation for our framework. Employing
this index, a novel reward function is devised to offer a more precise evaluation method for the agent’s action selection within
a heterogeneous road network environment.
• The double dual networks that we introduced decompose the action-value function for both the global road network and
individual intersections into the sum of the state value function and the advantage function. This alteration detaches the
state value function from complete dependency on the action, thereby expediting the convergence speed of the network.
Simultaneously, it transforms the full IGM constraints into value range constraints for the advantage function, offering a more
straightforward implementation. This transformation simplifies the learning process for the optimal action-value function.
The subsequent sections of this paper are structured as follows. In Section 2, we present the fundamental concepts of environment,
action, state, and reward within the MARL model established in this study. Section 3 provides a detailed description of the model
framework and structure. In Section 4, a simulation-based case study is presented, and the proposed method is compared with other
baseline methods. The concluding section, Section 5, summarizes the entire paper and draws conclusions.
2. Agent formulation
2.1. Environment and action
In the model presented in this paper, each intersection is managed by an individual agent. Intersections and the corresponding
agents are denoted by the same set of indices 𝑖 (𝑖 = [1, 2, … , 𝐼]). 𝐼 is the number of intersections in the road network. Intersection 𝑖
is characterized as a 𝐽𝑖 -way intersection, with each entrance comprising 𝑍⃗ 𝑗 lanes and each exit having ⃖⃖⃖ 𝑍 𝑗 lanes. The entrances and
exits refer to the passages in the intersection that are specifically used for vehicles to enter or leave the road. For ease of reference,
each entrance and exit are sequentially numbered in a clockwise direction. The lengths of the entrance and exit are denoted as 𝑙⃗𝑖,𝑗
and ⃖𝑙𝑖,𝑗 (𝑗 = [1, 2, … , 𝐽𝑖 ]), respectively. Intersection 𝑖 has 𝑃𝑖 phases, each having a green time of 𝑔𝑖,𝑝 (𝑝 = [1, 2, … , 𝑃𝑖 ]), and the phase
sequence at each intersection remains constant. Fig. 1 shows the structure of a 5 × 5 traffic network (composed of 25 homogeneous
intersections), that is, 𝐼 = 25. For the convenience of description, we have numbered the intersections. Take Intersection 13 as an
example. This intersection is a four-leg intersection, 𝐽13 = 4. Its entrance road and exit road correspond to the blue and yellow
box lines in the figure respectively, and both have three lanes. The signal timing scheme of this intersection is shown in Fig. 1(c).
Specifically, the intersection has four signal phases, 𝑃13 = 4, and the green time of each phase is 𝑔13,1 , 𝑔13,2 , 𝑔13,3 , 𝑔13,4 .
The action undertaken by the agent is defined as 𝑎𝑡𝑖 . In the context of the TSC task, actions refer to the adjustments made by
the agent to the signal lights. Action settings usually contain three approaches: phase selection (He et al., 2021; Wang et al., 2022;
3
Fig. 1. Example of the 5 × 5 multi-intersection scenario.
Kong et al., 2022; Bokade et al., 2023), phase shift (Kumar et al., 2020; Liu et al., 2023; Wei et al., 2018), and phase duration
adjustment (Li et al., 2021, 2020; Yoon et al., 2021; Luo et al., 2024). Phase selection involves choosing the appropriate phase from
a predefined set, offering flexibility in the combination of phases at the cost of relatively frequent changes in signal light colors. On
the other hand, phase shift is to determine whether to maintain the current phase, and phase duration adjustment involves adjusting
the green time based on specific rules, which are similar. In contrast to phase selection, the sequence of phase changes under the two
latter methods is relatively predetermined, aligning more closely with driver expectations (Mao et al., 2022). Therefore, we select
phase duration adjustment as the action-setting method concerning driver experience and preferences. Specifically, the agent has
three candidate operations: (i) maintaining the green time of the current phase; (ii) extending the green time of the current phase
by 𝛥̃𝑡 seconds; (iii) reducing the green time of the current green phase by 𝛥̃𝑡 seconds. On this basis, we formulate the following
rules by setting the minimum green time 𝑔𝑖,𝑚𝑖𝑛 and the maximum green time 𝑔𝑖,𝑚𝑎𝑥 (Zhang and Wang, 2010): the agent executes
the recommended action if the current phase’s green time is greater than 𝑔𝑖,𝑚𝑖𝑛 and the duration will not exceed 𝑔𝑖,𝑚𝑎𝑥 after this
action. Otherwise, the current recommended action is ignored, and the agent awaits the next moment (𝑡 + 𝛥𝑡) for reassessment and
adjustment.
2.2. State
In the multi-intersection cooperative signal control model based on MARL, the state includes two parts: the individual state
of the intersection and the global state of the road network. For individual states, the traffic condition of each lane within an
intersection is typically defined by a combination of available parameters such as the count of vehicles, the number of queued
vehicles, and phase information (Wei et al., 2021; Wang et al., 2021c). However, owing to spatial structure constraints in the road
network, an identical count of vehicles and queued vehicles on different lanes within the same intersection may correspond to
distinct traffic conditions. Therefore, road spatial structure should be captured as well to depict various traffic conditions on each
lane at intersections. Here we utilize road network structure parameters to reflect intersection state characteristics. Specifically, the
instantaneous state parameter 𝑠𝑡𝑖 of the intersection at time 𝑡 consists of the entrance state collection 𝑠⃗𝑡𝑖 , the exit state collection 𝑠⃖⃖𝑡𝑖 ,
and signal states 𝑝𝑡𝑖 (𝑝𝑡𝑖 = 1, 2, ... , 𝑃𝑖 ), as shown in Eq. (1):
{ }
𝑠𝑡𝑖 = 𝑠⃗𝑡𝑖 , 𝑠⃖⃖𝑡𝑖 , 𝑝𝑡𝑖 (1)
⃗𝑡
Among 𝑠⃗𝑡𝑖 , the traffic state of lane 𝑧 at entrance 𝑗 comprises the lane’s traffic density 𝑘 ⃗𝑡
𝑖,𝑗,𝑧 and queuing density 𝐾𝑖,𝑗,𝑧 (𝑧 =
[1, 2, … , 𝑍⃗ 𝑖 ]), which are calculated as shown in the Eqs.:
𝑛⃗𝑡𝑖,𝑗,𝑧
⃗𝑡 =
𝑘 (2)
𝑖,𝑗,𝑧
𝑙⃗𝑖,𝑗
⃗𝑡
𝑁
𝑡 𝑖,𝑗,𝑧
𝐾⃗𝑖,𝑗,𝑧 = (3)
𝑙⃗𝑖,𝑗
⃗ 𝑡 are respectively the counts of vehicles and queued vehicles on lane 𝑧 in intersection 𝑖’s entrance 𝑗 at time 𝑡,
where 𝑛⃗𝑡𝑖,𝑗,𝑧 and 𝑁 𝑖,𝑗,𝑧
pcu.
4
Since vehicles usually pass at free-flow speeds through exit lanes, whether those lanes contain sufficient space deserves more
attention than the traffic density and queuing density there. Hence we use the remaining space of exit lane 𝑧, denoted as 𝑐⃖⃖𝑡𝑖,𝑗,𝑧 to
express the exit’s state of lane 𝑧 at entrance 𝑗, which is also a fundamental element in 𝑠⃖⃖𝑡𝑖 . The 𝑐⃖⃖𝑡𝑖,𝑗,𝑧 is represented by the ratio of
the count of vehicles on lane 𝑧 in intersection 𝑖’s entrance 𝑗 at time 𝑡, 𝑛⃖⃖ 𝑡𝑖,𝑗,𝑧 , to the maximum number of vehicles that the lane can
accommodate, ⃖⃖𝑛max 𝑧𝑖 ]. The value of ⃖⃖𝑛max
𝑖,𝑗,𝑧 (𝑧 = [1, 2, … , ⃖⃖
⃖
𝑖,𝑗,𝑧 is equal to the ratio of the entrance road length 𝑙𝑖,𝑗 to the standard car length
𝑙v ).
𝑡
⃖⃖𝑛𝑖,𝑗,𝑧
𝑐⃖⃖𝑡𝑖,𝑗,𝑧 = 1 − max (4)
⃖⃖𝑛𝑖,𝑗,𝑧
Regarding road network traffic state 𝑆 𝑡 , some studies directly concatenate all the state information from each intersection to
indicate the global information of the road network (Liu et al., 2022; Zhang et al., 2019; Ma and Wu, 2022). While this approach
accurately reflects the road network traffic state, it also results in an exponential increase in the dimensions of the traffic state
features as the number of intersections rises. This dilemma poses a considerable challenge to network training. Nevertheless,
effectively conveying the overall traffic state information of large-scale road networks with fewer variables becomes a crucial task
in the collaborative control of heterogeneous intersections. In this study, we select representative feature parameters, including the
average values of traffic density and queuing density of each entrance at the intersection and the phase number (𝑝) to simplify the
representation of individual intersection traffic states, and the number of vehicles entering and exiting the road network during
period [𝑡 − 𝛥𝑡, 𝑡] (𝑛𝑡in and 𝑛𝑡out ) as the global state expression. Incorporating these features, as shown in Eq. (5), not only efficiently
reflects the traffic state of individual intersections but also signifies the traffic characteristics of the entire road network.
{ }
𝑆 𝑡 = 𝑠𝑡1 , 𝑠𝑡2 , … , 𝑠𝑡𝐼 , 𝑛𝑡𝑖𝑛 , 𝑛𝑡𝑜𝑢𝑡 (5)
2.3. Heterogeneous correlation index and reward
In MARL algorithms, rewards should possess spatial decomposability and be easily measurable following an agent’s action (Ma
et al., 2021). As stated in the Introduction, the incorporation of intersection structure information into the reward function calcula-
tion process is necessary, with the heterogeneity between intersections primarily indicated by variations in lane channelization and
signal phase design. Accordingly, we utilize the structural information of the target intersection 𝑖 and its upstream and downstream
intersections to formulate an intersection heterogeneous correlation index. The heterogeneous correlation between intersection 𝑖
and the upstream intersection 𝑖+ of the critical lane in phase 𝑝 and the heterogeneous correlation between the intersection 𝑖 and
𝑝 𝑝
the downstream intersection 𝑖+ of the critical lane in phase 𝑝, 𝑐𝑜𝑟
⃖⃖⃖⃖⃖⃖𝑖,𝑖+ and 𝑐𝑜𝑟
⃖⃖⃖⃖⃖⃗𝑖,𝑖− , are calculated as follows:
𝑝
𝑚
⃖⃖⃖𝑖+ ,𝑖
𝑝
⃖⃖⃖⃖⃖⃖𝑖,𝑖+ =
𝑐𝑜𝑟 √ (6)
𝑚𝑝𝑖 𝑞 𝑥𝑝𝑖+ ,𝑖
𝑝
𝑚⃗ 𝑝𝑖,𝑖−
⃖⃖⃖⃖⃖⃗𝑖,𝑖− =
𝑐𝑜𝑟 √ (7)
𝑚𝑝𝑖 𝑞 𝑥𝑝𝑖,𝑖−
where 𝑚𝑝𝑖 is the lane count corresponding to the critical traffic flow 𝑓𝑖𝑝 in phase 𝑝 of intersection 𝑖; 𝑚 𝑝
⃖⃖⃖𝑖+ ,𝑖 is the lane count
𝑝 + 𝑝
corresponding to the traffic flow merging into 𝑓𝑖 at the upstream intersection 𝑖 ; 𝑚 ⃗ 𝑖,𝑖− is the lane count of exits on the downstream
intersection 𝑖− that 𝑓𝑖𝑝 heads towards; 𝑥𝑝𝑖+ ,𝑖 is the distance between intersection 𝑖 and the stop line of the upstream intersection 𝑖+ ,
m; 𝑥𝑝𝑖,𝑖− is the distance between intersection 𝑖 and the stop line of downstream intersection 𝑖− , m; 𝑞 is the scaling factor used for
unifying variables’ numerical range.
Furthermore, the control objective of traffic signals varies with alterations in traffic supply and demand and is reflected by the
shift of different reward functions. During off-peak periods, when traffic supply adequately meets demand, traffic signal control aims
at reducing traffic delays and enhancing traffic efficiency. Hence, the average vehicle delay at individual intersections is selected
as the feature parameter to measure the action’s impact. Conversely, during peak periods with heightened traffic demand, certain
intersections may experience oversaturated traffic conditions. During such times, the primary objective of signal control is to increase
the throughput of the intersection. Therefore, the throughput of the intersection and the average vehicle delay serve together as the
selected feature parameters for assessing the action’s impact during peak hours. On this basis, considering the prevalent imbalanced
traffic demand of urban road networks, we utilize the ratio of the hourly volume 𝑂𝑖𝑢,𝑝 of the critical lane in phase 𝑝 of intersection
𝑖 to the hourly traffic volume 𝑂𝑖𝑢 of the intersection under varying traffic demands as the weighting factor (𝑢 = 0, 1, corresponds to
off-peak and peak hours respectively), as expressed as Eq. (8). This factor is employed to calculate the reward 𝑟𝑡𝑖 that agent 𝑖 can
attain by executing action 𝑎𝑡𝑖 at time 𝑡.
⎧∑ 𝑃 𝑖
𝑢,𝑝 𝑝 𝑝
⃖⃖⃖⃖ 𝑖,𝑖+ 𝑐𝑜𝑟
𝑂𝑖 𝑐𝑜𝑟 ⃖⃖⃖⃗𝑖,𝑖−
𝑑𝑖𝑡,𝑢,𝑝 , 𝑖𝑓 𝑢 = 0
𝑡 ⎪ 𝑝=1 𝑂𝑖𝑢
𝑟𝑖 = ⎨ ( ) (8)
⎪∑𝑃𝑖
𝑢,𝑝 𝑝 𝑝
𝑂𝑖 ⃖⃖⃖⃖ ⃖⃖⃖⃗𝑖,𝑖−
𝑐𝑜𝑟𝑖,𝑖+ 𝑐𝑜𝑟
𝑤 𝑡 𝑑 𝑡,𝑢,𝑝 + 𝑤𝑡 𝑜𝑡,𝑝 , 𝑖𝑓 𝑢 = 1
⎩ 𝑝=1 𝑂 𝑢
𝑖
𝑖,D 𝑖 𝑖,O 𝑖
where 𝑑𝑖𝑡,𝑢,𝑝
is the average vehicle delay of 𝑓𝑖𝑝 during the unit time step 𝛥𝑡 starting from time 𝑡 in period 𝑢. Its value is equal to the
quotient of the total stop time and the number of stopped vehicles of the traffic flow 𝑓𝑖𝑝 , s; 𝑜𝑡,𝑝
𝑖 represents the throughput, namely
5
the number of vehicles leaving the stop line, of 𝑓𝑖𝑝 during the unit time step 𝛥𝑡 starting from time 𝑡 in peak hours, pcu; 𝑤𝑡𝑖, D and
𝑤𝑡𝑖, O are respectively the weight values corresponding to the average vehicle delay and traffic volume, where 𝑤𝑡𝑖, D is negative and
𝑤𝑡𝑖, O is positive.
Global reward 𝑅𝑡 is the aggregate of rewards across all intersections as follows:
∑
𝐼
𝑅𝑡 = 𝑟𝑡𝑖 (9)
𝑖=1
The long-term cumulative reward 𝑄𝑡𝑖 of agent 𝑖 in state 𝑠𝑡𝑖 is expressed as Eq. (10). 𝑄𝑡𝑖 is the action-value function that requires
deep network fitting in deep reinforcement learning.
𝑄𝑡𝑖 = 𝐸[𝑟𝑡𝑖 + 𝛾𝑟𝑡+1

𝑖 + ⋯ + 𝛾 𝑐 𝑟𝑡+𝑐
𝑖 ] (10)
where 𝛾 is the discount factor to reduce the importance of future rewards and balance current rewards with future rewards,
𝛾 = [0, 1). 𝛾 approaching 0 indicates a prioritization of short-term rewards in action selection, while 𝛾 approaching 1 signifies
that the agent places greater emphasis on the acquisition of long-term rewards through action execution; 𝑐 represents the maximum
period considered for reward calculations.
Eq. (10) is usually simplified as follows:
𝑄𝑡𝑖 = 𝐸[𝑟𝑡𝑖 + 𝛾 max 𝑄𝑡+1

𝑖 ] (11)
3. Methodology
3.1. Overall framework
This paper introduces a collaborative control model for heterogeneous intersections based on a MARL algorithm. The comprehen-
sive network framework is depicted in Fig. 2. The network takes as input the local environmental state 𝑠𝑡 of each agent at time 𝑡 and
the overall traffic state 𝑆 𝑡 of the road network. The output is the joint action-value 𝑄𝑡𝑡𝑜𝑡 encompassing all agents. The calculation of
this value is executed through three modules: spatiotemporal feature extraction, individual value function fitting, and global value
function calculation.
The spatiotemporal feature extraction module incorporates GAT and gated recurrent unit (GRU) (Cho et al., 2014) as the primary
network components. This module takes the local environmental state of each agent as input and generates an output feature matrix
containing spatiotemporal information. Subsequently, this matrix serves as one of the input parameters in the subsequent process
and is fed into the individual value function fitting module. The blue block diagram of Fig. 2 shows in detail the information flow
( ) ( )
of agents 1 and 𝐼 in the spatiotemporal feature extraction module, where 𝑠𝑡1 , 𝑠𝑡1∗ and 𝑠𝑡𝐼 , 𝑠𝑡𝐼 ∗ represent local states of agents 1
and 𝐼, and local states of agents with direct connectedness with 1 and 𝐼. 𝐻1𝑡−1 and 𝐻𝐼𝑡−1 are hidden states of agents 1 and 𝐼 at 𝑡 − 1.
The individual value function fitting module, denoted by the yellow block in the figure, receives input information comprising the
spatiotemporal feature matrix and the global state information 𝑆 𝑡 of the road network. The module outputs the state function value
𝑉̃ 𝑡 and advantage function value 𝐴̃ 𝑡 for each agent in the global environment. Within the individual value function fitting module,
the spatiotemporal feature matrix is employed to fit the agent’s state value function and action advantage function. Simultaneously,
the global state information 𝑆 𝑡 utilized to generate the weight 𝑤𝑡TR and bias 𝑏𝑡TR . The action is determined based on the advantage
function value for each agent, obtained through a greedy policy exploration. The diagram in the purple block illustrates the global
value function calculation module, primarily responsible for calculating the global (joint) action function value 𝑄𝑡𝑡𝑜𝑡 . The input
information for this module comprises the action of each intersection, the global state information 𝑆 𝑡 , and the output results of
the individual value function fitting module. The first two serve as input information for the hybrid network within the module,
generating the weight of the advantage function value. This weight signifies the importance of each agent in the global network.
Details about the component modules are elucidated in subsequent sections.
3.2. Spatiotemporal feature extraction module
Intersection actions are determined by the action function value, and the computation of this value relies on the traffic states of
each intersection and the road network. Hence, precise state representation is pivotal for effective action selection. In Section 2.2, we
defined the traffic state of the intersection at time 𝑡, utilizing traffic density, queuing density, and signal phase as feature parameters
to depict the traffic state. While this expression conveys the environmental state of individual intersections, it lacks the correlation
of traffic flows between adjacent intersections. To enhance the fitting of the action function value for each intersection, we
employ a combined network utilizing the GAT and GRU to further investigate the spatiotemporal correlation between intersections.
Specifically, GAT facilitates the selective acquisition of information from adjacent intersections. Taking intersections as graph nodes
and road segments between intersections as edges, we simplify the road network into a graph structure composed only of points
and lines. The set of adjacent nodes of intersection 𝑖 is denoted as 𝐼𝑖∗ , with intersection 𝑖 having 𝐽𝑖 adjacent nodes. The projection
matrix 𝑤𝑡𝑖 is applied here to linearize the state information of node 𝑖 and its adjacent node 𝑖∗ , thereby altering the dimensions of
the state features. Building upon this, the node features are concatenated in sequence, fed into the attention network denoted as 𝜗,
6
Fig. 2. Model framework.
and projected to a scalar value. Ultimately, by subjecting this scalar value to a nonlinear transformation through the activation unit
LeakyRelu (⋅), we can derive the correlation coefficient 𝜙𝑡𝑖,𝑖∗ between the target intersection and its adjacent intersections at time 𝑡:
( ( ))
𝜙𝑡𝑖,𝑖∙ = LeakyRelu 𝜗 𝑤𝑡𝑖 𝑠𝑡𝑖 ∥ 𝑤𝑡𝑖 𝑠𝑡𝑖∗ (12)
Concatenating operation ∥ leads to 𝜙𝑡𝑖,𝑖∗

≠ 𝜙𝑡𝑖∗ ,𝑖 ,
meaning that attention value is non-symmetrical, which fits the actual situation.
In urban road networks, imbalanced traffic demand results in distinct variations in the impact relationship between adjacent
intersections and the target intersection across different scenarios. For instance, if there is an increased traffic volume from
intersection 𝑖 to intersection 𝑖∗ , the influence of intersection 𝑖 on intersection 𝑖∗ will be more pronounced, 𝜙𝑡𝑖,𝑖∗ > 𝜙𝑡𝑖∗ ,𝑖 , in contrast,
𝜙𝑡𝑖,𝑖∗ has a smaller value, 𝜙𝑡𝑖,𝑖∗ < 𝜙𝑡𝑖∗ ,𝑖 . Subsequently, the sof tmax (⋅) function is employed to normalize the attention coefficients
between node 𝑖 and its adjacent nodes (as in Eq. (13)).
( )
( ) exp 𝜙𝑡𝑖,𝑖∙
𝑡 𝑡
𝛿𝑖,𝑖∙ = sof tmax 𝜙𝑖,𝑖∙ = ∑ ( ) (13)
𝑡
𝑖∗ ∈𝐼 ∗ exp 𝜙𝑖,𝑖∙
𝑖
Ultimately, the state information 𝜍𝑖𝑡 of intersection 𝑖 is obtained by combining the state information from each adjacent node
and the activation function 𝜎 (⋅):
⎛∑ ( )⎞
𝜍𝑖𝑡 = 𝜎 ⎜ 𝑡
𝛿𝑖,𝑖 𝑡 𝑡 ⎟
∙ 𝑤𝑖 𝑠𝑖∙ (14)
⎜∗ ∗ ⎟
⎝𝑖 ∈𝐼𝑖 ⎠
Utilizing the network structure delineated in Eqs. (12)–(14), we selectively incorporate the state information of adjacent
intersections at the same moment into the state representation of the target intersection. This approach yields a more comprehensive
7
state information set under conditions of incomplete observation. Subsequently, we introduce the GRU network to capture the
temporal relationships within the intersection state space. This enables the integration of variations and underlying patterns inherent
in historical traffic states into the state expression process. Preceding this, a single-layer fully connected neural network is employed
to uniformly convert the state information 𝜍𝑖𝑡 of intersection 𝑖 into a high-dimensional vector space (as in Eq. (15)). This facilitates
more effective learning and extraction of correlation features between variables.
𝑡
𝜍𝑖1 = 𝑤𝑡𝑖,𝜍1 𝜍𝑖𝑡 (15)
where 𝑤𝑡𝑖,𝜍1is the learnable weight factor of a fully connected neural network.
GRU usually consists of two components: the reset gate and the update gate. Notably, these gating mechanisms retain information
in long-term sequences without discarding it over time, irrespective of its relevance solely to predictions at the current moment.
More precisely, the process involves utilizing the hidden state 𝐻𝑖𝑡−1 inherited from the previous moment, representing the feature
set containing historical traffic information (with the initial moment 𝐻𝑖𝑡−1 set to 0). Additionally, the state information 𝜍𝑖𝑡 of the
target node, integrated with information from adjacent nodes, serves as input to obtain the reset gate signal 𝜏𝑖𝑡 and update gate
signal 𝜂𝑖𝑡 , as shown in Eqs. (16) and (17).
( )
𝜏𝑖𝑡 = 𝜎 𝜍𝑖1 𝑡
𝑤𝑡𝑖,𝜍𝜏 + 𝐻𝑖𝑡−1 𝑤𝑡𝑖,h𝜏 + 𝑏𝑡𝑖,𝜏 (16)
( )
𝜂𝑖𝑡 = 𝜎 𝜍𝑖1
𝑡
𝑤𝑡𝑖,𝜍𝜂 + 𝐻𝑖𝑡−1 𝑤𝑡𝑖,h𝜂 + 𝑏𝑡𝑖,𝜂 (17)
where 𝑤𝑡𝑖,𝜍𝜏 , 𝑤𝑡𝑖,h𝜏 , 𝑤𝑡𝑖,𝜍𝜂 and 𝑤𝑡𝑖,h𝜂 are learnable weight factors; 𝑏𝑡𝑖,𝜏 and 𝑏𝑡𝑖,𝜂 are bias terms; 𝜎 (⋅) stands for a nonlinear activation
function, being able to transform the data to values ranging from 0 to 1, thereby acting as a gating signal.
When 𝜏𝑖𝑡 is approaching 0, the model will discard the hidden information in the historical traffic state, leaving only the current
input information. Conversely, when 𝜏𝑖𝑡 is approaching 1, the model will assume that all historical information is valid, and add it to
the state information of the intersection at the current moment. Leveraging the reset gate gating signal, we can compel the hidden
state to discard any information from the future deemed irrelevant to predictions as outlined in Eqs. (18) and (19).
𝐻̃ 𝑖𝑡 = tanh(𝜍𝑖1
𝑡
𝑤𝑡𝑖,𝜍𝐻 + 𝜏𝑖𝑡 ⊙ 𝐻𝑖𝑡−1 𝑤𝑡𝑖,hH + 𝑏𝑡𝑖,H ) (18)
𝐻𝑖𝑡 = (1 − 𝜂𝑖𝑡 ) ⊙ 𝐻𝑖𝑡−1 + 𝜂𝑖𝑡 ⊙ 𝐻̃ 𝐼𝑡 (19)
For the convenience of description, we use the variable 𝑤 to uniformly represent the weight coefficients appearing in the paper and
use the variable b to represent the bias term, with different upper and lower subscripts to distinguish them. Building on this, the
update gating signal regulates the degree to which the state information from the previous moment influences the current state,
thereby updating the hidden state 𝐻𝑖𝑡 . Subsequently, a fully connected layer is once again employed to alter the vector dimension,
⃗ 𝑡:
ultimately yielding the final output result 𝐻 𝑖
⃗ 𝑡 = 𝑤𝑡 𝐻 𝑡
𝐻 (20)
𝑖 𝑖,H1 𝑖
where tanh (⋅) represents hyperbolic tangent activation function, for limiting data to the range [−1, 1]; (1 − 𝜂𝑖𝑡 ) ⊙ ℎ𝑡−1
𝑖 stands for the
selective forgetting of historical traffic hidden states; 𝜂𝑖𝑡 ⊙ 𝐻̃ 𝑖𝑡 is the selective memory of the candidate hidden states of the current
node.
3.3. Calculation of individual value
The comprehensive individual value function network is depicted in Fig. 3. The individual value function is synonymous with
the individual action-value function, with its value representing the anticipated return that the agent 𝑖 can achieve by executing
action 𝑎𝑡𝑖 in state 𝑠𝑡𝑖 , denoted as 𝑄𝑡𝑖 . A higher action value for any intersection in the road network is usually obtained from the
intersection being in a more favorable traffic condition or through the effectiveness of executing correct actions, i.e., a judicious
adjustment of the green time. Therefore, relying solely on individual function values hinders accurate differentiation of the causes
( )
behind traffic efficiency improvements. To address this, we employ a dual network to decompose the action-value function 𝑄𝑡𝑖 𝑠𝑡𝑖 , 𝑎𝑡𝑖
( ) ( )
of the intersection into the sum of the state value function 𝑉𝑖𝑡 𝑠𝑡𝑖 and the advantage function 𝐴𝑡𝑖 𝑠𝑡𝑖 , 𝑎𝑡𝑖 . The state function value
represents the expected return agent 𝑖 can achieve by following a policy in a given state. Specifically, the state function value
equals the average of all action-value functions in a specific state concerning the action probability. The advantage function value
represents the value of the current action relative to the average value: if the advantage is greater than zero, it indicates that the
action is effective compared to the average action; if the advantage is negative, it implies that the current action is not as effective
as the average action in optimizing the environment. The fitting network corresponding to the state value function and advantage
function is expressed as Eqs. (21)–(24). The networks are similar in structure, and both take the output of the spatiotemporal graph
attention network as input and consist of three fully connected layers. The output yields the state value 𝑉𝑖𝑡 of the intersection at
time 𝑡 and the advantage value 𝐴𝑡𝑖 for each action in this state.
(( ) )
𝑡 ⃗ 𝑡 ∥ 𝑎𝑡−1 𝑤𝑡 + 𝑏𝑡
𝛼𝑖1,V = Relu 𝐻 𝑖 𝑖 𝑖,V1 𝑖,V1
(21)
8
Fig. 3. Structure of individual value function network. The function of the red network in the figure is to generate the weight and deviation values corresponding
to the state value function and the advantage value function. The red icons of the network are the absolute value operation and the Relu activation function
from top to bottom.
( )
𝑡 𝑡
𝛼𝑖2,V = Relu 𝛼𝑖1,V 𝑤𝑡𝑖,V2 + 𝑏𝑡𝑖,V2 (22)
𝑉𝑖𝑡 = 𝛼𝑖2,V
𝑡
𝑤𝑡𝑖,V3 + 𝑏𝑡𝑖,V3 (23)
𝐴𝑡𝑖 = 𝛼𝑖2,V
𝑡
𝑤𝑡𝑖,A3 + 𝑏𝑡𝑖,A3 (24)
Through the aforementioned procedure, we acquire the state function value and advantage function value for the individual
intersections. Despite utilizing the GAT network to derive state parameters for adjacent intersections, the information available to
the agent is still local and constrained. The traffic environments associated with different agents may exhibit significant differences
at the same time, thereby augmenting the complexity of calculating the joint action-value function. Therefore, we compute the
individual state function value 𝑉̃𝑖𝑡 and advantage function value 𝐴̃ 𝑡𝑖 for each agent in the global environment:
( )
𝑡 𝐭 𝑡 𝑡
𝛼𝑖1, ̃
V
= Relu 𝐒 𝑤 ̃
𝑖,V1
+ 𝑏 ̃
𝑖,V1
(25)
| 𝑡 |
𝑤𝑡𝑖,TR = |𝛼𝑖1, 𝑤𝑡 + 𝑏𝑡𝑖,𝑉̃ 2 | (26)
| Ṽ 𝑖,𝑉̃ 2 |
( )
𝑡 𝐭 𝑡 𝑡
𝛼𝑖1,Ã = Relu 𝐒 𝑤𝑖,A1̃ + 𝑏𝑖,A1̃ (27)
𝑏𝑡𝑖,TR = 𝛼𝑖1,
𝑡 𝑡 𝑡
̃ + 𝑏𝑖,𝐴2
̃ 𝑤𝑖,𝐴2
A ̃ (28)
𝑉̃𝑖𝑡 = 𝑤𝑡𝑖,TR 𝑉𝑖𝑡 + 𝑏𝑡𝑖,TR (29)
𝐴̃ 𝑡𝑖 = 𝑤𝑡𝑖,TR 𝐴⃗𝑡𝑖 + 𝑏𝑡𝑖,TR (30)

𝑡
where the value of 𝑊𝑖,TR is positive to ensure the monotonicity of the local function and the global function; 𝐴⃗𝑡𝑖 is the advantage
value of the best action selected based on the greedy policy, that is, the maximum value in 𝐴𝑡𝑖 .
3.4. Calculation of global value
As shown in Fig. 4, the joint action value 𝑄𝑡𝑡𝑜𝑡 is computed based on the output from the individual value function network,
combined with the road network’s traffic state 𝑆 𝑡 . The magnitude of the function value indicates the impact of the joint action 𝑎𝑡𝑡𝑜𝑡
on enhancing the overall state of the road network environment in the current condition. However, in the individual value function
network, the selection of agent actions represents the optimal outcome only for individual intersections, which may diverge from
the optimal choice in global conditions. To ensure consistency between the two, the model construction must adhere to the IGM
constraint (Wang et al., 2020a) as illustrated in Eq. (31).
( ) [ ( ) ( )]
arg max 𝐴𝑡𝑡𝑜𝑡 𝑆 𝑡 , 𝑎𝑡𝑡𝑜𝑡 = arg max 𝐴𝑡1 𝑠21 , 𝑎𝑡1 , … , arg max 𝐴𝑡𝐼 𝑠2𝐼 , 𝑎𝑡𝐼 (31)
( )
To this end, we once again employ a dual network (Wang et al., 2016) to decompose the joint action-value function 𝑄𝑡𝑡𝑜𝑡 𝑆 𝑡 , 𝑎𝑡𝑡𝑜𝑡
𝑡
( 𝑡) 𝑡
( 𝑡 𝑡 )
into the sum of the state value function 𝑉𝑡𝑜𝑡 𝑆 and the advantage function 𝐴𝑡𝑜𝑡 𝑆 , 𝑎𝑡𝑜𝑡 . This approach aligns with the principles of
9
Fig. 4. Mixing network structure. The x-shaped purple icon in the figure represents multiplication, and the meanings of the other icons are consistent with that
of Fig. 3.
RL algorithms, where the action-value function and the state-value function become numerically equivalent when the agent executes
𝑡
the optimal action 𝑎 ⃖⃗𝑖 . Under such circumstances, the individual and global action advantage functions satisfy the relationship
illustrated in Eq. (31). Therefore, as Eq. (32), absent other constraints, optimizing the global action advantage function and all
individual action advantage functions to attain their maximum values will lead to the same outcome, thereby establishing that the
individual optimum aligns with the overall optimum. The consistency constraint of action selection between the entire road network
and individual intersections is thus transformed into a value range constraint for the advantage function.
( 𝑡 ) ( 𝑡) ( ) ( )
𝐴𝑡𝑡𝑜𝑡 𝑆 𝑡 , 𝑎
⃖⃗𝑡𝑜𝑡 = 𝐴𝑡𝑖 𝑠𝑡𝑖 , 𝑎
⃖⃗𝑖 = 0 𝑎𝑛𝑑 𝐴𝑡𝑡𝑜𝑡 𝑆 𝑡 , 𝑎𝑡𝑡𝑜𝑡 < 0, 𝐴𝑡𝑖 𝑠𝑡𝑖 , 𝑎𝑡𝑖 ≤ 0 (32)
𝑡
where 𝑎 ⃖⃗𝑡𝑜𝑡 is the joint optimal action.
At the same moment, for all agents, the overall state of the road network remains constant, and the historical joint state is
a predetermined value. Once the policy is established, the value of the intersection state remains unchanged with the actions
undertaken by the agent. Accordingly, the sum of the state function values of each agent at time 𝑡 is directly employed as the
state function value 𝑉𝑡𝑜𝑡 𝑡 in the joint state:
∑
𝐼
𝑡
𝑉𝑡𝑜𝑡 = 𝑉̃𝑖𝑡 (33)
𝑖=1
The advantage function provides a numerical representation of the value advantage associated with each action within the action
set relative to the average level. Its numerical computation relies on the agent’s state and the set of available actions in that state.
Despite the uniform state of the road network and predetermined optional actions at a given moment, the impact of executing the
same action on the road network state at different intersections varies. Consequently, directly computing the cumulative action
advantage function value is inappropriate. In this study, leveraging the overall state information 𝑆 𝑡 of the road network at time 𝑡
and the collective action 𝑎𝑡𝑡𝑜𝑡 taken by each intersection, we calculate the significance 𝜆𝑡𝑖 of each intersection’s action adjustment
concerning the overall state change of the road network. This is achieved through the hybrid network (Wang et al., 2020a) as follow:
[( ) ]
𝑡
𝛼𝑖1,𝜑 = Relu 𝑆 𝑡 ∥ 𝑎𝑡𝑡𝑜𝑡 𝑤𝑡𝑖,𝜑1 + 𝑏𝑡𝑖,𝜑1 (34)
( )
𝜑𝑡𝑖 = 𝜎 𝛼𝑖1,𝜑
𝑡
𝑤𝑡𝑖,𝜑2 + 𝑏𝑡𝑖,𝜑2 (35)
( )
𝑡
𝛼𝑖1,M = Relu 𝑆 𝑡 𝑤𝑡𝑖,M1 + 𝑏𝑡𝑖,M1 (36)
| 𝑡 |
𝑤𝑡𝑖,M = |𝛼𝑖1,M 𝑤𝑡𝑖,M2 + 𝑏𝑡𝑖,M2 | (37)
| |
𝜆𝑡𝑖 = 𝑤𝑡𝑖,M 𝜑𝑡𝑖 (38)
Hence, the advantage function value 𝐴𝑡𝑡𝑜𝑡 in the joint state is:
∑
𝐼
𝐴𝑡𝑡𝑜𝑡 = 𝜆𝑡𝑖 𝐴̃ 𝑡𝑖 (39)
𝑖=1
10
In the given equation, the activation function 𝜎 confines 𝜑𝑡𝑖 within the interval of 0–1, and taking the absolute value ensures that
𝑤𝑡𝑖,M > 0. Therefore, the weight value 𝜆𝑡𝑖 is a positive value. A positive correlation exists between the advantage function value of an
individual intersection and the overall advantage function value of the road network. This further guarantees consistency between
action selection and policy.
During the optimization process of the model, if the network update rate is excessively rapid, the agent may struggle to obtain a
stable and specific target value, complicating the formulation of a loss function for optimizing the network structure. Therefore, in
this study, we establish the loss function based on two network structures – the prediction network and the target network – having
the same configuration but different update speeds. Specifically, the prediction network updates parameters in each iteration, while
the target network synchronizes the latest weights from the evaluation network every generation to update the network. To make
better use of simulation data, we define 𝑠𝑡𝑖 , 𝑆 𝑡 , 𝑎𝑡𝑖 , 𝑠𝑡+1 𝑡 𝑡 𝑡
𝑖 , 𝑆 , 𝑅 and the 0–1 variable 𝜌 to determine whether the new state is the
terminal state as an experience. 𝜌𝑡 = 1 signifies that time 𝑡 + 𝛥𝑡 is the terminal state. The experience pool serves as a dynamic
repository comprised of experiences. Over time, as the agent iteratively updates its state and receives rewards through ongoing
interactions with the environment, the experience pool undergoes continuous updates. Upon reaching a predefined threshold for
the volume of data in the experience pool, the agent engages in a playback process, utilizing the records within the pool for dynamic
experience. This involves fitting the target label 𝑄̂ 𝑡𝑡𝑜𝑡 (as in Eq. (40)), calculating the loss function 𝐿𝑡𝑡𝑜𝑡 (𝜃) (as in Eq. (41)), and updating
the network weights using the gradient descent method.
{ 𝑡
𝑅 if 𝜌𝑡 = 1
𝑄̂ 𝑡𝑡𝑜𝑡 = ̂ 𝑡+1 (40)
𝑅 + 𝛾 max 𝑄𝑡𝑜𝑡 if 𝜌𝑡 = 0
𝑡
∑
𝜓̃
( 𝑡 )2
𝐿𝑡𝑡𝑜𝑡 (𝜃) = 𝑄̂ 𝑡𝑜𝑡 − 𝑄𝑡𝑡𝑜𝑡 (41)
𝜓=1
where 𝑄̂ 𝑡+1
𝑡𝑜𝑡 is the joint action-value of the target network at time 𝑡 + 1; 𝜓
̃ is the number of data entries randomly selected from the
experience pool; 𝜃 is the network weight of the prediction network.
Below is the pseudocode of the MARL_SGAT Algorithm. At each time step, the agent obtains a representation of the intersection
state 𝑠𝑡𝑖 and feature set of road network state 𝑆 𝑡 from the environment and relies on the individual value function network to fit
the value functions of all optional actions at the intersection in the current state. Subsequently, each agent selects and executes the
action 𝑎𝑡𝑖 based on the greedy strategy, and the environment then moves to the next state 𝑠𝑡+1 𝑡
𝑖 , and returns the reward 𝑟𝑖 . After all
𝑡
agents have completed their actions, the model will obtain the overall traffic status 𝑆 of the road network and calculate the overall
reward function 𝑅𝑡 under the joint action.
4. Experiments
4.1. Experimental setting
In this section, we utilize the traffic simulation software simulation of urban mobility (SUMO) to conduct simulation experiments
using both synthetic and real datasets, to evaluate the proposed multi-intersection signal cooperative control model. By invoking
the library function in SUMO, we directly retrieve intersection state information, including details such as vehicle count, location,
vehicle state, signal phase, and other parameters. The execution of actions and the acquisition of evaluation indicators during model
analysis are implemented using the corresponding functions of SUMO. In synthetic and real datasets, each vehicle is denoted as
(ID, arr_t,arr_p,lea_t,lea_p), where ID serves as the unique identifier. The remaining variables include the time and location of the
vehicle’s arrival on the road network and the time and location of its departure. The simulation duration for each round is set to
3600 s for both synthetic and real datasets. Each agent undergoes training for 200 rounds, employing a training data batch size of
32, a discount factor of 0.9, and an experience replay area size of 1000. In addition, according to the hyperparameter tuning results,
the time interval 𝛥𝑡 between two consecutive actions is set to 12 s; the green light adjustment time 𝛥𝑡̃ is set to 4 s; the value of the
discount coefficient 𝑞 is 3. Other parameters involved in the model are shown in Table 1. For ease of description, we have simplified
the expression of the three modules of the spatiotemporal feature extraction module, individual value function fitting module, and
global value function calculation module as SFE module, IVFF module, and GVFC module.
Synthetic Dataset: The corresponding road network environment for this dataset encompasses a total of 9 intersections, each
configured as a four-way intersection with six lanes. For any given entrance, the lanes contain left turn lanes, straight lanes, and right
turn lanes in sequence from the inner to outer direction. The probability of a vehicle taking different directions at the intersection
is distributed as follows: 15% for left turns, 75% for straight travel, and 10% for right turns. Each intersection follows a four-
phase signal control scheme. The distance between two adjacent intersections is set at 500 m, with a road speed limit of 16.7 m/s.
Considering actual traffic conditions, we categorize the daily traffic flow into two periods: off-peak and peak. In these periods,
the vehicle arrival rates in each inbound direction (left, straight, and right lanes) are 0.73 pcu/s and 0.97 pcu/s, respectively. In
addition, in order to ensure that the data changes in the synthetic data set are more consistent with the real situation, we adjust
the vehicle arrival rate to a certain extent during the training process of the model. Specifically, the agent will determine the data
type (peak hour data/off-peak hour data) in each round of the simulation experiment based on a Bernoulli random variable (with a
value of 0 or 1). On this basis, every ten generations, the agent will generate a random variable between [-0.05, 0.05], and adjust
the vehicle arrival rate of the road network based on this variable until the end of training. An example of adjusting the vehicle
arrival rate is as follows: if the random number generated at the beginning of the 10th training is −0.03, then during the 10–20
11
Algorithm 1 MARL_SGAT Algorithm

Input: Episode length 𝑀, Time length 𝑇 , Discount factor 𝛾, Renewal frequency 𝑚.
Output: Parameter 𝜃.
1: Initialize predict 𝑄 network with random weights 𝜃.
2: Initialize target 𝑄 network with random weights 𝜃. ̂
3: for 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 = 1 → 𝑀 do
4: for 𝑡 = 1 → 𝑇 do
5: Get global state 𝑆 𝑡 .
6: for 𝑖 = 1 → 𝐼 do
7: Get individual state 𝑠𝑡𝑖 .
8: if 𝑟𝑎𝑛𝑑𝑜𝑚 < 𝜀 then
9: With probability 𝜀 select action 𝑎𝑡𝑖 .
10: else if 𝑟𝑎𝑛𝑑𝑜𝑚 >= 𝜀 then
11: Sample action from policy 𝜋𝑖 .
12: end if
13: Execute action 𝑎𝑡𝑖 .
14: get reward 𝑟𝑡𝑖 and next state 𝑠𝑡+1 𝑖 .
( )
15: Add 𝑠𝑡𝑖 , 𝑎𝑡𝑖 , 𝑠𝑡+1
𝑖 to experience 𝑒.
16: 𝑠𝑡𝑖 ← 𝑠𝑡+1
𝑖
17: end for
18: Calculate global reward 𝑅𝑡 .
19: Get next global state 𝑆 𝑡+1 .
( )
20: Add 𝑆 𝑡 , 𝑅𝑡 , 𝑆 𝑡+1 , 𝜌𝑡 to experience 𝑒.
21: Store 𝑒 into the experience pool 𝐸.
22: 𝑆𝑖𝑡 ← 𝑆𝑖𝑡+1
23: 𝑡←𝑡+1
24: end for
25: Sample random mini-batch of transition from 𝐸.
26: Calculate loss 𝐿𝑡𝑡𝑜𝑡 (𝜃) and update parameter 𝜃.
27: if 𝑒𝑝𝑖𝑠𝑜𝑑𝑒∕𝑚 = 0 then
28: 𝜃̂ ← 𝜃
29: end if
30: end for
Table 1
Parameters of MARL_SGAT.
Module Network Input size Output size Hidden size
GAT 9 × 16 9 × 32 –
FC1 1 × 288 1 × 576 –
SFE module GRU 9 × 64, 9 × 64 9 × 64 9 × 64
FC2 1 × 576 1 × 288 –
Individual state value 9 × 32, 9 × 1 9 × 1 9 × 128, 9 × 32
Individual advantage value 9 × 32, 9 × 1 9 × 3 9 × 128, 9 × 32
IVFF module Global-individual
9 × 38, 9 × 1 9 × 1 1 × 64, 1 × 64
State value
Global-individual advantage value 9 × 38, 9 × 1 9 × 1 1 × 64, 1 × 64
Global state value 9 × 1 1 –
GVFC module Global advantage value 1 × 38, 1 × 9, 9 × 64 1 1 × 64, 1 × 64
iterations, the vehicle arrival rate during peak hours of the road network will become (1–0.03)*0.73 = 0.71 pcu/s; the vehicle arrival
rate during off-peak hours will become (1–0.03)*0.97 = 0.94 pcu/s. The above data adjustment methods help simulate situations
under different traffic flow conditions so that the model can better adapt to different traffic scenarios and enhance the training
performance of the model.
Real Dataset: Satellite images and simulation images depicting actual traffic scenarios were utilized (Fig. 5). Leveraging
monitoring equipment in this region, we conduct traffic flow counts at the specified intersections throughout both peak (6:30–8:30)
and off-peak periods (14:00–16:00) over one month. The gathered data is then compiled into a real dataset. Table 2 shows the basic
environmental information of each intersection, such as the number of signal phases, the number of lanes, the length of the signal
cycle, and the overall vehicle arrival status of the road network. The distribution of vehicles has been added to the supplementary
file in the form of a table (letters a to i are entrance numbers in the road network).
12
Table 2
Real environment settings.
Parameters Description Value
I Number of intersections in the road network 9
𝑤𝑡𝑖, D Weight value corresponding to the average vehicle delay −0.1
𝑤𝑡𝑖, O Weight value corresponding to the traffic volume 1.0
𝑃 Set of intersection signal phases 3, 3, 4, 4, 4, 4, 3, 4, 4
𝑀 Set of intersection lanes 10, 10, 18, 18, 18, 24, 15, 16, 22
𝐶𝑜 Set of the initial cycle lengths of intersections during off-peak hours (s) 70, 83, 122, 112, 114, 131, 82, 95, 134
𝐶𝑝 Set of initial cycle lengths of intersections during peak hours (s) 75, 87, 123, 99, 116, 129, 98, 110, 148
Arrival rate (Mean) Average vehicle arrival rate of road network (pcu/300 s) 103.55, 177.20
Arrival rate (SD) Standard deviation of vehicle arrival rate in road network (pcu/300 s) 15.15, 48.92
Fig. 5. Real traffic scene map.
4.2. Comparison scheme and evaluation metric
To demonstrate the effectiveness of the proposed algorithm, we compare MARL_SGAT with the following baseline algorithms,
including traditional algorithms and RL algorithms.
FixedTime: The traffic signal controller operates continuously based on predetermined traffic signal schemes, offering advantages
such as cost-effectiveness and easy maintenance. Presently, it is a widely employed method of intersection signal control. In this
study, the timing schemes for each intersection within both the synthetic and real road networks during various periods were
computed using the Webster algorithm.
IQL: Each intersection is autonomously governed by an individual intelligent agent, and there exists no communication
mechanism among these agents. Actions are solely chosen based on the holistic environmental conditions to realize cooperative
control of multiple signalized intersections.
DDQN-PER: The agent controls the traffic signals of the regional road network through the DQN network with an experienced
playback mechanism, and defines the intersection environmental state and reward function simply and consistently during the model
construction process (Bouktif et al., 2023).
MGMQ: The multi-layer graph computing method is introduced into DDQN, and GAT and an improved GraphSAGE are used
to extract the geometric and spatial characteristics of intersection individuals and intersection road networks; each intersection is
controlled by a separate agent, and all agents share a set of parameters (Wang et al., 2024).
CoLight: The intelligent agent realizes collaborative control of regional road network traffic signals based on the combined
network of DQN and GAT. On this basis, the attention mechanism is introduced into the network structure to promote agent
communication (Wei et al., 2019).
MN_Light: The agent mines the temporal characteristics of historical traffic flow status and action information by adding
bidirectional LSTM to DQN to improve the control performance of the model (Kong et al., 2022).
QMIX: Considering the collaborative and competitive relationships among intersections, individual intelligent agents are assigned
to control each intersection. Leveraging a blending network, the joint action function value is assessed as an intricate nonlinear
combination of each agent’s action function values, with structural constraints imposed to maintain consistency between global and
individual policies (Rashid et al., 2020).
13
Fig. 6. Comparison of algorithm convergence.
Table 3
Performance of algorithms on synthetic data and real-world data.
Algorithm Synthetic dataset Real dataset
Delay/s Travel speed Stops Throughput pcu Delay/s Travel speed Stops Throughput (pcu)
(m/s) (m/s)
1 FixedTime 74.55 9.17 2.36 5612.00 124.53 9.69 2.88 3807.00
2 IQL 74.29 ± 0.68 9.33 ± 0.08 2.24 ± 0.09 5633 ± 34.98 98.07 ± 6.75 9.55 ± 0.06 2.21 ± 0.09 3926 ± 46.53
3 DDQN-PER 74.33 ± 1.06 9.39 ± 0.19 2.15 ± 0.17 5644 ± 18.75 100.42 ± 7.85 9.52 ± 0.12 2.24 ± 0.09 3927 ± 61.87
4 MGMQ 72.51 ± 0.77 9.48 ± 0.09 2.03 ± 0.15 5679 ± 32.25 95.03 ± 6.11 9.71 ± 0.09 2.23 ± 0.03 3940 ± 50.17
5 CoLight 73.80 ± 0.60 9.41 ± 0.09 2.12 ± 0.12 5657 ± 24.46 96.12 ± 5.64 9.72 ± 0.09 2.21+0.14 3933+69.78
6 MN_Llight 74.35 ± 0.71 9.37 ± 0.10 2.15 ± 0.07 5643 ± 30.51 97.79 ± 5.23 9.65 ± 0.11 2.20+0.10 3929 ± 47.91
7 QMIX 73.71 ± 0.77 9.42 ± 0.18 2.03 ± 0.03 5696 ± 28.24 95.22 ± 3.60 9.66 ± 0.06 2.33 ± 0.06 3943 ± 31.00
8 QRC-TSC 71.25 ± 2.07 9.51 ± 0.26 2.01 ± 0.11 5714 ± 20.96 94.69 ± 5.27 9.79 ± 0.07 2.24 ± 0.12 3960 ± 31.33
9 MARL_SGAT 70.72 ± 0.93 9.77 ± 0.12 1.89 ± 0.08 5730 ± 6.58 89.68 ± 3.74 10.33 ± 0.05 2.07 ± 0.08 4016 ± 22.30
QRC-TSC: The same processing method as QMIX is adopted to ensure the consistency of global-individual policies. Each
intersection is equipped with a separate communication network and can selectively exchange policy messages of variable
lengths (Bokade et al., 2023).
4.3. Simulation results and discussion
4.3.1. Contrast experiment

In order to ensure the reliability of the results, we conduct 10 test experiments using different random seeds after the model
converged, and the results are summarized in Table 3, with algorithms being sequentially numbered from 1 for clarity. Notably,
the DRL-based control models (models 2–9) exhibit superior control effectiveness compared to the fixed time signal control method
across both datasets. Taking the average vehicle delay in the real road network as an example, models 2–9 show reductions of
21.25%, 19.36%, 23.69%, 22.81%, 21.47%, 23.54%, 23.72%, and 27.99%, respectively, in comparison to the fixed time method.
This indicates the evident advantages of employing DRL-based control models in addressing traffic signal issues. IQL can directly
select the optimal action to adapt to the current environment based on the overall status of the road network. In theory, it can show
better control performance, but the independent learning method also makes this method more likely to fall into local optimality.
Therefore, compared with other DRL algorithms involving information interaction, IQL does not significantly improve the signal
control performance of road networks. The model convergence speed is relatively slow (as shown in Fig. 7). However, due to the
relatively simple network structure, the average training time of IQL is shorter (see Fig. 6).
DDQN-PER has made enhancements to the fundamental elements and model structure of the RL framework, and its control
effectiveness aligns closely with IQL. In terms of model convergence, the settings of dual networks and experience replay mechanisms
have improved the stability of DDQN-PER in action value function learning to a certain extent, hence the model converges slightly
faster than IQL. MGMQ enhances the agent feature extraction process using DDQN, leverages GAT to aggregate lane information
within intersections, and employs GraphSAGE to capture spatiotemporal information from adjacent intersections, thereby providing
rich information for agent action selection. Consequently, the algorithm demonstrates significantly superior control performance
compared to most baseline methods. Taking the average delay per vehicle as an example, in real data, under MGMQ, the delay is
3.10% less than that under IQL, 5.37% less than that under DDQN-PER, 1.15% less than that under CoLight, and 2.82% less than
that under MN_Light. However, when compared with the value decomposition method of partial global optimal control, MGMQ
still falls short in control efficacy. Furthermore, MGMQ incorporates a parameter-sharing mechanism, enabling each agent to learn
from the experiences of others. This approach mitigates the need for each agent to commence learning from scratch, thereby
14
Fig. 7. Average reward during training process.
enhancing learning efficiency and reducing computational costs. CoLight and MN_Light consider the spatiotemporal correlations
in road network traffic flows, utilizing GAT and bidirectional LSTM networks to extract spatiotemporal dependencies in traffic flow
data, resulting in improved control over road network traffic flow. However, during the algorithm design process, the agent only
selects the optimal action based on the individual action-value function. Despite the introduction of GAT and LSTM increasing agent
communication, the optimal choices based on individual action-value functions still do not equate to optimal choices for the global
road network, leaving a possibility for further optimization of their control effectiveness, and the convergence of the two is also at
a medium level.
In comparison to the preceding five model methods, QMIX exhibits superior control effectiveness, compared with MN_Light,
the signal control method based on QMIX can increase the throughput of the road network while reducing the average vehicle
delay by 2.57s (real data set). This is due to the special structural design of QMIX. On the one hand, QMIX captures the deep
connections between road network features by incorporating GRU units into agent networks, allowing the agent to more accurately
describe the environmental state and then select actions. On the other hand, using hybrid networks, QMIX evaluates the joint
action function value as a complex nonlinear combination of each agent’s action function value and imposes structural constraints
to maintain optimal consistency between global and individual policies. The design of consistency constraints allows the agent to
select the optimal action suitable for the overall environment based only on the local environment state, reducing the state dimension
in complex environments. However, the increase in the number of network layers still inevitably increases the training time of
the model. QRC-TSC shares a similar control concept with QMIX but assigns a dedicated communication network to each signal
agent. Based on this network, the semaphore agent can flexibly select message recipients and exchange variable-length characteristic
messages with them, which makes QRC-TSC emerge as the optimal model method among the baseline methods across both datasets.
In this study, we design a state representation that comprehensively encapsulates the actual condition of the intersection and
formulate a reward function that is capable of precisely measuring the impact of actions. Through an amalgamated network,
employing GAT-GRU, we identify the spatiotemporal dependencies within traffic flow data. Furthermore, we simplify the search
process for the optimal action-value function through a decomposition structure within the double dual network, ensuring the
consistent selection of the global optimum for the road network and the individual optimum for the intersection. This accelerates
network convergence. Therefore, in comparison to other baseline methods, our approach exhibits superior performance in evaluation
metrics such as vehicle delay and speed. To more effectively illustrate the optimization performance of each model, we present the
training rewards for some models as Fig. 7.
Analyzing the above figures and tables reveals a significantly superior optimization control performance of MARL_SGAT
compared to other baseline methods. This advantage is particularly prominent in real data sets. In the real data set, even compared
to the optimal method QRC-TSC among the baseline methods, MARL_SGAT still demonstrates a 5.59% reduction in average vehicle
delay, a 7.59% decrease in the number of stops, a 6.39% increase in average vehicle speed, and a 1.41% increment in the number
of vehicles served by the road network. In the synthetic data set, relative to QRC-TSC, MARL_SGAT exhibits a 3.5% reduction in
average vehicle delay, a 5.97% decrease in the number of stops, a 5.83% increase in average vehicle speed, and a marginal 0.46%
rise in the number of vehicles served by the road network.This is because there is relatively little room for improvement in the road
network signal control performance under the synthetic data set. Based on the content in Section 4.1, we know that the synthetic
data set corresponds to a homogeneous road network composed of nine intersections, and its traffic conditions, including flow and
turning ratio, are relatively regular and stable. Since it is not derived from a real road network, the actual signal timing scheme
is not included in the synthetic data set. Therefore, the initial signal timing scheme under this data set is the fixed signal timing
scheme calculated by Webster’s algorithm. Webster’s algorithm has better control performance in a stable traffic environment (Araghi
et al., 2015), hence the timing scheme under initial conditions can achieve better control performance. Under such conditions, it is
reasonable that the optimization effects of each model are not obvious.
Regarding model convergence, there is no significant difference in the convergence algebra and average convergence time of
MARL_SGAT and QRC-TSC. Both networks are essentially value decomposition networks that consider individual-global consistency
15
Fig. 8. Performance of algorithms under the new reward function and old reward function in the real dataset.
Table 4
Performance of ablation models.
Algorithm Synthetic dataset Real dataset
Delay (s) Travel speed (m/s) Stops Delay (s) Travel speed (m/s) Stops
MARL_SGAT 69.72 10.17 1.89 89.68 10.33 2.07
None_ Feature extraction 74.79 9.52 2.05 97.15 9.61 2.21
None_GAT 73.89 9.21 2.10 96.79 10.26 2.22
None_GRU 74.29 9.18 2.08 96.68 10.52 2.24
constraints. The difference is that MARL_SGAT takes advantage of the easier-to-learn advantage function consistency constraints and
uses a dual dueling network to decompose the action value function to speed up network convergence; QRC_TSC directly adopts
the consistency constraint of the action value function and sets up a special communication network to determine the interactive
information between agents.
4.3.2. Ablation experiment

Comparison of heterogeneous reward function To further investigate the efficacy of the novel reward function introduced in this
study, we construct a histogram, as depicted in Fig. 8 (utilizing data from the real data set). The blue bars illustrate the model’s
performance under our reward function, while the orange ones reflect the simulation outcome from Wei et al. (2018) which relies on
a straightforward combination of traditional reward functions. The traditional metrics used here encompass easily accessible traffic
parameters such as queuing time, queuing length, or average vehicle delay. As illustrated in the figure, most model methods exhibit
superior control performance under our proposed reward function. This improvement can be attributed to the reward function’s
design. In real road networks, intersections often display substantial heterogeneity, necessitating distinct evaluation criteria for signal
agents to address different environments. The reward function introduced in this study enables each agent to holistically assess the
impact of action execution on its intersection environment in the current state. This assessment is based on structural information
and traffic flow data from the agent itself, as well as information from upstream and downstream intersections. Subsequently, agents
can then generate the most suitable signal control policy.
Comparison of the attention modules of the space–time graph Based on the MARL_SGAT model, we separate the spatiotemporal feature
extraction module, GAT, and GRU modules from the model individually. The control performance of each model is assessed then on
road network traffic flow using both synthetic and real data sets. The outcomes are presented in Table 4. A substantial decline in the
model’s control effectiveness can be observed from Table 4 after the removal of the feature extraction module. For instance, with the
real dataset, the None_Feature Extraction model exhibits an 8.11% increase in average vehicle delay, a 6.97% decrease in average
vehicle travel speed, and a 6.76% rise in the number of stops compared to MARL_SGAT. Upon reintroducing the GAT and GRU
modules separately, improvements in vehicle traffic situations are observed. Relative to not using the feature extraction module at
all, the average number of vehicle stops in the road network decreases by 0.45% and 1.36%, respectively. While the average vehicle
delay reduction is marginal at 0.17% and 0.28%, these models demonstrate a notable increase in the average vehicle speed of the
16
Table 5
Model convergence comparison.
Convergence index Algorithm
S_MARL MARL_SGAT
Convergent algebra 143 126
Average training time (s) 103.60 99.50
road network by 6.76% and 9.47%, respectively. This outcome to some extent validates the significance of capturing spatial and
temporal dependencies in traffic flow to enhance the control efficacy of signalized intersections.
Comparison of double dual network Table 5 illustrates the convergence of the loss function in two scenarios: MARL_SGAT (where
the individual value function and the advantage function are fitted by separate networks) and S_MARL (where only the individual
action-value function is fitted by a separate network, and the individual state value function and advantage function are derived
from the action-value function) (Wang et al., 2016). The table reveals that the convergence speed of MARL_SGAT surpasses that of
S_MARL: MARL_SGAT achieves convergence in 126 generations, while S_MARL requires 143 generations for full convergence. This
discrepancy arises because the decomposition structure of the dual dueling network in MARL_SGAT aids the agent in identifying
the factors contributing to the optimal action-value function. This diminishes the reliance on actions during the training process
of the state value function, consequently expediting the network convergence to some extent. However, the disparity in average
training time between the two methods is not prominent. Specifically, the average training time of S_MARL is 103.6 s, while that
of MARL_SGAT is 99.5 s, a mere 4.1 s shorter than S_MARL. The increase in network depth will result in a certain degree of
computational overhead for the model.
5. Conclusion
In this research, we address the challenge posed by the considerable heterogeneity observed among adjacent intersections
within urban road networks, termed ‘‘intersection heterogeneity’’, encompassing variations in intersection geometric structure,
traffic demand, signal timing design, and other features. To tackle the complex collaborative control problem arising from such
heterogeneity, we introduce an innovative methodology named MARL_SGAT. Leveraging both authentic and synthesized datasets
derived from real-world intersections, we conduct a comparative analysis of MARL_SGAT against seven distinct signal control
methods. Results substantiate the superior performance of MARL_SGAT, manifesting itself through a substantial reduction in
average vehicle delay and an enhancement in traffic efficiency in the road network. Remarkably, the model exhibits robust
control performance in heterogeneous and homogeneous road network environments, with its advantages notably pronounced
in heterogeneous settings. Furthermore, through ablation experiments, noteworthy improvements are observed in the operational
conditions of road networks across various model methods following the integration of the heterogeneous reward function proposed
in this study. Remarkably, MARL_SGAT demonstrates a 5.31% reduction in the average number of stops under real-world data
scenarios, in contrast to instances without using the proposed reward function, resulting in the high alleviation of frequent vehicle
starts and stops.
Nevertheless, it is important to acknowledge the limitations of this paper. Specifically, we have not accounted for the possible
existence of special situations such as indirect left turns, U-turns, and variable lanes in the road network during the design process
of the agent, potentially affecting the universality of this work. Developing MARL signal control models suitable for specific
environments is a future research direction for us to explore. Additionally, our focus will also include investigating the applicability
of various MARL algorithms, including but not limited to COMA and MADDPG, in large-scale signal control problems to further
address collaborative traffic signal control challenges.
CRediT authorship contribution statement
Yiming Bie: Writing – original draft, Supervision, Methodology, Conceptualization. Yuting Ji: Writing – original draft,
Visualization, Methodology, Conceptualization. Dongfang Ma: Writing – review & editing, Supervision, Resources, Methodology,
Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Data availability
Data will be made available on request.
17
Acknowledgments
This study is supported in part by the National Natural Science Foundation of China under Grant 52172334, Grant 52131202,
Grant 72361137006, and Grant 52131203, and in part by Key Research and Development Program of Zhejiang, China under Grant
2023C01240.
Appendix A. Supplementary data
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.trc.2024.104663.
References
Acar, B., Sterling, M., 2023. Ensuring federated learning reliability for infrastructure-enhanced autonomous driving. J. Intell. Connect. Veh. 6 (3), 125–135.
Araghi, S., Khosravi, A., Creighton, D., 2015. A review on computational intelligence methods for controlling traffic signal timing. Expert Syst. Appl. 42 (3),
1538–1550.
Bokade, R., Jin, X., Amato, C., 2023. Multi-agent reinforcement learning based on representational communication for large-scale traffic signal control. IEEE
Access 11, 47646–47658.
Bouktif, S., Cheniki, A., Ouni, A., El-Sayed, H., 2023. Deep reinforcement learning for traffic signal control with consistent state and reward design approach.
Knowl.-Based Syst. 267, 110440.
Chen, Y., Li, C., Yue, W., Zhang, H., Mao, G., 2021. Engineering a large-scale traffic signal control: A multi-agent reinforcement learning approach. In: IEEE
INFOCOM 2021-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, pp. 1–6.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder
for statistical machine translation. arXiv preprint arXiv:1406.1078.
Chu, T., Wang, J., Codecà, L., Li, Z., 2019. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 21 (3),
1086–1095.
Dong, H., Meng, Z., Wang, Y., Jia, L., Qin, Y., 2021. Multi-step spatial-temporal fusion network for traffic flow forecasting. In: 2021 IEEE International Intelligent
Transportation Systems Conference. ITSC, IEEE, pp. 3412–3419.
Fei, Y., Shi, P., Li, Y., Liu, Y., Qu, X., 2024. Formation control of multi-agent systems with actuator saturation via neural-based sliding mode estimators.
Knowl.-Based Syst. 284, 111292.
Genders, W., Razavi, S., 2019. Asynchronous n-step Q-learning adaptive traffic signal control. J. Intell. Transp. Syst. 23 (4), 319–331.
He, Y., Liu, Y., Yang, L., Qu, X., 2024. Deep adaptive control: Deep reinforcement learning-based adaptive vehicle trajectory control algorithms for different risk
levels. IEEE Trans. Intell. Veh. 9 (1), 1654–1666.
He, L., Wu, L., Wang, M., Li, J., Wu, D., et al., 2021. A spatial-temporal graph attention network for multi-intersection traffic light control. In: 2021 International
Joint Conference on Neural Networks. IJCNN, IEEE, pp. 1–8.
Huang, H., Hu, Z., Lu, Z., Wen, X., 2021. Network-scale traffic signal control via multiagent reinforcement learning with deep spatiotemporal attentive network.
IEEE Trans. Cybern. 53 (1), 262–274.
Ji, J., Bie, Y., Shi, H., Wang, L., 2024. Energy-saving speed profile planning for a connected and automated electric bus considering motor characteristic. J.
Clean. Prod. 448, 141721.
Jiang, Q., Qin, M., Shi, S., Sun, W., Zheng, B., 2022. Multi-agent reinforcement learning for traffic signal control through universal communication method.
arXiv preprint arXiv:2204.12190.
Joo, H., Ahmed, S.H., Lim, Y., 2020. Traffic signal control for smart cities using reinforcement learning. Comput. Commun. 154, 324–330.
Kekuda, A., Anirudh, R., Krishnan, M., 2021. Reinforcement learning based intelligent traffic signal control using n-step SARSA. In: 2021 International Conference
on Artificial Intelligence and Smart Systems. ICAIS, IEEE, pp. 379–384.
Kong, A.Y., Lu, B.X., Yang, C.Z., Zhang, D.M., 2022. A deep reinforcement learning framework with memory network to coordinate traffic signal control. In:
2022 IEEE 25th International Conference on Intelligent Transportation Systems. ITSC, IEEE, pp. 3825–3830.
Kumar, N., Rahman, S.S., Dhakad, N., 2020. Fuzzy inference enabled deep reinforcement learning-based traffic light control for intelligent transportation system.
IEEE Trans. Intell. Transp. Syst. 22 (8), 4919–4928.
Li, C., Ma, X., Xia, L., Zhao, Q., Yang, J., 2020. Fairness control of traffic light via deep reinforcement learning. In: 2020 IEEE 16th International Conference
on Automation Science and Engineering. CASE, IEEE, pp. 652–658.
Li, Z., Yu, H., Zhang, G., Dong, S., Xu, C.-Z., 2021. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transp.
Res. C 125, 103059.
Liang, X.J., Guler, S.I., Gayah, V.V., 2020. An equitable traffic signal control scheme at isolated signalized intersections using Connected Vehicle technology.
Transp. Res. C 110, 81–97.
Liu, Y., Jia, R., Ye, J., Qu, X., 2022. How machine learning informs ride-hailing services: A survey. Commun. Transp. Res. 2, 100075.
Liu, J., Qin, S., Su, M., Luo, Y., Wang, Y., Yang, S., 2023. Multiple intersections traffic signal control based on cooperative multi-agent reinforcement learning.
Inform. Sci. 647, 119484.
Luo, H., Bie, Y., Jin, S., 2024. Reinforcement learning for traffic signal control in hybrid action space. IEEE Trans. Intell. Transp. Syst. http://dx.doi.org/10.
1109/TITS.2023.3344585.
Ma, C., Li, Y., Meng, W., 2023. A review of vehicle speed control strategies. J. Intell. Connect. Veh. 6 (4), 190–201.
Ma, J., Wu, F., 2022. Feudal multi-Agent reinforcement learning with adaptive network partition for traffic signal control. arXiv preprint arXiv:2205.13836.
Ma, D., Zhou, B., Song, X., Dai, H., 2021. A deep reinforcement learning approach to traffic signal control with temporal traffic pattern mining. IEEE Trans.
Intell. Transp. Syst. 23 (8), 11789–11800.
Mao, F., Li, Z., Li, L., 2022. A comparison of deep reinforcement learning models for isolated traffic signal control. IEEE Intell. Transp. Syst. Mag. 15 (1),
160–180.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602.
Nigam, A., Srivastava, S., 2023. Hybrid deep learning models for traffic stream variables prediction during rainfall. Multimodal Transp. 2 (1), 100052.
Qin, X., Ke, J., Wang, X., Tang, Y., Yang, H., 2022. Demand management for smart transportation: A review. Multimodal Transp. 1 (4), 100038.
Qu, X., Lin, H., Liu, Y., 2023. Envisioning the future of transportation: Inspiration of ChatGPT and large models. Commun. Transp. Res. 3, 100103.
Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., Whiteson, S., 2020. Monotonic value function factorisation for deep multi-agent reinforcement
learning. J. Mach. Learn. Res. 21 (1), 7234–7284.
Song, X.B., Zhou, B., Ma, D., 2024. Cooperative traffic signal control through a counterfactual multi-agent deep actor critic approach. Transp. Res. C 160, 104528.
18
Su, H., Zhong, Y.D., Chow, J.Y., Dey, B., Jin, L., 2023. EMVLight: A multi-agent reinforcement learning framework for an emergency vehicle decentralized
routing and traffic signal control system. Transp. Res. C 146, 103955.
Tang, J., Zeng, J., 2022. Spatiotemporal gated graph attention network for urban traffic flow prediction based on license plate recognition data. Comput.-Aided
Civ. Infrastruct. Eng. 37 (1), 3–23.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y., 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
Wang, T., Cao, J., Hussain, A., 2021b. Adaptive traffic signal control for large-scale scenario with cooperative group-based multi-agent reinforcement learning.
Transp. Res. C 125, 103046.
Wang, J., Ren, Z., Liu, T., Yu, Y., Zhang, C., 2020a. Qplex: Duplex dueling multi-agent Q-learning. arXiv preprint arXiv:2008.01062.
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N., 2016. Dueling network architectures for deep reinforcement learning. In: International
Conference on Machine Learning. PMLR, pp. 1995–2003.
Wang, M., Wu, L., Li, J., He, L., 2021a. Traffic signal control with reinforcement learning based on region-aware cooperative strategy. IEEE Trans. Intell. Transp.
Syst. 23 (7), 6774–6785.
Wang, M., Wu, L., Li, M., Wu, D., Shi, X., Ma, C., 2022. Meta-learning based spatial-temporal graph attention network for traffic signal control. Knowl.-Based
Syst. 250, 109166.
Wang, Y., Xu, T., Niu, X., Tan, C., Chen, E., Xiong, H., 2020b. STMARL: A spatio-temporal multi-agent reinforcement learning approach for cooperative traffic
light control. IEEE Trans. Mob. Comput. 21 (6), 2228–2242.
Wang, Z., Zhu, H., He, M., Zhou, Y., Luo, X., Zhang, N., 2021c. Gan and multi-agent drl based decentralized traffic light signal control. IEEE Trans. Veh. Technol.
71 (2), 1333–1348.
Wang, T., Zhu, Z., Zhang, J., Tian, J., Zhang, W., 2024. A large-scale traffic signal control algorithm based on multi-layer graph deep reinforcement learning.
Transp. Res. C 162, 104582.
Wei, H., Xu, N., Zhang, H., Zheng, G., Zang, X., Chen, C., Zhang, W., Zhu, Y., Xu, K., Li, Z., 2019. Colight: Learning network-level cooperation for traffic signal
control. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. pp. 1913–1922.
Wei, H., Zheng, G., Gayah, V., Li, Z., 2021. Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation. ACM SIGKDD
Explor. Newsl. 22 (2), 12–18.
Wei, H., Zheng, G., Yao, H., Li, Z., 2018. Intellilight: A reinforcement learning approach for intelligent traffic light control. In: Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2496–2505.
Yazdani, M., Sarvi, M., Bagloee, S.A., Nassir, N., Price, J., Parineh, H., 2023. Intelligent vehicle pedestrian light (IVPL): A deep reinforcement learning approach
for traffic signal control. Transp. Res. C 149, 103991.
Yoon, J., Ahn, K., Park, J., Yeo, H., 2021. Transferable traffic signal control: Reinforcement learning with graph centric state representation. Transp. Res. C 130,
103321.
Yu, J., Laharotte, P.-A., Han, Y., Leclercq, L., 2023. Decentralized signal control for multi-modal traffic network: A deep reinforcement learning approach. Transp.
Res. C 154, 104281.
Zhang, G., Wang, Y., 2010. Optimizing minimum and maximum green time settings for traffic actuated control at isolated intersections. IEEE Trans. Intell.
Transp. Syst. 12 (1), 164–173.
Zhang, Z., Yang, J., Zha, H., 2019. Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization. arXiv
preprint arXiv:1909.10651.
Zhao, L., Song, Y., Zhang, C., Liu, Y., Wang, P., Lin, T., Deng, M., Li, H., 2019. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans.
Intell. Transp. Syst. 21 (9), 3848–3858.
19

Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma

Uploaded by

Copyright:

Available Formats

Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma

Uploaded by

Copyright:

Available Formats

Transportation Research Part C 164 (2024) 104663

Contents lists available at ScienceDirect

Transportation Research Part C

Multi-agent Deep Reinforcement Learning collaborative Traffic

ARTICLE INFO ABSTRACT

2.1. Environment and action

Fig. 1. Example of the 5 × 5 multi-intersection scenario.

2.3. Heterogeneous correlation index and reward

𝑄𝑡𝑖 = 𝐸[𝑟𝑡𝑖 + 𝛾𝑟𝑡+1

𝑄𝑡𝑖 = 𝐸[𝑟𝑡𝑖 + 𝛾 max 𝑄𝑡+1

3.1. Overall framework

3.2. Spatiotemporal feature extraction module

Fig. 2. Model framework.

Concatenating operation ∥ leads to 𝜙𝑡𝑖,𝑖∗

𝐻𝑖𝑡 = (1 − 𝜂𝑖𝑡 ) ⊙ 𝐻𝑖𝑡−1 + 𝜂𝑖𝑡 ⊙ 𝐻̃ 𝐼𝑡 (19)

3.3. Calculation of individual value

𝑉̃𝑖𝑡 = 𝑤𝑡𝑖,TR 𝑉𝑖𝑡 + 𝑏𝑡𝑖,TR (29)

𝐴̃ 𝑡𝑖 = 𝑤𝑡𝑖,TR 𝐴⃗𝑡𝑖 + 𝑏𝑡𝑖,TR (30)

3.4. Calculation of global value

𝜆𝑡𝑖 = 𝑤𝑡𝑖,M 𝜑𝑡𝑖 (38)

4.1. Experimental setting

Algorithm 1 MARL_SGAT Algorithm

Fig. 5. Real traffic scene map.

4.2. Comparison scheme and evaluation metric

Fig. 6. Comparison of algorithm convergence.

4.3. Simulation results and discussion

4.3.1. Contrast experiment

Fig. 7. Average reward during training process.

4.3.2. Ablation experiment

CRediT authorship contribution statement

Declaration of competing interest

Data will be made available on request.

Appendix A. Supplementary data

Supplementary material related to this article can be found online at https://doi.org/10.1016/j.trc.2024.104663.

You might also like