2021 Comnet Iab-Rl
2021 Comnet Iab-Rl
2021 Comnet Iab-Rl
when the environment’s reply is stochastic. We can perform gate on the performance of different distributed greedy hop-
offline training through realistic instances of the actual en- by-hop path selection to the core network. The work in [12]
vironment statistics, or we can have an online training and proposes a routing scheme using multiple overlapping span-
operating system that learns while sending packets to UEs ning trees and schedules transmissions to minimize the end-
through IAB nodes. Once trained, the agent will be able to to-end delay along a subset of paths computed from the rout-
play the strategy and provide the best reward in front of any ing scheme. Authors in [31] study path selection and rate
instance of the random environment. allocation to maximize the network data rate by leveraging
Routing and scheduling in wireless multi-hop networks Lyapunov stochastic optimization.
has been traditionally considered a hard problem due to inter- Some works on routing and scheduling take into ac-
ference constraints. Solely relying on an RL approach may count the status of links (i.e., line-of-sight (LOS) and non-
lead to largely suboptimal working points. This is the rea- LOS (NLOS)). The work in [9] performs a slot-by-slot link
son why we decided to adopt a hybrid approach, which com- scheduling to maximize the instantaneous throughput con-
bines traditional optimization techniques and recent RL ap- sidering the blockage probability in the current slot de-
proaches to synergically provide a quasi-optimal and adap- scribed according to a discrete-time Markov chain. Authors
tive resource allocation algorithm. in [8] present a joint dynamic routing and scheduling policy
In this article, we formally introduce the optimization based on proportional flow delays. Every packet requires a
problem of flow routing and link scheduling in mmWave multi-hop path to reach its destination, which is selected on
IAB networks, jointly coordinating access and backhaul the base of the current network conditions. [11] performs
parts to maximize throughput in a multi-hop network archi- routing and scheduling to improve the end-to-end through-
tecture. We solve the problem with a Column Generation put in the wireless backhaul, targeting at urban environments
(CG)-based approach and leverage its generated variables to and utilizing 3D models of buildings as primary blockage
populate an optimal candidate action set for the RL agent. sources. [22] maximizes the number of protected flows in
Based on these action sets, built of promising scheduling a wireless backhaul by selecting relay nodes to bypass the
options, we design a Deep Reinforcement Learning (DRL) blockages when they occur.
framework, based on Long Short-Term Memory (LSTM) Recent years have also seen a widespread utilization of
neural networks, which can overcome the static limits of the machine learning techniques, such as reinforcement learn-
optimization approach in dynamic environments. We place ing, in mmWave wireless networks. The work in [13] pro-
emphasis on realistic scenarios and unreliable networks that poses spectrum allocation algorithms based on Double Deep
are vulnerable to recurrent and dynamic blockages. We pro- Q-network (DQN) and Actor Critic for IAB networks. [24]
pose an offline and an online training version of the ap- proposes a semi-distributed multi-armed bandit learning al-
proach, which we evaluate against traditional approaches via gorithm to minimize the end-to-end latency in backhaul net-
numerical simulations. Furthermore, we discuss feasibility works, which is proven to be adaptive to load imbalance,
issues in implementing our solution in a real system. channel variations and link failures. Authors in [31] resort
The rest of the paper is organized as follows. We first to regret RL to perform route selection and tackle the prob-
discuss related works in Section 2 and point out the contri- lem of rate allocation by successive convex approximation
butions of this article, then we provide a system overview method. The work in [30] maximizes data rate by control-
in Section 3. In Section 4 we present the formulation of ling transmitter beamwidth and power by using risk-sensitive
the problem based on Column Generation, whose results are RL, while authors in [7] present a DQN based approach
used in Section 5, where our RL-based approach is detailed. to assign backhaul resources to users with blockages. Fi-
The results of the numerical evaluation are showcased and nally, there have been some successful cases of combina-
discussed in Section 6. Finally, Section 7 concludes the pa- tion of DRL and LSTM to perform resource management in
per with some final remarks. wireless networks [14, 21] as we do in this article. These
works show the good performance of RL methods on dy-
namic problems with different targets in various network
2. Related Work scenarios. However, none of them aims to deal with rout-
Resource management in mmWave access networks has ing and link scheduling problem in mmWave IAB networks
been largely investigated in recent literature, taking into ac- and with the dynamics caused by link blockages.
count the new challenges brought in by directional transmis- Despite the significant results that have been achieved
sions compared with the conventional omnidirectionality as- in literature, the efforts made in dealing with the dynamic
sumption of works on sub-6GHz networks. Several papers blockage scenario characterizing IAB networks are far from
have investigated the optimization of bandwidth allocation being enough. Some of the existing works assume static
[6, 18, 27], power allocation [6, 23, 30, 15, 16, 33, 36], blockages for specific propagation scenarios [6] or those
beamwidth assignment [30], frame / slots design [28, 15, caused by urban buildings [11], while others assume simple
34], transmission delay [8, 24, 12]. dynamic blockages [9, 22]. Indeed, these works either use
Among all these works, it draws remarkable attention steady state distribution of the link status to compute an ex-
that a large part of them dedicates to the traffic routing and pected metric, which cannot capture frame-by-frame or slot-
transmission scheduling problem. Authors in [25] investi- by-slot actual link conditions [9], or make decisions based
from a technological point of view. is the channel gain between IAB-node 𝑖’s transmitter and
UE 𝑞’s receiver when 𝑖 transmits to UE 𝑢 and 𝑞 receives
4.2.1. Link pattern constraints from IAB-node 𝑟. Figure 2 illustrates the considered chan-
During the CG process, we generate admissible patterns nel gain model. Channel gain matrices, together with the
by considering all the technological and practical aspects assumption that transmitting nodes point their antennas to-
arising when we activate simultaneous links in a mmWave wards their intended receivers, allow us to provide a com-
IAB scenario, such as the channel model, the SINR values plete description of the propagation conditions that the play-
required to activate specific MCSs, the availability of power ers of the considered scenario can meet. Note that these ma-
control to reduce the interference impact, half-duplex con- trices can be computed before the optimization, relying on
straints, etc. These aspects also include how devices are en- both statistical models and measurement campaigns. Also,
gineered. Details like the number of antenna panels and the this matrix-based approach makes the model independent of
number of beams that can be simultaneously activated by the any fine technological and propagation detail, which can be
panel in the pattern must be taken into account. The main computed offline and directly included in the matrices.
sets of constraints necessary to capture these aspects are de- Based on the above-mentioned channel gains, SINR con-
scribed in the following. straints for backhaul links and access links can be written as
in Equation (7a), for IAB-node receivers, and Equation (7b),
SINR constraints SINR constraints are fundamental to im- for UEs. The left-hand side of each equation expresses the
plement a realistic MCS selection when a link is activated. power available at receivers of the link (from IAB-node 𝑖
We define two SINR thresholds 𝛾𝑚𝐴𝐶𝐶 and 𝛾𝑚𝐵𝐻 to indicate to IAB-node 𝑗 or from IAB-node 𝑖 to UE 𝑢), where 𝑝𝐵𝐻 𝑖,𝑗
the SINR values necessary to activate MCS 𝑚 over an ac- and 𝑝𝐴𝐶𝐶 respectively refer to the transmission power from
𝑖,𝑢
cess and a backhaul link, respectively. We compute received IAB-node 𝑖 to IAB-node 𝑗 or to UE 𝑢. The right-hand side
powers using a channel gain matrix model, in which the includes the total interference power from other simultane-
transmission power of node 𝑖 is multiplied by the channel ously active links and the noises at IAB-node and UE re-
gain matrix’s element (𝑖, 𝑗) to provide the received power at ceiver, respectively, 𝜂𝐼𝐴𝐵 and 𝜂𝑈 𝐸 . SINR conditions must
node 𝑗. Channel gain includes not only path loss but also be satisfied considering thresholds 𝛾𝑚𝐵𝐻 and 𝛾𝑚𝐴𝐶𝐶 associated
hardware- and antenna-related gain value. Since mmWave to the MCS 𝑚. Note that the 𝑀(⋅) term derives from a Big-
transmissions are highly directional, the antenna’s pointing M linearization technique that deactivates SINR constraints
direction is important for computing correct received power over links that are not selected to be active with the MCS 𝑚
and interference and must appear in channel gain matrices. (or not active at all), namely 𝑧𝐵𝐻𝑖,𝑗,𝑚 = 0 and 𝑤𝑖,𝑢,𝑚 = 0.
𝐴𝐶𝐶
We introduce four channel gain matrices: 𝐺𝑖→𝑗,𝑟→𝑛𝐵𝐵𝐵𝐵 ,
𝐵𝐴𝐵𝐵 , 𝐺𝐵𝐵𝐵𝐴 and 𝐺𝐵𝐴𝐵𝐴 . The first two matrices ex-
𝐺𝑖→𝑢,𝑟→𝑛 𝑖→𝑗,𝑟→𝑞 𝑖→𝑢,𝑟→𝑞
Coverage constraints IAB-node 𝑖 can transmit to UE 𝑢 or
press the channel gains between two IAB-nodes. 𝐺𝑖→𝑗,𝑟→𝑛
𝐵𝐵𝐵𝐵 to another IAB-node 𝑗 only if node 𝑖 can cover 𝑢 or reach
represents the channel gain between transmitter 𝑖 and re- 𝑗. The coverage is evaluated by checking whether the low-
ceiver 𝑛 when 𝑖 transmits to IAB-node 𝑗 and 𝑛 receives est access or backhaul MCS can be achieved over the link
from IAB-node 𝑟. The specification of nodes 𝑗 and 𝑟 al- activated in isolation. If such an MCS cannot be achieved,
lows to properly compute channel gains, considering an- the corresponding elements in the binary matrices 𝐶𝑂𝑉𝑖,𝑢
𝐴𝐶𝐶
tenna directionality and different pointing directions. Sim- and 𝐶𝑂𝑉𝑖,𝑗 are set to 0. They take value 1 otherwise. The
𝐵𝐻
ilarly, 𝐺𝑖→𝑢,𝑟→𝑛
𝐵𝐴𝐵𝐵 indicates the channel gain between IAB- constraints are:
node 𝑖 and IAB-node 𝑛 when 𝑖 transmits to UE 𝑢 and 𝑛 re-
𝑧𝐵𝐻 𝐵𝐻
𝑖,𝑗,𝑚 ≤ 𝐶𝑂𝑉𝑖,𝑗 , ∀(𝑖, 𝑗) ∈ 𝐵𝐻 , 𝑚 ∈ 𝐵𝐻 , (8a)
ceives from IAB-node 𝑟. The other two matrices express
the channel gains between IAB-nodes and UEs. Specif- 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚 ≤ 𝐴𝐶𝐶
𝐶𝑂𝑉𝑖,𝑢 , ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶
,𝑚 ∈ 𝐴𝐶𝐶
. (8b)
ically, 𝐺𝑖→𝑗,𝑟→𝑞
𝐵𝐵𝐵𝐴 indicates the channel gain between IAB-
Note that these constraints are not strictly required, as previ-
node 𝑖’s transmitter and UE 𝑞’s receiver when 𝑖 transmits
ous SINR constraints prevent out-of-coverage links from be-
⎛ ⎞
( ) ∑ ∑
𝐵𝐻 ⎜ 𝐴𝐶𝐶 𝐵𝐴𝐵𝐵 ⎟
𝑝𝐵𝐻
𝑖,𝑗
𝐺 𝐵𝐵𝐵𝐵
𝑖→𝑗,𝑖→𝑗
+ 𝑀 1 − 𝑧 𝐵𝐻
𝑖,𝑗,𝑚
≥ 𝛾 𝑚 ⎜ 𝐼𝐴𝐵
𝜂 + 𝑝 𝐵𝐻 𝐵𝐵𝐵𝐵
ℎ,𝑘
𝐺ℎ→𝑘,𝑖→𝑗
+ 𝑝 ℎ,𝑢
𝐺ℎ→𝑢,𝑖→𝑗 ⎟
, ∀(𝑖, 𝑗) ∈ 𝐵𝐻 , 𝑚 ∈ (7a)
⎜ ℎ,𝑘∈∶ ℎ∈,𝑢∈ ⎟
⎝ 𝑘≠𝑗 ⎠
⎛ ⎞
( ) ⎜ ∑ ∑
𝐵𝐴𝐵𝐴 ⎟
𝑝𝐴𝐶𝐶
𝑖,𝑢
𝐵𝐴𝐵𝐴
𝐺𝑖→𝑢,𝑖→𝑢 +𝑀 1− 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚
≥ 𝛾𝑚𝐴𝐶𝐶 ⎜𝜂 𝑈 𝐸 + 𝑝𝐵𝐻
ℎ,𝑘
𝐵𝐵𝐵𝐴
𝐺ℎ→𝑘,𝑖→𝑢 + 𝑝𝐴𝐶𝐶
ℎ,𝑞
𝐺ℎ→𝑞,𝑖→𝑢 ⎟, ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶 , 𝑚 ∈ . (7b)
⎜ ℎ,𝑘∈ ℎ∈,𝑞∈ ∶ ⎟
⎝ 𝑞≠𝑢 ⎠
ing activated. However, they remarkably speed up the solu- ters that respectively provide the ID of node 𝑖’s panel to be
tion process by efficiently cutting unfeasible solutions from used for the transmission to IAB-node 𝑗 and UE 𝑢, while
the exploration tree of the optimization solver. 𝑅𝑥𝑃𝑖,𝑗𝐵𝐻 ∈ defines the panel from which node 𝑖 receives
𝑖
node 𝑗’s transmissions. Note that these parameters can be
Half-duplex and multiple beams constraints The most computed a priori as they depend on the nodes’ placement.
common realization of an IAB-node consists of a set of 3
or 4 panels mounted on an node according to a triangle or Power allocation constraints We assume that the power
a square form factor to cover the entire node’s surrounding. assigned to each IAB-node beam can change at every slot to
Given a node 𝑖, each panel 𝑝 in the set of node’s panels, 𝑖 , follow SINR constraints and a per-panel power budget can
is equipped with a number of RF chains that allow the panel be applied on IAB nodes3 . Constraints (10a) and (10b) en-
to transmit up to 𝐾𝑖,𝑝 𝑇 𝑋 or to receive up to 𝐾 𝑅𝑋 simultaneous
𝑖,𝑝 sure that no power can be allocated to a transmission if the
streams (beams) to / from neighboring nodes. If at least one associated link is not activated in the pattern.
link in a panel of node 𝑖 is active in reception, the node 𝑖 is We capture a common feature of real hardware for which
declared active in reception and the binary decision variable the panel transmission power can be tuned only between a
𝑏𝑅𝑋
𝑖 is set to 1, as stated by constraint (9a). Due to the power minimum and maximum value. Constraint (10c) enforces
allocation constraints introduced later, we follow a slightly that the overall power allocated for concurrent transmissions
different approach for transmissions. If at least one link in at the panel 𝑝 of node 𝑖 must not exceed the total power avail-
panel 𝑝 of node 𝑖 is active in transmission, the panel is de- able at that panel, 𝑃𝑖,𝑝
𝑀𝐴𝑋 . Constraint (10d) imposes a min-
clared active in transmission and the binary decision variable imum activation power 𝑃𝑖,𝑝 𝑀𝐼𝑁 to panel 𝑝 of node 𝑖, which
𝑏𝑝−𝑇
𝑖,𝑝
𝑋
is set to 1. Similarly, if at least one panel is active in must be shared among simultaneously activated links.
transmission, the entire node 𝑖 is declared active in transmis- ∑
sion and the binary decision variable 𝑏𝑇𝑖 𝑋 is set to 1. This 𝑝𝐵𝐻
𝑖,𝑗 ≤ 𝑃 𝑀𝐴𝑋
𝐵𝐻 𝑧𝐵𝐻
𝑖,𝑗,𝑚 , ∀(𝑖, 𝑗) ∈ 𝐵𝐻 , (10a)
𝑖,𝑇 𝑥𝑃𝑖,𝑗
behavior is enforced by constraints (9b) and (9c). 𝑚∈
∑
The binary variables allow to enforce an IAB-node to 𝑝𝐴𝐶𝐶 ≤ 𝑃 𝑀𝐴𝑋𝐴𝐶𝐶 𝑤𝐴𝐶𝐶 ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶 , (10b)
𝑖,𝑢 𝑖,𝑇 𝑥𝑃𝑖,𝑢 𝑖,𝑢,𝑚 ,
operate in a half-duplex mode, that is, it cannot be active both 𝑚∈
in transmission and in reception in the same slot2 , as ensured ∑ ∑
𝐵𝐻
𝑝𝑖,𝑗 + 𝐴𝐶𝐶 𝑀𝐴𝑋
𝑝𝑖,𝑢 ≤ 𝑃𝑖,𝑝 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , (10c)
by constraint (9d). Finally, according to the current hardware 𝑗∈∶ 𝑢∈ ∶
specifications, we assume that UEs can receive from at most 𝐵𝐻 =𝑝 𝑇 𝑥𝑃 𝐴𝐶𝐶 =𝑝
𝑇 𝑥𝑃𝑖,𝑗 𝑖,𝑢
one IAB-node in each pattern (slot), as stated by constraint ∑ ∑
𝑀𝐼𝑁 𝑝−𝑇 𝑋
(9e). The constraints are defined as follows: 𝑝𝐵𝐻
𝑖,𝑗 + 𝑝𝐴𝐶𝐶
𝑖,𝑢 ≥ 𝑃𝑖,𝑝 𝑏𝑖,𝑝 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 .
∑ 𝑗∈∶ 𝑢∈ ∶
𝐴𝐶𝐶 =𝑝
(9a)
𝐵𝐻 =𝑝
𝑧𝐵𝐻 𝑅𝑋 𝑅𝑋
𝑗,𝑖,𝑚 ≤ 𝐾𝑖,𝑝 𝑏𝑖 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , 𝑇 𝑥𝑃𝑖,𝑗 𝑇 𝑥𝑃𝑖,𝑢
𝑗∈,𝑚∈∶ (10d)
BH =𝑝
𝑅𝑥𝑃𝑖,𝑗
∑ ∑ 𝑝−𝑇 𝑋 𝑇 𝑋
𝑧𝐵𝐻
𝑖,𝑗,𝑚 + 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚 ≤ 𝑏𝑖,𝑝 𝐾𝑖,𝑝 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , 5. Adaptive Resource Allocation
𝑚∈,𝑗∈∶ 𝑚∈,𝑢∈ ∶
𝑇 𝑥𝑃 𝐵𝐻 =𝑝 𝑇 𝑥𝑃 𝐴𝐶𝐶 =𝑝 In this section, the basic aspects of deep reinforcement
𝑖,𝑗 𝑖,𝑢
learning (DRL) and recurrent neural network (RNN) are in-
(9b)
troduced. Subsequently, the flow routing and link schedul-
𝑏𝑝−𝑇
𝑖,𝑝
𝑋
≤ 𝑏𝑇𝑖 𝑋 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , (9c) ing problem is reformulated as a buckets-pipes game. Then,
𝑏𝑅𝑋 + 𝑏𝑇𝑖 𝑋 ≤ 1, ∀𝑖 ∈ , (9d) based on the patterns provided by the CG method, a DRL-
𝑖
∑ based approach is presented in detail. To adapt the DRL
𝑤𝐴𝐶𝐶
𝑗,𝑢 ≤ 1, ∀𝑢 ∈ , (9e) model to the dynamic IAB scenario, an online framework is
𝑗∈ presented at last and feasibility issues are discussed.
where 𝑇 𝑥𝑃𝑖,𝑗
𝐵𝐻 ∈ and 𝑇 𝑥𝑃 𝐴𝐶𝐶 ∈ are input parame-
𝑖 𝑖,𝑢 𝑖
3 This is a realistic assumption based on current prototype device spec-
ifications. However, other types of power budget model (like per-node or
2 Note that this is a hardware constraint that can be easily removed when per-beam) can be easily included in the formulation with small modifica-
the considered devices can work full-duplex by adopting cancellation tech- tions of the power allocation constraints.
niques to maintain a sufficient level of isolation between transmitting and
receiving panels.
described in Section 4, the optimization objective (2a) cor- 5.3.1. State space, action space, and reward function
responds to the long-term expected return RL aims to max- State Space The factors having an impact on the through-
imize. Also, technological constraints (7), (8), (9), and (10) put in an IAB network mainly consist in two elements:
are already satisfied in the sets of candidate actions (pat- the buffer occupation in each IAB-node (i.e., whether re-
terns) available to the agent. Indeed, it is built from link lay nodes store enough bits to be transferred) and links sta-
patterns generated with the CG approach. The overall flow tus (i.e., whether the links are unobstructed or not). As ex-
routing, pattern scheduling, and provided fairness depend plained in Section 3, backhaul links are hardly exposed to
on which patterns the RL agent selects and how the traffic blockages in practice, therefore we consider in the state def-
flows through the system. The RL agent selects one link inition only the status of access links, which, instead, can
pattern (pipes’ valves) to activate in each slot based on the undergo several blockages. Thus, our state vector is built
action probability given by its action policy, while the data from the concatenation of two components: the vector of the
transmission obeys the following rules. Traffic is buffered number of data bits buffered in each IAB-node (excluding
in queues4 at IAB-nodes, and can be transmitted only to the IAB-donor) and the binary vector representing the blockage
reachable IAB-nodes and UEs (see coverage constraints in status of every access link.
(8)). The total number of bits transferred in a slot from an The buffer-occupation vector at the end of slot 𝑡 is
IAB-node through its outgoing links is limited by the num- an (|𝑠𝑢𝑏 |)-dimensional vector storing the number of bits
ber of bits in its queue, as indicated by flow balance equa- buffered at each IAB-node 𝑛, 𝐵𝑛𝑡 , normalized to the num-
tions (2c) and (2d). Similarly, the maximum number of bits ber of bits that can be transferred in a slot over the link
each link can transmit is limited by its capacity, which is de- with the minimum capacity of the whole network, 𝑐𝑚𝑖𝑛 ⋅ 𝛿.
termined by the activated pattern, according to constraints This normalization allows to shrink the state space so as to
(2f) and (2g). UE throughput fairness (2b) is pursued by facilitate the RL agent’s exploration, thus accelerating the
equally sharing the number of currently buffered data bits convergence of the learning process. Namely, the buffer-
transmitted along multiple links originated at each single occupation vector at step 𝑡 is defined as:
IAB-node, which appear in the same link pattern. ⌊ ⌋
𝐵𝑛𝑡
5.3. Flow Allocation and Pattern Scheduling based 𝑆𝑡 = [𝑠𝑛 ]𝑛∈𝑠𝑢𝑏 , with 𝑠𝑛 =
1 𝑡 𝑡
, (11)
𝑐𝑚𝑖𝑛 ⋅ 𝛿
on LSTM-Assisted A2C
The buckets-pipes game formulated in the previous sec- while the link-blockage vector can be written as
tion forms the environment the RL agent interacts with. Note
that, according to 5G IAB specifications, the IAB-donor is 𝑆𝑡2 = [𝑜𝑡𝑙 ]𝑙∈𝐴𝐶𝐶 , (12)
in charge of managing the entire IAB network, therefore we
can assume this RL agent to be hosted in the IAB-donor and where the value of 𝑜𝑡𝑙 depends on the availability of link 𝑙 in
act as a centralized controller. One slot of the frame is equiv- the current slot 𝑡. Two assumptions about the knowledge on
alent to one step of RL interaction. the link status can be made. In a first more ideal scenario,
The proposed approach resorts to 𝑘-step Advantage Ac- the status of all access links can be monitored slot-by-slot,
tor Critic (A2C) [20] to allocate resources under different which we call fully-observable case, and 𝑜𝑡𝑙 is defined as:
IAB network conditions. This technique is applied to our {
scenario because it can take advantage of both value-based 1, 𝑙 is unblocked,
𝑜𝑡𝑙 = ∀𝑙 ∈ 𝐴𝐶𝐶 . (13)
and policy-gradient approaches and it empirically performs 0, 𝑙 is blocked,
better than other similar approaches such as Asynchronous In a second more realistic scenario, we assume that only
Advantage Actor Critic (A3C) and DQN, as we found out the status of those links appearing in the pattern selected
in our preliminary tests. A2C makes sequential action de- in the current slot 𝑡 can be collected, according to whether
cisions based on the current state of the environment, how- transmissions are successful or not. We call this scenario
ever, it is difficult to select the appropriate action for the next partially-observable case. Based on this consideration, if
slot when drastic changes can occur in networks (e.g., dy- an access link is outside the pattern selected, we deem its
namic link blockages). To address this issue, LSTM network status as "unknown", therefore we define 𝑜𝑡𝑙 as:
is adopted to characterize link status variation regularities
and feed A2C with a processed state description indicating ⎧ 1, 𝑙 is unblocked, inside pattern,
at which point of the repeating history the environment sta- ⎪
tus currently is. 𝑜𝑡𝑙 = ⎨ 0, 𝑙 is blocked, inside pattern, ∀𝑙 ∈ 𝐴𝐶𝐶 . (14)
The essential elements of the RL approach (state space, ⎪−1, 𝑙 is outside pattern,
⎩
action space and reward function), the NN’s architecture,
and their training procedures are further elaborated in the The RL state vector is the concatenation of the two afore-
following paragraphs. mentioned vectors: 𝑆𝑡 = [𝑆𝑡1 , 𝑆𝑡2 ].
4 For a fair comparison with the optimization approach in Section 4,
we assume that queue sizes do not limit the performance of the system.
Action Space The link patterns generated in the CG ap-
However, queue limits can be easily added to the RL environments and the proach of Section 4.2 serve as actions among which the RL
agent can be trained accordingly. agent can select the one to perform in each slot. Each pattern
...
...
Denoting the set of generated patterns as , the action at Prob.
Forget
Gate
step 𝑡 is 𝐴𝑡 ∈ .
Input
Gate
...
ing to the current policy 𝜋, thus the links within this pattern
Cell State
Hidden State
Output
Gate
tivated links can truly deliver bits finally depends on whether State
...
IAB nodes have enough bits buffered and on the presence of Value
expressed as: 𝐴𝑡 is selected according to the current policy 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔),
which is determined by the NN vectors 𝜃 and 𝜔 (Line 7).
min 𝐿𝑐 = (𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 ; 𝜓, 𝜔))2 , (16) Then, the set of blocked links 𝐿𝑏 is removed from 𝐴𝑡 (Line
𝜓,𝜔
9) to simulate the occurrence of blockages according to the
∑
𝑘−1
𝐺𝑘 (𝑆𝑡 ) = 𝛾 𝑖 𝑅𝑡+𝑖 + 𝛾 𝑘 𝑣(𝑆𝑡+𝑘 ), (17) model described in Equation (1). Finally, the data transfer is
𝑖=0 performed over unblocked links. (Lines 10-14). The num-
ber of transferred data bits from each transmitter’s buffer to
where the expected k-step return 𝐺𝑘 (𝑆𝑡 ) is computed at step each receiver’s buffer (Lines 12-13) is limited by:
𝑡 + 𝑘 based on 𝑘-step experience.
Considering the policy’s performance measure is the 1. the overall number of bits in transmitter 𝑖’s buffer at
long-term reward (critic), the goal of the actor part of the step 𝑡: 𝐵𝑖𝑡 (Line 11)
framework (pre-processing and actor NN) is to guide the pol- 2. the number of bits equally assigned by the node 𝑖’s
icy parameters 𝜃 and 𝜔 in the direction of ∇𝜃,𝜔 𝔼[𝐺𝑡 ] to de- transmitter to each of the unblocked parallel links,
𝐵𝑖𝑡
rive the best action policy 𝜋(𝑆𝑡 ; 𝜃, 𝜔). As explained in Sec- (𝑖, ⋅), in 𝐴𝑡 : |{(𝑖,⋅)}|
tion 5.1, a low-variance unbiased estimate of the policy gra- 3. the capacity 𝑐𝑖,𝑗 of link (𝑖, 𝑗) that limits to 𝑐𝑖,𝑗 ⋅ 𝛿 the
dient can be considered: ∇𝜃,𝜔 log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔)(𝐺𝑘 (𝑆𝑡 ) − bits transferred over link (𝑖, 𝑗) within a slot.
𝑣(𝑆𝑡 )). Moreover, in order to promote the action exploration,
thus preventing a premature convergence to sub-optimal de- Effective link data transfers determine the reward, which
terministic policies, the policy entropy 𝐻(𝜋(𝑆𝑡 ; 𝜃, 𝜔)) is is set to the number of IAB-nodes’ access links that can
included in the policy error minimization, which is com- transfer a non-null number of bits (Line 14). The next state
∑ 𝑆𝑡+1 is updated considering buffer-occupation vector 𝑆𝑡+1 1
puted as: − 𝐴𝑡 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔) log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔). In this
way, similarly to the value loss, the policy loss to be min- and link-status vector 𝑆𝑡+1 2 , which are filled according to
imized is defined as: Equation (11) and Equation (12) (Lines 15-16). As for the
link status, either the fully-observable case or the partial-
min 𝐿𝑎 = − log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔)(𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 )) observable case can be selected.
𝜃,𝜔
− 𝜂𝐻(𝜋(𝑆𝑡 ; 𝜃, 𝜔)). (18) After the data collection loop, the value of the expected
k-step return 𝐺𝑘 (𝑆𝑡 ) is computed. Its initial value is set in
Let 𝜅 denote the concatenation of 𝜓, 𝜃 and 𝜔, i.e., 𝜅 = Lines 18-19, then the gradients are computed in Lines 21-24
[𝜓, 𝜃, 𝜔]. The ultimate goal of the training process is to iter- based on Equations (19) and (20). Finally, NN parameters’
atively update 𝜅 to minimize the total loss function 𝐿 (19), vector 𝜅 is updated in Line 25. The above operations are
which is the sum of the policy loss and value loss: repeated until the learning phase ends, after 𝑇𝑚𝑎𝑥 steps.
This is an offline training procedure, as in most of the
𝐿 = − log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜅)(𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 )) − 𝜂𝐻(𝜋(𝑆𝑡 ; 𝜅)) DRL applications, in which a deep NN, trained in a virtual
+ 𝛽(𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 ; 𝜅))2 , (19) but realistic environment, can be then used in real network
𝜅 =𝜅 + ∇𝜅 𝐿. (20) operations in a real environment. This training procedure re-
quires some computational effort, however it only needs to
The parameter vector 𝜅 updates via Equation (20). be run once to set the appropriate vector 𝜅, then the trained
The detailed learning procedure is described in Algo- NN can be used, with much less effort, to properly drive the
rithm 1, which includes flow routing and pattern schedul- pattern selection in real-time, usually implemented with ded-
ing aspects in order to compute the state transition and the icated neural engine hardware.
reward. The whole training period spans 𝑇𝑚𝑎𝑥 steps (slots),
while the NN model is updated every 𝑡𝑚𝑎𝑥 steps (slots). Con- 5.4. Online Learning
sidering the IAB network needs to undergo an initial tran- Since the offline training procedure described in the pre-
sient period to reach a realistic steady state (i.e., stationary vious section instructs NNs using a random environment, the
buffer levels and link behaviors), we introduce a warm-up RL agent can easily deal with the intermittent link availabil-
period 𝑡𝑤𝑎𝑟𝑚−𝑢𝑝 . ity according to its implicit statistic. However, if this statistic
After the parameter initialization of pre-processing, ac- is not fully stationary and some drastic change occurs in the
tor and critic networks (Line 1), the system warms up (Line distribution of link blockage probabilities, such pre-trained
3). During the warm-up, rewards are set to 0 to prevent a pol- NNs can incur in a performance degradation.
icy bias in favor of IAB-donor’s direct access transmissions. To tackle drastic changes, a more attractive solution is to
Before each model update, the pointer to the beginning of perform online learning, which means that NN training and
the train sequence 𝑡𝑠𝑡𝑎𝑟𝑡 is updated and the gradient 𝑑𝜅 is set testing are carried out in parallel. In particular, the most-
to 0 (Line 5). recently updated NN model interacts with the current envi-
Data for the main iteration update of gradients and pa- ronment, while in the meantime, the results of these inter-
rameters (Lines 4-25) are collected in a data collection loop actions are collected as training data to be used in the next
in Lines 6-17. The data collection loop focuses on an experi- NN model update. In this way, NNs can catch system dy-
ence of 𝑡𝑚𝑎𝑥 steps (stopping if the interaction reaches a termi- namics on-the-fly and adjust NN weights to adapt to the new
nal state). In each run of the data collection loop, the action environment.
Algorithm 1 Learning Procedures on IAB Resource Allocation Assume IAB system pauses or RL model update takes 0 secs.
tmax
Parameters: total training steps 𝑇𝑚𝑎𝑥 , model update inter-
val steps 𝑡𝑚𝑎𝑥 , warm-up steps 𝑡𝑤𝑎𝑟𝑚−𝑢𝑝 . M0 M1 M2 M3 M4 ... Simulation
timeline
the update procedure last less than 1 ms, namely less than 8 The total time for the offline training spans 1.8e6 steps,
slots of a physical IAB frame. This is a very small time pe- which takes approximately 40 minutes on our Intel(R)
riod if compared with the data collection time 𝑡𝑚𝑎𝑥 (in our Xeon(R) CPU E5-2640 v4 @ 2.40GHz and 125GB RAM
experimental settings, 200 ∼ 500 steps/slots, corresponding machine. Since it takes some time to reach a stationary situa-
to 25 ∼ 62.5 ms). In this sense, the online-learning training tion where IAB-nodes’ queues have enough data to be trans-
period will not be significantly extended. ferred, we have considered a warm-up period of 400 slots,
empirically determined, during which rewards are set to 0.
This allows us to avoid biasing the solution on IAB-donor
6. Experimental Results access transmissions, as they can provide immediate initial
In this section, we evaluate the performance of our reward with respect to IAB-nodes’ ones which initially have
LSTM-assisted A2C-based resource allocation method empty queues.
through a simulation campaign. We first describe the con- The online approach is evaluated in a dynamic IAB net-
sidered network scenario and the NN model settings. Then, work scenario where not only a different number of links
the performance of offline and online schemes is analyzed. undergo random blockages, but also blockage dynamics is
totally changed several times. In particular, the results of the
6.1. IAB Network Scenario offline approach are compared with those of the online ap-
The instances we consider are the results of an internally- proach, demonstrating that the online training solution can
developed random instance generator compliant with 3GPP mostly outperform the offline model via automatically ad-
NR IAB simulation guidelines [2]. The playground consists justing NN weights in accordance with blockage dynamics.
of a 300𝑚 × 300𝑚 square with 1 IAB-donor, 4 IAB-nodes
and 30 UEs randomly deployed. The IAB-donor is placed at 6.3. Offline Approach Performance Analysis
a height of 25m and is equipped with a single 24x16 panel In this part, the LSTM-A2C NN models (referred to as
antenna array that can generate and process 4 simultaneous DRL in the following) are trained offline in scenarios char-
streams. IAB-nodes are placed at a height of 6m and are acterized by the same link blockage statistics as in the test-
equipped with 4-panel 8x6 antenna array with elements per ing scenarios to which the trained NN model is subsequently
panel, each able to create and process 1 stream. UEs are applied. DRL’s performance is compared against three alter-
equipped with omni-directional antennas and set to a height native resource allocation methods:
of 1.5m. The maximum transmission power is set to 32dBm
• CG-RND: where the link pattern to be activated in the
for the IAB-donor and 23dBm for each IAB-node panel.
current slot is randomly chosen among those available
We consider a 3GPP NR TR 38.901 channel model [1].
at the end of pricing process in CG Section 4;
IAB access and backhaul transmissions are carried out at 28
GHz with 400 MHz of bandwidth and NR Numerology #3 • CG-OPT: where the (quasi-)optimal frame is com-
(120 kHz subcarrier spacing). Each frame consists of 80 puted by using the whole CG approach, which con-
slots, each of which lasts 125 𝜇𝑠. We consider a single MCS siders ideal fully-reliable links.
for backhaul and access: 16 and 8, respectively6 . MCS 16 • Multi-Slot: a heuristic algorithm proposed in [26] to
corresponds to a SINR threshold of 5.60 dB and rate of 525.9 perform link scheduling, which coordinates the link
Mbps, while MCS 8 can achieve a rate of 121.4 Mbps with interference to construct sets of links (similar to the
a SINR threshold of -3.77 dB. idea of link patterns in our work) that satisfy SINR
conditions. According to the Multi-Slot algorithm, we
6.2. NN Settings generate patterns and then iteratively apply them in
In the experiments, the same NN settings are used in both sequential order slot by slot.
offline and online approaches. In particular, the NN model is
composed of 1 fully connected layer with 32 units, 1 LSTM All the values are based on the average of 10 instances ran-
layer with 64 hidden units, 8 fully connected layers with 32 domly generated.
units for both actor and critic networks. The output layers of All four methods are radically different and have differ-
actor and critic networks use Softmax and linear functions. ent complexity levels, which are difficult to precisely com-
Reward’s discount factor 𝛾 is set to 0.99, hence a long- pare. Therefore, we resort to their average solution time on
term reward is considered. Weights 𝜂 for policy entropy and our laptop to have an approximate idea. CG-RND solves
𝛽 for value loss in total loss function of Equation (19) are a sequence of integer and linear programming models and
set to 0.01 and 0.25. Learning rate and batch size for the stops when no objective function improvement can be ob-
learning process are 0.007 and 200, respectively. RMSProp tained. Each instance requires a different number of iter-
Optimizer is used to minimize the total loss so as to adjust ations, but we have experimentally checked that the im-
the NN parameters. provement is limited beyond the 200-th iteration. The en-
tire procedure lasts less than 1 minute. The optimization
6 Note that we have considered a single MCS for access and backhaul
approach of CG-OPT needs more time for the final inte-
only for sake of simplicity. Indeed, MCS selection is not in the charge of the
RL agent, whose actions are the different link patterns. MCS is implicitly ger solution compared with CG-RND, thus its total time in-
included in the definition of link patterns, which are automatically generated creases up to more than 10 minutes. Dealing with resource
in the CG procedure. allocation problems over integer resources, both CG-RND
12
Light block. 1.0 Light block., IAB-donor
0 0.0
DRL CG-RND CG-OPT Multi-Slot DRL CG-RND CG-OPT Multi-Slot
(a) Per-frame overall number of downloaded bits by all UEs. (b) Volume shares downloaded at IAB-donor or IAB-nodes.
1.0
Light block.
0.8
1.0 Mid block.
Severe block.
Multi-Slot than that of DRL, we don’t further include Multi- with the same probability, although with a small through-
Slot in CDF curves to simplify the presentation of the results. put. 2) DRL can serve more users than CG-OPT in all three
As can be seen from the upper right corner of the figure, un- blockage situations, showing an advantage not only from a
der three cases of different blockage densities (i.e., light, mid throughput perspective but also in terms of coverage, when
and severe blockages), the maximum rates achieved by DRL the links’ availability is not close to the ideal.
and CG-OPT are near 121.4 Mbps (the maximum achiev-
able rate of MCS 8), while CG-RND can at most reach about 6.3.3. DRL success analysis
60 Mbps. This means that the problem faced is not triv- We analyze now the reasons of the success of DRL
ial and only a careful link-pattern selection can allow us to method with the help of Figure 7, where, for the sake of
reach good performance. In addition, DRL can learn how brevity, we consider only the case of severe blockages, but
to provide high throughput, even in case of blockages, to the same considerations apply to other cases as well. The
those users that are not directly affected by them. The av- heatmaps in the top row show the number of different access
erage per-UE data rates are reported in Table 4: while in links incident to each UE (horizontal axis) scheduled in the
light-blockage conditions the throughput provided by DRL whole testing period of each instance (vertical axis). The
is only 4.5% less than CG-OPT, DRL outperforms CG-OPT bottom row instead shows the percentage of slots in which
by more than 23% in case of severe blockages. the indicated UE appears as a receiver of a link in the acti-
The values of the CDFs at the origin indicate the percent- vated patterns. For each method, the order of instances’ IDs
age of users that cannot be served. This information is better and UEs’ IDs in every heatmap is computed according to the
described by Figure 6(d), which indicates the percentage of descending order of slot percentage.
UEs in the playground that receives a non-null throughput in Comparing the three heatmaps in the top row, we can
the different scenarios. The first and expected result is that see that CG-OPT method activates the smallest number of
as the blockages get denser, more users are excluded from different access links for each UE, as shown by its darkest
the service. Two interesting aspects further emerge: 1) CG- heatmap. DRL and CG-RND, instead, resort to more access
RND and Multi-Slot methods show the highest service per- links, which increase the probability to use alternatives when
centages, because they do not tend to select the best UEs to an access link is blocked. This increases the scheduling ro-
maximize the overall throughput, but rather to reach all UEs bustness. In the bottom row, the lightest heatmap shows that
8 6 4 2 0
8 6 4 2 0
4 4 4
Instances
Instances
Instances
3 3 3
2 2 2
1 1 1
0 0 0
1.0 1.0 1.0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
0.8 0.8 0.8
Instances
Instances
Instances
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Users Users Users
(a) DRL. (b) CG-RND. (c) CG-OPT.
Figure 7: Comparison of the three methods with severe blockages.
11 11 11
Total bits to UEs per frame (Mbit)
9
9 9
8
8 8
7
7 7
6
Offline Offline Offline
6 6
5 Online Online Online
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Time (s) Time (s) Time (s)
(a) Total Traffic Volume - Reset-all. (b) Total Traffic Volume - Reset-memory. (c) Total Traffic Volume - Reset-none.
Figure 8: Comparison among the three online schemes in Lo-Hi-Lo-blockage scenario.
11 11 11
Offline Offline Offline
Total bits to UEs per frame (Mbit)
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Time (s) Time (s) Time (s)
(a) Total Traffic Volume - Reset-all. (b) Total Traffic Volume - Reset-memory. (c) Total Traffic Volume - Reset-none.
Figure 9: Comparison among the three online schemes in Hi-Lo-Hi-blockage scenario.
the online model largely outperforms the offline model. The Acknowledgment
reason is that frequent blockages of the first stage of the
Hi-Lo-Hi-blockage scenario implicitly set a limit to the of- This work was partially supported by China Scholarship
fline model’s policy, which leaves more rooms for the online Council (CSC) Grant No. 201806470077.
model to improve over when more link opportunities become
available in the next stages. References
This allows us to draw further conclusions. If we want [1] 3GPP, a. Study on channel model for frequencies from 0.5 to 100
to use an offline-training approach, we have to pay atten- GHz, TR 38.901.
tion to training it in an environment that is less affected [2] 3GPP, b. Study on integrated access and backhaul, TR 38.874.
by blockages than it can potentially be when severer con- [3] Capone, A., Carello, G., Filippini, I., Gualandi, S., Malucelli, F.,
2010a. Routing, scheduling and channel assignment in wireless mesh
ditions are met. However, the blockage intensity cannot
networks: optimization models and algorithms. Ad Hoc Networks 8,
be too light, otherwise blockage countermeasures cannot be 545–563.
learned. Vice versa, an online-training solution can always [4] Capone, A., Carello, G., Filippini, I., Gualandi, S., Malucelli, F.,
perform best. 2010b. Solving a resource allocation problem in wireless mesh net-
Finally, the comparison across three strategies (i.e., works: A comparison between a CP-based and a classical column
generation. Networks: An International Journal 55, 221–233.
reset-all, reset-memory and reset-none) indicates that the on-
[5] Capone, A., Filippini, I., Gualandi, S., Yuan, D., 2013. Resource opti-
line approach can potentially perform well. Considering the mization in multi-radio multi-channel wireless mesh networks, Wiley
longer convergence time at each stage of the "reset-all" strat- Online Library.
egy and the performance similar to the other two strategies, [6] Du, J., Onaran, E., Chizhik, D., Venkatesan, S., Valenzuela, R.A.,
we suggest to use the "reset-none" strategy as the best trade- 2017. Gbps user rates using mmwave relayed backhaul with high-
gain antennas. IEEE Journal on Selected Areas in Communications
off between performance and complexity.
35, 1363–1372.
[7] Feng, M., Mao, S., 2019. Dealing with limited backhaul capacity
in millimeter-wave systems: a deep reinforcement learning approach.
7. Conclusion IEEE Communications Magazine 57, 50–55.
In this paper, we have proposed a CG-based DRL ap- [8] García-Rois, J., Banirazi, R., González-Castaño, F.J., Lorenzo, B.,
proach for resource allocation in mmWave 5G IAB networks Burguillo, J.C., 2018. Delay-aware optimization framework for pro-
portional flow delay differentiation in millimeter-wave backhaul cellu-
able to face realistic link blockages. The results have shown lar networks. IEEE Transactions on Communications 66, 2037–2051.
that we can outperform the optimization approaches typ- [9] He, Z., Mao, S., Kompella, S., Swami, A., 2015. Minimum
ically used in wireless multi-hop networks, demonstrating time length scheduling under blockage and interference in multi-hop
that our approach can automatically adapt to environmental mmwave networks, in: 2015 IEEE Global Communications Confer-
changes. In addition, we have developed an online version ence (GLOBECOM), IEEE. pp. 1–7.
[10] Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory.
of our approach to increase its robustness in front of dra- Neural Computation 9, 1735–1780.
matically changing environments. Indeed, it can catch sys- [11] Hu, Q., Blough, D.M., 2017. Relay selection and scheduling for mil-
tem dynamics on-the-fly and adjust the training to adapt to limeter wave backhaul in urban environments, in: 2017 IEEE 14th
the change. We have also carried out an analysis of imple- International Conference on Mobile Ad Hoc and Sensor Systems
mentation and feasibility issues of our approach, which has (MASS), IEEE. pp. 206–214.
[12] Kilpi, J., Seppänen, K., Suihko, T., Paananen, J., Chen, D.T., Wainio,
proven how it can be practically implemented just relying on P., 2017. Link scheduling for mmwave WMN backhaul, in: 2017
realistic hardware requirements. IEEE International Conference on Communications (ICC), IEEE. pp.
Finally, we believe that our approach can be seen as one 1–7.
of the examples in which traditional optimization techniques [13] Lei, W., Ye, Y., Xiao, M., 2020. Deep reinforcement learning based
and recent RL approaches can positively coexist and provide spectrum allocation in integrated access and backhaul networks. IEEE
Transactions on Cognitive Communications and Networking .
remarkable advantages by synergically leveraging their re- [14] Li, R., Wang, C., Zhao, Z., Guo, R., Zhang, H., 2020. The lstm-based
spective strengths. advantage actor-critic learning for resource management in network
slicing with user mobility. IEEE Communications Letters .
[15] Li, Y., Luo, J., Xu, W., Vucic, N., Pateromichelakis, E., Caire, G., 2018 IEEE Conference on Computer Communications (INFOCOM),
2017a. A joint scheduling and resource allocation scheme for mil- IEEE. pp. 1205–1213.
limeter wave heterogeneous networks, in: 2017 IEEE Wireless Com- [35] Zhang, B., Devoti, F., Filippini, I., 2020. RL-based resource allo-
munications and Networking Conference (WCNC), IEEE. pp. 1–6. cation in mmwave 5G IAB networks, in: 2020 IEEE Mediterranean
[16] Li, Y., Pateromichelakis, E., Vucic, N., Luo, J., Xu, W., Caire, G., Communication and Computer Networking Conference (MedCom-
2017b. Radio resource management considerations for 5G millimeter Net), IEEE. pp. 1–8.
wave backhaul and access networks. IEEE Communications Maga- [36] Zhuang, H., Chen, J., Wu, D.O., 2017. Joint access and backhaul
zine 55, 86–92. resource management for ultra-dense networks, in: 2017 IEEE Inter-
[17] Luo, J., Rosenberg, C., Girard, A., 2010. Engineering wireless mesh national Conference on Communications (ICC), IEEE. pp. 1–6.
networks: joint scheduling, routing, power control, and rate adapta-
tion. IEEE/ACM Transactions on Networking 18, 1387–1400.
[18] Ma, Z., Li, B., Yan, Z., Yang, M., 2020. Qos-oriented joint op- Bibo Zhang received the B.S. degree in infor-
timization of resource allocation and concurrent scheduling in 5G mation engineering and the M.S. degree in elec-
millimeter-wave network. Computer Networks 166, 106979. tronics and communication engineering from Bei-
[19] MacCartney, G.R., Rappaport, T.S., Rangan, S., 2017. Rapid fad- jing University of Posts and Telecommunications,
ing due to human blockage in pedestrian crowds at 5G millimeter- China, in 2015 and 2018. She is currently pur-
wave frequencies, in: 2017 IEEE Global Communications Confer- suing the Ph.D. degree in information technology,
ence (GLOBECOM), IEEE. pp. 1–7. from Politecnico di Milano, Italy. Her current re-
[20] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, search interests include resource management in
T., Silver, D., Kavukcuoglu, K., 2016. Asynchronous methods for 5G mmWave networks.
deep reinforcement learning, in: International Conference on Ma-
chine Learning (ICML), pp. 1928–1937.
[21] Naparstek, O., Cohen, K., 2018. Deep multi-user reinforcement learn- Francesco Devoti received the B.S., and M.S. de-
ing for distributed dynamic spectrum access. IEEE Transactions on grees in Telecommunication Engineering, and the
Wireless Communications 18, 310–323. Ph.D. degree in Information Technology from the
[22] Niu, Y., Ding, W., Wu, H., Li, Y., Chen, X., Ai, B., Zhong, Z., 2019. Politecnico di Milano, in 2013, 2016, and 2020 re-
Relay-assisted and QoS aware scheduling to overcome blockage in spectively. He is currently a research scientist at
mmwave backhaul networks. IEEE Transactions on Vehicular Tech- NEC Laboratories Europe. His research interests
nology 68, 1733–1744. include millimeter-wave technologies in 5G net-
[23] Niu, Y., Gao, C., Li, Y., Su, L., Jin, D., Zhu, Y., Wu, D.O., 2016. works.
Energy-efficient scheduling for mmwave backhauling of small cells
in heterogeneous cellular networks. IEEE Transactions on Vehicular
Technology 66, 2674–2687. Ilario Filippini received B.S. and M.S. degrees in
[24] Ortiz, A., Asadi, A., Sim, G.H., Steinmetzer, D., Hollick, M., 2019. Telecommunication Engineering and a Ph.D in In-
Scaros: A scalable and robust self-backhauling solution for highly formation Engineering from the Politecnico di Mi-
dynamic millimeter-wave networks. IEEE Journal on Selected Areas lano, in 2003, 2005, and 2009, respectively. He
in Communications . is currently an Associate Professor with the Dipar-
[25] Polese, M., Giordani, M., Roy, A., Castor, D., Zorzi, M., 2018. Dis- timento di Elettronica, Informazione e Bioingeg-
tributed path selection strategies for integrated access and backhaul neria, Politecnico di Milano. His research in-
at mmwaves, in: 2018 IEEE Global Communications Conference terests include planning, optimization, and game
(GLOBECOM), IEEE. pp. 1–7. theoretical approaches applied to wired and wire-
[26] Saad, M., Abdallah, S., 2019. On millimeter wave 5G backhaul link less networks, performance evaluation and re-
scheduling. IEEE Access 7, 76448–76457. source management in wireless access networks,
[27] Saha, C., Afshang, M., Dhillon, H.S., 2018. Bandwidth partition- and traffic management in software defined net-
ing and downlink analysis in millimeter wave integrated access and works. On these topics, he has published over 60
backhaul for 5G. IEEE Transactions on Wireless Communications peer-reviewed articles. He serves in the TPC of
17, 8195–8210. major conferences in networking and as an Asso-
[28] Sahoo, B., Yao, C.H., Wei, H.Y., 2017. Millimeter-wave multi-hop ciate Editor of IEEE Transactions on Mobile Com-
wireless backhauling for 5G cellular networks, in: 2017 IEEE 85th puting and Elsevier Computer Networks. He is an
Vehicular Technology Conference (VTC Spring), IEEE. pp. 1–5. IEEE Senior Member.
[29] Shi, S., Wang, Q., Xu, P., Chu, X., 2016. Benchmarking state-of-the-
art deep learning software tools, in: 2016 7th International Confer-
Danilo De Donno received the B.Sc. and M.Sc.
ence on Cloud Computing and Big Data (CCBD), IEEE. pp. 99–104.
degrees (cum laude) in telecommunication engi-
[30] Vu, T.K., Bennis, M., Debbah, M., Latva-Aho, M., Hong, C.S.,
neering from Politecnico di Milano, Italy, in 2005
2018a. Ultra-reliable communication in 5G mmwave networks: a
and 2008, respectively, and the Ph.D. degree in
risk-sensitive approach. IEEE Communications Letters 22, 708–711.
Information Engineering from the University of
[31] Vu, T.K., Liu, C.F., Bennis, M., Debbah, M., Latva-Aho, M., 2018b.
Salento, Lecce, Italy, in 2012. He was a Post-
Path selection and rate allocation in self-backhauled mmwave net-
Doctoral Fellow with the ElectroMagnetic Lab
works, in: 2018 IEEE Wireless Communications and Networking
Lecce (EML2) of the University of Salento from
Conference (WCNC), IEEE. pp. 1–6.
2012 to 2015 and a Post-Doc Researcher with the
[32] Wang, Y.E., Wei, G.Y., Brooks, D., 2019. Benchmarking TPU,
IMDEA Networks Institute, Madrid, Spain, from
GPU, and CPU platforms for deep learning. arXiv preprint
2015 to 2017. In July 2017, he joined the Huawei
arXiv:1907.10701 .
Research Center in Milan, Italy, as a Wireless Sys-
[33] Yang, G., Xiao, M., Al-Zubaidy, H., Huang, Y., Gross, J., 2018. Anal-
tem Engineer. His areas of interest lie in mm-Wave
ysis of millimeter-wave multi-hop networks with full-duplex buffered
communications, with main research focus on the
relays. IEEE/ACM Transactions on Networking (TON) 26, 576–590.
development of PHY and MAC algorithms for hy-
[34] Yuan, D., Lin, H.Y., Widmer, J., Hollick, M., 2018. Optimal
brid and full-digital system architectures.
joint routing and scheduling in millimeter-wave cellular networks, in: