2021 Comnet Iab-Rl

PRE_PRINT VERSION
Resource Allocation in mmWave 5G IAB Networks: A Reinforcement

Learning Approach based on Column Generation
Bibo Zhanga,∗ , Francesco Devotia,b , Ilario Filippinia and Danilo De Donnoc
a Politecnicodi Milano, Dipartimento di Elettronica, Informazione e Bioingegneria, 20133 Milan, Italy
b NEC Laboratories Europe, Heidelberg, Germany
c Milan Research Center, Huawei Technologies Italia S.r.l., 20147 Milan, Italy
ARTICLE INFO Abstract

Keywords: Millimeter wave (mmWave) communications have been introduced in the 5G standardization pro-
Millimeter-wave Communication cess due to their attractive potential to provide a huge capacity extension to traditional sub-6 GHz
Wireless Access Networks technologies. However, such high-frequency communications are characterized by harsh propagation
IAB Networks conditions, thus requiring base stations to be densely deployed. Integrated access and backhaul (IAB)
Resource Allocation network architecture proposed by 3GPP is gaining momentum as the most promising and cost-effective
Deep Reinforcement Learning solution to this need of network densification.
Long Short-Term Memory (LSTM) IAB networks’ available resources need to be carefully tuned in a complex setting, including direc-
Column Generation tional transmissions, device heterogeneity, and intermittent links with different levels of availability
that quickly change over time. It is hard for traditional optimization techniques to provide alone the
best performance in these conditions. We believe that Deep Reinforcement Learning (DRL) tech-
niques, especially assisted with Long Short-Term Memory (LSTM), can implicitly capture the regu-
larities of environment dynamics and learn the best resource allocation strategy in networks affected
by obstacle blockages. In this article, we propose a DRL based framework based on the Column
Generation (CG) that shows remarkable effectiveness in addressing routing and link scheduling in
mmWawe 5G IAB networks in realistic scenarios.
1. Introduction (IAB)[2]. The idea is to place simpler relay nodes, called

IAB-nodes, in the coverage area of a full-fledged mmWave
The 3GPP 5G standardization has introduced new base station (BS), called IAB-donor, and form a wireless
frequencies above 24 GHz for Radio Access Networks backhaul to forward data packets between the IAB-donor
(RANs), namely Frequency Range 2 (FR2) or millimeter-
and user equipment (UE). The peculiar aspect of this archi-
wave (mmWave) band, as one of the main reliefs from the
tecture is the self-backhauling approach, where both radio
global mobile traffic growth that is challenging the capac- access and wireless backhaul links share the same radio re-
ity of access networks with communication technologies be- sources and interfaces. Therefore, a proper management of
low 6 GHz. Large bandwidths (several hundred MHz) avail- the radio resource allocation is fundamental to operate this
able at those mainly-underutilized spectrum portions have
network and it is carried out by the IAB-donor. In particular,
unleashed plenty of opportunities to deliver the RAN Gbps-
since the proposed media access control (MAC) solution is
throughput promise. based on TDMA, it involves the optimization of the routing
However, this attractive advantage comes at the cost of paths and the scheduling of directional transmissions along
a harsh propagation environment characterized by very high established links.
path losses and no propagation through obstacles, not only Routing and scheduling in wireless multi-hop networks
vehicle and buildings but also human bodies. While direc-
have a long-standing literature focusing on optimization
tional antennas (e.g., phased arrays) can mitigate path losses,
techniques that consider always-available links [34, 3, 5].
although requiring sophisticated hardware and smart beam However, the harsh propagation environment of mmWave
steering procedures, there is very little they can do against re- frequencies and the strong impact of the obstacles on link
current obstacle blockages. Indeed, mmWave deployments availability make these approaches inadequate for mmWave
are typically coverage-limited, thus 5G mmWave access net- IAB networks. Indeed, the optimal performance provided in
works require closer base stations than traditional radio ac-
ideal link conditions can be hindered by their unpredictable
cess networks. This translates into high installation costs for
on-off behavior, thus destroying the advantages of the op-
operators that need to connect many sites with fibers. timization. We could in principle perform an optimization
In order to provide a technically and economically viable each time the network undergoes a change. However, op-
solution to the required network densification, 3GPP release timization algorithms are usually time consuming, which
16 specifications have introduced a new multi-hop wireless makes this solution infeasible for real-time operations.
access architecture, named Integrated Access and Backhaul
We believe Reinforcement Learning (RL) techniques are
∗ Corresponding author the ideal solution for mmWave IAB networks due to the in-
[email protected] (B. Zhang); [email protected] (F. trinsic ability of these algorithms to adapt to environment
Devoti); [email protected] (I. Filippini); conditions. Indeed, RL agents can be trained to play against
[email protected] (D.D. Donno)
ORCID (s):
the environment to understand what the best strategy is, even
B: Zhang et al.: Preprint submitted to Elsevier Page 1 of 19

PRE_PRINT VERSION
Resource Allocation in mmWave 5G IAB Networks
when the environment’s reply is stochastic. We can perform gate on the performance of different distributed greedy hop-
offline training through realistic instances of the actual en- by-hop path selection to the core network. The work in [12]
vironment statistics, or we can have an online training and proposes a routing scheme using multiple overlapping span-
operating system that learns while sending packets to UEs ning trees and schedules transmissions to minimize the end-
through IAB nodes. Once trained, the agent will be able to to-end delay along a subset of paths computed from the rout-
play the strategy and provide the best reward in front of any ing scheme. Authors in [31] study path selection and rate
instance of the random environment. allocation to maximize the network data rate by leveraging
Routing and scheduling in wireless multi-hop networks Lyapunov stochastic optimization.
has been traditionally considered a hard problem due to inter- Some works on routing and scheduling take into ac-
ference constraints. Solely relying on an RL approach may count the status of links (i.e., line-of-sight (LOS) and non-
lead to largely suboptimal working points. This is the rea- LOS (NLOS)). The work in [9] performs a slot-by-slot link
son why we decided to adopt a hybrid approach, which com- scheduling to maximize the instantaneous throughput con-
bines traditional optimization techniques and recent RL ap- sidering the blockage probability in the current slot de-
proaches to synergically provide a quasi-optimal and adap- scribed according to a discrete-time Markov chain. Authors
tive resource allocation algorithm. in [8] present a joint dynamic routing and scheduling policy
In this article, we formally introduce the optimization based on proportional flow delays. Every packet requires a
problem of flow routing and link scheduling in mmWave multi-hop path to reach its destination, which is selected on
IAB networks, jointly coordinating access and backhaul the base of the current network conditions. [11] performs
parts to maximize throughput in a multi-hop network archi- routing and scheduling to improve the end-to-end through-
tecture. We solve the problem with a Column Generation put in the wireless backhaul, targeting at urban environments
(CG)-based approach and leverage its generated variables to and utilizing 3D models of buildings as primary blockage
populate an optimal candidate action set for the RL agent. sources. [22] maximizes the number of protected flows in
Based on these action sets, built of promising scheduling a wireless backhaul by selecting relay nodes to bypass the
options, we design a Deep Reinforcement Learning (DRL) blockages when they occur.
framework, based on Long Short-Term Memory (LSTM) Recent years have also seen a widespread utilization of
neural networks, which can overcome the static limits of the machine learning techniques, such as reinforcement learn-
optimization approach in dynamic environments. We place ing, in mmWave wireless networks. The work in [13] pro-
emphasis on realistic scenarios and unreliable networks that poses spectrum allocation algorithms based on Double Deep
are vulnerable to recurrent and dynamic blockages. We pro- Q-network (DQN) and Actor Critic for IAB networks. [24]
pose an offline and an online training version of the ap- proposes a semi-distributed multi-armed bandit learning al-
proach, which we evaluate against traditional approaches via gorithm to minimize the end-to-end latency in backhaul net-
numerical simulations. Furthermore, we discuss feasibility works, which is proven to be adaptive to load imbalance,
issues in implementing our solution in a real system. channel variations and link failures. Authors in [31] resort
The rest of the paper is organized as follows. We first to regret RL to perform route selection and tackle the prob-
discuss related works in Section 2 and point out the contri- lem of rate allocation by successive convex approximation
butions of this article, then we provide a system overview method. The work in [30] maximizes data rate by control-
in Section 3. In Section 4 we present the formulation of ling transmitter beamwidth and power by using risk-sensitive
the problem based on Column Generation, whose results are RL, while authors in [7] present a DQN based approach
used in Section 5, where our RL-based approach is detailed. to assign backhaul resources to users with blockages. Fi-
The results of the numerical evaluation are showcased and nally, there have been some successful cases of combina-
discussed in Section 6. Finally, Section 7 concludes the pa- tion of DRL and LSTM to perform resource management in
per with some final remarks. wireless networks [14, 21] as we do in this article. These
works show the good performance of RL methods on dy-
namic problems with different targets in various network
2. Related Work scenarios. However, none of them aims to deal with rout-
Resource management in mmWave access networks has ing and link scheduling problem in mmWave IAB networks
been largely investigated in recent literature, taking into ac- and with the dynamics caused by link blockages.
count the new challenges brought in by directional transmis- Despite the significant results that have been achieved
sions compared with the conventional omnidirectionality as- in literature, the efforts made in dealing with the dynamic
sumption of works on sub-6GHz networks. Several papers blockage scenario characterizing IAB networks are far from
have investigated the optimization of bandwidth allocation being enough. Some of the existing works assume static
[6, 18, 27], power allocation [6, 23, 30, 15, 16, 33, 36], blockages for specific propagation scenarios [6] or those
beamwidth assignment [30], frame / slots design [28, 15, caused by urban buildings [11], while others assume simple
34], transmission delay [8, 24, 12]. dynamic blockages [9, 22]. Indeed, these works either use
Among all these works, it draws remarkable attention steady state distribution of the link status to compute an ex-
that a large part of them dedicates to the traffic routing and pected metric, which cannot capture frame-by-frame or slot-
transmission scheduling problem. Authors in [25] investi- by-slot actual link conditions [9], or make decisions based

PRE_PRINT VERSION
on the current blockage situation [22], which only aims to IAB-donor
maximize the instantaneous throughput of the current time IAB-node

UE
slot, without considering the impact on future time slots. Backhaul link
Access link
2.1. Our contribution Pattern 1
Differently from existing works, this article aims to Pattern 2
tackle the problem of joint routing and scheduling in Pattern 3
mmWave IAB networks, with an emphasis on dealing with Pattern 4

Pattern 5
dynamic blockages described by a realistic blockage model
in rapidly changing environments. Extending our previous Figure 1: A toy example of IAB network scenario and five
work [35] that considers a simplified blockage model relying patterns constructed by access and backhaul links.
on a Bernoulli process, we have completely redesigned the
haul), i.e., they interfere with each other. Moreover, we con-
neural network (NN) framework to deal with the temporal
sider here a static 5G Enhanced Mobile Broadband (eMBB)
correlation of the measure-based blockage model developed
use case (like domestic broadband access, high-throughput
in [19]. The main contributions of this article are summa-
gates, digital kiosks, etc.) where UEs can be assumed to be
rized as follows.
static nodes. This is expected to be the first application of
• We propose a hybrid approach where optimization and mmWave IAB networks.
RL techniques jointly contribute to provide an opti- We focus on the downlink traffic flow as it is expected to
mal and adaptive solution to the complicated problem play the most important role in such networks, leaving the in-
of routing and link scheduling in mmWave IAB net- vestigation on the uplink transmission to a future work. The
works. main performance figures we will analyze are the average
UE throughput, in terms of number of data bits transferred
• We consider a very realistic scenario, and specifically
from the IAB-donor to UEs, and the service coverage, which
design an approach tailored to it, in which random
we express as the percentage of UEs in the service area ex-
blockages remarkably impact on the resource alloca-
perimenting a non-null throughput.
tion in mmWave IAB networks.
The network can be represented as a directed graph
• We provide a CG-based formulation of the problem (, ), where  denotes the index set of nodes including
that can be used both as a reference benchmark for IAB-donor, IAB-nodes and UEs, and  includes all the po-
optimality and as a support for our DRL framework. tential links connecting the nodes in . If not specified,
• We develop a DRL framework based on LSTM to IAB-donor is deemed as a special IAB-node. Hence, the
capture the regularity of environment dynamics so as node set  is divided into two subsets, namely, IAB-node
to better adapt to the changing conditions. Its per- set  ⊂  and UE set  ⊂ . Without loss of general-
formance advantages with respect to traditional opti- ity, IAB-donor is regarded as the 0-th IAB-node and the re-
mization approaches clearly emerge from a numerical maining subset of IAB-nodes is represented via their indices
evaluation. 𝑠𝑢𝑏 = 1, 2, … , || − 1.
MmWave 5G access networks are based on a time-
• We address implementation and feasibility issues of
division multiplexing / time-division multiple access (TDM
the proposed DRL framework.
/ TDMA) resource sharing where each frame involves 𝑇 ∈ ℕ
• We implement both offline and online DRL ap- slots with equal duration 𝛿 in a time domain  . IAB-nodes
proaches to discuss strengths and weaknesses of an transmit in these slots achieving a rate that depends on the
online NN parameters’ update rather than only an ini- signal-to-interference-plus-noise ratio (SINR) available at
tial offline NN training. receivers, thus an interference coordination approach must
be adopted not only to activate a sequence of backhaul links
3. System Overview to transport traffic through IAB-nodes, but also to schedule
access links in the same frame backhaul links are scheduled.
In this article, we consider a mmWave 5G access net- In accordance with this premise, IAB networks implement
work featuring an IAB architecture. It consists of a multi- a space-division multiplexing (SDM) approach on top of
hop mmWave wireless network with a gNodeB (IAB-donor), TDM / TDMA to take advantage of the high directivity of
which is directly connected to the core network via a high ca- mmWave antennas, allowing multiple concurrent transmis-
pacity link (i.e., fiber), a set of self-backhauled IAB-nodes sions in each slot. This results in a frame composed of a slot-
that act as relay nodes for the user traffic to / from the by-slot sequence of sets of links simultaneously activated,
IAB-donor, and user equipment (UE) that can reach IAB- i.e., a sequence of link patterns, that satisfy channel con-
donor either via direct links or through multi-hop IAB-node ditions (e.g., interference requirements, antenna patterns),
paths. The backhaul links between IAB-donor and IAB- half-duplex constraints, multi-beam features, power limits,
node, or IAB-node and IAB-node, and the access links be- etc. We will describe them in detail in Section 4. Figure 1
tween IAB-donor and UEs, or IAB-nodes and UEs, are wire- depicts an example of a network scenario with five possible
less and share the same frequency bands (i.e., in-band back- link patterns. Note that a link pattern can include both access

PRE_PRINT VERSION
and backhaul links. The optimal sequence of activated link Table 1

patterns allows to maximize the number of data bits trans- Summary of notations in Section 4.1
ferred from the IAB-donor to UEs, which is equivalent to
Parameter Definition
maximize the downlink throughput of the IAB network.
,  Set of nodes and set of potential links
Frequent obstacle obstructions strongly characterize ,  Set of IAB-nodes and set of UEs
mmWave links, therefore a dynamic and realistic blockage , 𝑅 Pattern set and restricted pattern set
𝐶𝐺 Set of patterns obtained from CG pricing
model has been introduced into our system. When a link is 𝑇, 𝛿 Number of slots in a frame and the length of a slot
blocked by an obstacle, no data can be transferred. How- 𝑒
𝐵𝐵𝐻(𝑖,𝑗) Amount of bits delivered over backhaul link (𝑖, 𝑗) in pattern 𝑒
𝑒
ever, IAB-nodes are expected to be installed at relatively 𝐵𝐴𝐶𝐶(𝑖,𝑢) Amount of bits delivered over access link (𝑖, 𝑢) in pattern 𝑒
𝜌𝑚𝑖𝑛 Minimum data rate required for each user
high places (e.g., lamp posts, roof tops, etc.) to improve vis- 𝐶𝑑𝑜𝑛𝑜𝑟 Capacity value used to assign IAB-donor role in IAB-nodes
ibility and avoid tampering, therefore we can expect few ob- Variable Definition
stacles to exist there, hence it is less likely for backhaul links 𝜆𝑒 Number of slots in which pattern 𝑒 is activated
to be blocked. In contrast, access links are exposed to more 𝐵𝐻
𝑓𝑖,𝑗,𝑘 Flow bits from 𝑖 to 𝑗 directed to 𝑘 (𝑖, 𝑗, 𝑘 ∈ )
recurrent blockages caused by nomadic obstacles, like those
𝐴𝐶𝐶
𝑓𝑖,𝑢 Flow bits from 𝑖 ∈  to UE 𝑢 ∈ 
𝐵𝐻
𝑓𝐶𝑁,𝑟,ℎ Flow bits from core to IAB-donor 𝑟 directed to ℎ ∈ 
produced by pedestrian and transportation traffic. Based on
this observation, we realistically apply blockages only to ac-
number of bits in each slot requires to reduce the impact of
cess links1 . In particular, we adopt the measurement-based
interference, thus to minimize the transmission power for ev-
signal fading model analyzed in [19], which characterizes
ery established link; (2) the ability of our RL approach to
various blockages caused by pedestrian crowds in New York
learn of potentially obstructed links allows not to waste en-
to provide a stochastic blocked duration of a link. A binary
ergy on transmitting data that will not be correctly decoded
semi-Markov link-blockage model is proposed in that work
at the receiver due to the blockage.
where the blocked duration of each access link follows a log-
normal distribution. Motivated by the Poisson nature of ob-
stacle blockage events, the non-blocked (available) duration 4. Optimization Approach to Resource
of a link follows a negative exponential distribution. The Allocation
considered probability density functions are shown in (1)
From an optimization standpoint, the problem of max-
where 𝜇 and 𝜎 are respectively mean and standard deviation
imizing the performance of a mmWave IAB network can
of the blocked-link time duration and 𝜆 is the obstacle block-
be reduced to a variant of the traditional problem of re-
age event arrival rate. To coordinate with a time-slotted sys-
source optimization in wireless multi-hop networks [5], in
tem, the time spans of both blocked and not-blocked phases
which flow routing and link scheduling are jointly optimized
(𝑡B and 𝑡NB ) are rounded up and expressed in number of slots.
while ensuring fairness among UEs and meeting mmWave-
specific physical requirements, such as half-duplex, simul-
1 −(ln 𝑡B −𝜇)2 ∕2𝜎 2 taneous multiple beams, directional interference, and trans-
𝑓 (𝑡B ) = √ 𝑒
𝑡B 𝜎 2𝜋 (1) mission power limitations.
𝑓 (𝑡NB ) = 𝜆𝑒−𝜆𝑡NB
4.1. Pattern-based Formulation
Resource allocation in wireless networks has been tra- Link-based decision variables are usually considered
ditionally carried out via optimization-based approaches, in optimization problem formulations, often leading to in-
which can provide optimal solutions in quasi-static scenar- tractable mathematical programs that can be solved in rea-
ios. When facing the dynamic environment of mmWave ac- sonable time only for instances of very limited size. To over-
cess networks (e.g., frequent topology changes due to recur- come this limitation, a different set of decision variables can
ring link blockages), it is infeasible to resort to optimization be defined. The idea of link patterns introduced in Section 3
methods to compute a near-optimal scheduling scheme on- can be leveraged and decision variables can be set to rep-
the-fly, due to their time-consuming algorithms. To tackle resent those compatible sets of links that can be simulta-
the complicated blockage scenario described above, RL can neously activated meeting both hardware specifications and
be the ideal solution to capture the intrinsic regularities of SINR requirements. We denote with  the set of all the po-
the dynamic environment and learn how to provide a ro- tential patterns, and associate an integer decision variable 𝜆𝑒
bust network schedule well performing in realistic scenarios. to each pattern 𝑒 ∈ . The value of 𝜆𝑒 denotes the number
However, RL can get stuck into local optima. Therefore, we of slots in which pattern 𝑒 is activated. Considering that only
present next our hybrid approach that joins together advan- one pattern can be activated per time slot, the overall num-
tages of both optimization and RL techniques. ber of times all patterns are activated provides the number
Although indirectly, we believe our approach can play a of occupied slots. Resorting to pattern-based decision vari-
part in improving the network energy efficiency. Indeed, (1) ables also has the advantage to allow to separate the routing
the generation of optimized link patterns to transfer a large and scheduling problem from the link coexistence problem
1 Note that this is not a limitation of our scenario, but rather an effort to
originated from physical constraints. Indeed, routing and
make it more realistic. Indeed, backhaul link blockages can be straightfor- scheduling needs only to be fed up with feasible patterns,
wardly included in the approach if the specific use case needs it. regardless the assumptions and the procedures patterns are

PRE_PRINT VERSION
generated with. sequence of activated link patterns defines a total number of

We define flow variable 𝑓𝑖,𝑗,𝑘
𝐵𝐻 as the total number of bits bits in a frame transmitted along each link, and this num-
in a frame of the flow from IAB-node 𝑖 to IAB-node 𝑗 di- ber must be larger than or equal to the one indicated by flow
rected to IAB-node 𝑘 (hence 𝑓𝑖,𝑗,𝑗
𝐵𝐻 denotes the flow directly variables to obtain a consistent solution. This aspect is en-
from IAB-node 𝑖 to IAB-node 𝑗) and 𝑓𝑖,𝑢 𝐴𝐶𝐶 as the number of forced by constraint (2f) for backhaul links and (2g) for ac-
bits in a frame sent over the access link between IAB-node cess links. Finally, constraint (2h) states that the number of
𝑖 and UE 𝑢. Then, the mixed-integer linear programming activated pattern 𝑒, each considered with the multiplicity 𝜆𝑒 ,
(MILP) formulation for the resource allocation problem is must not exceed the frame length 𝑇 .
given by:
Optimization objective (2a)-(2b) Since mmWave access
∑
max 𝐴𝐶𝐶
𝑓𝑟,𝑢 , (2a) networks are envisioned to provide very high throughput
𝑟∈,𝑢∈ to the users in the service area, we jointly optimize routing
∑ and scheduling to maximize the total flow of bits received
𝑠.𝑡. 𝐴𝐶𝐶
𝑓𝑟,𝑢 ⩾ 𝜌𝑚𝑖𝑛 𝑇 𝛿, ∀𝑢 ∈  , (2b)
by UEs in a frame, as indicated in the objective function
𝑟∈
∑ ∑ (2a). However, a mere throughput maximization often
𝐵𝐻
𝑓𝑛,𝑟,𝑟 𝐵𝐻
+ 𝑓𝐶𝑁,𝑟,𝑟 = 𝐴𝐶𝐶
𝑓𝑟,𝑢 , ∀𝑟 ∈ , (2c) leads to solutions that prioritize some well-positioned UEs,
𝑛∈∶
𝑛≠𝑟
𝑢∈
excluding many others from the service. In order to avoid
∑ ∑ such an undesirable behavior, we set a min-rate constraint
𝐵𝐻 𝐵𝐻 𝐵𝐻
𝑓𝑛,𝑟,ℎ + 𝑓𝐶𝑁,𝑟,ℎ = 𝑓𝑟,𝑚,ℎ , ∀𝑟, ℎ ∈  ∶ 𝑟 ≠ ℎ, (2b) to guarantee fairness among users, where 𝜌𝑚𝑖𝑛 is the
𝑛∈∶ 𝑚∈∶
𝑛≠𝑟 𝑚≠𝑟 minimum required data rate that each user has to achieve
(2d) and 𝛿 is the slot temporal duration.
{
∑ 𝐶𝑑𝑜𝑛𝑜𝑟 𝑟 = 0
𝐵𝐻
𝑓𝐶𝑁,𝑟,ℎ ≤ , ∀𝑟 ∈ , (2e) Optimally solving the pattern-based problem formula-
0 𝑟≠0
ℎ∈ tion presented above requires us to provide as input the
∑ ∑
𝑒
𝐵𝐵𝐻(𝑖,𝑗) 𝜆𝑒 ≥ 𝐵𝐻
𝑓𝑖,𝑗,ℎ , ∀𝑖, 𝑗 ∈  ∶ 𝑖 ≠ 𝑗, (2f) whole set of possible patterns that can be activated. How-
𝑒∈ ℎ∈ ever, this set has a cardinality that increases exponentially
∑ with the number of links, thus creating a formulation with
𝑒 𝐴𝐶𝐶
𝐵𝐴𝐶𝐶(𝑖,𝑢) 𝜆𝑒 ≥ 𝑓𝑖,𝑢 , ∀𝑖 ∈ , 𝑢 ∈  , (2g)
𝑒∈
a huge number of variables, still making the problem
∑ potentially intractable. In order to solve this issue, the
𝜆𝑒 ≤ 𝑇 , (2h)
Column Generation (CG) technique can be applied.
𝑒∈
In CG, only a subset of 𝜆𝑒 variables is considered at the
𝐵𝐻
𝑓𝐶𝑁,𝑖,𝑘 𝐵𝐻
, 𝑓𝑖,𝑗,𝑘 𝐴𝐶𝐶
, 𝑓𝑖,𝑢 ∈ ℝ+ , ∀𝑖, 𝑗, 𝑘 ∈ , 𝑢 ∈  , (2i) beginning. Denoting the formulation (2) as Master Problem
𝜆𝑒 ∈ ℤ+ , ∀𝑒 ∈ . (2j) (MP), we refer to Restricted Master Problem (RMP) to indi-
cate formulation (2) with a restricted set of pattern variables
We refer to Table 1 for the complete list of parameters and 𝜆𝑒 ∶ 𝑒 ∈ 𝑅 ⊂ , where 𝑅 is a restricted pattern set. The
variables. solution of the RMP provides a selection of link patterns,
namely a frame, that is optimal for RMP, but may be not for
Routing constraints (2c)-(2e) Traffic originates from the the original MP, as only a subset of variables (patterns) is
core network and flows to UEs via the IAB-donor and the considered. We need a procedure, namely Column Genera-
IAB-nodes. We denote as 𝑓𝐶𝑁,𝑟,ℎ𝐵𝐻 the core-network flow tion, to check whether the solution obtained is also optimal
entering IAB-donor and directed to IAB-node ℎ. A core- for the original MP or to find out the variables (patterns) to
network flow is allowed only at node with a non-null ca- be included into the pattern set to further improve the solu-
pacity 𝐶𝑑𝑜𝑛𝑜𝑟 . If node 𝑟 is an IAB-donor, namely 𝑟 = 0, tion. Every time a new objective-improving pattern is found,
𝐵𝐻
𝑓𝐶𝑁,𝑟,ℎ > 0; otherwise, 𝑓𝐶𝑁,𝑟,ℎ
𝐵𝐻 = 0. This is enforced by it is added to the set of available patterns, and the process it-
constraint (2e). erates until no improving pattern can be generated. At the
Flow balance equation (2c) states that the incoming flow end of the CG process, the final obtained pattern set 𝐶𝐺
to a destination IAB-node 𝑟 from both the other IAB-nodes provides a good set of candidates to solve MP. The practice
and the core network must equalize the flow sent to all the also shows 𝐶𝐺 to be of limited size with respect to the set
UEs served by 𝑟. Similarly, enforced by (2d), the incoming of all possible link patterns.
flow to an intermediate IAB-node 𝑟 along the path to IAB- The set 𝐶𝐺 includes link patterns built considering
node ℎ must equalize to the outgoing flow directed to ℎ. nodes’ hardware constraints, SINR thresholds, channel and
antenna gains, power levels, etc., under the assumption that
Scheduling constraints (2f)-(2h) The flows defined by links are always available, thus no random obstacle is consid-
the previous set of constraints must be supported by the ered. From a point of view, this is a limitation, as scheduled
scheduling of proper link patterns. The activation of link patterns can incur link blockages, limiting the expected bit
pattern 𝑒 provides a number of bits to be transmitted over transfer of the current and next slots. However, set 𝐶𝐺 in-
each link in the single slot given by parameters 𝐵𝐵𝐻(𝑖,𝑗)
𝑒 and cludes the best mix of link activations to support the optimal
𝐵𝐴𝐶𝐶(𝑖,𝑢) for, respectively, backhaul and access links. The
𝑒

PRE_PRINT VERSION
Table 2 dual constraint is violated, the considered variable has a pos-

Summary of notations in Section 4.2 itive reduced cost and therefore can produce an improvement
in the objective function if it is added to the set of the con-
Parameter Definition
sidered variables. Indeed, this inclusion can potentially pro-
𝐵𝐻 , 𝐴𝐶𝐶 Sets of backhaul and access links
𝐵𝐻 , 𝐴𝐶𝐶 Sets of MCSs for backhaul and access vide a positive contribution to the object of the maximiza-
𝑅𝐵𝐻
𝑚 , 𝑅𝑚
𝐴𝐶𝐶 Bitrates with MCS 𝑚 for backhaul and access tion. Therefore, the aim of the pricing procedure is to gen-
𝛾𝑚𝐵𝐻 , 𝛾𝑚𝐴𝐶𝐶 SINR thresholds of MCS 𝑚 for backhaul and access
𝐺 𝐵𝐵𝐵𝐵 ,𝐺 𝐵𝐵𝐵𝐴 Channel gain from backhaul to backhaul and access
erate a new feasible pattern (a new variable) with a positive
𝐺𝐵𝐴𝐵𝐵 , 𝐺𝐵𝐴𝐵𝐴 Channel gain from access to backhaul and access reduced cost such that the related dual constraint is violated.
𝜂𝐼𝐴𝐵 , 𝜂𝑈 𝐸
𝐶𝑂𝑉 𝐵𝐻
Noises at IAB-node and UE receiver
Coverage matrix of backhaul
The variable associated to this pattern must then be added to
𝐶𝑂𝑉 𝐴𝐶𝐶 Coverage matrix of access lin-MP, the pattern included in the link-pattern set, and lin-
𝑖
𝑇𝑋
𝐾𝑖,𝑝
Set of panels of IAB-node 𝑖
Multi-beams at panel 𝑝 of node 𝑖 for transmission
MP solved again. Then, the generation repeats. The genera-
𝑅𝑋
𝐾𝑖,𝑝 Multi-beams at panel 𝑝 of node 𝑖 for reception tion stops when the optimal solution of lin-MP is achieved,
𝐵𝐻
𝑇 𝑥𝑃𝑖,𝑗 Panel ID of node 𝑖 used on backhaul link (𝑖, 𝑗) that is, when no pattern can be built such that the related dual
𝐴𝐶𝐶
𝑇 𝑥𝑃𝑖,𝑢
𝐵𝐻
Panel ID of node 𝑖 used on access link (𝑖, 𝑢)
constraint is violated.
𝑅𝑥𝑃𝑖,𝑗 Panel ID of node 𝑗 from which node 𝑖 receives
𝑀𝐴𝑋
𝑃𝑖,𝑝 Total power available at panel 𝑝 of node 𝑖 Consider formulation (2). Denoting with 𝛽𝑖,𝑗 the dual
𝑀𝐼𝑁
𝑃𝑖,𝑝 Minimum activation power at panel 𝑝 of node 𝑖 variable related to constraint (2f), 𝛼𝑖,𝑢 the dual variable re-
Variable Definition lated to constraint (2g), and 𝜓 the dual variable related to
𝛽𝑖,𝑗 , 𝛼𝑖,𝑢 , 𝜓 Dual variables w.r.t. constraints 2f, 2g, 2h constraint (2h), the dual constraint associated to a given pat-
𝑧𝐵𝐻
𝑖,𝑗,𝑚 Whether to activate backhaul link (𝑖, 𝑗) with MCS 𝑚 tern 𝑒 is
𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚 Whether to activate access link (𝑖, 𝑢) with MCS 𝑚 ∑ ∑
𝑝 , 𝑝𝐴𝐶𝐶
𝐵𝐻 Transmission power allocated to backhaul and access 𝑒
𝐵𝐵𝐻(𝑖,𝑗) 𝛽𝑖,𝑗 + 𝑒
𝐵𝐴𝐶𝐶(𝑖,𝑢) 𝛼𝑖,𝑢 −𝜓 ≤ 0, (3)
𝑅𝑋
𝑏𝑖 Whether IAB-node 𝑖 is in reception
(𝑖,𝑗)∈𝐵𝐻 (𝑖,𝑢)∈𝐴𝐶𝐶
𝑏𝑝−𝑇
𝑖,𝑝
𝑋
Whether panel 𝑝 of IAB-node 𝑖 is in transmission
𝑏𝑇𝑖 𝑋 Whether at least a panel of node 𝑖 is in transmission
where 𝐵𝐻 = {(𝑖, 𝑗) ∈  ×  ∶ 𝑖 ≠ 𝑗} and 𝐴𝐶𝐶 =
{(𝑖, 𝑢) ∈  ×  } are the sets of wireless backhaul links and
routing and scheduling enforced by MP. Therefore, among
access links, respectively. Note that we have considered only
all possible link patterns, 𝐶𝐺 forms the most promising set
dual variables associated to constraints where primal pattern
of actions that an RL agent can play to achieve a good perfor-
variable 𝜆𝑒 does appear, as this is the primal variable we need
mance. Note that set 𝐶𝐺 depends only on hardware config-
to generate. We refer to Table 2 for the complete list of pa-
urations and nodes locations, thus can be computed a priori
rameters and variables.
once the IAB network is deployed: it is not an output of the
To solve the pricing problem we must look for a new pat-
RL process, but rather an input that defines its action space
tern 𝑒. We express with binary variables 𝑧𝐵𝐻𝑖,𝑗,𝑚 and 𝑤𝑖,𝑢,𝑚 that
𝐴𝐶𝐶
and must be computed once for all.
take value 1 when deciding to activate in pattern 𝑒 backhaul
Given the above considerations, we can use the CG ap-
link (𝑖, 𝑗) and access link (𝑖, 𝑢) with Modulation and Coding
proach to achieve a twofold result:
Scheme (MCS) 𝑚, respectively. The new pattern must sat-
1. Solving MP using formulation (2) with the link- isfy feasibility constraints (namely, hardware constraints and
pattern set 𝐶𝐺 to obtain a quasi-optimal solution of SINR thresholds described in Section 4.2.1) and, in order to
instances of any size, whose results will be the main find a new pattern that violates the dual constraint (3), we
benchmark to be compared against those achievable must maximize the following quantity:
with the RL approach we propose in this article. ∑ ∑
𝑅𝐵𝐻 𝑖,𝑢,𝑚 − 𝜓, (4)
𝐵𝐻
2. Exploiting CG to be a link-pattern generator that cre- 𝑚 𝛿𝛽𝑖,𝑗 𝑧𝑖,𝑗,𝑚 + 𝑅𝐴𝐶𝐶
𝑚 𝛿𝛼𝑖,𝑢 𝑤𝐴𝐶𝐶
(𝑖,𝑗)∈𝐵𝐻 , (𝑖,𝑢)∈𝐴𝐶𝐶 ,
ates the action space of our RL approach, in which we 𝑚∈𝐵𝐻 𝑚∈𝐴𝐶𝐶
will learn the best sequence of link-patterns (among
those in 𝐶𝐺 ) to apply in a dynamic environment to where 𝐴𝐶𝐶 and 𝐵𝐻 are, respectively, the set of MCSs
pursue a fair user throughput maximization as in MP. available for access and backhaul links, and 𝑅𝐴𝐶𝐶
𝑚 and 𝑅𝐵𝐻𝑚
are the bitrates achievable with the 𝑚-th MCS in the set of
4.2. Link Pattern Generation those available for access and backhaul links, respectively.
The generation of new promising link patterns in the CG The pricing guarantees that, if a solution with a positive
approach is based on the assessment of the new pattern’s po- objective function value is found, the dual constraint is vio-
tential to improve the MP’s objective function. This proce- lated and the pattern must be added to the set of those avail-
dure is called pricing and relies on the continuous relaxation able. Thus, we should add a new pattern 𝑒𝑛 to  by setting
of the original MP that we name as lin-MP, and in particular, 𝑒𝑛 ∑
𝐵𝐵𝐻(𝑖,𝑗) = 𝑅𝐵𝐻 𝐵𝐻
𝑚 𝛿 𝑧𝑖,𝑗,𝑚 ∀(𝑖, 𝑗) ∈ 
𝐵𝐻
, (5)
on the dual variables of lin-MP.
𝑚∈𝐵𝐻
We recall that, given a solution of a primal problem lin- 𝑒𝑛 ∑
MP, if the dual variables related to such solution are fea- 𝐵𝐴𝐶𝐶(𝑖,𝑢) = 𝑅𝐴𝐶𝐶
𝑚 𝛿 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚 ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶 . (6)
sible for lin-MP’s dual problem, then the given primal so- 𝑚∈𝐴𝐶𝐶
lution is optimal for lin-MP. Besides, each variable (con- Vice versa, if a negative solution or no solution is found,
straint) of lin-MP is associated to a constraint (variable) of we can certify that no other patterns would improve the ob-
its dual problem. Given a primal variable, if the associated jective function value. Therefore, the CG process can stop.

PRE_PRINT VERSION
Alternatively, we can put a maximum number of CG itera-

tions to arbitrarily approximate the optimal set of patterns.
As a final step, when an optimal set of available patterns
𝐶𝐺 has been found for lin-MP, the original problem MP
can be solved removing the continuous relaxation and con-
sidering  = 𝐶𝐺 . This final integer solution will provide a
heuristic result. Indeed, 𝐶𝐺 is not guaranteed to be optimal
for the original problem MP as the generated pattern set is Figure 2: Channel gain model. Blue and green triangles rep-
optimal just for lin-MP, the continuous relaxation, and may resent the transmitter and receiver IAB nodes which are con-
not be so for the integer version. However, many works in nected with backhaul links shown as blue arrows. Purple dots
literature [4, 5, 17] show that results very close to the op- are the UEs served with access links marked by purple arrows.
timum can be achieved and a small gap can be obtained in The dashed black arrows indicate the channel gains.
very short execution time. We present now the constraints
we must consider to provide a link pattern that is feasible to IAB-node 𝑗 and 𝑞 receives from IAB-node 𝑟. 𝐺𝑖→𝑢,𝑟→𝑞 𝐵𝐴𝐵𝐴
from a technological point of view. is the channel gain between IAB-node 𝑖’s transmitter and
UE 𝑞’s receiver when 𝑖 transmits to UE 𝑢 and 𝑞 receives
4.2.1. Link pattern constraints from IAB-node 𝑟. Figure 2 illustrates the considered chan-
During the CG process, we generate admissible patterns nel gain model. Channel gain matrices, together with the
by considering all the technological and practical aspects assumption that transmitting nodes point their antennas to-
arising when we activate simultaneous links in a mmWave wards their intended receivers, allow us to provide a com-
IAB scenario, such as the channel model, the SINR values plete description of the propagation conditions that the play-
required to activate specific MCSs, the availability of power ers of the considered scenario can meet. Note that these ma-
control to reduce the interference impact, half-duplex con- trices can be computed before the optimization, relying on
straints, etc. These aspects also include how devices are en- both statistical models and measurement campaigns. Also,
gineered. Details like the number of antenna panels and the this matrix-based approach makes the model independent of
number of beams that can be simultaneously activated by the any fine technological and propagation detail, which can be
panel in the pattern must be taken into account. The main computed offline and directly included in the matrices.
sets of constraints necessary to capture these aspects are de- Based on the above-mentioned channel gains, SINR con-
scribed in the following. straints for backhaul links and access links can be written as
in Equation (7a), for IAB-node receivers, and Equation (7b),
SINR constraints SINR constraints are fundamental to im- for UEs. The left-hand side of each equation expresses the
plement a realistic MCS selection when a link is activated. power available at receivers of the link (from IAB-node 𝑖
We define two SINR thresholds 𝛾𝑚𝐴𝐶𝐶 and 𝛾𝑚𝐵𝐻 to indicate to IAB-node 𝑗 or from IAB-node 𝑖 to UE 𝑢), where 𝑝𝐵𝐻 𝑖,𝑗
the SINR values necessary to activate MCS 𝑚 over an ac- and 𝑝𝐴𝐶𝐶 respectively refer to the transmission power from
𝑖,𝑢
cess and a backhaul link, respectively. We compute received IAB-node 𝑖 to IAB-node 𝑗 or to UE 𝑢. The right-hand side
powers using a channel gain matrix model, in which the includes the total interference power from other simultane-
transmission power of node 𝑖 is multiplied by the channel ously active links and the noises at IAB-node and UE re-
gain matrix’s element (𝑖, 𝑗) to provide the received power at ceiver, respectively, 𝜂𝐼𝐴𝐵 and 𝜂𝑈 𝐸 . SINR conditions must
node 𝑗. Channel gain includes not only path loss but also be satisfied considering thresholds 𝛾𝑚𝐵𝐻 and 𝛾𝑚𝐴𝐶𝐶 associated
hardware- and antenna-related gain value. Since mmWave to the MCS 𝑚. Note that the 𝑀(⋅) term derives from a Big-
transmissions are highly directional, the antenna’s pointing M linearization technique that deactivates SINR constraints
direction is important for computing correct received power over links that are not selected to be active with the MCS 𝑚
and interference and must appear in channel gain matrices. (or not active at all), namely 𝑧𝐵𝐻𝑖,𝑗,𝑚 = 0 and 𝑤𝑖,𝑢,𝑚 = 0.
𝐴𝐶𝐶
We introduce four channel gain matrices: 𝐺𝑖→𝑗,𝑟→𝑛𝐵𝐵𝐵𝐵 ,
𝐵𝐴𝐵𝐵 , 𝐺𝐵𝐵𝐵𝐴 and 𝐺𝐵𝐴𝐵𝐴 . The first two matrices ex-
𝐺𝑖→𝑢,𝑟→𝑛 𝑖→𝑗,𝑟→𝑞 𝑖→𝑢,𝑟→𝑞
Coverage constraints IAB-node 𝑖 can transmit to UE 𝑢 or
press the channel gains between two IAB-nodes. 𝐺𝑖→𝑗,𝑟→𝑛
𝐵𝐵𝐵𝐵 to another IAB-node 𝑗 only if node 𝑖 can cover 𝑢 or reach
represents the channel gain between transmitter 𝑖 and re- 𝑗. The coverage is evaluated by checking whether the low-
ceiver 𝑛 when 𝑖 transmits to IAB-node 𝑗 and 𝑛 receives est access or backhaul MCS can be achieved over the link
from IAB-node 𝑟. The specification of nodes 𝑗 and 𝑟 al- activated in isolation. If such an MCS cannot be achieved,
lows to properly compute channel gains, considering an- the corresponding elements in the binary matrices 𝐶𝑂𝑉𝑖,𝑢
𝐴𝐶𝐶
tenna directionality and different pointing directions. Sim- and 𝐶𝑂𝑉𝑖,𝑗 are set to 0. They take value 1 otherwise. The
𝐵𝐻
ilarly, 𝐺𝑖→𝑢,𝑟→𝑛
𝐵𝐴𝐵𝐵 indicates the channel gain between IAB- constraints are:
node 𝑖 and IAB-node 𝑛 when 𝑖 transmits to UE 𝑢 and 𝑛 re-
𝑧𝐵𝐻 𝐵𝐻
𝑖,𝑗,𝑚 ≤ 𝐶𝑂𝑉𝑖,𝑗 , ∀(𝑖, 𝑗) ∈ 𝐵𝐻 , 𝑚 ∈ 𝐵𝐻 , (8a)
ceives from IAB-node 𝑟. The other two matrices express
the channel gains between IAB-nodes and UEs. Specif- 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚 ≤ 𝐴𝐶𝐶
𝐶𝑂𝑉𝑖,𝑢 , ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶
,𝑚 ∈  𝐴𝐶𝐶
. (8b)
ically, 𝐺𝑖→𝑗,𝑟→𝑞
𝐵𝐵𝐵𝐴 indicates the channel gain between IAB-
Note that these constraints are not strictly required, as previ-
node 𝑖’s transmitter and UE 𝑞’s receiver when 𝑖 transmits
ous SINR constraints prevent out-of-coverage links from be-

PRE_PRINT VERSION
⎛ ⎞
( ) ∑ ∑
𝐵𝐻 ⎜ 𝐴𝐶𝐶 𝐵𝐴𝐵𝐵 ⎟
𝑝𝐵𝐻
𝑖,𝑗
𝐺 𝐵𝐵𝐵𝐵
𝑖→𝑗,𝑖→𝑗
+ 𝑀 1 − 𝑧 𝐵𝐻
𝑖,𝑗,𝑚
≥ 𝛾 𝑚 ⎜ 𝐼𝐴𝐵
𝜂 + 𝑝 𝐵𝐻 𝐵𝐵𝐵𝐵
ℎ,𝑘
𝐺ℎ→𝑘,𝑖→𝑗
+ 𝑝 ℎ,𝑢
𝐺ℎ→𝑢,𝑖→𝑗 ⎟
, ∀(𝑖, 𝑗) ∈ 𝐵𝐻 , 𝑚 ∈  (7a)
⎜ ℎ,𝑘∈∶ ℎ∈,𝑢∈ ⎟
⎝ 𝑘≠𝑗 ⎠
⎛ ⎞
( ) ⎜ ∑ ∑
𝐵𝐴𝐵𝐴 ⎟
𝑝𝐴𝐶𝐶
𝑖,𝑢
𝐵𝐴𝐵𝐴
𝐺𝑖→𝑢,𝑖→𝑢 +𝑀 1− 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚
≥ 𝛾𝑚𝐴𝐶𝐶 ⎜𝜂 𝑈 𝐸 + 𝑝𝐵𝐻
ℎ,𝑘
𝐵𝐵𝐵𝐴
𝐺ℎ→𝑘,𝑖→𝑢 + 𝑝𝐴𝐶𝐶
ℎ,𝑞
𝐺ℎ→𝑞,𝑖→𝑢 ⎟, ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶 , 𝑚 ∈  . (7b)
⎜ ℎ,𝑘∈ ℎ∈,𝑞∈ ∶ ⎟
⎝ 𝑞≠𝑢 ⎠
ing activated. However, they remarkably speed up the solu- ters that respectively provide the ID of node 𝑖’s panel to be
tion process by efficiently cutting unfeasible solutions from used for the transmission to IAB-node 𝑗 and UE 𝑢, while
the exploration tree of the optimization solver. 𝑅𝑥𝑃𝑖,𝑗𝐵𝐻 ∈  defines the panel from which node 𝑖 receives
𝑖
node 𝑗’s transmissions. Note that these parameters can be
Half-duplex and multiple beams constraints The most computed a priori as they depend on the nodes’ placement.
common realization of an IAB-node consists of a set of 3
or 4 panels mounted on an node according to a triangle or Power allocation constraints We assume that the power
a square form factor to cover the entire node’s surrounding. assigned to each IAB-node beam can change at every slot to
Given a node 𝑖, each panel 𝑝 in the set of node’s panels, 𝑖 , follow SINR constraints and a per-panel power budget can
is equipped with a number of RF chains that allow the panel be applied on IAB nodes3 . Constraints (10a) and (10b) en-
to transmit up to 𝐾𝑖,𝑝 𝑇 𝑋 or to receive up to 𝐾 𝑅𝑋 simultaneous
𝑖,𝑝 sure that no power can be allocated to a transmission if the
streams (beams) to / from neighboring nodes. If at least one associated link is not activated in the pattern.
link in a panel of node 𝑖 is active in reception, the node 𝑖 is We capture a common feature of real hardware for which
declared active in reception and the binary decision variable the panel transmission power can be tuned only between a
𝑏𝑅𝑋
𝑖 is set to 1, as stated by constraint (9a). Due to the power minimum and maximum value. Constraint (10c) enforces
allocation constraints introduced later, we follow a slightly that the overall power allocated for concurrent transmissions
different approach for transmissions. If at least one link in at the panel 𝑝 of node 𝑖 must not exceed the total power avail-
panel 𝑝 of node 𝑖 is active in transmission, the panel is de- able at that panel, 𝑃𝑖,𝑝
𝑀𝐴𝑋 . Constraint (10d) imposes a min-
clared active in transmission and the binary decision variable imum activation power 𝑃𝑖,𝑝 𝑀𝐼𝑁 to panel 𝑝 of node 𝑖, which
𝑏𝑝−𝑇
𝑖,𝑝
𝑋
is set to 1. Similarly, if at least one panel is active in must be shared among simultaneously activated links.
transmission, the entire node 𝑖 is declared active in transmis- ∑
sion and the binary decision variable 𝑏𝑇𝑖 𝑋 is set to 1. This 𝑝𝐵𝐻
𝑖,𝑗 ≤ 𝑃 𝑀𝐴𝑋
𝐵𝐻 𝑧𝐵𝐻
𝑖,𝑗,𝑚 , ∀(𝑖, 𝑗) ∈ 𝐵𝐻 , (10a)
𝑖,𝑇 𝑥𝑃𝑖,𝑗
behavior is enforced by constraints (9b) and (9c). 𝑚∈
∑
The binary variables allow to enforce an IAB-node to 𝑝𝐴𝐶𝐶 ≤ 𝑃 𝑀𝐴𝑋𝐴𝐶𝐶 𝑤𝐴𝐶𝐶 ∀(𝑖, 𝑢) ∈ 𝐴𝐶𝐶 , (10b)
𝑖,𝑢 𝑖,𝑇 𝑥𝑃𝑖,𝑢 𝑖,𝑢,𝑚 ,
operate in a half-duplex mode, that is, it cannot be active both 𝑚∈
in transmission and in reception in the same slot2 , as ensured ∑ ∑
𝐵𝐻
𝑝𝑖,𝑗 + 𝐴𝐶𝐶 𝑀𝐴𝑋
𝑝𝑖,𝑢 ≤ 𝑃𝑖,𝑝 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , (10c)
by constraint (9d). Finally, according to the current hardware 𝑗∈∶ 𝑢∈ ∶
specifications, we assume that UEs can receive from at most 𝐵𝐻 =𝑝 𝑇 𝑥𝑃 𝐴𝐶𝐶 =𝑝
𝑇 𝑥𝑃𝑖,𝑗 𝑖,𝑢
one IAB-node in each pattern (slot), as stated by constraint ∑ ∑
𝑀𝐼𝑁 𝑝−𝑇 𝑋
(9e). The constraints are defined as follows: 𝑝𝐵𝐻
𝑖,𝑗 + 𝑝𝐴𝐶𝐶
𝑖,𝑢 ≥ 𝑃𝑖,𝑝 𝑏𝑖,𝑝 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 .
∑ 𝑗∈∶ 𝑢∈ ∶
𝐴𝐶𝐶 =𝑝
(9a)
𝐵𝐻 =𝑝
𝑧𝐵𝐻 𝑅𝑋 𝑅𝑋
𝑗,𝑖,𝑚 ≤ 𝐾𝑖,𝑝 𝑏𝑖 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , 𝑇 𝑥𝑃𝑖,𝑗 𝑇 𝑥𝑃𝑖,𝑢
𝑗∈,𝑚∈∶ (10d)
BH =𝑝
𝑅𝑥𝑃𝑖,𝑗
∑ ∑ 𝑝−𝑇 𝑋 𝑇 𝑋
𝑧𝐵𝐻
𝑖,𝑗,𝑚 + 𝑤𝐴𝐶𝐶
𝑖,𝑢,𝑚 ≤ 𝑏𝑖,𝑝 𝐾𝑖,𝑝 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , 5. Adaptive Resource Allocation
𝑚∈,𝑗∈∶ 𝑚∈,𝑢∈ ∶
𝑇 𝑥𝑃 𝐵𝐻 =𝑝 𝑇 𝑥𝑃 𝐴𝐶𝐶 =𝑝 In this section, the basic aspects of deep reinforcement
𝑖,𝑗 𝑖,𝑢
learning (DRL) and recurrent neural network (RNN) are in-
(9b)
troduced. Subsequently, the flow routing and link schedul-
𝑏𝑝−𝑇
𝑖,𝑝
𝑋
≤ 𝑏𝑇𝑖 𝑋 , ∀𝑖 ∈ , 𝑝 ∈ 𝑖 , (9c) ing problem is reformulated as a buckets-pipes game. Then,
𝑏𝑅𝑋 + 𝑏𝑇𝑖 𝑋 ≤ 1, ∀𝑖 ∈ , (9d) based on the patterns provided by the CG method, a DRL-
𝑖
∑ based approach is presented in detail. To adapt the DRL
𝑤𝐴𝐶𝐶
𝑗,𝑢 ≤ 1, ∀𝑢 ∈  , (9e) model to the dynamic IAB scenario, an online framework is
𝑗∈ presented at last and feasibility issues are discussed.
where 𝑇 𝑥𝑃𝑖,𝑗
𝐵𝐻 ∈  and 𝑇 𝑥𝑃 𝐴𝐶𝐶 ∈  are input parame-
𝑖 𝑖,𝑢 𝑖
3 This is a realistic assumption based on current prototype device spec-
ifications. However, other types of power budget model (like per-node or
2 Note that this is a hardware constraint that can be easily removed when per-beam) can be easily included in the formulation with small modifica-
the considered devices can work full-duplex by adopting cancellation tech- tions of the power allocation constraints.
niques to maintain a sufficient level of isolation between transmitting and
receiving panels.

PRE_PRINT VERSION
5.1. Deep Reinforcement Learning and Recurrent IAB-donor

Neural Networks
Reinforcement learning (RL) is widely adopted to opti-
mize decision making and control via sequential interactions IAB-node
between an agent and the environment through accumulated
experience following a trial-and-error strategy. Specifically,
at time step 𝑡, the environment is in the state 𝑆𝑡 , and, condi-
tional on this state, the agent selects an action 𝐴𝑡 according to
the current policy 𝜋 and then executes it in the environment. UE
At time step 𝑡+1, based on its reaction to 𝐴𝑡 , the environment
transits to the state 𝑆𝑡+1 with some probability and gives a Figure 3: Buckets-pipes game formulation. The blue tanks
reward 𝑅𝑡 back to the agent. The agent adjusts the policy 𝜋 with water stored represent IAB-donor and IAB nodes, which
so as to maximize the long-term cumulative reward, namely are connected to UEs’ green buckets with pipes. The thick
∑
the expected return 𝔼𝜋 [𝐺𝑡 ] = 𝔼[ 𝑇𝑘=𝑡+1 𝛾 𝑘−𝑡−1 𝑅𝑘 ]. 𝑇 is the and thin pipes are respectively the equivalents of backhaul and
terminal step in an episode, a basic sequence of interactions access links with different capacities. A pipe can be clogged
between the agent and the environment, and 𝛾 is a discount with obstacles denoted as the lower-left red patch. The water
factor controlling the importance of a future reward to the (data) flows shown in blue dashed lines comply with the rules
current utility. This maximization can be achieved by two derived from Section 4.
categories of algorithms, namely based on value functions plied to sequential data in which the time order is strongly
and based on policy gradient. relevant, e.g., in natural language processing tasks or stock
Value-function approaches pick actions at each state in market predictions. Differently from basic NNs, RNN’s out-
accordance with values estimated via value functions: ei- put at the current step depends not only on the current input
ther state-value functions or action-value functions. A state- but also on a hidden state that stores the relevant informa-
value function 𝑣𝜋 (𝑆𝑡 ) = 𝔼𝜋 [𝐺𝑡 |𝑆𝑡 ] is the expected return tion about the data sequence history, so as to capture time-
from state 𝑆𝑡 , following the policy 𝜋. An action-value func- varying dynamics. However, RNNs only perform well when
tion 𝑞𝜋 (𝑆𝑡 , 𝐴𝑡 ) = 𝔼𝜋 [𝐺𝑡 |𝑆𝑡 , 𝐴𝑡 ] is the expected return from short-term memory is required, due to the vanishing gradient
state 𝑆𝑡 , taking action 𝐴𝑡 and following policy 𝜋 afterwards. problem encountered in the back propagation process. To
The best policy 𝜋 ∗ is the strategy that instructs the agent to address this issue, LSTM [10] network has been proposed to
take the action that leads to the largest value, thus expected allow a NN to embrace both short-term and long-term mem-
future reward, at each state. The value functions can be ex- ory by introducing data processing gates (i.e., forget gate,
pressed in tabular form or being approximated as a function input gate, output gate) to regulate the flow of information
of states and actions, such as linear functions, kernels or deep in a memory cell. LSTM networks have seen many success-
neural networks (DNNs). DRL employs DNNs to approxi- ful wireless network applications based on variable memory
mate value functions. length, sometimes in connection with DRL. The most rele-
Policy gradient approaches directly represent policy vant works are discussed in Section 2.
as a probability function of taking an action at a given The ability of LSTM network to deal in a compact way
state depending on the parameter vector 𝜃, or in other with the history of a data sequence makes it the ideal candi-
words, 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃) = 𝑃 𝑟{𝐴𝑡 |𝑆𝑡 ; 𝜃}. Like value functions, date in building the NNs to approximate actor and critic in
𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃) can also be represented by a DNN where 𝜃 stores the considered scenario. Indeed, since a realistic mmWave
the connection weights. 𝜃 is updated by applying approxi- link blockage behavior exhibits a strong temporal compo-
mate gradient ascent to ∇𝜃 𝔼[𝑅𝑡 ], whose unbiased estimate nent, considering the history of this behavior allows the RL
is ∇𝜃 log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃)𝑅𝑡 . To reduce the variance, a baseline agent to select better link patterns to be activated in each slot.
is often subtracted from the unbiased estimate. A common
choice is to use the estimated state-value function 𝑣(𝑆𝑡 ) as 5.2. Buckets-Pipes Game Formulation
baseline: ∇𝜃 log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃)(𝑅𝑡 − 𝑣(𝑆𝑡 )). In this formula, In order to perform flow routing and link-pattern
𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃) behaves as an actor, indicating to the RL agent scheduling in an RL environment, we reformulate the system
the action to perform, and 𝑣(𝑆𝑡 ) as a critic, providing a model as a game for an RL agent. IAB-donor, IAB-nodes,
quality assessment of the achieved state. Both 𝑣(𝑆𝑡 ; 𝜓) and and UEs are regarded as buckets that store data bits as water.
𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃) can be approximated by a NN. The actor-critic Wireless links act as pipes of different capacities connect-
technique derives from policy gradient methods, but incor- ing these buckets, which are controlled by valves. The link
porates the strengths of value functions. Since 𝑅𝑡 is an pattern activated in a slot corresponds to a group of pipes’
estimate of 𝑞(𝐴𝑡 , 𝑆𝑡 ), the scaling factor 𝑅𝑡 − 𝑣(𝑆𝑡 ) of the valves to be opened, letting the water flow through the pipes.
policy gradient can represent the advantage of taking ac- Wireless links experimenting blockage situations are equiv-
tion 𝐴𝑡 over the other actions at state 𝑆𝑡 and be written as alent to temporarily clogged pipes. The game’s objective is
𝑎(𝐴𝑡 , 𝑆𝑡 ) = 𝑞(𝐴𝑡 , 𝑆𝑡 ) − 𝑣(𝑆𝑡 ). This outlines the definition of to maximize the total amount of water reaching UEs’ buckets
advantage actor critic (A2C) approaches [20]. in a frame. The buckets-pipes game is illustrated in Figure 3.
Recurrent neural networks (RNNs) have been widely ap- Note that in comparison with the optimization approach

PRE_PRINT VERSION
described in Section 4, the optimization objective (2a) cor- 5.3.1. State space, action space, and reward function
responds to the long-term expected return RL aims to max- State Space The factors having an impact on the through-
imize. Also, technological constraints (7), (8), (9), and (10) put in an IAB network mainly consist in two elements:
are already satisfied in the sets of candidate actions (pat- the buffer occupation in each IAB-node (i.e., whether re-
terns) available to the agent. Indeed, it is built from link lay nodes store enough bits to be transferred) and links sta-
patterns generated with the CG approach. The overall flow tus (i.e., whether the links are unobstructed or not). As ex-
routing, pattern scheduling, and provided fairness depend plained in Section 3, backhaul links are hardly exposed to
on which patterns the RL agent selects and how the traffic blockages in practice, therefore we consider in the state def-
flows through the system. The RL agent selects one link inition only the status of access links, which, instead, can
pattern (pipes’ valves) to activate in each slot based on the undergo several blockages. Thus, our state vector is built
action probability given by its action policy, while the data from the concatenation of two components: the vector of the
transmission obeys the following rules. Traffic is buffered number of data bits buffered in each IAB-node (excluding
in queues4 at IAB-nodes, and can be transmitted only to the IAB-donor) and the binary vector representing the blockage
reachable IAB-nodes and UEs (see coverage constraints in status of every access link.
(8)). The total number of bits transferred in a slot from an The buffer-occupation vector at the end of slot 𝑡 is
IAB-node through its outgoing links is limited by the num- an (|𝑠𝑢𝑏 |)-dimensional vector storing the number of bits
ber of bits in its queue, as indicated by flow balance equa- buffered at each IAB-node 𝑛, 𝐵𝑛𝑡 , normalized to the num-
tions (2c) and (2d). Similarly, the maximum number of bits ber of bits that can be transferred in a slot over the link
each link can transmit is limited by its capacity, which is de- with the minimum capacity of the whole network, 𝑐𝑚𝑖𝑛 ⋅ 𝛿.
termined by the activated pattern, according to constraints This normalization allows to shrink the state space so as to
(2f) and (2g). UE throughput fairness (2b) is pursued by facilitate the RL agent’s exploration, thus accelerating the
equally sharing the number of currently buffered data bits convergence of the learning process. Namely, the buffer-
transmitted along multiple links originated at each single occupation vector at step 𝑡 is defined as:
IAB-node, which appear in the same link pattern. ⌊ ⌋
𝐵𝑛𝑡
5.3. Flow Allocation and Pattern Scheduling based 𝑆𝑡 = [𝑠𝑛 ]𝑛∈𝑠𝑢𝑏 , with 𝑠𝑛 =
1 𝑡 𝑡
, (11)
𝑐𝑚𝑖𝑛 ⋅ 𝛿
on LSTM-Assisted A2C
The buckets-pipes game formulated in the previous sec- while the link-blockage vector can be written as
tion forms the environment the RL agent interacts with. Note
that, according to 5G IAB specifications, the IAB-donor is 𝑆𝑡2 = [𝑜𝑡𝑙 ]𝑙∈𝐴𝐶𝐶 , (12)
in charge of managing the entire IAB network, therefore we
can assume this RL agent to be hosted in the IAB-donor and where the value of 𝑜𝑡𝑙 depends on the availability of link 𝑙 in
act as a centralized controller. One slot of the frame is equiv- the current slot 𝑡. Two assumptions about the knowledge on
alent to one step of RL interaction. the link status can be made. In a first more ideal scenario,
The proposed approach resorts to 𝑘-step Advantage Ac- the status of all access links can be monitored slot-by-slot,
tor Critic (A2C) [20] to allocate resources under different which we call fully-observable case, and 𝑜𝑡𝑙 is defined as:
IAB network conditions. This technique is applied to our {
scenario because it can take advantage of both value-based 1, 𝑙 is unblocked,
𝑜𝑡𝑙 = ∀𝑙 ∈ 𝐴𝐶𝐶 . (13)
and policy-gradient approaches and it empirically performs 0, 𝑙 is blocked,
better than other similar approaches such as Asynchronous In a second more realistic scenario, we assume that only
Advantage Actor Critic (A3C) and DQN, as we found out the status of those links appearing in the pattern selected
in our preliminary tests. A2C makes sequential action de- in the current slot 𝑡 can be collected, according to whether
cisions based on the current state of the environment, how- transmissions are successful or not. We call this scenario
ever, it is difficult to select the appropriate action for the next partially-observable case. Based on this consideration, if
slot when drastic changes can occur in networks (e.g., dy- an access link is outside the pattern selected, we deem its
namic link blockages). To address this issue, LSTM network status as "unknown", therefore we define 𝑜𝑡𝑙 as:
is adopted to characterize link status variation regularities
and feed A2C with a processed state description indicating ⎧ 1, 𝑙 is unblocked, inside pattern,
at which point of the repeating history the environment sta- ⎪
tus currently is. 𝑜𝑡𝑙 = ⎨ 0, 𝑙 is blocked, inside pattern, ∀𝑙 ∈ 𝐴𝐶𝐶 . (14)
The essential elements of the RL approach (state space, ⎪−1, 𝑙 is outside pattern,
⎩
action space and reward function), the NN’s architecture,
and their training procedures are further elaborated in the The RL state vector is the concatenation of the two afore-
following paragraphs. mentioned vectors: 𝑆𝑡 = [𝑆𝑡1 , 𝑆𝑡2 ].
4 For a fair comparison with the optimization approach in Section 4,
we assume that queue sizes do not limit the performance of the system.
Action Space The link patterns generated in the CG ap-
However, queue limits can be easily added to the RL environments and the proach of Section 4.2 serve as actions among which the RL
agent can be trained accordingly. agent can select the one to perform in each slot. Each pattern

PRE_PRINT VERSION
𝑒 contains links as sub-actions and the number of bits to be Actor Network
transmitted in a single slot indicated by capacities 𝐵𝐵𝐻(𝑖,𝑗)

𝑒
Fully Connected Layer

Pre-processing Network
and 𝐵𝐴𝐶𝐶(𝑖,𝑢) for backhaul and access links, respectively.
𝑒
State LSTM Layer Action
...
...
Denoting the set of generated patterns as , the action at Prob.
Forget
Gate
step 𝑡 is 𝐴𝑡 ∈ .

In each slot (step) 𝑡, the agent selects a pattern 𝐴𝑡 accord-
Input
Gate
...
ing to the current policy 𝜋, thus the links within this pattern

are activated and enabled to transfer data. Whether or not ac-
Cell State
Hidden State
Output
Gate
tivated links can truly deliver bits finally depends on whether State
...
IAB nodes have enough bits buffered and on the presence of Value
random obstacles obstructing the links.

Critic Network
Reward Function Considering this work aims to maxi-

mize the total traffic volume downloaded by UEs in a frame, Figure 4: Neural network architecture for IAB scheduling.
an intuitive idea is to define the immediate reward as the the IAB network status in the next slot. Specifically, it con-
total number of bits UEs receive in each slot. However, tains a fully connected layer which extracts features from the
this design strongly biases the solution towards UEs directly input state vector. Then, the LSTM layer captures the regu-
connected to the IAB-donor. In such a multi-hop network larities in dynamics of IAB nodes’ queue lengths and block-
scenario, data received by UEs in a certain slot are the cu- age behaviors by constantly updating its hidden state and cell
mulative result of the bits moved through the wireless back- state shown as green and orange circulating flows in Figure
haul in previous slots, which doesn’t produce any immediate 4. LSTM’s hidden state and cell state are updated based on
transmission to UEs. Also, we need to normalize the reward forget gate, input gate and output gate, and differently oper-
to improve the agent’s learning, this means normalizing the ate5 . Specifically, the hidden state focuses more on recent
number of transferred bits. In short, we have to provide an experience, while the cell state stores relatively long-term
answer to the following questions: memory with the help of the forget gate to eliminate unessen-
• How could we precisely evaluate the current action tial memories and the input gate to filter in useful fresh infor-
based only on its immediate effect, while simultane- mation from hidden state and new input. LSTM layer out-
ously considering the cumulative effect of previous puts the hidden state to the A2C network, providing a con-
actions? cise prediction about the near-future state, which captures
the upcoming situation, even in case of sudden changes, so
• How could we relate throughput to reward, and also as to assist A2C in selecting the appropriate action.
maintain a normalized reward? Actor network and critic network are both composed of
• How could we avoid the bias on direct connections several fully connected layers to represent the policy and the
between IAB-donor and UEs, which provide a myopic value function. The actor network adopts a Softmax output
immediate advantage? for the action probability distribution, 𝜋(𝐴𝑡 |𝑆𝑡 ), from which
the next action to be executed is sampled. The critic network
Based on these considerations, the reward function at step 𝑡 utilizes a linear output for the estimated state-value function,
is defined as: 𝑣(𝑆𝑡 ). As we can see in Figure 4, the actor network’s output
∑ dimension is the cardinality of the action space, while the
𝑅𝑡 = 𝑡
𝐼𝑛,𝑢 , (15)
critic network has only one unit for the scalar state value.
(𝑛,𝑢)∈𝐴𝑡 ∶
𝑛∈𝑠𝑢𝑏 ,𝑢∈
5.3.3. Training Algorithm
where 𝑡
𝐼𝑛,𝑢 is a binary indicator of whether an access link The NN model presented is trained with momentum-
of IAB-node 𝑛 (IAB-donor excluded) is effective at 𝑡. It as- based methods and back-propagation through time, relying
sumes value 1 when link (𝑛, 𝑢) indeed delivers bits (i.e., there on the data collected from the interaction process involving
are bits in the transmitter’s queue and the link is not blocked), the state space, the action space, and the reward function
0 otherwise. Therefore, the immediate reward 𝑅𝑡 counts the defined in Section 5.3.1. The NN parameters consist in 𝜓
number of access links between IAB-nodes and UEs that ef- for the critic network to estimate state values, 𝜃 for the ac-
fectively transfer data. tor network to generate action probability and 𝜔 for the pre-
processing network.
5.3.2. Neural Network Architecture We can identify a critic part of the framework, which
The neural network architecture we have designed for the consists of the pre-processing and the critic NN, that aims
resource allocation problem in IAB networks is sketched in at minimizing the value loss, namely the mean squared error
Figure 4. It mainly consists of three key components: a pre- between the current and the estimated state value, formally
processing network, an actor network and a critic network. 5 We omit the details of the three gates due to limited space. Please,
The pre-processing network transforms the state vector refer to [10] for more details.
representing the status of the current slot into a prediction of

PRE_PRINT VERSION
expressed as: 𝐴𝑡 is selected according to the current policy 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔),
which is determined by the NN vectors 𝜃 and 𝜔 (Line 7).
min 𝐿𝑐 = (𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 ; 𝜓, 𝜔))2 , (16) Then, the set of blocked links 𝐿𝑏 is removed from 𝐴𝑡 (Line
𝜓,𝜔
9) to simulate the occurrence of blockages according to the
∑
𝑘−1
𝐺𝑘 (𝑆𝑡 ) = 𝛾 𝑖 𝑅𝑡+𝑖 + 𝛾 𝑘 𝑣(𝑆𝑡+𝑘 ), (17) model described in Equation (1). Finally, the data transfer is
𝑖=0 performed over unblocked links. (Lines 10-14). The num-
ber of transferred data bits from each transmitter’s buffer to
where the expected k-step return 𝐺𝑘 (𝑆𝑡 ) is computed at step each receiver’s buffer (Lines 12-13) is limited by:
𝑡 + 𝑘 based on 𝑘-step experience.
Considering the policy’s performance measure is the 1. the overall number of bits in transmitter 𝑖’s buffer at
long-term reward (critic), the goal of the actor part of the step 𝑡: 𝐵𝑖𝑡 (Line 11)
framework (pre-processing and actor NN) is to guide the pol- 2. the number of bits equally assigned by the node 𝑖’s
icy parameters 𝜃 and 𝜔 in the direction of ∇𝜃,𝜔 𝔼[𝐺𝑡 ] to de- transmitter to each of the unblocked parallel links,
𝐵𝑖𝑡
rive the best action policy 𝜋(𝑆𝑡 ; 𝜃, 𝜔). As explained in Sec- (𝑖, ⋅), in 𝐴𝑡 : |{(𝑖,⋅)}|
tion 5.1, a low-variance unbiased estimate of the policy gra- 3. the capacity 𝑐𝑖,𝑗 of link (𝑖, 𝑗) that limits to 𝑐𝑖,𝑗 ⋅ 𝛿 the
dient can be considered: ∇𝜃,𝜔 log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔)(𝐺𝑘 (𝑆𝑡 ) − bits transferred over link (𝑖, 𝑗) within a slot.
𝑣(𝑆𝑡 )). Moreover, in order to promote the action exploration,
thus preventing a premature convergence to sub-optimal de- Effective link data transfers determine the reward, which
terministic policies, the policy entropy 𝐻(𝜋(𝑆𝑡 ; 𝜃, 𝜔)) is is set to the number of IAB-nodes’ access links that can
included in the policy error minimization, which is com- transfer a non-null number of bits (Line 14). The next state
∑ 𝑆𝑡+1 is updated considering buffer-occupation vector 𝑆𝑡+1 1
puted as: − 𝐴𝑡 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔) log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔). In this
way, similarly to the value loss, the policy loss to be min- and link-status vector 𝑆𝑡+1 2 , which are filled according to
imized is defined as: Equation (11) and Equation (12) (Lines 15-16). As for the
link status, either the fully-observable case or the partial-
min 𝐿𝑎 = − log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔)(𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 )) observable case can be selected.
𝜃,𝜔
− 𝜂𝐻(𝜋(𝑆𝑡 ; 𝜃, 𝜔)). (18) After the data collection loop, the value of the expected
k-step return 𝐺𝑘 (𝑆𝑡 ) is computed. Its initial value is set in
Let 𝜅 denote the concatenation of 𝜓, 𝜃 and 𝜔, i.e., 𝜅 = Lines 18-19, then the gradients are computed in Lines 21-24
[𝜓, 𝜃, 𝜔]. The ultimate goal of the training process is to iter- based on Equations (19) and (20). Finally, NN parameters’
atively update 𝜅 to minimize the total loss function 𝐿 (19), vector 𝜅 is updated in Line 25. The above operations are
which is the sum of the policy loss and value loss: repeated until the learning phase ends, after 𝑇𝑚𝑎𝑥 steps.
This is an offline training procedure, as in most of the
𝐿 = − log 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜅)(𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 )) − 𝜂𝐻(𝜋(𝑆𝑡 ; 𝜅)) DRL applications, in which a deep NN, trained in a virtual
+ 𝛽(𝐺𝑘 (𝑆𝑡 ) − 𝑣(𝑆𝑡 ; 𝜅))2 , (19) but realistic environment, can be then used in real network
𝜅 =𝜅 + ∇𝜅 𝐿. (20) operations in a real environment. This training procedure re-
quires some computational effort, however it only needs to
The parameter vector 𝜅 updates via Equation (20). be run once to set the appropriate vector 𝜅, then the trained
The detailed learning procedure is described in Algo- NN can be used, with much less effort, to properly drive the
rithm 1, which includes flow routing and pattern schedul- pattern selection in real-time, usually implemented with ded-
ing aspects in order to compute the state transition and the icated neural engine hardware.
reward. The whole training period spans 𝑇𝑚𝑎𝑥 steps (slots),
while the NN model is updated every 𝑡𝑚𝑎𝑥 steps (slots). Con- 5.4. Online Learning
sidering the IAB network needs to undergo an initial tran- Since the offline training procedure described in the pre-
sient period to reach a realistic steady state (i.e., stationary vious section instructs NNs using a random environment, the
buffer levels and link behaviors), we introduce a warm-up RL agent can easily deal with the intermittent link availabil-
period 𝑡𝑤𝑎𝑟𝑚−𝑢𝑝 . ity according to its implicit statistic. However, if this statistic
After the parameter initialization of pre-processing, ac- is not fully stationary and some drastic change occurs in the
tor and critic networks (Line 1), the system warms up (Line distribution of link blockage probabilities, such pre-trained
3). During the warm-up, rewards are set to 0 to prevent a pol- NNs can incur in a performance degradation.
icy bias in favor of IAB-donor’s direct access transmissions. To tackle drastic changes, a more attractive solution is to
Before each model update, the pointer to the beginning of perform online learning, which means that NN training and
the train sequence 𝑡𝑠𝑡𝑎𝑟𝑡 is updated and the gradient 𝑑𝜅 is set testing are carried out in parallel. In particular, the most-
to 0 (Line 5). recently updated NN model interacts with the current envi-
Data for the main iteration update of gradients and pa- ronment, while in the meantime, the results of these inter-
rameters (Lines 4-25) are collected in a data collection loop actions are collected as training data to be used in the next
in Lines 6-17. The data collection loop focuses on an experi- NN model update. In this way, NNs can catch system dy-
ence of 𝑡𝑚𝑎𝑥 steps (stopping if the interaction reaches a termi- namics on-the-fly and adjust NN weights to adapt to the new
nal state). In each run of the data collection loop, the action environment.

PRE_PRINT VERSION
Algorithm 1 Learning Procedures on IAB Resource Allocation Assume IAB system pauses or RL model update takes 0 secs.
tmax
Parameters: total training steps 𝑇𝑚𝑎𝑥 , model update inter-
val steps 𝑡𝑚𝑎𝑥 , warm-up steps 𝑡𝑤𝑎𝑟𝑚−𝑢𝑝 . M0 M1 M2 M3 M4 ... Simulation
timeline
1: Initialize neural network weight vector 𝜅 = [𝜓, 𝜃, 𝜔];

Apply M0 to
2: Initialize step 𝑡 ← 1; IAB system
3: Warm up the IAB system within 𝑡𝑤𝑎𝑟𝑚−𝑢𝑝 ;

Practical
4: while 𝑡 < 𝑇𝑚𝑎𝑥 do M0 M0 M1 M1 M2 M2 M3 M3 ... solution
5: 𝑡𝑠𝑡𝑎𝑟𝑡 ← 𝑡; Get state 𝑆𝑡 ; Reset gradient 𝑑𝜅 ← 0; tmax
6: while 𝑡 − 𝑡𝑠𝑡𝑎𝑟𝑡 < 𝑡𝑚𝑎𝑥 and 𝑆𝑡 is not terminal do Update M0 to M1 25~62.5 ms < 1 ms
7: Select pattern 𝐴𝑡 based on 𝜋(𝐴𝑡 |𝑆𝑡 ; 𝜃, 𝜔); Apply M0 to IAB system (200~500 slots) (TPU/GPU)
8: 𝑅𝑡 ← 0; Figure 5: Ideal simulation timeline and feasible solution in

9: Eliminate blocked link set 𝐴𝑡 ← 𝐴𝑡 ⧵ 𝐿𝑏 ; practice.
10: for (𝑖, 𝑗) ∈ 𝐴𝑡 do
11: if 𝐵𝑖𝑡 > 0 then LSTM layer (i.e., the cell state and hidden state) and
𝐵𝑖𝑡
12: 𝐵𝑖𝑡 ← 𝐵𝑖𝑡 − min{ |{(𝑖,⋅)}| , 𝑐𝑖,𝑗 ⋅ 𝛿}; leave all the other parameters the same.
𝐵𝑡 • Reset-none: Make no changes to the entire NN model.
13: 𝐵𝑗𝑡 ← 𝐵𝑗𝑡 + min{ |{(𝑖,⋅)}|
𝑖
, 𝑐𝑖,𝑗 ⋅ 𝛿};
14: if 𝑖 ∈ 𝑠𝑢𝑏 , 𝑗 ∈  then 𝑅𝑡 ← 𝑅𝑡 + 1; In the simulation experiments, we compare the results ob-
15: Get 𝑆𝑡+1
1 , 𝑆 2 based on Eqs. (11) and (12); tained with these three strategies to understand which one
𝑡+1
1 , 𝑆 2 ]; performs best.
16: 𝑆𝑡+1 ← [𝑆𝑡+1 𝑡+1
17: 𝑡 ← 𝑡 + 1; 5.4.2. Dealing with implementation and feasibility
18: if 𝑆𝑡 is not terminal then 𝐺 ← 𝑣(𝑆𝑡 ; 𝜓, 𝜔); In the offline training of Algorithm 1, we select the best
19: else 𝐺 ← 0; actions according to the current NN model and apply them
20: 𝑖 ← 𝑡 − 1; to the IAB network in the virtual environment for 𝑡𝑚𝑎𝑥 steps
21: while 𝑖 ⩾ 𝑡𝑠𝑡𝑎𝑟𝑡 do (slots) to collect a new training data batch. Then, we use this
22: 𝐺 ← 𝑅𝑖 + 𝛾𝐺; data batch as new input for the NN whose parameters has to
23: Update 𝑑𝜅 ← 𝑑𝜅 + ∇𝜅 𝐿 based on Eq. (19); be updated. The updated NN model will be used in the next
24: 𝑖 ← 𝑖 − 1; round of 𝑡𝑚𝑎𝑥 steps to collect further data and the process
25: Update 𝜅 using 𝑑𝜅 based on Eq. (20); repeats until the end of the training is reached.
This approach is applicable only to an offline learning
Despite being a promising solution, the online training scheme, with no training interactions with the real IAB net-
of LSTM-based NNs presents two big challenges: 1) how to work. Indeed, a mere application of Algorithm 1 to an online
deal with the memory incorporated in the LSTM layer when learning paradigm would require either the update time of
the environment changes; 2) how such an online learning NN parameters to be negligible or the IAB network to pause
approach can be realistically implemented in real-time. and wait for the NN model to be updated, which are two ap-
proaches not always possible in practice. To address this is-
5.4.1. Dealing with LSTM memory sue of the online training, we propose the following scheme,
The LSTM layer of our framework is in charge of detect- sketched in Figure 5. We refer to an ideal simulation time-
ing potential regularities in the past link blockage behavior. line (drawn in red) in which the data collected during a batch
This is of fundamental importance in stationary conditions of 𝑡𝑚𝑎𝑥 interactions by using NN model 𝑀𝑥 are processed at
to select appropriate actions, however this memory can bias the end of the batch to immediately update the NN model and
NN training after a sudden environmental change. On the obtain the new model 𝑀𝑥+1 , which will be used in the next
opposite, the complete removal of the LSTM layer would batch. In the practical solution (drawn in blue), we introduce
negatively affect the performance as well, due to the lack of transient periods (shaded) during which the new NN param-
good predictions on the network status RL agent’s actions eters are computed and, in the meantime, the old model 𝑀𝑥
will have to face. Therefore, we have investigated on three is applied, but data are not collected. When a transient pe-
strategies in order to strike the balance between these two riod ends, the updated NN model 𝑀𝑥+1 is put to use and
opposite aspects and better adapt the online training to block- a new batch of 𝑡𝑚𝑎𝑥 interactions is collected. In doing so,
age behaviors. Whenever a radical change happens, which the IAB network can keep running while the model can be
can be identified by a sharp drop in the cumulative reward, trained as well.
we apply the following strategies: The impact of these transient periods on the convergence
• Reset-all: Reset the memory factors in the LSTM speed is limited. Indeed, the NN model update procedure
layer (i.e., the cell state and hidden state) and all the requires less than 100 ms on our general-purpose laptop.
parameters in the whole NN. Therefore, with the help of optimized coding and specialized
hardware, like TPUs and GPUs [32, 29], we can reasonably
• Reset-memory: Reset only the memory factors in the assume to have an improvement of a factor 100, which makes

PRE_PRINT VERSION
the update procedure last less than 1 ms, namely less than 8 The total time for the offline training spans 1.8e6 steps,
slots of a physical IAB frame. This is a very small time pe- which takes approximately 40 minutes on our Intel(R)
riod if compared with the data collection time 𝑡𝑚𝑎𝑥 (in our Xeon(R) CPU E5-2640 v4 @ 2.40GHz and 125GB RAM
experimental settings, 200 ∼ 500 steps/slots, corresponding machine. Since it takes some time to reach a stationary situa-
to 25 ∼ 62.5 ms). In this sense, the online-learning training tion where IAB-nodes’ queues have enough data to be trans-
period will not be significantly extended. ferred, we have considered a warm-up period of 400 slots,
empirically determined, during which rewards are set to 0.
This allows us to avoid biasing the solution on IAB-donor
6. Experimental Results access transmissions, as they can provide immediate initial
In this section, we evaluate the performance of our reward with respect to IAB-nodes’ ones which initially have
LSTM-assisted A2C-based resource allocation method empty queues.
through a simulation campaign. We first describe the con- The online approach is evaluated in a dynamic IAB net-
sidered network scenario and the NN model settings. Then, work scenario where not only a different number of links
the performance of offline and online schemes is analyzed. undergo random blockages, but also blockage dynamics is
totally changed several times. In particular, the results of the
6.1. IAB Network Scenario offline approach are compared with those of the online ap-
The instances we consider are the results of an internally- proach, demonstrating that the online training solution can
developed random instance generator compliant with 3GPP mostly outperform the offline model via automatically ad-
NR IAB simulation guidelines [2]. The playground consists justing NN weights in accordance with blockage dynamics.
of a 300𝑚 × 300𝑚 square with 1 IAB-donor, 4 IAB-nodes
and 30 UEs randomly deployed. The IAB-donor is placed at 6.3. Offline Approach Performance Analysis
a height of 25m and is equipped with a single 24x16 panel In this part, the LSTM-A2C NN models (referred to as
antenna array that can generate and process 4 simultaneous DRL in the following) are trained offline in scenarios char-
streams. IAB-nodes are placed at a height of 6m and are acterized by the same link blockage statistics as in the test-
equipped with 4-panel 8x6 antenna array with elements per ing scenarios to which the trained NN model is subsequently
panel, each able to create and process 1 stream. UEs are applied. DRL’s performance is compared against three alter-
equipped with omni-directional antennas and set to a height native resource allocation methods:
of 1.5m. The maximum transmission power is set to 32dBm
• CG-RND: where the link pattern to be activated in the
for the IAB-donor and 23dBm for each IAB-node panel.
current slot is randomly chosen among those available
We consider a 3GPP NR TR 38.901 channel model [1].
at the end of pricing process in CG Section 4;
IAB access and backhaul transmissions are carried out at 28
GHz with 400 MHz of bandwidth and NR Numerology #3 • CG-OPT: where the (quasi-)optimal frame is com-
(120 kHz subcarrier spacing). Each frame consists of 80 puted by using the whole CG approach, which con-
slots, each of which lasts 125 𝜇𝑠. We consider a single MCS siders ideal fully-reliable links.
for backhaul and access: 16 and 8, respectively6 . MCS 16 • Multi-Slot: a heuristic algorithm proposed in [26] to
corresponds to a SINR threshold of 5.60 dB and rate of 525.9 perform link scheduling, which coordinates the link
Mbps, while MCS 8 can achieve a rate of 121.4 Mbps with interference to construct sets of links (similar to the
a SINR threshold of -3.77 dB. idea of link patterns in our work) that satisfy SINR
conditions. According to the Multi-Slot algorithm, we
6.2. NN Settings generate patterns and then iteratively apply them in
In the experiments, the same NN settings are used in both sequential order slot by slot.
offline and online approaches. In particular, the NN model is
composed of 1 fully connected layer with 32 units, 1 LSTM All the values are based on the average of 10 instances ran-
layer with 64 hidden units, 8 fully connected layers with 32 domly generated.
units for both actor and critic networks. The output layers of All four methods are radically different and have differ-
actor and critic networks use Softmax and linear functions. ent complexity levels, which are difficult to precisely com-
Reward’s discount factor 𝛾 is set to 0.99, hence a long- pare. Therefore, we resort to their average solution time on
term reward is considered. Weights 𝜂 for policy entropy and our laptop to have an approximate idea. CG-RND solves
𝛽 for value loss in total loss function of Equation (19) are a sequence of integer and linear programming models and
set to 0.01 and 0.25. Learning rate and batch size for the stops when no objective function improvement can be ob-
learning process are 0.007 and 200, respectively. RMSProp tained. Each instance requires a different number of iter-
Optimizer is used to minimize the total loss so as to adjust ations, but we have experimentally checked that the im-
the NN parameters. provement is limited beyond the 200-th iteration. The en-
tire procedure lasts less than 1 minute. The optimization
6 Note that we have considered a single MCS for access and backhaul
approach of CG-OPT needs more time for the final inte-
only for sake of simplicity. Indeed, MCS selection is not in the charge of the
RL agent, whose actions are the different link patterns. MCS is implicitly ger solution compared with CG-RND, thus its total time in-
included in the definition of link patterns, which are automatically generated creases up to more than 10 minutes. Dealing with resource
in the CG procedure. allocation problems over integer resources, both CG-RND

PRE_PRINT VERSION
Table 3 generated patterns randomly at each slot, which implies pat-

Random blockage distribution settings for offline models. terns must be carefully designed; (2) the path routing and
Block. types Light Mid Severe scheduling also need to be optimized in addition to pat-
𝜇 5.58 7.37 7.88 terns, this is the reason why CG-OPT performs much bet-
1∕𝜆 5000 3500 2300 ter, in particular under light blockage conditions; (3) adap-
tive schemes are essential for a scheduling method to be ap-
plied in dynamic networks, as shown by the comparison with
Table 4
DRL; (4) Multi-Slot does not consider queue lengths of IAB-
Average per-UE data rates (Mbps).
nodes that is important to define a good pattern sequence in a
Block. types Light Mid Severe frame. Comparing CG-OPT, CG-RND and DRL, the order
CG-RND 25.756 22.052 16.110 of the throughput reduction is CG-OPT > DRL > CG-RND,
CG-OPT 38.588 26.843 17.598 which is due to the fact that the CG-OPT method, working
DRL 36.821 28.802 21.752 under ideal link behavior assumptions, is more severely af-
and CG-OPT are NP-hard. The Multi-Slot algorithm is a fected by blockages. By contrast, the CG-RND method, us-
greedy polynomial algorithm with a worst-case complexity ing all patterns with the same probability, is less influenced
of 𝑂(𝑛3 ), where 𝑛 is the number of links. It is very fast and by blockages as the inefficiencies in the multi-hop delivery
takes about 50 ms to provide a solution. All these methods tend to dominate over blockages, i.e., the limited number of
provide the final frame structure at the end of their execu- delivered bits is due more to the recurrent lack of bits in the
tion. Our DRL approach is based on a NN formed by totally transmission buffer when a link is activated than to a block-
about 1000 units in 20 layers. It needs about 40 minutes to age that occasionally prevents the link from transmitting.
be trained, which is done once for all, and it takes about 1 ms Figure 6(a) also shows a fundamental aspect of our ap-
to be applied to define each slot of the frame. Clearly, sim- proach: DRL delivers more bits than CG-OPT after the
ple approaches have a time advantage, but their performance blockage intensity increases over a certain level (e.g., set-
has strong limitations, which we discuss later on. tings for mid and severe blockages). Thanks to the adapt-
Three levels of link-blockage intensities are imposed on ability of RL to a dynamic environment, the DRL method,
IAB networks. They are defined by three sets of parameters assisted by LSTM, can learn the regularities in access links’
𝜇, 𝜎, 𝜆 of the distributions in Equation (1), whose values are availability during the training phase and select the ap-
shown in Table 3. We set 𝜎 to 0.5 ms and adjust 𝜇 to obtain propriate link-pattern sequence accordingly. The CG-OPT
ratios of average blocked duration to average non-blocked method, which provides the best choices in ideal conditions,
duration of respectively 0.06, 0.51, and 1.30. They corre- can still perform well when light blockages make the sce-
spond to increasing blockage intensities, hence we refer to nario quasi-ideal. However, as soon as the blockage inten-
them as light, mid and severe blockages in the rest of the ar- sity impairs this ideality, its performance quickly decreases.
ticle. Note that, since we have noticed a fast convergence of The effect on perceived UE throughput can also be evaluated
the DRL approach, we have accelerated the blockage model by the average data rates reported in Table 4.
of a factor 10, for both online and offline versions, to shorten Figure 6(b) provides an insight into how the data bits are
the simulation duration. delivered to UEs via the IAB network. The upper translucent
A first result is related to the two different state spaces bars represent the traffic volume percentage received by UEs
described in Section 5.3.1, which consider either the status from IAB-nodes through multi-hops, while the lower solid
of all access links (fully-observable) or only that of those bars report the complementary volume percentage directly
links activated in the currently selected pattern (partially- received from the IAB-donor. We can see that DRL and
observable). We performed all the experiments with both CG-OPT methods can more efficiently utilize the hops of the
alternatives and the results were very similar. Therefore, for wireless backhaul. This efficiency is almost independent of
the sake of brevity, we show in this article only the ones us- the blockage intensity, demonstrating that a smart resource
ing the partially-observable state, which refers to a more re- allocation is necessary to operate a multi-hop wireless back-
alistic approach. haul in any conditions. Indeed, the CG-RND and Multi-Slot
methods result in more bits directly sent by the IAB-donor,
6.3.1. IAB Network Traffic Delivery which reduce when the blockage intensity increases only be-
The performance of the four methods can be evaluated cause UEs can be reached by more than one IAB-node but
considering the average overall traffic volume (number of only one IAB-donor, and thus, IAB-node delivery is more
bits) delivered to all UEs in a frame, which corresponds to robust than IAB-donor’s one. Moreover, CG-RND delivers
the value of the objective function (2a). In Figure 6(a), the less bits directly from IAB-donor than Multi-Slot because
differences among the four methods over the three block- good patterns allow to better exploit the wireless backhaul.
age levels are illustrated. In general, the total traffic vol-
ume decreases as more intense blockages are introduced into 6.3.2. UE’s Quality of Service
the scenario. Multi-Slot has a remarkably low performance The cumulative distribution function (CDF) curves of
because (1) the quality of its generated patterns is lower achievable UEs’ data rates are shown in Figure 6(c). Since
than that of CG-RND, even if CG-RND schedules its CG- previous analysis has shown much worse performance of

PRE_PRINT VERSION
12
Light block. 1.0 Light block., IAB-donor
Total bits to UEs per frame (Mbit)

Mid block. Mid block., IAB-donor
10 Severe block.
Percentage of data transfer

Severe block., IAB-donor
0.8 Light block., IAB-nodes
8 Mid block., IAB-nodes
Severe block., IAB-nodes
0.6
6
0.4
4
2 0.2
0 0.0
DRL CG-RND CG-OPT Multi-Slot DRL CG-RND CG-OPT Multi-Slot
(a) Per-frame overall number of downloaded bits by all UEs. (b) Volume shares downloaded at IAB-donor or IAB-nodes.
1.0
Light block.
0.8
1.0 Mid block.
Severe block.
Percentage of UEs served

0.8
0.6 Light block., CG-OPT
CDF
Light block., CG-RND 0.6

Light block., DRL
0.4 Mid block., CG-OPT
Mid block., CG-RND 0.4
0.2
Mid block., DRL
Severe block., CG-OPT
Severe block., CG-RND 0.2
0.0 Severe block., DRL
0 20 40 60 80 100 120 0.0
UE data rate (Mbps) DRL CG-RND CG-OPT Multi-Slot
(c) Per-UE data rate CDF. (d) Percentage of served UE.
Figure 6: Performance of offline DRL approach considering different blockage intensities, compared with CG-RND, CG-OPT and
Multi-Slot.
Multi-Slot than that of DRL, we don’t further include Multi- with the same probability, although with a small through-
Slot in CDF curves to simplify the presentation of the results. put. 2) DRL can serve more users than CG-OPT in all three
As can be seen from the upper right corner of the figure, un- blockage situations, showing an advantage not only from a
der three cases of different blockage densities (i.e., light, mid throughput perspective but also in terms of coverage, when
and severe blockages), the maximum rates achieved by DRL the links’ availability is not close to the ideal.
and CG-OPT are near 121.4 Mbps (the maximum achiev-
able rate of MCS 8), while CG-RND can at most reach about 6.3.3. DRL success analysis
60 Mbps. This means that the problem faced is not triv- We analyze now the reasons of the success of DRL
ial and only a careful link-pattern selection can allow us to method with the help of Figure 7, where, for the sake of
reach good performance. In addition, DRL can learn how brevity, we consider only the case of severe blockages, but
to provide high throughput, even in case of blockages, to the same considerations apply to other cases as well. The
those users that are not directly affected by them. The av- heatmaps in the top row show the number of different access
erage per-UE data rates are reported in Table 4: while in links incident to each UE (horizontal axis) scheduled in the
light-blockage conditions the throughput provided by DRL whole testing period of each instance (vertical axis). The
is only 4.5% less than CG-OPT, DRL outperforms CG-OPT bottom row instead shows the percentage of slots in which
by more than 23% in case of severe blockages. the indicated UE appears as a receiver of a link in the acti-
The values of the CDFs at the origin indicate the percent- vated patterns. For each method, the order of instances’ IDs
age of users that cannot be served. This information is better and UEs’ IDs in every heatmap is computed according to the
described by Figure 6(d), which indicates the percentage of descending order of slot percentage.
UEs in the playground that receives a non-null throughput in Comparing the three heatmaps in the top row, we can
the different scenarios. The first and expected result is that see that CG-OPT method activates the smallest number of
as the blockages get denser, more users are excluded from different access links for each UE, as shown by its darkest
the service. Two interesting aspects further emerge: 1) CG- heatmap. DRL and CG-RND, instead, resort to more access
RND and Multi-Slot methods show the highest service per- links, which increase the probability to use alternatives when
centages, because they do not tend to select the best UEs to an access link is blocked. This increases the scheduling ro-
maximize the overall throughput, but rather to reach all UEs bustness. In the bottom row, the lightest heatmap shows that

PRE_PRINT VERSION
5 5 5
Transmission percent #(Ingress links)

8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
4 4 4
Instances
Instances
Instances
3 3 3
2 2 2
1 1 1
0 0 0
1.0 1.0 1.0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
0.8 0.8 0.8
Instances
Instances
Instances
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Users Users Users
(a) DRL. (b) CG-RND. (c) CG-OPT.
Figure 7: Comparison of the three methods with severe blockages.
11 11 11

10 10 10
9
9 9
8
8 8
7
7 7
6
Offline Offline Offline
6 6
5 Online Online Online
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Time (s) Time (s) Time (s)
(a) Total Traffic Volume - Reset-all. (b) Total Traffic Volume - Reset-memory. (c) Total Traffic Volume - Reset-none.
Figure 8: Comparison among the three online schemes in Lo-Hi-Lo-blockage scenario.
the DRL’s access link diversity is obtained by selecting pat- Table 5

terns with more access links than CG-OPT and CG-RND, Distribution settings for blockage levels in the online frame-
this stresses again the idea that the best scheduling strategy work.
is to select patterns with redundant access links so that the Levels 1st 2nd 3rd 4th 5th
probability that at least one is effective is higher. 𝜇 5.58 6.83 7.37 7.66 7.88
1∕𝜆 5000 4250 3500 2900 2300
6.4. Online Model Performance Analysis
We conclude this section by testing the online training The three online strategies (reset-all, reset-memory, and
framework, where a NN model is continuously trained while reset-none) presented in Section 5.4 are tested in these two
being applied to an IAB network. Since the ability of our scenarios.
approach to adapt to random blockages and provide high In Figures 8 and 9, the results show the performance in
throughput has already been shown in the previous para- terms of total traffic volume delivered to UEs in a frame. The
graphs, we intend to evaluate here whether the online model horizontal axis indicates the timeline under the simplified,
keeps learning from the ongoing interactions with the IAB but reasonable, as shown in Section 5.4.2, assumption of
network and can automatically re-adapt on-the-fly to chang- negligible NN update time. In these figures, the blue curves
ing dynamics. All the results presented in this section are the represent the performance in the testing period of the pre-
averages over 10 random instances and we apply a moving trained offline model, while the red curves describe the sys-
average of 0.0625s to all reported plots. tem behaviors during online training. Vertical dashed lines
In the following tests, the blockage intensities are di- delimit the stages of the experiment by indicating sudden
vided into 5 levels from lightest to severest (from 1st to 5th changes of blockage intensities. Moreover, while the offline
level, correspondingly) which is defined by the parameters model is pre-trained over a scenario with the same statistics
in Table 5 referring to the blockage model in Equation (1). as those of the first stage, the online model starts from a ran-
As in the offline case, 𝜎 is still set to 0.5 ms for simplicity. dom initialization and converges to a stable performance by
The tests are conducted in two representative scenarios: the end of the first stage.
As a first observation, we can see that the more the stage
• Lo-Hi-Lo-blockage: where the experiment begins
blockage intensity differs from the one of the first stage, the
with the lightest (1st-level) blockage intensity and pro-
higher performance advantage the online model takes over
ceeds with 3rd, 5th, 4th, and 2nd-level blockage inten-
the offline one. This confirms that faced scenarios are in-
sities;
deed different and, ideally, different NN weights would be
• Hi-Lo-Hi-blockage: where the experiment begins required. Another interesting finding is that the online model
with the severest (5th-level) blockage intensity and has only a small lead over the offline model in the Lo-Hi-Lo-
proceeds with 3rd, 1st, 2nd, and 4th-level blockage blockage scenario, while in the Hi-Lo-Hi-blockage scenario,
intensities.

PRE_PRINT VERSION
11 11 11
Offline Offline Offline

10 Online 10 Online 10 Online
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Time (s) Time (s) Time (s)
(a) Total Traffic Volume - Reset-all. (b) Total Traffic Volume - Reset-memory. (c) Total Traffic Volume - Reset-none.
Figure 9: Comparison among the three online schemes in Hi-Lo-Hi-blockage scenario.
the online model largely outperforms the offline model. The Acknowledgment
reason is that frequent blockages of the first stage of the
Hi-Lo-Hi-blockage scenario implicitly set a limit to the of- This work was partially supported by China Scholarship
fline model’s policy, which leaves more rooms for the online Council (CSC) Grant No. 201806470077.
model to improve over when more link opportunities become
available in the next stages. References
This allows us to draw further conclusions. If we want [1] 3GPP, a. Study on channel model for frequencies from 0.5 to 100
to use an offline-training approach, we have to pay atten- GHz, TR 38.901.
tion to training it in an environment that is less affected [2] 3GPP, b. Study on integrated access and backhaul, TR 38.874.
by blockages than it can potentially be when severer con- [3] Capone, A., Carello, G., Filippini, I., Gualandi, S., Malucelli, F.,
2010a. Routing, scheduling and channel assignment in wireless mesh
ditions are met. However, the blockage intensity cannot
networks: optimization models and algorithms. Ad Hoc Networks 8,
be too light, otherwise blockage countermeasures cannot be 545–563.
learned. Vice versa, an online-training solution can always [4] Capone, A., Carello, G., Filippini, I., Gualandi, S., Malucelli, F.,
perform best. 2010b. Solving a resource allocation problem in wireless mesh net-
Finally, the comparison across three strategies (i.e., works: A comparison between a CP-based and a classical column
generation. Networks: An International Journal 55, 221–233.
reset-all, reset-memory and reset-none) indicates that the on-
[5] Capone, A., Filippini, I., Gualandi, S., Yuan, D., 2013. Resource opti-
line approach can potentially perform well. Considering the mization in multi-radio multi-channel wireless mesh networks, Wiley
longer convergence time at each stage of the "reset-all" strat- Online Library.
egy and the performance similar to the other two strategies, [6] Du, J., Onaran, E., Chizhik, D., Venkatesan, S., Valenzuela, R.A.,
we suggest to use the "reset-none" strategy as the best trade- 2017. Gbps user rates using mmwave relayed backhaul with high-
gain antennas. IEEE Journal on Selected Areas in Communications
off between performance and complexity.
35, 1363–1372.
[7] Feng, M., Mao, S., 2019. Dealing with limited backhaul capacity
in millimeter-wave systems: a deep reinforcement learning approach.
7. Conclusion IEEE Communications Magazine 57, 50–55.
In this paper, we have proposed a CG-based DRL ap- [8] García-Rois, J., Banirazi, R., González-Castaño, F.J., Lorenzo, B.,
proach for resource allocation in mmWave 5G IAB networks Burguillo, J.C., 2018. Delay-aware optimization framework for pro-
portional flow delay differentiation in millimeter-wave backhaul cellu-
able to face realistic link blockages. The results have shown lar networks. IEEE Transactions on Communications 66, 2037–2051.
that we can outperform the optimization approaches typ- [9] He, Z., Mao, S., Kompella, S., Swami, A., 2015. Minimum
ically used in wireless multi-hop networks, demonstrating time length scheduling under blockage and interference in multi-hop
that our approach can automatically adapt to environmental mmwave networks, in: 2015 IEEE Global Communications Confer-
changes. In addition, we have developed an online version ence (GLOBECOM), IEEE. pp. 1–7.
[10] Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory.
of our approach to increase its robustness in front of dra- Neural Computation 9, 1735–1780.
matically changing environments. Indeed, it can catch sys- [11] Hu, Q., Blough, D.M., 2017. Relay selection and scheduling for mil-
tem dynamics on-the-fly and adjust the training to adapt to limeter wave backhaul in urban environments, in: 2017 IEEE 14th
the change. We have also carried out an analysis of imple- International Conference on Mobile Ad Hoc and Sensor Systems
mentation and feasibility issues of our approach, which has (MASS), IEEE. pp. 206–214.
[12] Kilpi, J., Seppänen, K., Suihko, T., Paananen, J., Chen, D.T., Wainio,
proven how it can be practically implemented just relying on P., 2017. Link scheduling for mmwave WMN backhaul, in: 2017
realistic hardware requirements. IEEE International Conference on Communications (ICC), IEEE. pp.
Finally, we believe that our approach can be seen as one 1–7.
of the examples in which traditional optimization techniques [13] Lei, W., Ye, Y., Xiao, M., 2020. Deep reinforcement learning based
and recent RL approaches can positively coexist and provide spectrum allocation in integrated access and backhaul networks. IEEE
Transactions on Cognitive Communications and Networking .
remarkable advantages by synergically leveraging their re- [14] Li, R., Wang, C., Zhao, Z., Guo, R., Zhang, H., 2020. The lstm-based
spective strengths. advantage actor-critic learning for resource management in network
slicing with user mobility. IEEE Communications Letters .

PRE_PRINT VERSION
[15] Li, Y., Luo, J., Xu, W., Vucic, N., Pateromichelakis, E., Caire, G., 2018 IEEE Conference on Computer Communications (INFOCOM),
2017a. A joint scheduling and resource allocation scheme for mil- IEEE. pp. 1205–1213.
limeter wave heterogeneous networks, in: 2017 IEEE Wireless Com- [35] Zhang, B., Devoti, F., Filippini, I., 2020. RL-based resource allo-
munications and Networking Conference (WCNC), IEEE. pp. 1–6. cation in mmwave 5G IAB networks, in: 2020 IEEE Mediterranean
[16] Li, Y., Pateromichelakis, E., Vucic, N., Luo, J., Xu, W., Caire, G., Communication and Computer Networking Conference (MedCom-
2017b. Radio resource management considerations for 5G millimeter Net), IEEE. pp. 1–8.
wave backhaul and access networks. IEEE Communications Maga- [36] Zhuang, H., Chen, J., Wu, D.O., 2017. Joint access and backhaul
zine 55, 86–92. resource management for ultra-dense networks, in: 2017 IEEE Inter-
[17] Luo, J., Rosenberg, C., Girard, A., 2010. Engineering wireless mesh national Conference on Communications (ICC), IEEE. pp. 1–6.
networks: joint scheduling, routing, power control, and rate adapta-
tion. IEEE/ACM Transactions on Networking 18, 1387–1400.
[18] Ma, Z., Li, B., Yan, Z., Yang, M., 2020. Qos-oriented joint op- Bibo Zhang received the B.S. degree in infor-
timization of resource allocation and concurrent scheduling in 5G mation engineering and the M.S. degree in elec-
millimeter-wave network. Computer Networks 166, 106979. tronics and communication engineering from Bei-
[19] MacCartney, G.R., Rappaport, T.S., Rangan, S., 2017. Rapid fad- jing University of Posts and Telecommunications,
ing due to human blockage in pedestrian crowds at 5G millimeter- China, in 2015 and 2018. She is currently pur-
wave frequencies, in: 2017 IEEE Global Communications Confer- suing the Ph.D. degree in information technology,
ence (GLOBECOM), IEEE. pp. 1–7. from Politecnico di Milano, Italy. Her current re-
[20] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, search interests include resource management in
T., Silver, D., Kavukcuoglu, K., 2016. Asynchronous methods for 5G mmWave networks.
deep reinforcement learning, in: International Conference on Ma-
chine Learning (ICML), pp. 1928–1937.
[21] Naparstek, O., Cohen, K., 2018. Deep multi-user reinforcement learn- Francesco Devoti received the B.S., and M.S. de-
ing for distributed dynamic spectrum access. IEEE Transactions on grees in Telecommunication Engineering, and the
Wireless Communications 18, 310–323. Ph.D. degree in Information Technology from the
[22] Niu, Y., Ding, W., Wu, H., Li, Y., Chen, X., Ai, B., Zhong, Z., 2019. Politecnico di Milano, in 2013, 2016, and 2020 re-
Relay-assisted and QoS aware scheduling to overcome blockage in spectively. He is currently a research scientist at
mmwave backhaul networks. IEEE Transactions on Vehicular Tech- NEC Laboratories Europe. His research interests
nology 68, 1733–1744. include millimeter-wave technologies in 5G net-
[23] Niu, Y., Gao, C., Li, Y., Su, L., Jin, D., Zhu, Y., Wu, D.O., 2016. works.
Energy-efficient scheduling for mmwave backhauling of small cells
in heterogeneous cellular networks. IEEE Transactions on Vehicular
Technology 66, 2674–2687. Ilario Filippini received B.S. and M.S. degrees in
[24] Ortiz, A., Asadi, A., Sim, G.H., Steinmetzer, D., Hollick, M., 2019. Telecommunication Engineering and a Ph.D in In-
Scaros: A scalable and robust self-backhauling solution for highly formation Engineering from the Politecnico di Mi-
dynamic millimeter-wave networks. IEEE Journal on Selected Areas lano, in 2003, 2005, and 2009, respectively. He
in Communications . is currently an Associate Professor with the Dipar-
[25] Polese, M., Giordani, M., Roy, A., Castor, D., Zorzi, M., 2018. Dis- timento di Elettronica, Informazione e Bioingeg-
tributed path selection strategies for integrated access and backhaul neria, Politecnico di Milano. His research in-
at mmwaves, in: 2018 IEEE Global Communications Conference terests include planning, optimization, and game
(GLOBECOM), IEEE. pp. 1–7. theoretical approaches applied to wired and wire-
[26] Saad, M., Abdallah, S., 2019. On millimeter wave 5G backhaul link less networks, performance evaluation and re-
scheduling. IEEE Access 7, 76448–76457. source management in wireless access networks,
[27] Saha, C., Afshang, M., Dhillon, H.S., 2018. Bandwidth partition- and traffic management in software defined net-
ing and downlink analysis in millimeter wave integrated access and works. On these topics, he has published over 60
backhaul for 5G. IEEE Transactions on Wireless Communications peer-reviewed articles. He serves in the TPC of
17, 8195–8210. major conferences in networking and as an Asso-
[28] Sahoo, B., Yao, C.H., Wei, H.Y., 2017. Millimeter-wave multi-hop ciate Editor of IEEE Transactions on Mobile Com-
wireless backhauling for 5G cellular networks, in: 2017 IEEE 85th puting and Elsevier Computer Networks. He is an
Vehicular Technology Conference (VTC Spring), IEEE. pp. 1–5. IEEE Senior Member.
[29] Shi, S., Wang, Q., Xu, P., Chu, X., 2016. Benchmarking state-of-the-
art deep learning software tools, in: 2016 7th International Confer-
Danilo De Donno received the B.Sc. and M.Sc.
ence on Cloud Computing and Big Data (CCBD), IEEE. pp. 99–104.
degrees (cum laude) in telecommunication engi-
[30] Vu, T.K., Bennis, M., Debbah, M., Latva-Aho, M., Hong, C.S.,
neering from Politecnico di Milano, Italy, in 2005
2018a. Ultra-reliable communication in 5G mmwave networks: a
and 2008, respectively, and the Ph.D. degree in
risk-sensitive approach. IEEE Communications Letters 22, 708–711.
Information Engineering from the University of
[31] Vu, T.K., Liu, C.F., Bennis, M., Debbah, M., Latva-Aho, M., 2018b.
Salento, Lecce, Italy, in 2012. He was a Post-
Path selection and rate allocation in self-backhauled mmwave net-
Doctoral Fellow with the ElectroMagnetic Lab
works, in: 2018 IEEE Wireless Communications and Networking
Lecce (EML2) of the University of Salento from
Conference (WCNC), IEEE. pp. 1–6.
2012 to 2015 and a Post-Doc Researcher with the
[32] Wang, Y.E., Wei, G.Y., Brooks, D., 2019. Benchmarking TPU,
IMDEA Networks Institute, Madrid, Spain, from
GPU, and CPU platforms for deep learning. arXiv preprint
2015 to 2017. In July 2017, he joined the Huawei
arXiv:1907.10701 .
Research Center in Milan, Italy, as a Wireless Sys-
[33] Yang, G., Xiao, M., Al-Zubaidy, H., Huang, Y., Gross, J., 2018. Anal-
tem Engineer. His areas of interest lie in mm-Wave
ysis of millimeter-wave multi-hop networks with full-duplex buffered
communications, with main research focus on the
relays. IEEE/ACM Transactions on Networking (TON) 26, 576–590.
development of PHY and MAC algorithms for hy-
[34] Yuan, D., Lin, H.Y., Widmer, J., Hollick, M., 2018. Optimal
brid and full-digital system architectures.
joint routing and scheduling in millimeter-wave cellular networks, in:

2021 Comnet Iab-Rl

Uploaded by

Copyright:

Available Formats

2021 Comnet Iab-Rl

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2021 Comnet Iab-Rl

Uploaded by

Copyright:

Available Formats

PRE_PRINT VERSION

Resource Allocation in mmWave 5G IAB Networks: A Reinforcement

ARTICLE INFO Abstract

1. Introduction (IAB)[2]. The idea is to place simpler relay nodes, called

B: Zhang et al.: Preprint submitted to Elsevier Page 1 of 19

B: Zhang et al.: Preprint submitted to Elsevier Page 2 of 19

on the current blockage situation [22], which only aims to IAB-donor

maximize the instantaneous throughput of the current time IAB-node

mmWave IAB networks, with an emphasis on dealing with Pattern 4

B: Zhang et al.: Preprint submitted to Elsevier Page 3 of 19

and backhaul links. The optimal sequence of activated link Table 1

B: Zhang et al.: Preprint submitted to Elsevier Page 4 of 19

generated with. sequence of activated link patterns defines a total number of

B: Zhang et al.: Preprint submitted to Elsevier Page 5 of 19

Table 2 dual constraint is violated, the considered variable has a pos-

B: Zhang et al.: Preprint submitted to Elsevier Page 6 of 19

Alternatively, we can put a maximum number of CG itera-

B: Zhang et al.: Preprint submitted to Elsevier Page 7 of 19

B: Zhang et al.: Preprint submitted to Elsevier Page 8 of 19

5.1. Deep Reinforcement Learning and Recurrent IAB-donor

B: Zhang et al.: Preprint submitted to Elsevier Page 9 of 19

B: Zhang et al.: Preprint submitted to Elsevier Page 10 of 19

𝑒 contains links as sub-actions and the number of bits to be Actor Network

transmitted in a single slot indicated by capacities 𝐵𝐵𝐻(𝑖,𝑗)

Fully Connected Layer

Fully Connected Layer

Fully Connected Layer

Fully Connected Layer

Fully Connected Layer

random obstacles obstructing the links.

Reward Function Considering this work aims to maxi-

B: Zhang et al.: Preprint submitted to Elsevier Page 11 of 19

B: Zhang et al.: Preprint submitted to Elsevier Page 12 of 19

1: Initialize neural network weight vector 𝜅 = [𝜓, 𝜃, 𝜔];

3: Warm up the IAB system within 𝑡𝑤𝑎𝑟𝑚−𝑢𝑝 ;

8: 𝑅𝑡 ← 0; Figure 5: Ideal simulation timeline and feasible solution in

B: Zhang et al.: Preprint submitted to Elsevier Page 13 of 19

B: Zhang et al.: Preprint submitted to Elsevier Page 14 of 19

Table 3 generated patterns randomly at each slot, which implies pat-

B: Zhang et al.: Preprint submitted to Elsevier Page 15 of 19

Total bits to UEs per frame (Mbit)

Percentage of data transfer

Percentage of UEs served

Light block., CG-RND 0.6

B: Zhang et al.: Preprint submitted to Elsevier Page 16 of 19

Transmission percent #(Ingress links)

Transmission percent #(Ingress links)

Transmission percent #(Ingress links)

Total bits to UEs per frame (Mbit)

Total bits to UEs per frame (Mbit)

the DRL’s access link diversity is obtained by selecting pat- Table 5

B: Zhang et al.: Preprint submitted to Elsevier Page 17 of 19

Total bits to UEs per frame (Mbit)

Total bits to UEs per frame (Mbit)

B: Zhang et al.: Preprint submitted to Elsevier Page 18 of 19

B: Zhang et al.: Preprint submitted to Elsevier Page 19 of 19

You might also like