Q-Learning Based Link Adaptation in 5G

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications: Track 2: Networking and

MAC

Q-Learning based Link Adaptation in 5G


Shangbin Wu, Galini Tsoukaneri, Belkacem Mouhouche
Samsung R&D Institute UK, Communications House, Staines-upon-Thames, TW18 4QE, United Kingdom.
Email: {shangbin.wu, g.tsoukaneri, b.mouhouche}@samsung.com

Abstract—This paper proposes a novel method of con- tuned parameters in the LA design process to optimize
structing a Q-learning framework for link adaptation (LA) performance. The main benefits of OLA are simple im-
in the fifth generation (5G) mobile network. The state-action plementation and smooth transitions between MCSs and
function is approximated via a neural network (NN). The
state space relies on the hybrid automatic repeat request transmission ranks. However, there are two main draw-
(HARQ) and channel state information (CSI) reports from backs in the OLA approach. First, heuristically tuned
the user equipment (UE). The output of the Q-learning based parameters can lead to suboptimal performance. Second,
LA (QLA) approach consists of the assigned modulation the number of these parameters can explode because the
and coding schemes (MCSs) and number of layers, which number of 5G multi-service scenarios becomes massive.
are used to construct the action space. Reward is calculated
based on the HARQ information and the transmit block size As a result, researchers have been studying how to
(TBS). System level simulation in a typical indoor hotspot introduce network intelligence, in order to simultaneously
scenario has been performed, showing that the proposed minimize heuristically tuned parameters and provide opti-
QLA outperforms the ordinary LA approach in terms of mal performance [5]. One consideration is reinforcement
user throughput. learning [6]. Authors in [7] considered the state-action-
Index Terms—Q-learning, link adaptation, neural network,
5G. reward-state-action (SARSA) learning approach for LA.
However, the number of layers, which plays a crucial
I. I NTRODUCTION role in a 5G MIMO system, has not been taken into
account in the framework in [7]. Also, the state in [7] has
The emerging demand for high-speed reliable com- not considered any past acknowledgement information,
munications with significantly improved user experience which can result in sudden fluctuations in instantaneous
has been driving the development of the fifth generation UE throughput. In a multi-cell mobile network, two archi-
(5G) wireless communication network [1]. Currently, 5G tectures of reinforcement learning are widely considered.
networks have introduced technologies such as massive One is the single-agent method, which assumes that there
multiple-input multiple-output (MIMO), millimeter wave exists a central agent which controls all the actions of base
(mmWave), and vehicle-to-everything (V2X) to cover an stations. This architecture is close to optimum but it is less
unprecedentedly wide range of application scenarios. This practical because the overhead is high when the central
means that a 5G network should be able to deliver infor- agent is attempting to collect information from all base
mation in various channel conditions. Additionally, it can stations. Another architecture is the multi-agent method,
be expected that the application of artificial intelligence i.e., there are multiple agents to perform reinforcement
and machine learning [2] can further enhance the capabil- learning individually. There may be shared information
ity of 5G networks. across all agents, which can still increase overhead.
Link adaptation (LA) is used to ensure link throughput This paper proposes a Q-learning framework for LA
and reliability according to different channel conditions. (QLA). The key contributions of this paper are two-fold.
LA relies on the channel measurement feedback and 1) The proposed framework combines the benefits of
packet acknowledgement from a UE to meet a certain OLA and reinforcement based method. The pro-
block error rate (BLER) and has played an important posed QLA considers past HARQ information and
role in both the fourth generation (4G) and 5G mobile CSI feedback from user in a state. This design takes
networks. the filtering element from OLA to guarantee smooth
LA consists of adjusting modulation orders and code transitions between MCSs and transmission ranks. In
rates (MCS) and the number of MIMO layers, also known the action space, the number of transmission ranks
as transmission rank. Higher MCS and transmission rank is also obtained in addition to MCSs. Additionally,
can potentially increase data rate but at the same time neural network (NN) is applied to approximating the
increase BLER. Ordinary LA (OLA) approaches such as state-action function. The corresponding architecture
[3] and [4] use filtering methods to decide the increment of the QLA-based radio access network (RAN) is
and decrement of MCSs based on the statistics of HARQ depicted in Fig. 1, where multiple QLA modules
information of a UE. These rely on many heuristically are equipped in the medium access control (MAC)
978-1-5386-8110-7/19/$31.00 © 2019 IEEE layer. Each module can be assigned to a UE and

978-1-7281-4490-0/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: QSIO. Downloaded on October 28,2022 at 23:52:49 UTC from IEEE Xplore. Restrictions apply.
2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications: Track 2: Networking and
MAC

K
where aK = {0, 1} is a vector recording the HARQ
K
information and rK = {1, 2, 3, 4} is a vector recoding
the rank information of the past K transmissions. Then,
the metric β will be mapped to find the MCSs and
Nlayers to be assigned in the next transmission. If the
metric β is larger than a heuristic threshold, MCS and/or
Nlayers will be increased, where as β is smaller than a
heuristic threshold, MCS and/or Nlayers will be decreased.
When K is small, the output MCSs and Nlayers will be
more relevant to the instantaneous CSI report, causing
larger fluctuations in user throughput. When K is large,
the output MCSs and Nlayers will rely more on past
information, resulting in smoother but more conservative
user throughputs. Also, as mentioned before, the number
of heuristic parameters such as K and thresholds can grow
Fig. 1. Architecture of the proposed QLA-based RAN. massively when more scenarios need to be covered by the
system.

controls the LA process. III. QLA


2) The proposed QLA framework is a multi-agent To overcome the drawbacks of OLA and introduce
architecture without shared information. Although intelligence in the network, a reinforcement-learning based
this results in a moving target problem [8], where QLA is proposed.
the behavior of each agent can impact on behaviors
of other agents, it is the with the lowest overhead A. Overview of reinforcement learning
and feasible for mobile networks in reality. RL is a machine learning technique that allows a learner
The remainder of this paper is organized as follows. to learn to make decisions by observing (sampling) the
Section 2 outlines the system model and describes ordinary environment. A sample from the environment is called a
LA. This is followed by the proposed Q-learning based LA state. The decisions are actions that maximize the reward.
(QLA) in Section 3. Simulation results and analysis are It should be noted here that reward is the ultimate goal and
provided in Section 4. Conclusions are drawn in Section 5. not the intermediate result. It can be expected that action
taken in each step will have influence to the potential
II. S YSTEM MODEL AND ORDINARY LA actions in later steps.
In this paper, an indoor network deployment is con- Reinforcement learning is usually described via a
sidered. The deployment is a 120m×50m area with 12 Markov decision process (MDP) [6], which is character-
transception points (TRPs), which is aligned with the ized by four elements E = E S, A, P, R. State space S
5G Indoor Hotspot-eMBB scenario described in [9]. In consists of all possible states of the environment. A state
a practical system, the BS will decide the number of s ∈ S is the perception of the environment of the learner.
layers Nlayers and MCS of each codeword for a packet Action space A contains potential actions to be taken by
to be transmitted to the UE. Another variable in the the learning. If an action a ∈ A is operated on state s,
packet decided at the network side is the transmit block then the hidden transfer function P : S × A × S → R will
size (TBS), which is usually a function of MCSs. After transit the current state to another state according to certain
receiving the data packet, the UE will attempt to decode probability. The reward function R : S × A × S → R
the packet. If the packet is successfully decoded, the UE describes the reward during the state transition. The learner
will feed back an acknowledgement (ACK) to the BS. attempts to learn a policy π, which maps a state s to
Otherwise, a non-ACK (NACK) will be sent by the UE. an action. Depending on the policy type, a deterministic
Ordinary LA uses UE HARQ information and CSI policy directly maps a state to an action π : S → A and a
feedbacks to decide the assigned MCSs and Nlayers for randomized policy maps a state to a vector of real numbers
the next transmission to the UE [3] [4], which is related π : S × A → R each representing the probability of taking
to a metric β action a ∈ A in state s. In the policy learning process, the
learner will keep updating the state-action value function
β new = β old + Δβ (1) (Q function) Qπ (s, a), which stores the estimated values
where Δβ is a function of HARQ and reported CSI. For of aggregated rewards using policy π.
example, Δβ can represented as When the model of the environment is accessible
(model-based learning), i.e., the hidden transfer function
Δβ = Δβ (aK , rK ) (2) P is known, the expected values of the Q function

Authorized licensed use limited to: QSIO. Downloaded on October 28,2022 at 23:52:49 UTC from IEEE Xplore. Restrictions apply.
2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications: Track 2: Networking and
MAC

sists of the HARQ information in the past K trans-


missions. If RIprev equals 1, then MCSprev 2 is NULL.
Since the state setting involves past HARQ information,
fluctuations of user data rate can be reduced as the filtering
method (1).
2) Action space: An action a ∈ A is a 3-tuple with a =
(Nlayers , MCS1 , MCS2 ) representing the Nlayers , MCS1 ,
and MCS2 to be assigned to the upcoming transmission.
It should also be noted that TBS was directly determined
by MCS1 and MCS2 [10].
Fig. 2. Diagram of LA within a reinforcement learning framework.
3) Policy: A policy π(s) is a mapping from a state to
an action, which is commonly used in a -greedy manner.
The -greedy algorithm [6] consists of two steps. One step
can be computed iteratively with dynamic programming. is known as exploitation where the BS chooses the action
According to [6], the optimal policy satisfies the optimal with the highest expected reward with probability (Pr.)
Bellman equation and can be found by selecting the action 1−. The other step is known as exploration, where the BS
maximizing the state-action value iteratively. The state- randomly picks an action with Pr.  in order to estimate the
action values increase monotonically each time the policy expected reward for each action. As a result, the -greedy
is updated with the best action. Therefore, when the policy policy π  (s) can be expressed as
converges, it converges to the optimal policy. 
 π (s) with Pr. = 1 − 
However, in practice, it is usually difficult to obtain the π (s) =
model of the environment. Namely, the hidden transfer Choose a ∈ A Pr. = |A| 1
Pr. = .
function P is unknown. In this case, model-free learning (5)
can be applied. Model-free learning assumes no knowl- 4) Reward: The reward function is measured by how
edge of the environment and relies on approximating the Q many bits are successfully delivered to the UE. The BS
function by sampling the environment, states, and rewards. acquires the reward via the HARQ information fed back
A widely used model-free learning method is the Monte from the UE, which can be expressed as
Carlo (MC) method [6], where the value function and 
policies are updated only when an episode of samples TBSprev if ACK
r= (6)
are finished. Another model-free learning method is the 0 if NACK
temporal-difference (TD) learning [6] where the value
function and policies are updated in a step-by-step manner. where TBSprev is the assigned TBS of the previous
A TD learning method known as Q-learning is used in transmission. As the TBS is related to MCSs [10], the
this paper. The main characteristic of Q-learning is that reward is higher if the previous MCSs were higher and
the approximation of the Q function is independent of the the transmission was successful. If the transmission failed,
policy being followed, which largely reduces complexity the reward becomes zero.
[6]. 5) NN-based state-action value function: The state-
action function (Q function) Q(s, a) records the accumu-
B. QLA lated reward of taking action a in state s and is updated
As shown in Fig. 2, the decision and feedback processes incrementally as
in the network are a natural fit for the reinforcement Q(s, a) = Q(s, a) + α (r + γQ(s , a ) − Q(s, a)) (7)
learning framework. In this paper we considered the Q-
learning algorithm for LA (QLA). where s is the new state consisting of the HARQ infor-
1) State space: A state s ∈ S is characterized by a mation and CSI, a is the action according to the new
6-tuple, i.e., state and policy π (·), α is the learning rate and γ is the
 discount factor.
s = CQIreported , RIreported , RIprev , (3) The proposed QLA starts with the BS choosing an
MCSprev , MCSprev , HARQprev ) (4) action, i.e., deciding (Nlayers , MCS1 , MCS2 ), according
1 2
to π  (s). Then, the UE feeds back HARQ information
where CQIreported and RIreported are the most recent and CSI reports to the BS, forming the new state s . The
channel quality indicator (CQI) and rank indicator (RI) BS computes the reward r according to HARQ and TBS.
reported by the CSI feedback, respectively, RIprev consists Next, the BS finds the new state a using the policy π(s ),
of the assigned RI in the past K transmissions, MCSprev
1 , i.e., a = π(s ), and updates the state-action function
MCSprev
2 are the two assigned MCSs for the two code- according to (7). The policy will be updated by using
words in the previous transmission, and HARQprev con- the action with the highest expected reward, which can

Authorized licensed use limited to: QSIO. Downloaded on October 28,2022 at 23:52:49 UTC from IEEE Xplore. Restrictions apply.
2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications: Track 2: Networking and
MAC

TABLE I
1: Q(s, a) = 0, ∀s, a; π(s, a) = |A(s)|
1
, ∀a; S ETTINGS OF THE NN GRAPH
2: for t = 1, 2, 3, . . . do
3: BS chooses action a according to π  (s); Input layer neurons 128
4: BS assigns MCSs, Nlayers corresponding to a and Output layer neurons (Nlayers − 1) × 15 × 15 + 15
transmits data packages to UE; Hidden layer neurons twice the output layer neurons
5: UE feeds back ACK/NACK and CSI to BS, form- Optimizer AdamOptimizer [12]
ing new state s ;
6: BS calculates reward r and computes a = π(s );
7: BS updates the state-action function be in the order of 102 bits while that of (sH , aH ) can be in
the order 104 bits. If both errors are equally weighted and
Q(s, a) = Q(s, a) + α (r + γQ(s , a ) − Q(s, a)) ; contributing to the computation of RMSE, the objective
8: BS updates policy π(s) = argmax Q(s, u); function will be highly biased to multi-layer high-MCS
u actions. To overcome this problem, a relative metric can
9: s = s  , a = a ;
be used. As a result, the objective function considered in
10: end for
this paper is defined by the root mean squared logarithmic
error (RMSLE), which considers the relative error instead
Fig. 3. Pseudo codes of QLA. of absolute error values. The RMSLE can be expressed as

 N   2
 target

 log Q + 1 − log Q predict


+ 1
 i i
error = i=1
N
(9)

where Qpredict
i is the ith element of the NN output vector
and Qtarget
i is the ith element of the target vector.

IV. R ESULTS AND A NALYSIS


System level simulations have been performed to study
the proposed QLA method. Each TRP is equipped with
4/8 antennas and is attached by three fullbuffer traffic UEs.
Each UE has 2/4 antennas. There are 15 MCS levels (upto
Fig. 4. NN graph of the state-action function.
256 quadrature amplitude modulation (QAM)). Downlink
transmissions have upto rank 4 (depending on the number
be presented as of UE antennas) and are done in transmission mode 4
(TM4) and 10 (TM10) codebook based closed-loop spatial
π(s) = argmax Q(s, u). (8) multiplexing. The operating carrier frequency is 3.5GHz.
u
Settings regarding the QLA are as follows. The exploration
This procedure will be repeated iteratively and the state- probability  is 0.1, the learning rate α is 0.2, and the
action function will be updated. Pseudo codes of this discounting factor γ is 0.8.
proposed QLA are presented in Fig. 3.
Since the state space and action space are both massive, A. Intercell-interference-free scenario
NN is applied to learning the Q function. A three-layer To demonstrate Q-learning’s adaptation process, we
dense linear NN is constructed as shown in Fig. 4. The first consider the MCS adaptation process in an intercell-
activation of each neuron is the rectified linear unit (ReLU) interference-free scenario. In this case, the throughput
function [11]. The input of the NN is the state vector and of this intercell-interference-free network will only be
the output of the NN is the Q function. Other settings of bounded by MCSs. Fig. 5 shows the MCS of the first
the NN graph are listed in Table I. codeword with QLA and OLA. It can be observed that the
During the training of the NN, a regular choice of QLA requires approximately 2400 iterations to reach the
objective function is root mean squared error (RMSE). peak MCS and stabilize. During the MCS adaptation pro-
However, it should be pointed out that the RMSE function cess, exploration steps are taken resulting in fluctuations
is not suitable for this LA problem because the Q values in the MCS. This phenomenon is an intrinsic property
of different state-action pairs (s, a) can differ significantly. of Q-learning and occurs even when the MCS reaches
For example, let (sL , aL ) correspond to a single-layer low- its peak. On the contrary, an OLA converges to the peak
MCS transmission and (sH , aH ) correspond to multi-layer much faster (approximately 35 iterations) than the QLA in
high-MCS transmission. The training error in (sL , aL ) can an intercell-interference-free scenario, because the MCS

Authorized licensed use limited to: QSIO. Downloaded on October 28,2022 at 23:52:49 UTC from IEEE Xplore. Restrictions apply.
2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications: Track 2: Networking and
MAC

15

10 10-1

5
10-2
Exploration
0
0 500 1000 1500 2000 2500 3000 3500 4000
10-3

15

10-4
10

5 10-5

0
0 100 200 300 400 500 600 700 800 900 1000 10-6
0 500 1000 1500 2000 2500 3000

Fig. 5. Comparison of MCS1 between QLA and OLA. Fig. 7. RMLSE with respect to number of iterations.

4.5
2

4 1.9

3.5 1.8

1.7
3
1.6
2.5
1.5
2
1.4

1.5 1.3

1 1.2
Mean
1.1
0.5 Median
0 100 200 300 400 500 600 700 800
1
5 10 15 20 25 30 35 40 45

Fig. 6. Comparison on number of layers between QLA and OLA.


Fig. 8. Normalized user throughput vs episode number.

monotonically increases. However, the convergence of


QLA in Nlayer is comparable to OLA, as shown in Fig. 6. throughput of the first episode. The NN learns to increase
Both approaches reach their peaks after approximately 50 user throughput sharply in the first eighteen episodes,
iterations. These comparisons reveal that the drawback of resulting in a 70% gain. Then, the learning process tends
QLA is mainly the slow convergence in MCSs. However, to stabilize, increasing by another 10% in the later 27
after convergence, QLA will have similar performance as episodes.
OLA. The analysis of behaviors of the number of layers and
MCSs in the intercell-interference-limited scenario is less
B. Intercell-interference-limited scenario straightforward, as snapshots do not provide statistical
Next, an intercell-interference-limited scenario is con- meanings and the distributions of rank, HARQ, and MCS
sidered. In this case, all intercell interference is enabled. need to be considered jointly. Probability mass functions
The training convergence of the NN is depicted in Fig. 7. (PMFs) of numbers of layers and histograms of MCSs
Training convergence can be observed after 500 iterations, in OLA and QLA are illustrated in Table II and Fig. 9,
reaching RMLSE in the order of 10−4 . Although explo- respectively. It can be observed that OLA operates with
rations cause fluctuations, the convergence trend continues more rank 1 transmissions (e.g., 89%) and more ambitious
to decrease as the number of iterations increases. MCSs such as MCS level 11 and 12. On the contrary,
The learning process of the NN is shown in Fig. 8, QLA learns to use mainly rank 2 transmissions (e.g., 99%)
by plotting the normalized user throughput with respect and more conservative MCSs instead. These conservative
to the episode number. Here, an episode means a 5000 MCSs such as MCS level 4 and 6 in QLA result in higher
transmission time interval (TTI) simulation run. The nor- ACK percentage than OLA.
malized value is the ratio of the user throughput to the user The consequence of the QLA-based control of rank and

Authorized licensed use limited to: QSIO. Downloaded on October 28,2022 at 23:52:49 UTC from IEEE Xplore. Restrictions apply.
2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications: Track 2: Networking and
MAC

TABLE II
1
PMF S OF NUMBERS OF LAYERS IN OLA AND QLA
0.9
Rank 1 Rank 2
0.8
OLA 89% 11%
Rank PMF 0.7
QLA 1% 99%
OLA 57% 42% 0.6
ACK PMF
QLA 70% 85% 0.5

0.4

0.3

0.2

0.1

0
0 2 4 6 8 10 12 14 16 18

Fig. 10. User throughput comparison between QLA and OLA.

TABLE III
U SER THROUGHPUT GAIN OF QLA OVER OLA

Nt =8, Nr =2 Nt =8, Nr =4
5% throughput 1% 3%
Median throughput 13% 12%
95% throughput 5% 5%
Mean throughput 6% 8%
Fig. 9. Histogram of MCSs in OLA and QLA.

MCSs is shown in Fig. 10, which illustrates cumulative R EFERENCES


distribution functions (CDFs) of user throughput under [1] Samsung. 5G vision. White paper. [Online].
Available: http://www.samsung.com/global/business-
different antenna configurations and LA approaches. It can images/insights/2015/Samsung-5G-Vision-0.pdf.
be seen that the QLA outperforms OLA by more than 10% [2] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang,
at median throughput, and 5 − 7% at mean and cell center “Intelligent 5G: when cellular networks meet artificial intelligence,”
IEEE Wireless Comm., vol. 24, no. 5, pp. 175–183, Oct. 2017.
(95%) throughput. However, the gain at cell edge (5%) [3] A. Duran, M. Toril, F. Ruiz, and A. Mendo, “Self-optimization
throughput is less significant. Detailed comparisons are algorithm for outer loop link adaptation in LTE,” IEEE Commun.
also listed in Table III. Lett., vol. 19, no. 11, pp. 2005–2008, Nov. 2015.
[4] Blanquez-Casado, F., Gomez, G. Aguayo-Torres, M. C., and En-
trambasaguas, J. T., “eOLLA: an enhanced outer loop link adapta-
V. C ONCLUSION tion for cellular networks,” EURASIP J. Wireless Commun. Netw.,
pp. 1–16, 2016:20, Jan. 2016.
A QLA has been proposed in this paper. The state space [5] H. Kim, Y. Jiang, R. B. Rana, S. Kannan, S. Oh, P. Viswanath,
is constructed using the HARQ information and reported “Communication algorithms via deep learning,” in Proc. ICLR’18,
CSI from a UE. Since past information has been included Vancouver, Canada, Apr. 2018, pp. 1–19.
[6] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Intro-
in a state, user throughput fluctuations can be reduced. The duction., 2nd ed., MIT Press, Cambridge, MA, 2018.
action space is constructed with the number of layers and [7] R. Karmakar, S. Chattopadhyay, and S. Chakraborty, “SmartLA:
MCSs to be assigned to the UE in the next transmission. reinforcement learning-based link adaptation for high throughput
wireless access networks,” ELSEVIER Computer Commun., vol.
In general, the QLA requires more iterations to con- 110, pp. 1–25, May 2017.
verge than the OLA. However, in an intercell-interference- [8] L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning
limited scenario, it has been demonstrated via system level as a rehearsal for decentralized planning,” Neurocomputing, vol.
190, pp. 82–94, May 2016.
simulations that the proposed QLA outperforms OLA in [9] ITU-R M. 2412, “Guidelines for evaluation of radio interface tech-
all key performance metrics. After analyzing the statistics nologies for IMT-2020,” International Telecommunication Union,
of MCS levels and the number of layers, it can be observed Oct. 2017.
[10] 3GPP T. S. 38.214, “Physical layer procedures for data,” V15.6.0,
that the gain in throughput comes from better balance June 2019.
between the MCS level and the number of layers. For [11] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
future work, other factors such as physical resource block networks,” in Proc. Machine Learning Research, Fort Lauderdale,
FL, USA, Apr. 2011, pp. 315–323.
allocations and power control can be included in the state [12] D. P. Kingma and J. L. Ba, “Adam: a method for stochastic
and actions spaces, to improve not only user throughputs optimization,” in Proc. ICLR’15, San Diego, California, USA, May.
but also power efficiency. 2015, pp. 1–15.

Authorized licensed use limited to: QSIO. Downloaded on October 28,2022 at 23:52:49 UTC from IEEE Xplore. Restrictions apply.

You might also like