Sun 2019
Sun 2019
Sun 2019
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
Abstract—As a key technique for enabling artificial intelli- significantly reduced [2] [3] [4] [5]. Moreover, 5G is also
gence, machine learning (ML) is capable of solving complex expected to provide more accurate localization, especially in
problems without explicit programming. Motivated by its success- an indoor environment [6].
ful applications to many practical tasks like image recognition,
both industry and the research community have advocated the These goals can be potentially met by enhancing the system
applications of ML in wireless communication. This paper com- from different aspects. For example, computing and caching
prehensively surveys the recent advances of the applications of resources can be deployed at the network edge to fulfill the
ML in wireless communication, which are classified as: resource demands for low latency and reduce energy consumption [7],
management in the MAC layer, networking and mobility man- [8]. The cloud computing based BBU pool can provide high
agement in the network layer, and localization in the application
layer. The applications in resource management further include data rates with the use of large-scale collaborative signal
power control, spectrum management, backhaul management, processing among BSs [9], [10] and can save much energy
cache management, beamformer design and computation re- via statistical multiplexing [11]. Furthermore, the co-existence
source management, while ML based networking focuses on the of heterogenous nodes, including macro BSs (MBSs), small
applications in clustering, base station switching control, user base stations (SBSs) and user equipments (UEs) with device-
association and routing. Moreover, literatures in each aspect is
organized according to the adopted ML techniques. In addition, to-device (D2D) capability, can boost the throughput and
several conditions for applying ML to wireless communication simultaneously guarantee seamless coverage [12]. However,
are identified to help readers decide whether to use ML and the involvement of computing resource, cache resource and
which kind of ML techniques to use. Traditional approaches are heterogenous nodes cannot alone satisfy the stringent re-
also summarized together with their performance comparison quirements of 5G. The algorithmic design enhancement for
with ML based approaches, based on which the motivations
of surveyed literatures to adopt ML are clarified. Given the resource management, networking, mobility management and
extensiveness of the research area, challenges and unresolved localization is essential as well. Faced with the characteristics
issues are presented to facilitate future studies. Specifically, ML of 5G, current resource management, networking, mobility
based network slicing, infrastructure update to support ML management and localization algorithms expose several limi-
based paradigms, open data sets and platforms for researchers, tations.
theoretical guidance for ML implementation and so on are
discussed. First, with the proliferation of smart phones, the expansion
of network scale and the diversification of services in the 5G
Index Terms—Wireless network, machine learning, resource era, the amount of data, related to applications, users and
management, networking, mobility management, localization
networks, will experience an explosive growth, which can
contribute to an improved system performance if properly
I. I NTRODUCTION utilized [13]. However, many of the existing algorithms are
Since the rollout of the first generation wireless commu- incapable of processing and/or utilizing the data, meaning that
nication system, wireless technology has been continuously much valuable information or patterns are wasted. Second, to
evolving from supporting basic coverage to satisfying more adapt to the dynamic network environment, algorithms like
advanced needs [1]. In particular, the fifth generation (5G) RRM algorithms are often fast but heuristic. Since the resulting
mobile communication system is expected to achieve a con- system performance can be far from optimal, these algorithms
siderable increase in data rates, coverage and the number can hardly meet the performance requirements of 5G. To
of connected devices with latency and energy consumption obtain better performance, research has been done based
on optimization theory to develop more effective algorithms
Yaohua Sun (e-mail: [email protected]), Mugen Peng (e-mail: to reach optimal or suboptimal solutions. However, many
[email protected]), Yangcheng Zhou (e-mail: [email protected]), and studies assume a static network environment. Considering that
Yuzhe Huang (e-mail: [email protected]) are with the State Key Laboratory
of Networking and Switching Technology (SKL-NST), Beijing University of 5G networks will be more complex, hence leading to more
Posts and Telecommunications, Beijing, China. Shiwen Mao ([email protected]) complex mathematical formulations, the developed algorithms
is with the Department of Electrical and Computer Engineering, Auburn can possess high complexity. Thus, these algorithms will
University, Auburn, AL 36849-5201, USA. (Corresponding author: Mugen
Peng) be inapplicable in the real dynamic network, due to their
This work is supported in part by the State Major Science and Technology long decision-making time. Third, given the large number of
Special Project under 2017ZX03001025-006 and 2018ZX03001023-005, the nodes in future 5G networks, traditional centralized algorithms
National Natural Science Foundation of China under No. 61831002 and No.
61728101, the National Program for Special Support of Eminent Profession- for network management can be infeasible due to the high
als, and the US National Science Foundation under grant CNS-1702957. computing burden and high cost to collect global informa-
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
{
Margin
Support Vector
localization. The principles of different learning techniques are
Maximum-margin
introduced, and useful guidelines are provided. To facilitate Hyperplane
future applications of machine learning, challenges and open Support Vector
issues are identified. Overall, this survey aims to fill the gaps
found in the previous papers [19]–[33], and our contributions
are threefold:
1) Popular machine learning techniques utilized in wireless
networks are comprehensively summarized including
their basic principles and general applications, which are o Attribute 1
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
TABLE I
SUMMARY OF ABBREVIATIONS
B. Unsupervised Learning
Class 1
data Unsupervised learning is a machine learning task that aims
to learn a function to describe a hidden structure from un-
Class 2 labeled data. In surveyed works, the following unsupervised
data
"
learning techniques are utilized.
test sample 1) K-Means Clustering Algorithm: In K-means clustering,
K=3 the aim is to partition n data points into K clusters, and
each data point belongs to the cluster with the nearest mean.
K=5 The most common version of K-means algorithm is based on
iterative refinement. At the beginning, K means are randomly
initialized. Then, in each iteration, each data point is assigned
to exactly one cluster, whose mean has the least Euclidean
Fig. 2. An illustration of the KNN model.
distance to the data point, and the mean of each cluster is up-
dated. The algorithm continuously iterates until the members
in each cluster do not change. The basic principle is illustrated
by involving a weight for each neighbor that is proportional in Fig. 3.
to the inverse of its distance to the test point. In addition, one 2) Principal Component Analysis: Principal component
of the keys to applying KNN is the tradeoff of the parameter analysis (PCA) is a dimension-reduction tool that can be
K, and the value selection can be referred to [35]. used to reduce a large set of variables to a small set that
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
Exploration
Reward
Tradeoff Action
Function
o Attribute 1
Agent
Fig. 3. An illustration of the K-means model.
Q-value
Update Q-value
Reward
Time Difference Error Action
Calculation Selection
Environment
Cost and next state
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
4. The current
4. The next state and
state in the action in the
transition transition
Utility sample sample
Immediate Strategy Action
Estimation 7. The periodical
Reward Update Selection
Update Target update of weights
DQN
DQN
Fig. 7. The process of joint utility and strategy estimation based learning.
5. The maximum 5. The Q-value for 6. Weight
Q-value for the the state-action pair update
4. The reward in next state
the transition
4) Joint Utility and Strategy Estimation Based Learning: sample
Loss Function
In this algorithm shown in Fig. 7, each agent holds an
estimation of the expected utility, whose update is based on
the immediate reward. The probability to select each action,
2. State transition 1. Action selection based
named as strategy, is updated in the same iteration based on the and reward feedback on the current state
Environment
utility estimation [42]. The main benefit of this algorithm lies
in that it is fully distributed when the reward can be directly
calculated locally, as, for example, the data rate between a Fig. 8. The main components and process of the basic DRL.
transmitter and its paired receiver. Based on this algorithm,
one can further estimate the regret of each action based on Hidden Hidden Hidden
Layer 1 Layer 2 Layer 3
utility estimations and the received immediate reward, and
Input
then update strategy using regret estimations. In surveyed Layer Hidden
works, this algorithm is often connected with some equilibrium Layer 4
Output
concepts in game theory like Logit equilibrium and coarse Layer
correlated equilibrium.
5) Deep Reinforcement Learning: In [43], authors propose
to use a deep NN, called deep Q network (DQN), to approxi-
mate optimal Q values, which allows the agent to learn from
the high-dimensional sensory data directly, and reinforcement
learning based on DQN is known as deep reinforcement
learning (DRL). Specifically, state transition samples generated
by interacting with the environment are stored in the replay
memory and sampled to train the DQN, and a target DQN is Fig. 9. The architecture of a DNN.
adopted to generate target values, which both help stabilize the
training procedure of DRL. Recently, some enhancements to /TV[Z xt /TV[Z xt 1 /TV[Z xt /TV[Z xt 1
Ċ
Ċ
DRL have come out, and readers can refer to the literature [44] [TLURJ l 1
h t 1 h l 1
htl11
for more details. The main components and working process NOJJKTRG_KX
l 1 l 1 l 1
t
l 1
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
Inputs Convolution Pooling Dense Outputs In deep learning and reinforcement learning, knowledge can be
represented by weights and Q values, respectively. Specifically,
when deep learning is adopted for image recognition, one
JKTYKRG_KXY
U[ZV[ZRG_KX
can use the weights that have been well trained for another
澢澢澢 image recognition task as the initial weights, which can help
achieve a satisfactory performance with a small training set.
For reinforcement learning, Q values learned by an agent in
Fig. 11. The architecture of a CNN.
a former environment can be involved in the Q value update
in a new but similar environment to make a wiser decision
at the initial stage of learning. Specifications on integrating
Input Layer Output Layer transfer learning with reinforcement learning can be referred to
1 Code 1
[47]. However, when transfer learning is utilized, the negative
impact of former knowledge on the performance should be
2 1 2 carefully handled, since there still exist some differences
between tasks or environments.
2
x
3 3
x¢
4 4
F. Some Implicit Assumptions
M
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
TABLE II
T HE APPLICATIONS OF MACHINE LEARNING METHODS SURVEYED IN THIS PAPER
[17]
Extreme-Learning Machine [78] [128]
Dense Neural Network [15] [109] [111] [129]
Convolutional Neural Network [55] [81] [110]
Recurrent Neural Network [63] [64] [79] [80]
Transfer Learning [57] [82] [77] [18]
[78] [100]
[101]
[103]
Collaborative Filtering [90]
Other Reinforcement Learning Tech- [60] [87]
niques
Other Supervised/Unsupervied [119] [120] [126]
Techniques [121]
8
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
Faced with the above issues, machine learning techniques manage aggregated interference generated by multiple CRs at
including model-free reinforcement learning and NNs can the receivers of primary (licensed) users, and the secondary
be employed. Specifically, reinforcement learning can learn BSs are taken as learning agents. The state set defined for
a good resource management policy based on only the re- each agent is composed of a binary variable indicating whether
ward/cost fed back by the environment, and quick decisions the secondary system generates excess interference to primary
can be made for a dynamic network once a policy is learned. receivers, the approximate distance between the secondary
In addition, owing to the superior approximation capabilities user (SU) and protection contour, and the transmission power
of deep NNs, some high complexity resource management corresponding to the current SU. The action set of the sec-
algorithms can be approximated, and similar network per- ondary BS is the set of power levels that can be adopted
formance can be achieved but with much lower complexity. with the cost function designed to limit the interference at the
Moreover, NNs can be utilized to learn the content popularity, primary receivers. Taking into account that the agent cannot
which helps fully make use of limited cache resource, and always obtain an accurate observation of the interference
distributed Q-learning can endow each node with autonomous indicator, authors then discuss two cases, namely complete
decision capability for resource allocation. In the following, information and partial information, and handle the latter by
the applications of machine learning in power control, spec- involving belief states in Q learning. Moreover, two different
trum management, backhaul management, beamformer design, ways of Q value representation are discussed utilizing look-
computation resource management and cache management up tables and neural networks, and the memory, as well as,
will be introduced. computation overheads are also examined. Simulations show
that the proposed scheme can allow agents to learn a series of
optimization policies that will keep the aggregated interference
A. Machine Learning Based Power Control
under a desired value.
In the spectrum sharing scenario, effective power control In addition, some research utilizes reinforcement learning
can reduce inter-user interference, and hence increase system to achieve the equilibrium state of wireless networks, where
throughput. In the following, reinforcement, supervised, and power control problems are modeled as non-cooperative games
transfer learning-based power control are elaborated. among multiple nodes. In [51], the power control of femto BSs
1) Reinforcement Learning Based Approaches: In [48], (FBSs) is conducted to mitigate the cross-tier interference to
authors focus on inter-cell interference coordination (ICIC) macrocell UEs (MUEs). Specifically, the power control and
based on Q learning with Pico BSs (PBSs) and a macro BS carrier selection is modeled as a normal form game among
(MBS) as the learning agents. In the case of time-domain ICIC, FBSs in mixed strategies, and a reinforcement learning algo-
the action performed by each PBS is to select the bias value rithm based on joint utility and strategy estimation is proposed
for cell range expansion and transmit power on each resource to help FBSs reach Logit equilibrium. In each iteration of the
block (RB), and the action of the MBS is only to choose the algorithm, each FBS first selects an action according to its
transmit power. The state of each agent is defined by a tuple current strategy, and then receives a reward which equals its
of variables, each of which is related to the SINR condition of data rate if the QoS of the MUE is met and equals to zero
each UE, while the received cost of each agent is defined to otherwise. Based on the received reward, each FBS updates its
meet the total transmit power constraint and make the SINR utility estimation and strategy by the process proposed in [42].
of each served UE approach a target value. In each iteration of By numerical simulations, it is demonstrated that the proposed
the algorithm, each PBS first selects an action leading to the algorithm can achieve a satisfactory performance while each
smallest Q value for the current state, and then the MBS selects FBS does not need to know any information of the game but
its action in the same way. While in the case of frequency- its own reward. It is also shown that taking identical utilities
domain ICIC, the only difference lies in the action definition for FBSs benefits the whole system performance.
of Q learning. Utilizing Q learning, the Pico and Macro tiers In [52], authors model the channel and power level selection
can autonomously optimize system performance with a little of D2D pairs in a heterogeneous cellular network as a stochas-
coordination. In [49], authors use Q learning to optimize the tic non-cooperative game. The utility of each pair is defined
transmit power of SBSs in order to reduce the interference by considering the SINR constraint and the difference between
on each RB. With learning capabilities, each SBS does not its achieved data rate and the cost of power consumption.
need to acquire the strategies of other players explicitly. In To avoid the considerable amount of information exchange
contrast, the experience is preserved in the Q values during among pairs incurred by conventional multi-agent Q learning,
the interaction with other SBSs. To apply Q learning, the state an autonomous Q learning algorithm is developed based on the
of each SBS is represented as a binary variable that indicates estimation of pairs’ beliefs about the strategies of all the other
whether the QoS requirement is violated, and the action is the pairs. Finally, simulation results indicate that the proposal
selection of power levels. When the QoS requirement is met, possesses a relatively fast convergence rate and can achieve
the reward is defined as the achieved, instantaneous rate, which near-optimal performance.
equals to zero otherwise. The simulation result demonstrates 2) Supervised Learning Based Approaches: Considering
that Q-learning can increase the long-term expected data rate the high complexity of traditional optimization based resource
of SBSs. allocation algorithms, authors in [15], [55] propose to utilize
Another important scenario requiring power control is the deep NNs to develop power allocation algorithms that can
CR scenario. In [50], distributed Q learning is conducted to achieve real-time processing. Specifically, different from the
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
10
traditional ways of approximating iterative algorithms where OFDMA network and is superior to the traditional power
each iteration is approximated by a single layer of the NN, control algorithm in terms of the average capacity of cells.
authors in [15] adopts a generic dense neural network to 4) Lessons Learned: It can be learned from [48]–[52],
approximate the classic WWMSE algorithm for power control [57] that distributed Q learning and learning based on joint
in a scenario with multiple transceiver pairs. Notably, the utility and strategy estimation can both help to develop self-
number of layers and the number of ReLUs and binary units organizing and autonomous power control schemes for CRNs
that are needed for achieving a given approximation error and heterogenous networks. Moreover, according to [50], Q
are rigourously analyzed, from which it can be concluded values can be represented in a tabular form or by a neuron net-
that the approximation error just has a little impact on the work that have different memory and computation overheads.
size of the deep NN. As for NN training, the training data In addition, as indicated by [57], Q learning can be enhanced
set is generated by running the WMMSE algorithm under to make agents better adapt to a dynamic environment by
varying channel realizations, and channel realizations together involving transfer learning for a heterogenous network. Fol-
with corresponding power allocation results output by the lowing [51], better system performance can be achieved by
WMMSE algorithm constitute labeled data. Via simulations, making the utility of agents identical to the system’s goal. At
it is demonstrated that the adopted fully connected NN can last, according to [15], [55], [56], using NNs to approximate
achieve similar performance but with much lower computation traditional high-complexity power allocation algorithms is a
time compared with the WMMSE algorithm. potential way to realize real-time power allocation.
While in [55], a CNN based power control scheme is devel-
oped for the same scenario considered in [15], where the full
B. Machine Learning Based Spectrum Management
channel gain information is normalized and taken as the input
of the CNN, while the output is the power allocation vector. With the explosive increase of data traffic, spectrum short-
Similar to [15], the CNN is firstly trained to approximate ages have drawn big concerns in the wireless communication
traditional WMMSE to guarantee a basic performance, and community, and efficient spectrum management is desired to
then the loss function is further set as a function of spectral improve spectrum utilization. In the following, reinforcement
efficiency (SE) or energy efficiency (EE). Simulation result learning and unsupervised learning based spectrum manage-
shows that the proposal can achieve almost the same or even ment are introduced.
higher SE and EE than WMMSE at a faster computing speed. 1) Reinforcement Learning Based Approaches: In [58],
Also using NNs for power allocation, authors in [56] adopt an spectrum management in millimeter-wave, ultra-dense net-
NN architecture formed by stacking multiple encoders from works is investigated and temporal-spatial reuse is considered
pre-trained auto-encoders and a pre-trained softmax layer. The as a method to improve spectrum utilization. The spectrum
architecture takes the CSI and the location indicators as the management problem is formulated as a non-cooperative game
input with each indicator representing whether a user is a cell- among devices, which is proved to be an ordinary potential
edge user, and the output is the resource allocation result. game guaranteeing the existence of Nash equilibrium (NE). To
The training data set is generated by solving a sum rate help devices achieve NE without global information, a novel,
maximization problem under different CSI realizations via the distributed Q learning algorithm is designed, which facilitates
genetic algorithm. devices to learn environments from the individual reward. The
3) Transfer Learning Based Approaches: In heterogenous action and reward of each device are channel selection and
networks, when femtocells share the same radio resources channel capacity, respectively. Different from traditional Q
with macrocells, power control is needed to limit the inter- learning where the Q value is defined over state-action pairs,
tier interference to MUEs. However, facing with dynamic the Q value in the proposal is defined over actions, and that
environment, it is difficult for femtocells to meet the QoS is each action corresponds to a Q value. In each time slot,
constraints of MUEs during the entire operation time. In [57], the Q value of the played action is updated as a weighted
distributed Q-learning is utilized for inter-tier interference sum of the current Q value and the immediate reward, while
management, where the femtocells, as learning agents, aim the Q values of other actions remain the same. In addition,
at optimizing their own capacity while satisfying the data based on rigorous analysis, a key conclusion is drawn that less
rate requirement of MUEs. Due to the frequent changes in coupling in learning agents can help speed up the convergence
RB scheduling, i.e., the RB allocated to each UE is different of learning. Simulations demonstrate that the proposal can
from time-to-time, and the backhaul latency, the power control converge faster and is more stable than several baselines, and
policy learned by femtocells will be no longer useful and can also leads to a small latency.
cause the violation of the data rate constraints of MUEs. To Similar to [58], authors in [59] focus on temporal-spatial
deal with this problem, authors propose to let the MBS inform spectrum reuse but with the use of MAB theory. In order to
femtocells about the future RB scheduling, which facilitates overcome the high computation cost brought by the centralized
the power control knowledge transfer between different en- channel allocation policy, a distributed three-stage policy is
vironments. Hence, femtocells can still avoid interference to proposed. The goal of the first two stages is to help SU find
the MUE, even if its RB allocation is changed. In this study, the optimal channel access rank, while the third stage, based
the power control knowledge is represented by the complete on MAB, is for the optimal channel allocation. Specifically,
Q-table learned for a given RB. System level simulations with probability 1-ε, each SU chooses a channel based on the
demonstrate this scheme can normally work in a multi-user channel access rank and empirical idle probability estimates,
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
11
TABLE III
M ACHINE L EARNING BASED P OWER C ONTROL
and uniformly chooses a channel at random otherwise. Then, Bush and Mosteller learning algorithms, are available for each
the SU senses the selected channel and will receive a reward femtocell. To determine the action selection probabilities based
equal to 1 if neither the primary user nor other SUs transmit on the propensity of each action output by the learning process,
over this channel. By simulations, it is shown that the proposal logistic functions are utilized, which are commonly used in
can achieve significantly smaller regrets than the baselines in machine learning to transform the full-range variables into the
the spectrum temporal-spatial reuse scenario. The regret is limited range of a probability. Using the proposed approach,
defined as the difference between the total reward of a genie- higher cell throughput is achieved owing to the significant
aided rule and the expected reward of all SUs. reduction in intra-tier and inter-tier interference. In [61], the
joint communication mode selection and subchannel allocation
In [60], authors study a multi-objective, spectrum access
of D2D pairs is solved by joint utility and strategy estimation
problem in a heterogenous network. Specifically, the con-
based reinforcement learning for a D2D enabled C-RAN
cerned problem aims at minimizing the received intra/ inter-
shown in Fig. 13. In the proposal, each action of a D2D pair
tier interference at the femtocells and the inter-tier inter-
is a tuple of a communication mode and a subchannel. Once
ference from femtocells to eNBs simultaneously under QoS
each pair has selected its action, distributed RRH association
constraints. Considering the lack of global and complete
and power control are executed, and then each pair receives
channel information, unknown number of nodes, and so on,
the system SE as the instantaneous utility, based on which the
the formulated problem is very challenging. To handle this
utility estimation for each action is updated. Via simulation,
issue, a reinforcement learning approach based on joint util-
it is demonstrated that a near optimal performance can be
ity and strategy estimation is proposed, which contains two
achieved by properly setting the parameter controlling the
sequential levels. The purpose of the first level is to identify
balance between exploration and exploitation.
available spectrum resource for femtocells, while the second
level is responsible for the optimization of resource selection. To overcome the challenges existing in current solutions to
Two different utilities are designed for each level, namely spectrum sharing between operators, an inter-operator proxi-
the spectrum modeling and spectrum selection utilities. In mal spectrum sharing (IOPSS) scheme is presented in [62],
addition, three different learning algorithms, including the in which a BS is able to intelligently offload users to its
gradient follower, the modified RothErev, and the modified neighboring BSs based on spectral proximity. To achieve this
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
12
BBU Pool images together with audio to VR users in the downlink. Thus,
it is essential for resource block allocation to jointly consider
both the uplink and downlink. To address this problem, each
BS is taken as an ESN agent whose actions are its resource
Fronthaul
block allocation plans for both uplink and downlink. The input
&5$1
XVHU of the ESN maintained by each BS is a vector containing the
indexes of the probability distribution that all the BSs currently
RRH
use, while the output is a vector of utility values associated
,QWHUIHUHQFH
IURP8(V with each action, based on which the BS selects its action.
8VHIXO Simulation results show that the proposed algorithm yields
6LJQDO
significant gains, in terms of VR QoS utility.
$''SDLU
2) Lessons Learned: First, it is learned from [58] that less
Fig. 13. The considered uplink D2D enabled C-RAN scenario. coupling in learning agents can help speed up the convergence
of distributed reinforcement learning for a spatial-temporal
spectrum reuse scenario. Second, following [61], near-optimal
system SE can be achieved for a D2D enabled C-RAN by joint
goal, a Q learning framework is proposed, resulting in a self- utility and strategy estimation based reinforcement learning, if
organizing, spectrally efficient network. The state of a BS the parameter balancing exploration and exploitation becomes
is the experienced load whose value is discretized, while an larger with time going by. Third, as indicated by [63], [64],
action of a BS is a tuple of spectral sharing parameters related when each BS is allowed to manage its own spectrum resource,
to each neighboring BS. These parameters include the number RNN can be used by each BS to predict its utility, which can
of RBs requiring each neighboring BS to be reserved, the help the system reach a stable resource allocation outcome.
probability of each user served by the neighboring BS with the
strongest SINR, and the reservation proportion of the requested
RBs. The cost function of each BS is related to both the C. Machine Learning Based Backhaul Management
QoE of its users and the change in the number of RBs it In wireless networks, in addition to the management of radio
requests. Through extensive simulations with different loads, resources like power and spectrum, the management of back-
the distributed, dynamic IOPSS based Q learning can help haul links connecting SBSs and MBSs or connecting BSs and
mobile network operators provide users with a high QoE and the core network is essential as well to achieve better system
reduce operational costs. performance. This subsection will introduce literatures related
In addition to adopting classical reinforcement learning to backhaul management based on reinforcement learning.
methods, a novel reinforcement learning approach involving 1) Reinforcement Learning Based Approaches: In [65],
recurrent neural networks is utilized in [63] to handle the Jaber et al. propose a backhaul-aware, cell range extension
management of both licensed and unlicensed frequency bands (CRE) method based on RL to adaptively set the CRE offset
in LTE-U systems. Specifically, the problem is formulated as value. In this method, the observed state for each small cell
a non-cooperative game with SBSs and an MBS as game is defined as a value reflecting the violation of its backhaul
players. Solving the game is a challenging task since each capacity, and the action to be taken is the CRE bias of a
SBS may know only a little information about the network, cell considering whether the backhaul is available or not. The
especially in a dense deployment scenario. To achieve a mixed- definition of the cost for each small cell intends to maximize
strategy NE, multi-agent reinforcement learning based on echo the utilization of total backhaul capacity while keeping the
state networks is proposed, which are easy to train and can backhaul capacity constraint of each cell satisfied. Q learning
track the state of a network over time. Each BS is an ESN is adopted to minimize this cost through an iterative process,
agent using two ESNs, namely ESN α and ESN β, to approx- and simulation results show that the proposal relieves the
imate the immediate and the expected utilities, respectively. backhaul congestion in macrocells and improves the QoE
The input of the first ESN comprises the action profile of of users. In [66], authors concentrate on load balancing to
all the other BSs, while the input of the latter is the user improve backhaul resource utilization by learning system bias
association of the BS. Compared to traditional RL approaches, values via a distributed Q learning algorithm. In this algorithm,
the proposal can quickly learn to allocate resources with not Xu et al. take the backhaul utilization quantified to several
much training data. During the algorithm execution, each BS levels as the environment state based on which each SBS
needs to broadcast only the action currently taken and its determines an action, that is, the bias value. Then, with the
optimal action. The simulation result shows that the proposed reward function defined as the weighted difference between
approach improves the sum rate of the 50th percentile of users the backhaul resource utilization and the outage probability
by up to 167% compared to Q learning. A similar idea that for each SBS, Q learning is utilized to learn the bias value
combines RNNs with reinforcement learning has been adopted selection strategy, achieving a balance between system-centric
by [64] in a wireless network supporting virtual reality (VR) performance and user-centric performance. Numerical results
services. Specifically, a complete VR service consists of two show that this algorithm is able to optimize the utilization
communication stages. In the uplink, BSs collect tracking of backhaul resources under the promise of guaranteeing user
information from users, while BSs transmit three-dimensional QoS.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
13
TABLE IV
M ACHINE L EARNING BASED S PECTRUM M ANAGEMENT
Unlike those in [65] and [66], authors in [67] and [68] scenario with SBSs acting as relays for MUEs, which results
model backhaul management from a game-theoretic perspec- in an improved rate and delay.
tive. Problems are solved employing an RL approach based on
joint utility and strategy estimation. Specifically, the backhaul D. Machine Learning Based Cache Management
management problem is formulated as a minority game in Due to the proliferation of smart devices and intelligent ap-
[67], where SBSs are the players and have to decide whether plications, such as augmented reality, virtual reality, ubiquitous
to download files for predicted requests while serving the social networking, and IoT, wireless communication systems
urgent demands. In order to approximate the mixed NE, an have experienced a tremendous data traffic increase over the
RL-based algorithm, which enables each SBS to update its past couple of years. Additionally, it has been envisioned that
strategy based on only the received utility, is proposed. In the cellular network will produce about 30.6 exabytes data
contrast to previous, similar RL algorithms, this scheme is per month by 2020 [72]. Faced with the explosion of data
mathematically proved to converge to a unique equilibrium demands, the caching paradigm is introduced for the future
point for the formulated game. In [68], MUEs can com- wireless network to shorten latency and alleviate the trans-
municate with the MBS with the help of SBSs serving as mission burden on backhaul [73]. Recently, many excellent
relays. The backhaul links between SBSs and the MBS are research studies have adopted ML techniques to manage cache
heterogenous including both wired and wireless backhaul. The resource with great success.
competition among MUEs is modeled as a non-cooperative 1) Reinforcement Learning Based Approaches: Consider-
game, where their actions are the selections of transmission ing the various spatial and temporal content demands among
power, the assisting SBSs, and rate splitting parameters. Using different small cells, authors in [74] develop a decentralized
the proposed RL approach, coarse correlated equilibrium of caching update scheme based on joint utility-and-strategy-
the game is reached. In addition, it is demonstrated that the estimation RL. With this approach, each SBS can optimize
proposal achieves better average throughput and delay for the a caching probability distribution over content classes using
MUEs than existing benchmarks do. only the received instantaneous utility feedback. In addition,
2) Lessons Learned: It can be learned from [65] that Q by doing weighted sum of the caching strategies of each SBS
learning can help with alleviating backhaul congestion and and the cloud, a tradeoff between local content popularity
improving the QoE of users by autonomously adjusting CRE and global popularity can be achieved. In [75], authors also
parameters. Following [66], when one intends to balance focus on distributed caching design. Different from [74], BSs
between system-centric performance and user-centric perfor- are allowed to cooperate with each other in the sense that
mance, the reward fed back to the Q learning agent can be each BS can get the locally missing content from other BSs
defined as a weighted difference between them. As indicated via backhaul, which can be a more cost-efficient solution.
by [67], joint utility and strategy estimation based learning can Meanwhile, D2D offloading is also considered to improve
help achieve a balance between downloading files potentially the cache utilization. Then, to minimize system transmission
requested in the future and serving current traffic. According cost, a distributed Q learning algorithm is utilized. For each
to [68], distributed reinforcement learning facilitates UEs to BS, the content placement is taken as the observed state,
select heterogenous backhaul links for a heterogenous network and the adjustment of cached contents is taken as the action.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
14
TABLE V
M ACHINE L EARNING BASED BACKHAUL M ANAGEMENT
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
15
scheme integrating 3D CNN for video generic feature extrac- arrival and energy status at the mobile device. To develop
tion, SVM for generating representation vectors of videos, and the optimal offloading decision policy, a double DQN based
then a regression model for predicting the video popularity learning approach is proposed, which does not need the
by taking the corresponding representation vector as input. complete information about network dynamics and can handle
After the popularity of each video is obtained, the optimal state spaces with high dimension. Simulation results show that
portion of each video cached at the BS is derived to minimize the proposal can improve computation offloading performance
the backhaul load in each time period. The advantage of the significantly compared with several baseline policies.
proposal lies in the ability to predict the popularity of new 1) Lessons Learned: From [85], it is learned that DRL
uploaded videos with no statistical information required. based on double DQN can be used to optimize the com-
3) Transfer Learning Based Approaches: Generally, con- putation offloading policy without knowing the information
tent popularity profile plays key roles in deriving efficient about network dynamics, such as channel quality dynamics,
caching policies, but its estimation with high accuracy suffers and meanwhile can handle the issue of state space explosion
from a long time incurred by collecting user file request sam- for a wireless network providing MEC services. However,
ples. To overcome this issue, authors in [82] involve the idea of authors in [85] only consider a single user. Hence, in the
transfer learning by integrating the file request samples from future, it is interesting to study the computation offloading
the social network domain into the file popularity estimation for the scenario with multiple users based on DRL, whose
formula. By theoretical analysis, the training time is expressed offloading decisions can be coupled due to interference and
as a function of the “distance“ between the probability distri- constrained MEC resources.
bution of the requested files and that of the source domain
samples. In addition, transfer learning based approaches are F. Machine Learning Based Beamforming
also adopted in [83] and [84]. Specifically, although col- Considering the ever-increasing QoS requirements and the
laborative filtering (CF) can be utilized to estimate the file need for real-time processing in practical systems, authors in
popularity matrix, it faces the problem of data sparseness. [86] propose a supervised learning based resource allocation
Hence, authors in [83] propose a transfer learning based CF framework to quickly output the optimal or a near optimal re-
approach to extract collaborative social behavior information source allocation solution for the current scenario. Specifically,
from the interaction of D2D users within a social community the data related to historical scenarios is collected and the
(source domain), which improves the estimation of the (large- feature vector is extracted for each scenario. Then, the optimal
scale) file popularity matrix in the target domain. In [84], a or near optimal resource allocation plan can be searched
transfer learning based caching scheme is developed, which off-line by taking the advantage of cloud computing. After
is executed at each SBS. Particularly, by using the contextual that, those feature vectors with the same resource allocation
information like users’ social ties that are acquired from D2D solution are labeled with the same class index. Up to now,
interactions (source domain), cache placement at each small the remaining task to determine resource allocation for a
cell is carried out, taking estimated content popularity, traffic new scenario is to identify the class of its corresponding
load and backhaul capacity into account. Via simulation, it feature vector, and that is the resource allocation problem is
is demonstrated that the proposal can well deal with data transformed into a multi-class classification problem, which
sparsity and cold-start problems, which results in significant can be handled by supervised learning.
enhancement in users’ QoE. To make the application of the proposal more intuitive, an
4) Lessons Learned: It can learned from [78]–[80] that example using KNN to optimize beam allocation in a single
content popularity profile can be accurately predicted or esti- cell with multiple users is presented, and simulation results
mated by supervised learning like recurrent neural networks show an improvement in sum rate compared to a state-of-
and extreme learning machine, which is useful for the cache the-art method. However, it should be noted that there exists
management problem formulation. Second, based on [82]– delay caused by KNN to compare the similarities between the
[84], involving the content request information from other current scenario and past scenarios. When such similarities
domains, such as the social network domain, can help reduce can only be calculated one-by-one and the number of past
the time needed for popularity estimation. At last, when the file scenarios is large, it is possible that the environment has
popularity profile is difficult to acquire, reinforcement learning changed before KNN outputs a beam allocation result, which
is an effective way to directly optimize the caching policy, as can lead to poor performance. Therefore, a low-complexity
indicated by [74], [75], [77]. classifier with small response delay is essential for machine
learning based beamforming to well adapt to the time-varying
user phase information and channel state information.
E. Machine Learning Based Computation Resource Manage- 1) Lessons Learned: As indicated by [86], resource man-
ment agement in wireless networks can be transformed into a
In [85], authors investigate a wireless network that provides supervised, classification task, where the labeled data set is
MEC services, and a computation offloading decision problem composed of feature vectors representing different scenarios
is formulated for a representative mobile terminal, where and their corresponding classes. The feature vectors belonging
multiple BSs are available for computation offloading. More to the same class correspond to the same resource allocation
specifically, the problem takes environmental dynamics into solution. Then, various machine learning techniques for classi-
account including time-varying channel quality and the task fication can be applied to determine the resource allocation for
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
16
TABLE VI
M ACHINE L EARNING BASED C ACHE M ANAGEMENT
TABLE VII
M ACHINE L EARNING BASED C OMPUTATION R ESOURCE M ANAGEMENT
a new scenario. When the classification algorithm is with low- in the real world. Fortunately, the traffic flow in vehicular
complexity, it is possible to achieve near real-time resource networks possesses spatial-temporal regularity. Based on this
allocation. At last, it should be highlighted that the key to observation, Li et al. in [87] propose an online reinforcement
the success of this framework lies in the proper feature vector learning approach (ORLA) for a vehicular network shown in
construction for the communication scenario and the design Fig. 15. The proposal is divided into two learning phases:
of low-complexity multi-class classifiers. initial reinforcement learning and history-based reinforcement
learning. In the initial learning model, the vehicle-BS associa-
IV. M ACHINE L EARNING BASED N ETWORKING tion problem is seen as a multi-armed bandit problem. The
With the rapid growth of data traffic and the expansion of action of each BS is the decision on the association with
the network, networking in future wireless communications vehicles and the reward is defined to minimize the deviation
requires more efficient solutions. In particular, the imbalance of the data rates of the vehicles from the average rate of
of traffic loads among heterogenous BSs needs to be ad- all the vehicles. In the second learning phase, considering
dressed, and meanwhile, wireless channel dynamics and newly the spatial-temporal regularities of vehicle networks, the as-
emerging vehicle networks both incur a big challenge for sociation patterns obtained in the initial RL stage enable the
traditional networking algorithms that are mainly designed load balancing of BSs through history-based RL when the
for static networks. To overcome these issues, research on environment dynamically changes. Specifically, each BS will
ML based user association, BS switching control, routing, and calculate the similarity between the current environment and
clustering has been conducted. each historical pattern, and the association matrix is output
based on the historical association pattern. Compared with
the max-SINR scheme and distributed dual decomposition
A. Machine Learning Based BS Association
optimization, the proposed ORLA reaches the minimum load
1) Reinforcement Learning Based Approaches: In the ve- variance of multiple cells.
hicle network, the introduction of economical SBSs greatly
reduces the network operation cost. However, proper associ- Besides the information related to SINR, backhaul capacity
ation schemes between vehicles and BSs are needed for load constraints and diverse attributes related to the QoE of users
balancing among SBSs and MBSs. Most previous algorithms should also be taken into account for user association. In
often assume static channel quality, which is not feasible [88], authors propose a distributed, user-centric, backhaul-
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
17
TABLE VIII
M ACHINE L EARNING BASED B EAMFORMING D ESIGN
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
18
TABLE IX
M ACHINE L EARNING BASED U SER A SSOCIATION
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
19
to the policy values in actor-critic reinforcement learning. has emerged as a breakthrough for providing efficient routing
The simulation result shows that combining RL with transfer protocols to enhance the overall network performance [33]. In
learning outperforms the method only using RL, in terms of this vein, we provide a vivid summarization on novel machine
both energy saving and convergence speed. learning based routing schemes.
Though some TL based methods have been employed to 1) Reinforcement Learning Based Approaches: To over-
develop BS sleeping strategies, WiFi network scenarios have come the challenge incurred by dynamic channel availability in
not been covered. Under the context of WiFi networks, the CRNs, authors in [105] propose a clustering and reinforcement
knowledge of the real time data, gathered from the APs related learning based multi-hop routing scheme, which provides high
to the present environment, is utilized for developing switching stability and scalability. Using Q learning, the availability
on-off policy in [101], where the actor-critic algorithm is used. of the bottleneck channel along the route can be well esti-
These works have indicated that TL can offer much help in mated, which guides the routing node selection. Specifically,
finding optimal BS switching strategies, but it should be noted the source node maintains a Q table, where each Q value
that TL may lead to a negative influence in the network, since corresponds to a pair composed of a destination node and
there are still differences between the source task and the the next-hop node. After Q values are learned and given the
target task. To resolve this problem, authors in [18] propose destination, the source node chooses the next-hop node with
to diminish the impact of the prior knowledge on decision the largest Q value.
making with time going by. Also focusing on multi-hop routing in CRNs, two different
3) Unsupervised Learning Based Approaches: K-means, routing schemes based on reinforcement learning, namely
which is a kind of unsupervised learning algorithm, can help traditional RL scheme and RL-based scheme with average
enhance BS switching on-off strategies. In [102], based on Q value, are investigated in [106]. In both schemes, the
the similarity of the location and traffic load of BSs, K- definitions of the action and state are the same as those in
means clustering is used to group BSs into different clusters. [105], and the reward is defined as the channel available time
In each cluster, the interference is mitigated by allocating of the bottleneck link. Compared to traditional RL scheme,
orthogonal resources among communication links, and the RL-based scheme with average Q value can help with selecting
traffic of off-BSs can be offloaded to on-BSs. Simulation more stable routes. The superior performance of these two
shows that involving K-means results in a lower average cost schemes is verified by implementing a test bed and comparing
per BS when the number of users is large. In [103], by with a highest-channel route selection scheme.
applying K-means, different values of RSRQ are grouped into In [107], authors study the influences of several network
different clusters. Consequently, the users will be grouped characteristics, such as network size, on the performance
into clusters, which depends on their corresponding RSRQ of Q learning based routing for a cognitive radio ad hoc
values. After that, the cluster information is considered as a network. It is found that network characteristics have slight
part of the system state in Q learning to find the optimal BS impacts on the end-to-end delay and packet loss rate of
switching strategy. With the help of K-means, the proposed SUs. In addition, reinforcement learning is also a promising
method achieves lower average energy consumption than the paradigm for developing routing protocols for the unmanned
method without K-means. robotic network. Specifically, to save network overhead in
4) Lessons Learned: First, according to [94], the actor- high-mobility scenarios, a Q learning based geographic routing
critic learning based method enables BSs to make wise switch- strategy is introduced in [108], where each state represents
ing decisions without the need for knowledge about the traffic a mobile node and each action defines a routing decision.
loads within the BSs. Second, following [96], [97], the power A characteristic of the Q learning adopted in this paper is
cost incurred by BS switching should be involved in the cost the novel design of the reward that incorporates packet travel
fed back to the reinforcement learning agent, which makes the speed. Simulation using NS-3 confirms a better packet delivery
energy consumption optimization more reasonable. Third, as ratio but with a lower network overhead compared to existing
indicated by [18], integrating transfer learning into actor-critic methods.
learning based BS switching can achieve better performance 2) Supervised Learning Based Approaches: In [109], a
at the beginning as well as faster convergence compared to routing scheme based on DNNs is developed, which enables
the traditional actor-critic learning. At last, based on [102], each router in heterogenous networks to predict the whole path
[103], by properly clustering BSs and users by K-means before to the destination. More concretely, each router trains a DNN
optimizing BS on-off states, better performance can be gained. to predict the next proper router for each potential destination
using the training data generated by following Open Shortest
Path First (OSPF) protocol. The input and output of the DNN
C. Machine Learning Based Routing is the traffic patterns of all the routers and the index of the
To fulfill stringent traffic demands in the future, many new next router, respectively. Moreover, instead of training all the
RAN technologies continuously come into being, including weights of the DNN at the same time, a greedy layer-wise
C-RANs, CR networks (CRNs) and ultra-dense networks. training method is adopted. By simulations, lower signaling
To realize effective networking in these scenarios, routing overhead and higher throughput is observed compared with
strategies play key roles. Specifically, by deriving proper paths OSPF routing strategy.
for data transmission, transmission delay and other types of To improve the routing performance for the wireless back-
performance can be optimized. Recently, machine learning bone, deep CNNs are exploited in [110], which can learn from
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
20
TABLE X
M ACHINE L EARNING BASED BS S WITCHING
the experienced congestion. Specifically, a CNN is constructed D. Machine Learning Based Clustering
for each routing strategy, and the CNN takes traffic pattern In wireless networking, it is common to divide nodes or
information collected from routers, such as traffic generation users into different clusters to conduct some cooperation or
rate, as input to predict whether the corresponding routing coordination within each cluster, which can further improve
strategy can cause congestion. If yes, the next routing strategy network performance. Based on the introduction on ML, it
will be evaluated until it is predicted that there will be can be seen that the clustering problem can be naturally dealt
no congestion. Meanwhile, it should be noted that these with the K-means algorithm, as some papers do. Moreover,
constructed CNNs are trained in an on-line manner with the supervised learning and reinforcement learning can be utilized
training data set continuously updated, and hence the routing as well.
decision becomes more accurate. 1) Supervised Learning Based Approaches: To reduce con-
tent delivery latency in a cache enabled small cell network, a
user clustering based TDMA transmission scheme is proposed
3) Lessons Learned: First, according to [105], [106], when in [111] under pre-determined user association and content
one applies Q learning to route selection in CRNs, the reward placement. The user cluster formation and the time duration
feedback can be set as a metric representing the quality to serve each cluster need to be optimized. Since the number
of the bottleneck link along the route, such as the channel of potential clusters grows exponentially with respect to the
available time of the bottleneck link. Second, following [107], number of users served by an SBS, a DNN is constructed
network characteristics, such as network size, have limited to predict whether each user is in a cluster, which takes the
impacts on the performance of Q learning based routing for a user channel gains and user demands as input. In this manner,
cognitive radio ad hoc network, in terms of end-to-end delay users joining clusters can be quickly identified, reducing the
and packet loss rate of secondary users. Third, from [109], searching space to get the optimal user cluster formation.
it can be learned that DNNs can be trained using the data 2) Unsupervised Learning Based Approaches: In [112],
generated by OSPF routing protocol, and the resulting DNN K-means clustering is considered for clustering hotspots in
model based routing can achieve lower signaling overhead densely populated ares with the goal of maximizing spectrum
and higher throughput. At last, as discussed in [110], OSPF utilization. In this scheme, the mobile device accessing cellular
can be substituted by CNN based routing, which is trained in networks can act as a hotspot to provide broadband access
an online fashion and can avoid past, fault routing decisions to nearby users called slaves. The fundamental problem to
compared to OSPF. be solved is to identify which devices play the role of
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
21
TABLE XI
M ACHINE L EARNING BASED ROUTING S TRATEGY
hotspots and the set of users associated with each hotspot. scheme.
To this end, authors first adopt a modified version of the 4) Lessons Learned: First, based on [111], DNNs can help
constrained K-means clustering algorithm to group the set identify those users or BSs that are not necessary to join
of users into different clusters based on their locations, and clusters, which facilitates the searching of optimal cluster
both the maximum number and minimum number of users in formation due to the reduction of the searching space. Second,
a cluster are set. Then, the user with the minimum average following [102], [112], clustering problem can be naturally
distance to both the center of the cluster and the BS is solved using K-means clustering. Third, as indicated by [113],
selected as the hotspot in each cluster. After that, a graph- deep reinforcement learning can be used to directly select the
coloring approach is utilized to assign spectrum resource to members forming a cluster in a dynamic network environment
each cluster, and power and spectrum resource allocation to with time-varying CSI and cache states.
all slaves and hotspots are performed later on. The simulation
result shows that the proposal can significantly increase the
V. M ACHINE L EARNING BASED M OBILITY M ANAGEMENT
total number of users that can be served in the system with
lower cost and complexity. In wireless networks, mobility management is a key com-
In [102], authors use clustering of SBSs to realize the ponent to guarantee successful service delivery. Recently,
coordination between them. The similarity between two SBSs machine learning has shown its significant advantages in
considers both their distance and the heterogeneity between user mobility prediction, handover parameter optimization,
their traffic loads, so that two BSs with shorter distance and and so on. In this section, machine learning based mobility
higher load difference have more chances to cooperate. Since management schemes are comprehensively surveyed.
the similarity matrix possesses the properties of Gaussian
similarity matrix, the SBS clustering problem can be handled
by K-means clustering with each SBS corresponding to an A. Reinforcement Learning Based Approaches
attribute vector composed of its coordinates and traffic load. In [114], authors focus on a two-tier network composed
By intra-cluster coordination, the number of switched-OFF of macro cells and small cells, and propose a dynamic fuzzy
BSs can be increased by offloading UEs from SBSs that are Q learning algorithm for mobility management. To apply Q
switched OFF to active SBSs, compared to the case without learning, the call drop rate together with the signaling load
clustering and coordination. caused by handover constitutes the system state, while the
3) Reinforcement Learning Based Approaches: In [113], action space is defined as the set of possible values for the
to mitigate the interference in a downlink wireless network adjustment of handover margin. The aim is to achieve a
containing multiple transceiver pairs operating in the same tradeoff between the signaling cost incurred by handover and
frequency band, a cache-enabled opportunistic interference the user experience affected by call dropping ratio (CDR).
alignment (IA) scheme is adopted. Facing dynamic channel Simulation results show that the proposed scheme is effective
state information and content availability at each transmitter, in minimizing the number of handovers while keeping the
a deep reinforcement learning based approach is developed to CDR at a desired level. In addition, Klein et al. in [115]
determine communication link scheduling at each time slot, also apply the framework based on fuzzy Q learning to
and those scheduled transceiver pairs then perform IA. To jointly optimize TTT and Hys. Specifically, the framework
efficiently handle the raw collected data like channel state includes three key components, namely a fuzzy inference
information, the deep Q network in DRL is built using a system (FIS), heuristic exploration/exploitation policy (EEP)
convolutional neural network. Simulation results demonstrate and Q learning. The FIS input consists of the magnitude of
the improved performance of the proposal, in terms of system hysteresis margin and the errors of several KPIs like CDR,
sum rate and energy efficiency, compared to an existing and ε-greedy EEP is adopted for each rule in the rule set.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
22
TABLE XII
M ACHINE L EARNING BASED C LUSTERING
To show the superior performance of the proposal, a trend- After that, these extracted feature vectors will be input to the
based handover optimization scheme and a TTT assignment policy network that is based on a deep Q network to select the
scheme based on velocity estimation are taken as baselines. It best action for real network management. Finally, this novel
is observed that the fuzzy Q learning based approach greatly framework is applied to a seamless handover scenario with
reduces handover failures compared to the two schemes. one user and multiple APs. Using the measurement of RSSI as
Moreover, achieving load balancing during the handover input, the user is guided to select the best AP, which maximizes
process is an essential part. In [116], a fuzzy-rule based RL network throughput.
system is proposed for small cell networks, which aims to In [118], authors propose a two-layer framework to opti-
balance traffic load by selecting transmit power (TXP) and mize the handover process and reach a balance between the
Hys of BSs. Considering that the CBR and the OR will handover rate and system throughput. The first step is to apply
change significantly when the load in a cell is heavy, the a centralized control method to classify the UEs according
two parameters jointly comprise the observation state. The to their mobility patterns with unsupervised learning. Then,
adjustments of Hys and TXP are system actions, and the the multi-user handover process in each cluster is optimized
reward is defined such that user satisfaction is optimized. As in a distributed manner using DRL. Specifically, the RSRQ
a result, the optimal adjustment strategy for Hys and TXP received by the user from the candidate BS and the current
is generated by the Q learning system based on fuzzy rules, serving BS index make up the state vector, and the weighted
which can minimize the localized congestion of small cell sum between the average handover rate and throughput is
networks. defined as the system reward. In addition, considering that
For the LTE network with multiple SON functions, it is new state exploration in DRL may start from some unexpected
inevitable that optimization conflict exists. In [117], authors initial points, the performance of UEs will greatly fluctuate. In
propose a comprehensive solution for SON functions including this framework, Wang et al. apply the output of the traditional
handover optimization and load balancing. In this scheme, 3GPP handover scheme as training data to initialize the deep
the fuzzy Q Learning controller is utilized to adjust the Hys Q network through supervised learning, which can compensate
and TTT parameters simultaneously, while the heuristic Diff- the negative effects caused by exploration at the early stage
Load algorithm optimizes the handover offset according to of learning.
load measurements in the cell. To apply fuzzy Q learning,
radio link failure, handover failure and handover ping-pong, B. Supervised Learning Based Approaches
which are key performance indicators (KPIs) in the handover Except for the current location of the mobile equipment,
process, are defined as the input to the fuzzy system. By LTE- learning an individual’s next location enables novel mobile
Sim simulation, results show that the proposal enables the joint applications and a seamless handover process. In general,
optimization of the KPIs above. location prediction utilizes the user’s historical trajectory in-
In addition to Q learning, authors in [16] and [118] utilize formation to infer the next position of a user. In order to
deep reinforcement learning for mobility management. In [16], overcome the lack of historical information issue, Yu et al.
to overcome the challenge of intelligent wireless network propose a supervised learning based prediction method based
management when a large number of RANs and devices are on user activity patterns [119]. The core idea is to first predict
deployed, Cao et al. propose an artificial intelligence frame- the user’s next activity and then predict its next location.
work based on DRL. The framework is divided into four parts: Simulation results demonstrate the robust performance of the
real environment, environment capsule, feature extractor, and proposal.
policy network. Wireless facilities in a real environment upload
information such as the RSSI to the environment capsule. C. Unsupervised Learning Based Approaches
Then, the capsule transmits the stacked data to the wireless Considering the impact of RF conditions at cell edge on the
signal feature extraction part consisting of a CNN and an RNN. setting of handover parameters, authors in [120] propose an
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
23
unsupervised-shapelets based method to help BSs be automat- 1) Supervised Learning Based Approaches: Instead of as-
ically aware of the RF conditions at their cell edge by finding suming that equal signal differences account for equal geo-
useful patterns from RSRP information reported by users. In metrical distances as in traditional KNN based localization
addition, RSRP information can be employed to derive the approaches do, authors in [122] propose a feature scaling
position at the point of a handover trigger. In [121], authors based KNN (FS-KNN) localization algorithm. This algorithm
propose a modified self-organizing map (SOM) based method is inspired by the fact that the relation between signal differ-
to determine whether indoor users should be switched to ences and geometrical distances is actually dependent on the
another external BS based on their location information. SOM measured RSS. Specifically, in the signal distance calculation
is a type of unsupervised NN that allows generating a low between the fingerprint of each RP and the RSS vector re-
dimensional output space from the high dimensional discrete ported by the user, the square of each element-wise difference
input. The input data in this scheme is RSRP together with is multiplied by a weight that is a function of the corresponding
the angle of the arrival of the mobile terminal, based on which RSS value measured by the user. To identify the parameters
the real physical location of a user can be determined by the involved in the weight function, an iterative training procedure
SOM algorithm. After that, the handover decision can be made is used, and performance evaluation on a test set is made,
for the user according to pre-defined prohibited and permitted whose metric is taken as the objective function of a simulated
areas. Through evaluation using the network simulator, it is annealing algorithm to tune those parameters. After the model
shown that the proposal decreases the number of unnecessary is well trained, the distance between a newly received RSS
handovers by 70%. vector and each fingerprint is calculated, and then the location
of the user is determined by calculating a weighted mean of
the locations of the k nearest RPs.
D. Lessons Learned To deal with the high energy consumption incurred by
frequent AP scanning via WiFi interfaces, an energy-efficient
First, based on [114]–[117], fuzzy Q learning is a common
indoor localization system is developed in [123], where Zig-
approach to handover parameter optimization, and KPIs, such
Bee interfaces are used to collect WiFi signals. To improve lo-
as radio link failure, handover failure and CDR, are suitable
calization accuracy, three KNN based localization approaches
candidates for the observation state because of their close
adopting different distance metrics are evaluated, including
relationship with the handover process. Second, according to
weighted Euclidian distance, weighted Manhattan distance and
[16], [118], deep reinforcement learning can be used to directly
relative entropy. The principle for weight setting is that the
make handover decisions on the user-BS association taking
AP with more redundant information is assigned with a lower
only the measurements from users like RSSI and RSRQ as
weight. In [124], authors theoretically analyze the optimal
the input. Third, following [119], the lack of user history
number of nearest RPs used to identify the user location in a
trajectory information can be overcome by first predicting the
KNN based localization algorithm, and it is shown that k = 1
user’s next activity and then predicting its next location. At
and k = 2 outperform the other settings for static localization.
last, as indicated by [120], unsupervised-shapelets can help
To avoid regularly training localization models from scratch,
find useful patterns from RSRP information reported by users
authors in [125] propose an online independent support vector
and further make BSs aware of the RF conditions at their cell
machine (OISVM) based localization system that employs the
edge, which is critical to handover parameter setting.
RSS of Wi-Fi signals. Compared to traditional SVM, OISVM
is capable of learning in an online fashion and allows to make
VI. M ACHINE L EARNING BASED L OCALIZATION a balance between accuracy and model size, facilitating its
adoption on mobile devices. The constructed system includes
In recent years, we have witnessed an explosive proliferation two phases, i.e., the offline phase and the online phase. The
of location based services, whose service quality is highly de- offline phase further includes kernel parameter selection, data
pendent on the accuracy of localization. The mature technique under sampling to deal with the imbalanced data problem,
Global Positioning System (GPS) has been widely used for and offline training using pre-collected RSS data set with RSS
outdoor localization. However, when it comes to indoor local- samples appended with corresponding RP labels. In the online
ization, GPS signals from a satellite will be heavily attenuated, phase, location estimation is conducted for new RSS samples,
which makes GPS incapable of use in indoor localization. and meanwhile online learning is performed as new training
Furthermore, indoor environments are more complex as there data arrives, which can be collected via crowdsourcing. The
are lots of obstacles such as tables, wardrobes, and so on, thus simulation result shows that the proposal can reduce the
causing the difficulty of localization. In this situation, to locate location estimation error by 0.8m, while the prediction time
indoor mobile users precisely, many wireless technologies can and training time are decreased significantly compared to
be utilized such as WLAN, ultra-wide bandwidth (UWB) and traditional methods.
Bluetooth. Moreover, common measurements used in indoor Given that non-line-of-sight (NLOS) radio blockage can
localization include time of arrival (TOA), TDOA, channel lower the localization accuracy, it is beneficial to identify
state information (CSI) and received signal strength (RSS). NLOS signals. To this end, authors in [126] develop a rel-
To solve various problems associated with indoor localization, evance vector machine (RVM) based method for ultrawide
research has been conducted by adopting machine learning in bandwidth TOA localization. Specifically, an RVM based
scenarios with different wireless technologies. classifier is used to identify the LOS and NLOS signals
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
24
TABLE XIII
M ACHINE L EARNING BASED M OBILITY M ANAGEMENT
received by the agent with unknown position from the anchor coordinates and indexes of the corresponding grid areas. The
whose position is already known, while an RVM regressor is implementation procedure of the NN can be divided into three
adopted for ranging error prediction. Both of the two models parts that are transforming, denoising and localization. Par-
take a feature vector as input data, which consists of received ticularly, this method pre-trains the transforming section and
energy, maximum amplitude, and so on. The advantage of denoising section by using auto-encoder block. Experiments
RVM over SVM is that RVM uses a smaller number of show that the proposed method can realize higher localiza-
relevance vectors than the number of support vectors in the tion accuracy compared with maximum likelihood estimation,
SVM case, hence reducing the computational complexity. On the generalised regression neural network and fingerprinting
the contrary, authors in [127] propose an SVM based method, methods.
where a mapping between features extracted from the received 2) Unsupervised Learning Based Approaches: To reduce
waveform and the ranging error is directly learned. Hence, computation complexity and save storage space for fingerprint-
explicit LOS and NLOS signal identification is not needed ing based localization systems, authors in [131] first divide the
anymore. radio map into multiple sub radio maps. Then, Kernel Principal
In addition to KNN and SVM, some researchers have Components Analysis (KPCA) is used to extract features of
applied NNs to localization. In order to reduce the time cost of each sub radio map to get a low dimensional version. Result
the training procedure, authors in [128] utilize ELM. The RSS shows that the size of the radio map can be reduced by 72%
fingerprints and their corresponding physical coordinates are with 2m localization error. In [132], PCA is also employed
used to train the output weights. After the model is trained, it together with linear discriminant analysis to extract lower
can predict the physical coordinate for a new RSS vector. Also dimensional features from raw RSS measurement.
adopting a single layer NN, in [129], an NN based method is Considering the drawbacks of adopting RSS measurement
proposed for an LTE downlink system, aiming at estimating as fingerprints like its high randomness and loose correlation
the UE position. The employed NN contains three layers, with propagation distance, authors in [133] propose to utilize
namely an input layer, a hidden layer and an output layer, the calibrated CSI phase information for indoor fingerprinting.
with the input and the output being channel parameters and the Specifically, in the off-line phase, a deep autoencoder network
corresponding coordinates, respectively. Levenberg Marquardt is constructed for each position to reconstruct the collected
algorithm is applied to iteratively adjust the weights of the NN calibrated CSI phase information, and the weights are recorded
based on the Mean Squared Error. When the NN is trained, as the fingerprints. In the online phase, a new location is
it can predict the location given the new data. Preliminary obtained by a probabilistic method that performs a weighted
experimental results show that the proposed method yields a average of all the reference locations. Simulation results show
median positioning error distance of 6 meters for the indoor that the proposal outperforms traditional CSI or RSS based
scenario. methods in two representative scenarios. The whole process
In a WiFi network with RSS based measurement, a deep NN of the proposal is summarized in Fig. 17.
is utilized in [130] for indoor localization without using the Moreover, ideas similar to [133] are adopted in [134] and
pathloss model or comparing with the fingerprint database. The [135]. In [134], CSI amplitude responses are taken as the
training set contains the RSS vectors appended with the central input of the deep NN, which is trained by a greedy learning
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
25
CSI Collection the SVM classifier, since the former has lower computation
complexity owing to less number of relevance vectors. More-
over, according to [131], the size of the radio map can be
Phase Extraction effectively reduced by KPCA, saving the storage capacity
of user terminals. At last, as revealed in [17], [133]–[135],
autoencoder is able to extract useful and robust information
Phase Data At
Location 1
Phase Data At
Location N from RSS data or CSI data, which contributes to higher
localization accuracy.
Deep Learning
VII. C ONDITIONS FOR T HE A PPLICATION OF M ACHINE
L EARNING
Weights for
Location 1
Weights for
Location N
In this section, several conditions for the application of ML
are elaborated, in terms of the type of the problem to be
solved, training data, time cost, implementation complexity
Fingerprint Database and differences between machine learning techniques in the
same category. These conditions should be checked one-by-
Position Phase Data At one before making the final decision about whether to adopt
Algorithm Test Location ML techniques and which kind of ML techniques to use.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
26
TABLE XIV
M ACHINE L EARNING BASED L OCALIZATION
algorithm. At this time, the training data is generated by run running traditional routing strategies such as OSPF. In [110],
these algorithms under different network scenarios for multiple the training data set is generated in an online fashion, which
times. In [63], [64], since spectrum allocation among BSs is is collected by each router in the network. In literatures [17],
modeled as a non-cooperative game, the data to train the RNN [122]–[125], [128]–[130], [133]–[135] focusing on machine
model at each BS is generated by the continuous interactions learning based localization, training data is based on CSI data,
of BSs. For cache management, authors in [81] utilize a RSSI data or some channel parameters. The data comes from
mixture of data set YUPENN [137] and UFC101 [137] as their practical measurement in a real scenario using a certain device
own data set, while authors in [78] collect their data set by like a cell phone. To obtain these wireless data, different ways
using the YouTube Application Programming Interface, which can be adopted. For example, authors in [133] uses one mobile
consists of 12500 YouTube videos. In [79], [80], the content device equipped with an IWL 5300 NIC that can read CSI data
request data that the RNN uses to train and predict content from the slight modified device driver, while authors in [122]
request distribution is obtained from Youku of China network develop a client program for the mobile device to facilitate
video index. RSS measurement.
In [86] studying KNN based beamformer design, a feature As for other surveyed works, they mainly adopt reinforce-
vector in the training data set under a specific resource ment learning to solve resource management, networking and
allocation scenario is composed of time-variant parameters, mobility management problems. In reinforcement learning, the
such as the number of users and CSI, and these parameters reward/cost, which is fed back by the environment after the
can be collected and stored by the cloud. In [109], DNN learning agent takes an action, can be seen as training data.
based routing is investigated, and the training data set is The environment is often an virtual environment created by
obtained by recording the traffic patterns and paths while using certain softwares like Matlab, and the reward function
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
27
is defined to reflect the objective of the studied problem. For agent’s strategy is composed of probabilities to select each
example, authors in [61] aim at optimizing system spectral action. The agent only needs to generate a random number
efficiency, which is hence taken as the reward fed back to each between 0-1 to identify which action to play. Therefore, a
D2D pair. In [97], the reward for the DRL agent is defined well-trained agent using these two learning techniques can
as the difference between the maximum possible total power make quick decisions within milliseconds. Second, for the
consumption and the actual total power consumption, which KNN based resource management adopted in [86], it relies on
helps minimize system power cost. In summary, the second the calculation of similarity between the feature vector of the
condition for the application of ML is that essential training new network scenario and that of each past network scenario.
data can be acquired. However, owing to cloud computing, the similarities can be
calculated in parallel. Hence, the most similar K past scenarios
can be identified within very short time, and the resource
C. Time Cost
management decision can be made very quickly by directly
Another important aspect, which can prevent the application taking the resource configuration adopted by the majority of
of machine learning, is the time cost. Here, it is necessary the K past scenarios.
to distinguish between two different time metrics, namely Next, the discussion is made from the perspective of training
training time and response time as per [26]. The former time, and note that the KNN based approach does not have
represents the amount of time that ensures a machine learning a training procedure. Specifically, for a deep NN model, its
algorithm is fully trained. This is important for supervised and training may take a long time. Therefore, it is possible that
unsupervised learning to make accurate predictions for future the patterns of the communication environment has changed
inputs and also important for reinforcement learning to learn before the model learns a mapping rule. Moreover, training a
a good strategy or policy. As for response time, for a trained reinforcement learning model in a complex environment can
supervised or unsupervised learning algorithm, response time also cost much time. Hence, it is possible that the elements
refers to the time needed to output a prediction given an input, of the communication environment, such as the environmental
while it refers to the time needed to output an action for a dynamics and the set of agents, have been different before
trained reinforcement learning model. training is completed. Such a mismatch between training
In some applications, there can be a stringent requirement time and the timescale on which the characteristics of com-
about response time. For example, resource management deci- munication environments change can do harm to the actual
sions should be made on a timescale of milliseconds. To figure performance of a trained model. Nevertheless, training time
out the feasibility of machine learning in these applications, we can be reduced for neural network based approaches with
first make a discussion from the perspective of response time. the help of GPUs and transfer learning, while the training
In our surveyed papers, machine learning techniques applied of traditional reinforcement learning can be accelerated using
in resource management can be coarsely divided into neural transfer learning as well, as shown in [18]. In summary, the
network based approaches and other approaches. third condition for the application of machine learning is that
1)Neural network based approaches: For the response time time cost, including response time and training time, should
of neural networks after trained, it has been reported that meet the requirements of applications and match with the time
Graphics Processing Unit (GPU)-based parallel computing can scale on which the communication environment varies.
enable them to make predictions within milliseconds [139].
Note that even without GPUs, a trained DNN can make a
power control decision for a network with 30 users using D. Implementation Complexity
0.0149 ms on average (see Table I in [15]). Hence, it is pos- In this subsection, the implementation complexity of ma-
sible for the proposed neural network based power allocation chine learning algorithms is discussed by taking the algorithms
approaches in [15], [55], [56] to make resource management for mobility management as examples. Related surveyed works
decisions in time. Similarly, since deep reinforcement learning can be categorized into directly optimizing handover param-
selects the resource allocation action based on the Q-values eters using fuzzy Q learning [114]–[117], handing over users
output by the deep Q network, which is actually a neural to proper BSs using deep reinforcement learning [16], [118],
network, a trained deep reinforcement learning model can also helping improve handover performance by predicting user’s
be suitable for resource management in wireless networks. next location using a probabilistic model [119], analyzing
2)Other approaches: First, traditional reinforcement learn- RSRP patterns using unsupervised shapelets [120] and iden-
ing, which includes Q learning and joint utility and strategy tifying the area where handover is prohibited using self-
estimation based learning, aims at deriving a strategy or policy organizing map [121].
for the learning agent under a dynamic environment. Once First, to implement fuzzy Q learning, a table needs to be
reinforcement learning algorithms converge after being trained maintained to store q values, each of which corresponds to a
for sufficient time, the strategy or policy becomes fixed. In Q pair of a rule and one of its actions. In addition, only simple
learning, the policy is represented by a set of Q values, each mathematical operations are involved in the learning process,
of which associates with a pair of a system state and an action. and the inputs are common network KPIs like call dropping
The learning agent chooses the resource allocation action with ratio or the value of the parameter to be adjusted, which can
the maximal Q value under the current system state. As for all be easily acquired by the current network. Hence, it can be
the joint utility and strategy estimation based learning, the claimed that fuzzy Q learning possesses low implementation
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
28
complexity. Second, deep reinforcement learning is based on by CNN, which makes the training and inference procedures
neural networks, and it can be conveniently implemented by faster with lower overheads. In addition, CNN is good at
utilizing rich frameworks for deep learning, such as Ten- learning spatial features, such as the features of a channel
sorFlow and Keras. Its inputs are Received Signal Strength gain matrix. RNN is suitable for processing time series to
Indicator (RSSI) in [16] and Reference Signal Received Qual- learn features in time domain, while the advantage of extreme
ity (RSRQ) in [118], which can both be collected by the learning machine lies in good generalization performance at
current system. However, to accelerate the training process, an extremely fast learning speed without iteratively tuning on
it is preferred to run deep reinforcement learning programs on the hidden layer parameters.
GPUs, and this may incur high cost for the implementation. 2)Machine Learning Techniques for Decision Making: Ma-
Third, the approach in [119] is based on a simple probabilistic chine learning algorithms applied to decision making in dy-
model, while the approaches in [120], [121] only utilize the namic environments in surveyed works mainly include actor-
information reported by user terminals in current networks. critic learning, Q learning, joint utility and strategy estimation
Hence, the proposals in [119]–[121] can be not that complex based learning (JUSEL) and deep reinforcement learning.
for practical implementation. Compared with Q learning, actor-critic learning is able to
Although only the implementation complexity of machine learn an explicit stochastic policy that may be useful in non-
learning methods for mobility management is discussed, other Markov environments [140]. In addition, since value function
surveyed methods can be analyzed in a similar way. Specifi- and policy are updated separately, policy knowledge transfer is
cally, the implementation complexity should consider the com- easier to achieve [141]. For JUSEL, it is very suitable in multi-
plexity of data storage, the mathematical operations involved agent scenarios and is able to achieve stable systems where the
in the algorithms, the complexity of collecting the necessary gain obtained is bounded when an agent unilaterally deviates
information and the requirement on softwares and hardwares. from its mixed strategy. Compared with Q learning and actor-
In summary, the fourth condition for the application of ma- critic learning, one of the advantages of deep reinforcement
chine learning is that implementation complexity should be learning lies in its ability to learn from high dimensional
acceptable. input states, owing to the deep Q network [142]–[145]. On
the contrary, since both Q learning and actor-critic learning
need to store an evaluation for each state-action pair, they are
E. Comparison Between Machine Learning Techniques not suitable for communication systems with states of large
Problems in surveyed works can be generally categorized dimension. Another advantage of deep reinforcement learning
into regression, classification, clustering and decision making. is its ability to infer a good action under an unfamiliar state
However, for each kind of problem, different machine learn- [146]. Nevertheless, training deep reinforcement learning can
ing techniques can be available. In this section, comparison incur high computing burden. Finally, in addition to deep
between machine learning methods that can handle problems reinforcement learning, fuzzy Q learning, as a variant of Q
of the same type is conducted, which reveals the reason for learning, can also address the situation of continuous states
surveyed works to adopt a certain machine learning technique but with lower computation cost. However, setting membership
instead of others. Most importantly, readers can be guided to functions requires prior experience, and the number of rules
select the suitable machine learning technique. in the rule base can exponentially increase when the state
1)Machine Learning Techniques for Regression and Classi- dimension is high.
fication: Machine learning models applied to regression and In summary, the fifth condition for the application of ma-
classification tasks in surveyed works mainly include SVM, chine learning is that the advantages of the adopted machine
KNN and neural networks. KNN is a basic classification learning technique well fit into the studied problem, and
algorithm that is known to be very simple to implement. meanwhile its disadvantages are tolerable.
Generally, KNN is used as multi-class classifiers whereas
standard SVM has been regarded as one of the most robust VIII. A LTERNATIVES TO M ACHINE L EARNING AND
and successful algorithms to design low-complexity binary M OTIVATIONS
classifiers [86]. When data is not linearly separable, KNN In this section, we review and elaborate traditional ap-
can be a good choice compared to SVM. This is because proaches that are taken as baselines in the surveyed works
the regularization term and the kernel parameters should and are not based on machine learning. By comparing with
be selected for SVM, while one needs to choose only the these alternatives, the motivations to adopt machine learning
k parameter and the distance metric for KNN. Compared are carefully summarized.
with SVM and KNN, deep neural networks are powerful in
feature extraction and the performance can be improved more
significantly with the increase of training data size. However, A. Alternatives for Power Allocation
due to the optimization of a large number of parameters, their Basic alternatives are some simple heuristic schemes, such
training can be time consuming. Hence, when enough training as uniform power allocation among resource blocks [48],
data and GPU resource are available, deep neural networks transmitting with full power [49] and smart power control
are preferred. In addition, common neural networks further [57]. However, considering the sophisticated radio environ-
include DNN, CNN, RNN and extreme learning machine. ment faced by each BS, where the power allocation decisions
Compared with DNN, the number of weights can be reduced are coupled among BSs, these heuristic schemes can have poor
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
29
performance. Specifically, in [48], it is reported that the pro- [68], to verify the effectiveness of the reinforcement learning
posed Q learning based scheme achieves 125% performance based method, a simple baseline is that the messages of MUEs
improvement compared to uniform power allocation, while the are not transmitted via backhaul links between SBSs and the
average femtocell capacity is enhanced by 50.79% using Q MBS, which leads to poor MUE rates.
learning compared with smart power control in [57]. When
power levels are discrete, numerical search can be adopted, D. Alternatives for Cache Management
including exhaustive search and genetic algorithm that is a In [74], random caching and caching based on time-
heuristic searching algorithm inspired by the theory of natural averaged content popularity are taken as baselines. Reinforce-
evolution [147]. In [52], it is shown that multi-agent Q learning ment learning based cache management can achieve 13% and
can reach near optimal performance but with a huge reduc- 56% higher per-BS utility than these two heuristic schemes
tion in control signalling compared to centralized exhaustive for a dense network scenario. Also compared with the two
search. In [56], the trained deep learning model based on auto- schemes, the extreme learning machine based caching scheme
encoder is capable of outputting the same resource allocation in [78] decreases downloading delay significantly. Moreover,
solution got by genetic algorithm 86.3% of time with less the transfer learning based caching in [84] outperforms random
computation complexity. Another classical approach to power caching and achieves close performance to caching the most
allocation is the WMMSE algorithm utilized to generate popular contents with popularity perfectly known, in terms of
the training data in [15], [55]. The WMMSE algorithm is backhaul load and user satisfaction ratio. Another two well-
originally designed to optimize beamformer vectors, which known heuristic cache management strategies are least recently
transforms the weighted sum-rate maximization problem into a used (LRU) strategy [151] and least frequently used (LFU)
higher dimensional space to make the problem more tractable strategy [152]. Under LRU strategy, the cached content, which
[148]. In [15], for a network with 20 users, the DNN based is least requested recently, will be replaced by the new content
power control algorithm is demonstrated to achieve over 90% when the cache storage is full, while the cached content, which
sum rate got by WMMSE algorithm, but its CPU time only is requested the least many times, will be replaced under LFU
accounts for 4.8% of the latter’s CPU time. strategy. In [75], Q learning is shown to greatly outperform
LRU and LFU strategies in reducing transmission cost. In [77],
B. Alternatives for Spectrum Management the deep reinforcement learning approach in [77] achieves
In [60], the proposed reinforcement learning based spectrum higher cache hit rate compared with these two strategies.
management scheme is compared with a centralized dynamic
spectrum sharing (CDSS) scheme developed in [149]. The E. Alternatives for Beamforming
simulation result reveals that reinforcement learning can reach In [86], authors use KNN to address a beam allocation
nearly the same average cell throughput as the CDSS approach problem. The alternatives include exhaustive search and a
without information sharing between BSs. In [61], the adopted low complexity beam allocation (LBA) algorithm proposed in
distributed reinforcement learning is shown to achieve similar [153]. The latter is based on submodular optimization theory
system spectral efficiency got by exhaustive search that can that is a powerful tool for solving combinatorial optimization
bring a huge computing burden on the cloud. Moreover, a problems. Via simulation, it is observed that the KNN based
simple way of resource block allocation is the proportional allocation algorithm can approach optimal average sum rate
fair algorithm [150] that is utilized as a baseline in [64], and outperforms LBA algorithm with the increase of training
where the proposed RNN based resource allocation approach data size.
significantly outperforms the proportional fair algorithm, in
terms of user delay. F. Alternatives for Computation Resource Management
In [85], three heuristic computation offloading mechanisms
C. Alternatives for Backhaul Management are taken as baselines, namely mobile execution, server exe-
In [65], Q learning based cell range extension offset (CREO) cution and greedy execution. In the first and second schemes,
adjustment greatly reduces the number of users in cells with the mobile user processes computation tasks locally and of-
congested backhaul compared to a static CREO setting. In floads computation tasks to the MEC server, respectively. For
[66], a branch-and-bound based centralized approach is de- the greedy execution, the mobile user makes the offloading
veloped as the benchmark, aiming at maximizing the total decision to minimize the immediate task execution delay.
backhaul resource utilization. Compared to the benchmark, Numerical results reveal that much better long-term utility
the proposed distributed Q learning can achieve a competitive performance can be achieved by deep reinforcement learning
performance, in terms of the average throughput per user. based approach compared with these baselines. Meanwhile,
Authors in [67] use a centralized greedy backhaul management the proposal does not need to know the information of network
strategy as a comparison. The idea is to identify some BSs dynamics.
to download a fixed number of predicted files based on
a fairness rule. According to the simulation, reinforcement G. Alternatives for User Association
learning based backhaul management can reach much higher In [39], it is observed that multi-armed bandits learning
performance than centralized greedy approach, in terms of based association can achieve similar performance to exhaus-
remaining backhaul capacity, at a lower signalling cost. In tive search but with lower complexity and lower overhead
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
30
caused by information acquisition. In [87], reinforcement and [157] that are designed for static environments, deep
learning based user association is compared with two common reinforcement learning is shown to be capable of greatly
association schemes. One is max-SINR based association, improving user sum rate in the environment with time-varying
which means each user chooses the BS with the maximal SINR CSI.
to associate, and the other one is based on optimization theory,
which adopts gradient descent and dual decomposition [154]. K. Alternatives for Mobility Management
By simulation, it can be seen that reinforcement learning can
make user experience more uniform and meanwhile deliver In [16], authors compare their deep reinforcement learning
higher rates for vehicles. The max-SINR based association is approach with a handover policy in [158] that compares the
also used as a baseline in [90], which leads to poor QoS of RSSI values from the current AP and other APs. Simulation
UEs. result indicates that deep reinforcement learning can mitigate
ping-pong effect with high data rate. In [115], a fuzzy Q
learning algorithm is developed to adjust hysteresis and time-
H. Alternatives for BS Switching Control to-trigger values. To verify the effectiveness of the algorithm,
Assuming a full knowledge of traffic loads, authors in [155] two baseline schemes are considered, namely trend-based
greedily turn off as many as BSs to get the optimal BS handover optimization proposed in [159] and a scheme setting
switching solution, which is taken by [18] as a comparison time-to-trigger values based on velocity estimates. As for
scheme to verify the effectiveness of transfer learning based performance comparison, it is observed that fuzzy Q learning
BS switching. In [94], it is observed that actor-critic based based hysteresis adjustment significantly outperforms the two
BS switching consumes only a little more energy than an baselines, in terms of the number of early handover. Another
exhaustive search based scheme. However, the learning based alternative for mobility management is using fuzzy logic
approach does not need the knowledge of traffic loads in controller (FLC). In [114] and [116], numeric simulation has
advance. In [97], two alternative schemes to control BS on-off demonstrated the advantages of fuzzy Q learning over FLC
states are considered, namely single-BS association (SA) and whose performance is limited by available prior knowledge.
full coordinated association (FA). In SA scheme, each user Specifically, it is reported in [114] that fuzzy Q learning can
associates with a BS randomly and BSs without serving any still achieve competitive performance even without enough
users are turned off, while all the BSs are active in FA scheme. prior knowledge, while it is is shown to reach better long-term
Compared to the heuristic approaches, deep reinforcement performance in [116] compared with the FLC based method.
learning based method can achieve lower energy consumption
while meeting users’ demands. L. Alternatives for Localization
Surveyed works mainly utilize localization methods that
I. Alternatives for Network Routing are based on probability theory for performance comparison.
In [105], authors utilize spectrum-aware ad hoc on-demand These methods include FIFS [136] and Horus [160]. In [122],
distance vector routing (SA-AODV) approach as a baseline the simulation result shows that the mean of error distance
to demonstrate the superiority of their reinforcement learning achieved by the proposed feature scaling based KNN localiza-
based routing. Specifically, their proposal leads to lower route tion algorithm is 1.82 times better than that achieved by Horus.
discovery frequency. In [107], two intuitive routing strate- In [134], the deep auto-encoder based approach improves the
gies are presented for performance comparison, which are mean of the location errors by 20% and 31% compared to
shortest path (SP) routing and optimal primary user (PU) FIFS and Horus, respectively. The superiority of deep learning
aware shortest path (PASP) routing. The first scheme aims in enhancing localization performance has also been verified
at minimizing the number of hops in the route, while the in [133], [135] [161]. In [127], authors propose to use machine
second scheme intends to minimize the accumulated amount learning to estimate the ranging error for UWB localization.
of PUs’ activities. It has been shown that the proposed learning They make comparisons with two schemes purely based on
based routing can cause lower end-to-end delay than the two norm optimization without ranging error mitigation, which
schemes. According to Fig. 4 in [109], the deep learning based leads to poor localization performance. The other surveyed
routing algorithm reduces average per-hop delay by around papers like [128] and [130] mainly compare their proposals
93% compared with open shortest path first (OSPF) routing. with approaches that are also based on machine learning.
In [110], OSPF is also taken as a baseline that always leads
to high delay and packet loss rate. Instead, the proposed deep M. Motivations to Apply Machine Learning
convolutional neural network based routing strategy can reduce
After summarizing traditional schemes, the motivations of
them significantly.
authors in surveyed literatures to adopt machine learning based
approaches are clarified as follows.
J. Alternatives for Clustering • Developing Low-complexity Algorithms for Wireless
In [113], deep reinforcement learning is used to select Problems: This is a main reason for researchers to use
transmitter-receiver pairs to form a cluster, within which co- deep neural networks to approximate high complexity
operative interference alignment is performed. By comparing resource allocation algorithms. Particularly, it has been
with two existing user selection schemes proposed in [156] shown in [15] that a well trained deep neural network
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
31
can greatly reduce the time for power allocation with This motivation has been highlighted in [39], [52], [58],
a satisfying performance loss compared to WMMSE [60], [66]–[68], [70], [75], [102], [105], [109], [110].
approach. In addition, this is also a reason for some • Avoiding Past Faults: For some heuristic and classical ap-
researchers to use reinforcement learning. For example, proaches that are based on fixed rules, they are incapable
authors in [96] use distributed Q learning that leads to of avoiding unsatisfying results that have occurred previ-
a low-complexity sleep mode control algorithm for small ously, which means they are incapable of learning. Such
cells. In summary, this motivation applies to literatures approaches include the OSPF routing strategy taken as
[15], [48], [52], [55], [56], [58], [61], [66], [86], [88], the baseline in [110], the handover strategy based on the
[96], [111]. comparison of RSSI values that is taken as the baseline
• Overcoming the Lack of Network Informa- by [16], the heuristic BS switching control strategies for
tion/Knowledge: Although centralized optimization comparison in [97], max-SINR based user association,
approaches can achieve superior performance, they often and so on. In [110], authors present an intuitive example,
needs to know global network information, which can where OSPF routing leads to congestion at a router
be difficult to acquire. For example, the baseline scheme under a certain situation. Then, when this situation recurs,
for BS switching in [18] requires a full knowledge of the OSPF routing protocol will make the same routing
traffic loads in prior, which is challenging to be precisely decision that causes congestion again. However, with
known in advance. However, with transfer learning, the deep learning being trained by using history network
past experience in BS switching can be utilized to guide data, it can be predicted that whether a routing strategy
current switching control even without the knowledge will lead to congestion under the current traffic pattern.
of traffic loads. To adjust handover parameters, fuzzy Other approaches listed face the same kind of problems.
logic controller based approaches can be used. The This issue can be overcome by reinforcement learning
controller is based on a set of pre-defined rules, each approaches in [16], [87], [97], which evaluate each action
of which specifies a deterministic action under a certain based on its past performance. Hence, actions with bad
system state. However, the setting of the action is highly performance can be avoided in the future. For surveyed
dependent on expert knowledge about the network that literatures, this motivation applies to [16], [48], [49],
can be unavailable for a new communication system or [57], [64], [65], [67], [68], [74], [75], [77], [85], [87],
environment. In addition, knowing content popularity [97], [107], [110], [115].
of users is the key to properly manage cache resource, • Learning Robust Patterns: With the help of neural net-
and this popularity can be accurately learned by RNN works, useful patterns related to networks and users can
and extreme learning machine. Moreover, model free be extracted. These patterns are useful in resource man-
reinforcement learning can help network nodes make agement, localization, and so on. Specifically, authors in
optimized decisions without knowing the information [55] use a CNN to learn the spatial features of the channel
about network dynamics. Overall, this motivation is a gain matrix to make wiser power control decisions than
basic reason for adopting machine learning that applies WMMSE. For fingerprints based localization, traditional
to all the surveyed literatures. approaches, such as Horus, directly relies on the received
• Facilitating Self-organization Capabilities: To reduce signal strength data that can be easily affected by the
CAPEX and OPEX, and to simplify the coordination, complex indoor propagation environment. This fact has
optimization and configuration procedures of the net- motivated researchers to improve localization accuracy
work, self-organizing networks have been widely studied by learning more robust fingerprint patterns using neural
[162]. In particular, some researchers consider machine networks. For surveyed literatures, this motivation applies
learning techniques as potential enablers to realize self- to [15]–[17], [55], [56], [63], [64], [76]–[81], [85],
organization capabilities. By involving machine learning, [91], [97], [109], [110], [113], [118], [128]–[130],
especially reinforcement learning, each BS can self- [133]–[135].
optimize its resource allocation, handover parameter con- • Achieving Better Performance than Traditional Optimiza-
figuration, and so on. In summary, this motivation applies tion: Traditional optimization methods include submodu-
to [48], [51], [57], [60], [62]–[67], [70], [71], [74], lar optimization theory, dual decomposition, and so on. In
[96], [98], [114]–[116]. [86], authors have demonstrated that their designed KNN
• Reducing Signalling Overhead: When distributed rein- based beam allocation algorithm can outperform a beam
forcement learning is used, each learning agent only allocation algorithm based on submodular optimization
needs to acquire partial network information to make theory with the increase of training data size. In [87], it is
a decision, which helps avoid large signalling overhead. shown that reinforcement learning based user association
On the contrary, traditional approaches may require many achieves better performance than the approach based
information exchanges and hence lead to huge signalling on dual decomposition. Hence, it can be inferred that
cost. For example, as pointed out in [105], ad hoc on- machine learning has the potential in reaching better
demand distance vector routing will cause the constant system performance compared to traditional optimization
flooding of routing messages in a CRN. In [58], the cen- approaches. In surveyed literatures, this motivation ap-
tralized approach taken as the baseline allocates spectrum plies to [55], [86], [87].
resource based on the complete information about SUs.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
32
IX. C HALLENGES AND O PEN I SSUES ML based approaches, it is essential to identify some com-
Although many studies have been conducted on the appli- mon problems in wireless networks. These problems should
cations of ML in wireless communications, several challenges be together with corresponding labeled/unlabeled data for
and open issues are identified in this section to facilitate further supervised/unsupervised learning based approaches, similar
research in this area. to the open dataset MNIST that is often used in computer
vision. For reinforcement learning based approaches, standard
network control problems together with well defined environ-
A. Machine Learning Based Heterogenous Back- ments should be built, similar to the standard environment
haul/Fronthaul Management MountainCar-v0.
In future wireless networks, various backhaul/fronthaul so-
lutions will coexist [163], including wired backhaul/fronthaul
like fiber and cable as well as wireless backhaul/fronthaul like E. Theoretical Guidance for Algorithm Implementation
the sub-6 GHz band. Each solution has a different amount of It is known that the performance of ML algorithms is
energy consumption and different bandwidth, and hence the affected by the selection of hyperparameters like learning rate,
management of backhaul/fronthaul is important to the whole loss functions, and so on. Trying different hyperparameters
system performance. In this case, ML based techniques can be directly is a time-consuming task, especially when the training
utilized to select suitable backhaul/fronthaul solutions based time for the model under a fixed set of hyperparameters is
on the extracted traffic patterns and performance requirements long. Moreover, the theoretical analysis of the dataset size
of users. needed for training, the performance bound of deep learning
architectures, and the ability of generalization of different
B. Infrastructure Update learning models are still open questions. Since stability is
one of the main features of communication systems, rigorous
To make preparations for the deployment of ML based theoretical studies are essential to ensure ML based approaches
communication systems, current wireless network infrastruc- always work well in practical systems.
tures should be evolved. For example, servers equipped with
GPUs can be deployed at the network edge to implement
deep learning based signal processing, resource management F. Transfer Learning Based Approaches
and localization. Storage devices are needed at the network Transfer learning promises transferring the knowledge
edge as well to achieve in-time data analysis. Moreover, learned from one task to another similar task. By avoiding
network function virtualization (NFV) should be involved in training learning models from scratch, the learning process in
the wireless network, which decouples the network functions new environments can be speeded up, and the ML algorithm
and hardware, and then network functions can be implemented can have a good performance even with a small amount of
as softwares. On the basis of NFV, machine learning can be training data. Therefore, transfer learning is critical for the
adopted to realize flexible network control and configuration. practical implementation of learning models considering the
cost for training without prior knowledge. Using transfer learn-
C. Machine Learning Based Network Slicing ing, network operators can solve new but similar problems
As a cost-efficient way to support diverse use cases, network in a cost-efficient manner. However, negative effects of prior
slicing has been advocated by both academia and indus- knowledge on system performance should be addressed as
try [164] [165]. The core of network slicing is to allocate pointed out in [18], and need further investigation.
appropriate resources including computing, caching, back-
haul/fronthaul and radio resources on demand to guarantee X. C ONCLUSIONS
the performance requirements of different slices under slice
isolation constraints. Generally speaking, network slicing can This paper surveys the state-of-the-art applications of ML
benefit from ML in the following aspects. First, ML can be in wireless communication and outlines several unresolved
used to learn the mapping from service demands to resource problems. Faced with the intricacies of these applications, we
allocation plans, and hence a new network slice can be have broadly divided the body of knowledge into resource
quickly constructed. Second, by employing transfer learning, management in the MAC layer, networking and mobility
knowledge about resource allocation plans for different use management in the network layer, and localization in the ap-
cases in one environment can act as useful knowledge in plication layer. Within each of these topics, we have surveyed
another environment, which can speed up the learning process. the diverse ML based approaches that have been proposed for
Recently, authors in [166] and [167] have applied DRL to enabling wireless networks to run intelligently. Nevertheless,
network slicing, and the advantages of DRL are demonstrated considering that the applications of ML in wireless communi-
via simulations. cations are still at the initial stage, there are quite a number
of problems that need further investigation. For example,
infrastructure update is required for the implementation of ML
D. Standard Datasets and Environments for Research based paradigms, and open data sets and environments are
To make researchers pay full attention to the learning algo- expected to facilitate future research on the ML applications
rithm design and conduct fair comparisons between different in a wide range.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
33
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
34
matching game perspective,” IEEE Trans. Wireless Commun., vol. 16, [72] Cisco Visual Networking Index: “Global mobile data traffic forecast
no. 6, pp. 3761–3774, Jun. 2017. update 2015C2020,” 2016.
[50] A. Galindo-Serrano and L. Giupponi, “Distributed Q-learning for ag- [73] Z. Chang, L. Lei, Z. Zhou, S. Mao, and T. Ristaniemi, “Learn to cache:
gregated interference control in cognitive radio networks,” IEEE Trans. Machine learning for network edge caching in the big data era,” IEEE
Veh. Technol., vol. 59, no. 4, pp. 1823–1834, May 2010. Wireless Commun., vol. 25, no. 3, pp. 28–35, Jun. 2018.
[51] M. Bennis, S. M. Perlaza, P. Blasco, Z. Han, and H. V. Poor, [74] S. Hassan, S. Samarakoon, M. Bennis, M. Latva-aho, and C. S. Hong,
“Self-organization in small cell networks: A reinforcement learning “Learning-based caching in cloud-aided wireless networks,” IEEE Com-
approach,” IEEE Trans. Wireless Commun., vol. 12, no. 7, pp. 3202– mun. Lett., vol. 22, no. 1, pp. 137–140, Jan. 2018.
3212, Jul. 2013. [75] W. Wang et al., “Edge caching at base stations with device-to-device
[52] A. Asheralieva and Y. Miyanaga, “An autonomous learning-based al- offloading,” IEEE Access, vol. 5, pp. 6399–6410, Mar. 2017.
gorithm for joint channel and power level selection by D2D pairs in [76] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching and
heterogeneous cellular networks,” IEEE Trans. Commun., vol. 64, no. 9, computing for connected vehicles: A deep reinforcement learning ap-
pp. 3996–4012, Sep. 2016. proach,” IEEE Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, Jan. 2018.
[53] L. Xu and A. Nallanathan, “Energy-efficient chance-constrained resource [77] C. Zhong, M. Gursoy, and S. Velipasalar, “A deep reinforcement
allocation for multicast cognitive OFDM network,” IEEE J. Sel. Areas learning-based framework for content caching,” in Proceedings of CISS,
Commun., vol. 34, no. 5, pp. 1298–1306, May 2016. Princeton, NJ, USA, Mar. 2018, pp. 1–6.
[54] M. Lin, J. Ouyang, and W. P. Zhu, “Joint beamforming and power [78] S. M. S. Tanzil, W. Hoiles, and V. Krishnamurthy, “Adaptive scheme
control for device-to-device communications underlaying cellular net- for caching YouTube content in a cellular network: Machine learning
works,” IEEE J. Sel. Areas Commun., vol. 34, no. 1, pp. 138–150, Jan. approach,” IEEE Access, vol. 5, pp. 5870–5881, Mar. 2017.
2016. [79] M. Chen et al., “Caching in the sky: Proactive deployment
[55] W. Lee, M. Kim, and D. Cho, “Deep power control: Transmit power of cache-enabled unmanned aerial vehicles for optimized quality-
control scheme based on convolutional neural network,” IEEE Commun. ofexperience,” IEEE J. Sel. Areas Commun., vol. 35, no. 5, pp. 1046–
Lett., vol. 22, no. 6, pp. 1276–1279, Apr. 2018. 1061, May 2017.
[56] K. I. Ahmed, H. Tabassum, and E. Hossain, “Deep learning for radio [80] M. Chen, W. Saad, C. Yin, and M. Debbah, “Echo state networks
resource allocation in multi-cell networks,” IEEE Netw., Apr. 2019, doi: for proactive caching in cloud-based radio access networks with mobile
10.1109/MNET.2019.1900029, submitted for publication. users,” IEEE Trans. Wireless Commun., vol. 16, no. 6, pp. 3520–3535,
[57] A. Galindo-Serrano, L. Giupponi, and G. Auer, “Distributed learning Jun. 2017.
in multiuser OFDMA femtocell networks,” in Proceedings of VTC, [81] K. N. Doan, T. V. Nguyen, T. Q. S. Quek, and H. Shin, “Content-aware
Yokohama, Japan, May 2011, pp. 1–6. proactive caching for backhaul offloading in cellular network,” IEEE
[58] C. Fan, B. Li, C. Zhao, W. Guo, and Y. C. Liang, “Learning-based spec- Trans. Wireless Commun., vol. 17, no. 5, pp. 3128–3140, May 2018.
trum sharing and spatial reuse in mm-wave ultra dense networks,” IEEE [82] B. N. Bharath, K. G. Nagananda, and H. V. Poor, “A learning-based
Trans. Veh. Technol., vol. 67, no. 6, pp. 4954–4968, Jun. 2018. approach to caching in heterogenous small cell networks,” IEEE Trans.
[59] Y. Zhang, W. P. Tay, K. H. Li, M. Esseghir, and D. Gaiti, “Learning Commun., vol. 64, no. 4, pp. 1674–1686, Apr. 2016.
temporal-spatial spectrum reuse,” IEEE Trans. Commun., vol. 64, no. 7, [83] E. Bastug, M. Bennis, and M. Debbah, “Anticipatory caching in small
pp. 3092–3103, Jul. 2016. cell networks: A transfer learning approach,” in Proceedings of Workshop
[60] G. Alnwaimi, S. Vahid, and K. Moessner, “Dynamic heterogeneous Anticipatory Netw., Germany, Sep. 2014.
learning games for opportunistic access in LTE-based macro/femtocell
[84] E. Bastug, M. Bennis, and M. Debbah, “A transfer learning approach
deployments,” IEEE Trans. Wireless Commun., vol. 14, no. 4, pp. 2294–
for cache-enabled wireless networks,” in Proceedings of WiOpt, Mumbai,
2308, Apr. 2015.
India, May 2015, pp. 161–166.
[61] Y. Sun, M. Peng, and H. V. Poor, “A distributed approach to improving
[85] X. Chen et al., “Optimized computation offloading performance
spectral efficiency in uplink device-to-device enabled cloud radio access
in virtual edge computing systems via deep reinforcement learn-
networks,” IEEE Trans. Commun., vol. 66, no. 12, pp. 6511–6526, Dec.
ing,” 1805.06146v1, May 2018, accessed on Jun. 15, 2018.
2018.
[62] M. Srinivasan, V. J. Kotagi, and C. S. R. Murthy, “A Q-learning [86] J. Wang et al., “A machine learning framework for resource allocation
framework for user QoE enhanced self-organizing spectrally efficient assisted by cloud computing,” IEEE Netw., vol. 32, no. 2, pp. 144–151,
network using a novel inter-operator proximal spectrum sharing,” IEEE Apr. 2018.
J. Sel. Areas Commun., vol. 34, no. 11, pp. 2887–2901, Nov. 2016. [87] Z. Li, C. Wang, and C. J. Jiang, “User association for load balancing in
[63] M. Chen, W. Saad, and C. Yin, “Echo state networks for self-organizing vehicular networks: An online reinforcement learning approach,” IEEE
resource allocation in LTE-U with uplink-downlink decoupling,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 8, pp. 2217–2228, Aug. 2017.
Trans. Wireless Commun., vol. 16, no. 1, pp. 3–16, Jan. 2017. [88] F. Pervez, M. Jaber, J. Qadir, S. Younis, and M. A. Imran,
[64] M. Chen, W. Saad, and C. Yin, “Virtual reality over wireless net- “Fuzzy Q-learning-based user-centric backhaul-aware user cell associ-
works: Quality-of-service model and learning-based resource manage- ation scheme,” in Proceedings of IWCMC, Valencia, Spain, Jun. 2017,
ment,” IEEE Trans. Commun., vol. 66, no. 11, pp. 5621–5635, Nov. pp. 1840–1845.
2018. [89] T. Kudo and T. Ohtsuki, “Cell range expansion using distributed Q-
[65] M. Jaber, M. Imran, R. Tafazolli, and A. Tukmanov, “An adaptive learning in heterogeneous networks,” in Proceedings of VTC, Las Vegas,
backhaul-aware cell range extension approach,” in Proceedings of ICCW, USA, Sep. 2013, pp. 1–5.
London, UK, Jun. 2015, pp. 74–79. [90] Y. Meng, C. Jiang, L. Xu, Y. Ren, and Z. Han, “User association in
[66] Y. Xu, R. Yin, and G. Yu, “Adaptive biasing scheme for load balancing heterogeneous networks: A social interaction approach,” IEEE Trans.
in backhaul constrained small cell networks,” IET Commun., vol. 9, no. Veh. Technol., vol. 65, no. 12, pp. 9982–9993, Dec. 2016.
7, pp. 999–1005, Apr. 2015. [91] U. Challita, W. Saad, and C. Bettstetter, “Cellular-connected UAVs
[67] K. Hamidouche, W. Saad, M. Debbah, J. B. Song, and C. S. Hong, “The over 5G: Deep reinforcement learning for interference manage-
5G cellular backhaul management dilemma: To cache or to serve,” IEEE ment,” arXiv:1801.05500v1, Jan. 2018, accessed on Jun. 15, 2018.
Trans. Wireless Commun., vol. 16, no. 8, pp. 4866–4879, Aug. 2017. [92] H. Yu et al., “Mobile data offloading for green wireless networks,” IEEE
[68] S. Samarakoon, M. Bennis, W. Saad, and M. Latva-aho, “Backhaul- Wireless Commun., vol. 24, no. 4, pp. 31–37, Aug. 2017.
aware interference management in the uplink of wireless small cell [93] I. Ashraf, F. Boccardi, and L. Ho, “SLEEP mode techniques for small
networks,” IEEE Trans. Wireless Commun., vol. 12, no. 11, pp. 5813– cell deployments,” IEEE Commun. Mag., vol. 49, no. 8, pp. 72–79, Aug.
5825, Nov. 2013. 2011.
[69] J. Lun and D. Grace, “Cognitive green backhaul deployments for future [94] R. Li, Z. Zhao, X. Chen, and H. Zhang, “Energy saving through
5G networks,” in Proc. Int. Workshop Cognitive Cellular Systems (CCS), a learning framework in greener cellular radio access networks,” in
Germany, Sep. 2014, pp. 1-5. Proceedings of GLOBECOM, Anaheim ,USA, Dec. 2012, pp. 1556–1561.
[70] P. Blasco, M. Bennis, and M. Dohler, “Backhaul-aware self-organizing [95] X. Gan et al., “Energy efficient switch policy for small cells,” China
operator-shared small cell networks,” in Proceedings of ICC, Budapest, Commun., vol. 12, no. 1, pp. 78–88, Jan. 2015.
Hungary, Jun. 2013, pp. 2801-2806. [96] G. Yu, Q. Chen, and R. Yin, “Dual-threshold sleep mode control scheme
[71] M. Jaber, M. A. Imran, R. Tafazolli, and A. Tukmanov, “A multiple for small cells,” IET Commun., vol. 8, no. 11, pp. 2008–2016, Jul. 2014.
attribute user-centric backhaul provisioning scheme using distributed [97] Z. Xu, Y. Wang, J. Tang, J.Wang, and M. Gursoy, “A deep reinforcement
SON,” in Proceedings of GLOBECOM, Washington DC, USA, Dec. learning based framework for power-efficient resource allocation in cloud
2016, pp. 1-6. RANs,” in Proceedings of ICC, Paris, France, May 2017, pp. 1–6.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
35
[98] S. Fan, H. Tian, and C. Sengul, “Self-optimized heterogeneous networks [121] N. Sinclair, D. Harle, I. A. Glover, J. Irvine, and R. C. Atkinson,
for energy efficiency,” Eurasip J. Wireless Commun. Netw., vol. 2015, “An advanced SOM algorithm applied to handover management within
no. 21, pp. 1–11, Feb. 2015. LTE,” IEEE Trans. Veh. Technol., vol. 62, no. 5, pp. 1883–1894, Jun.
[99] P. Y. Kong and D. Panaitopol, “Reinforcement learning approach to 2013.
dynamic activation of base station resources in wireless networks,” in [122] D. Li, B. Zhang, Z. Yao, and C. Li, “A feature scaling based k-nearest
Proceedings of PIMRC, London, UK, Sep. 2013, pp. 3264–3268. neighbor algorithm for indoor positioning system,” in Proceedings of
[100] S. Sharma, S. J. Darak, and A. Srivastava, “Energy saving in het- GLOBECOM, Austin, USA, Dec. 2014, pp. 436–441.
erogeneous cellular network via transfer reinforcement learning based [123] J. Niu, B. Wang, L. Shu, T. Q. Duong, and Y. Chen, “ZIL: An energy-
policy,” in Proceedings of COMSNETS, Bengaluru, India, Jan. 2017, pp. efficient indoor localization system using ZigBee radio to detect WiFi
397–398. fingerprints,” IEEE J. Sel. Areas Commun., vol. 33, no. 7, pp. 1431–
[101] S. Sharma, S. Darak, A. Srivastava, and H. Zhang, “A transfer learning 1442, Jul. 2015.
framework for energy efficient Wi-Fi networks and performance analysis [124] Y. Xu, M. Zhou, W. Meng, and L. Ma, “Optimal KNN positioning
using real data,” in Proceedings of ANTS, Bangalore, India, Nov. 2016, algorithm via theoretical accuracy criterion in WLAN indoor environ-
pp. 1–6. ment,” in Proceedings of GLOBECOM, Miami, USA, Dec. 2010, pp.
[102] S. Samarakoon, M. Bennis, W. Saad, and M. Latva-aho, “Dynamic 1–5.
clustering and on/off strategies for wireless small cell networks,” IEEE [125] Z. Wu et al., “A fast and resource efficient method for indoor position-
Trans. Wireless Commun., vol. 15, no. 3, pp. 2164–2178, Mar. 2016. ing using received signal strength,” IEEE Trans. Veh. Technol., vol. 65,
[103] Q. Zhao, D. Grace, A. Vilhar, and T. Javornik, “Using k-means no. 12, pp. 9747–9758, Dec. 2016.
clustering with transfer and Q-learning for spectrum, load and energy [126] T. Van Nguyen, Y. Jeong, H. Shin, and M. Z. Win, “Machine learning
optimization in opportunistic mobile broadband networks,” in Proceed- for wideband localization,” IEEE J. Sel. Areas Commun., vol. 33, no. 7,
ings of ISWCS, Brussels, Belgium, Aug. 2015, pp. 116–120. pp. 1357–1380, Jul. 2015.
[104] M.Miozzo, L.Giupponi, M.Rossi and P.Dini, “Distributed Q-learning [127] H. Wymeersch, S. Marano, W. M. Gifford, and M. Z. Win, “A
for energy harvesting heterogeneous networks,” in 2015 IEEE Intl. Conf. machine learning approach to ranging error mitigation for UWB local-
on Commun. Workshop (ICCW), London, 2015, pp. 2006-2011. ization,” IEEE Trans. Commun., vol. 60, no. 6, pp. 1719–1728, Jun.
[105] Y. Saleem et al., “Clustering and reinforcement-learning-based routing 2012.
for cognitive radio networks,” IEEE Wireless Commun., vol. 24, no. 4, [128] Z. Qiu, H. Zou, H. Jiang, L. Xie, and Y. Hong, “Consensus-based par-
pp. 146–151, Aug. 2017. allel extreme learning machine for indoor localization,” in Proceedings
[106] A. Syed et al., “Route selection for multi-hop cognitive radio networks of GLOBECOM, Washington DC, USA, Dec. 2016, pp. 1–6.
using reinforcement learning: An experimental study,” IEEE Access, vol. [129] X. Ye, X. Yin, X. Cai, A. Prez Yuste, and H. Xu, “Neural-network-
4, pp. 6304–6324, Sep. 2016. assisted UE localization using radio-channel fingerprints in LTE net-
[107] H. A. A. Al-Rawi, K. L. A. Yau, H. Mohamad, N. Ramli, and W. works,” IEEE Access, vol. 5, pp. 12071–12087, Jun. 2017.
Hashim, “Effects of network characteristics on learning mechanism for [130] H. Dai, W. H. Ying, and J. Xu, “Multi-layer neural network for received
routing in cognitive radio ad hoc networks,” in Proceedings of CSNDSP, signal strength-based indoor localisation,” IET Commun., vol. 10, no. 6,
Manchester, USA, Jul. 2014, pp. 748–753. pp. 717–723, Apr. 2016.
[108] W. Jung, J. Yim, and Y. Ko, “QGeo: Q-learning-based geographic ad
[131] Y. Mo, Z. Zhang, W. Meng, and G. Agha, “Space division and dimen-
hoc routing protocol for unmanned robotic networks,” IEEE Commun.
sional reduction methods for indoor positioning system,” in Proceedings
Lett., vol. 21, no. 10, pp. 2258–2261, Oct. 2017.
of ICC, London, UK, Jun. 2015, pp. 3263–3268.
[109] N. Kato et al., “The deep learning vision for heterogeneous network
[132] J. Yoo, K. H. Johansson, and H. Jin Kim, “Indoor localization with-
traffic control: Proposal, challenges, and future perspective,” IEEE
out a prior map by trajectory learning from crowdsourced measure-
Wireless Commun., vol. 24, no. 3, pp. 146–153, Jun. 2017.
ments,” IEEE Trans. Instrum. Meas., vol. 66, no. 11, pp. 2825–2835,
[110] F. Tang et al., “On removing routing protocol from future wireless
Nov. 2017.
networks: A real-time deep learning approach for intelligent traffic
[133] X. Wang, L. Gao, and S. Mao, “CSI phase fingerprinting for indoor
control,” IEEE Wireless. Commun., vol. 25, no. 1, pp. 154–160, Feb.
localization with a deep learning approach,” IEEE Internet Things J.,
2018.
vol. 3, no. 6, pp. 1113–1123, Dec. 2016.
[111] L. Lei et al. , “A deep learning approach for optimizing content
delivering in cache-enabled HetNet,” in Proceedings of ISWCS, Bologna, [134] X. Wang, L. Gao, S. Mao, and S. Pandey, “CSI-based fingerprinting
Italy, Aug. 2017, pp. 449–453. for indoor localization: A deep learning approach,” IEEE Trans. Veh.
[112] H. Tabrizi, G. Farhadi, and J. M. Cioffi, “CaSRA: An algorithm Technol., vol. 66, no. 1, pp. 763–776, Jan. 2017.
for cognitive tethering in dense wireless areas,” in Proceedings of [135] X. Wang, L. Gao, and S. Mao, “BiLoc: Bi-modal deep learning for
GLOBECOM, Atlanta, USA, Dec. 2013, pp. 3855–3860. indoor localization with commodity 5GHz WiFi,” IEEE Access, vol. 5,
[113] Y. He et al., “Deep-reinforcement-learning-based optimization pp. 4209–4220, Mar. 2017.
for cache-enabled opportunistic interference alignment wireless net- [136] J. Xiao, K. Wu, Y. Yi, and L. M. Ni, “FIFS: Fine-grained indoor fin-
works,” IEEE Trans. Veh. Technol., vol. 66, no. 11, pp. 10433–10445, gerprinting system,” in Proceedings of IEEE ICCCN, Munich, Germany,
Nov. 2017. Jul. 2012, pp. 1–7.
[114] J. Wu, J. Liu, Z. Huang, and S. Zheng, “Dynamic fuzzy Q-learning [137] K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes, “Dynamic scene
for handover parameters optimization in 5G multi-tier networks,” in understanding: The role of orientation features in space and time in scene
Proceedings of WCSP, Nanjing, China, Oct. 2015, pp. 1–5. classification,” in Proceedings of IEEE Conf. on Computer Vision and
[115] A. Klein, N. P. Kuruvatti, J. Schneider, and H. D. Schotten, “Fuzzy Q- Pattern Recognition, Washington, DC, USA, Jun. 2012, pp. 1306-1313.
learning for mobility robustness optimization in wireless networks,” in [138] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of
Proceedings of GC Wkshps, Atlanta, USA, Dec. 2013, pp. 76–81. 101 human actions classes from videos in the wild,” CoRR, vol.
[116] P. Muoz, R. Barco, J. M. Ruiz-Avils, I. de la Bandera, and A. Aguilar, abs/1212.0402, 2012.
“Fuzzy rule-based reinforcement learning for load balancing techniques [139] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, “Deep Learn-
in enterprise LTE femtocells,” IEEE Trans. Veh. Technol., vol. 62, no. 5, ing.” MIT press Cambridge, 2016.
pp. 1962–1973, Jun. 2013. [140] K. Zhou, “Robust cross-layer design with reinforcement learning for
[117] K. T. Dinh and S. Kukliski, “Joint implementation of several LTE-SON IEEE 802.11n link adaptation,” in Proceedings of IEEE ICC 2011, Kyoto,
functions,” in Proceedings of GC Wkshps, Atlanta, USA, Dec. 2013, pp. Japan, Jun. 2011, pp. 1-5.
953–957. [141] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A
[118] Z. Wang, Y. Xu, L. Li, H. Tian, and S. Cui, “Handover control survey of actor-critic reinforcement learning: Standard and natural policy
in wireless systems via asynchronous multi-user deep reinforcement gradients,” IEEE Trans. Syst., Man, Cybern. C, vol. 42, no. 6, pp. 1291-
learning,” IEEE Internet Things J., vol. 5, no. 6, pp. 4296–4307, Dec. 1307, Nov. 2012.
2018. [142] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning based
[119] C. Yu et al., “Modeling user activity patterns for next-place predic- mode selection and resource management for green fog radio access
tion,” IEEE Syst. J., vol. 11, no. 2, pp. 1060–1071, Jun. 2017. networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Apr.
[120] D. Castro-Hernandez and R. Paranjape, “Classification of user trajec- 2019.
tories in LTE HetNets using unsupervised shapelets and multiresolution [143] M. Feng and S. Mao, “Dealing with limited backhaul capac-
wavelet decomposition,” IEEE Trans. Veh. Technol., vol. 66, no. 9, pp. ity in millimeter-wave systems: A deep reinforcement learning ap-
7934–7946, Sep. 2017. proach,” IEEE Commun. Mag., vol. 57, no. 3, pp. 50–55, Mar. 2019.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
36
[144] K. Xiao, S. Mao, and J. K. Tugnait, “TCP-Drinc: Smart congestion Yaohua Sun received the bachelor’s degree (with
control based on deep reinforcement learning,” IEEE Access, vol. 7, pp. first class Hons.) in telecommunications engineering
11892–11904, Jan. 2019. (with management) from Beijing University of Posts
[145] D. Shi et al., “Deep Q-network based route scheduling for TNC and Telecommunications (BUPT), Beijing, China, in
vehicles with passengers’ location differential privacy,” IEEE Internet 2014. He is a Ph.D. student at the Key Laboratory
Things J., Mar. 2019, doi: 10.1109/JIOT.2019.2902815, submitted for of Universal Wireless Communications (Ministry of
publication. Education), BUPT. His research interests include
[146] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning mul- game theory, resource management, deep reinforce-
tiple access for heterogeneous wireless networks,” IEEE J. Sel. Areas ment learning, network slicing, and fog radio access
Commun., vol. 37, no. 6, pp. 1277-1290, Jun. 2019. networks. He was the recipient of the National
[147] S. Sivanandam and S. Deepa, “Genetic algorithm optimization prob- Scholarship in 2011 and 2017, and he has been
lems,” in Introduction to Genetic Algorithms. Springer, 2008, pp. 165- reviewers for IEEE Transactions on Communications, IEEE Transactions
209. on Mobile Computing, IEEE Systems Journal, Journal on Selected Ar-
[148] Q. Shi, M. Razaviyayn, Z. Q. Luo, and C. He, “An iteratively weighted eas in Communications, IEEE Communications Magazine, IEEE Wireless
MMSE approach to distributed sum-utility maximization for a MIMO Communications Magazine, IEEE Wireless Communications Letters, IEEE
interfering broadcast channel,” IEEE Trans. Signal Process., vol. 59, no. Communications Letters, IEEE Access, and IEEE Internet of Things Journal.
9, pp. 4331-4340, Sep. 2011.
[149] G. Alnwaimi, K. Arshad, and K. Moessner, “Dynamic spectrum
allocation algorithm with interference management in co-existing net-
works,” IEEE Commun. Lett., vol. 15, no. 9, pp. 932-934, Sep. 2011.
[150] F. Kelly, “Charging and rate control for elastic traffic,” European
Transactions on Telecommunications, vol. 8, no. 1, pp. 33-37, 1997. Mugen Peng (M’05, SM’11) received the Ph.D.
[151] M. Abrams, C. R. Standridge, G. Abdulla, S. Williams, and E. A. Fox, degree in communication and information systems
“Caching proxies: Limitations and potentials,” in Proc. WWW-4 Boston from the Beijing University of Posts and Telecom-
Conf., 1995, pp. 119-133. munications (BUPT), Beijing, China, in 2005. Af-
[152] M. Arlitt, L. Cherkasova, J. Dilley, R. Friedrich, and T. Jin, “Eval- terward, he joined BUPT, where he has been a Full
uating content management techniques for Web proxy caches,” ACM Professor with the School of Information and Com-
SIGMETRICS Perform. Eval. Rev., vol. 27, no. 4, pp. 3-11, 2000. munication Engineering since 2012. During 2014
[153] J. Wang, H. Zhu, L. Dai, N. J. Gomes, and J. Wang, “Low-complexity he was also an academic visiting fellow at Prince-
beam allocation for switched-beam based multiuser massive MIMO ton University, USA. He leads a Research Group
systems,” IEEE Trans. Wireless Commun., vol. 15, no. 12, Dec. 2016, focusing on wireless transmission and networking
pp. 8236-8248. technologies in BUPT. He has authored and co-
[154] Q. Ye et al., “User association for load balancing in heterogeneous authored over 90 refereed IEEE journal papers and over 300 conference
cellular networks,” IEEE Trans. Wireless Commun., vol. 12, no. 6, pp. proceeding papers. His main research areas include wireless communication
2706-2716, Jun. 2013. theory, radio signal processing, cooperative communication, self-organization
[155] K. Son, H. Kim, Y. Yi, and B. Krishnamachari, “Base station operation networking, heterogeneous networking, cloud communication, and Internet of
and user association mechanisms for energy-delay tradeoffs in green Things.
cellular networks,” IEEE J. Sel. Areas Commun., vol. 29, no. 8, pp. Dr. Peng was a recipient of the 2018 Heinrich Hertz Prize Paper Award, the
1525-1536, Sep. 2011. 2014 IEEE ComSoc AP Outstanding Young Researcher Award, and the Best
[156] M. Deghel, E. Bastug, M. Assaad, and M. Debbah, “On the benefits Paper Award in the JCN 2016, IEEE WCNC 2015, IEEE GameNets 2014,
of edge caching for MIMO interference alignment,” in Proceedings of IEEE CIT 2014, ICCTA 2011, IC-BNMT 2010, and IET CCWMC 2009. He
SPAWC., Stockholm, Sweden, Jun. 2015, pp. 655-659. is currently or have been on the Editorial/Associate Editorial Board of the
[157] N. Zhao, F. R. Yu, H. Sun, and M. Li, “Adaptive power allocation IEEE Communications Magazine, IEEE ACCESS, IEEE Internet of Things
schemes for spectrum sharing in interference alignment (IA)-based cog- Journal, IET Communications, and China Communications.
nitive radio networks,” IEEE Trans. Veh. Technol., vol. 65, no. 5, pp.
3700-3714, May 2016.
[158] G. Cao et al., “Demo: SDN-based seamless handover in WLAN
and 3GPP cellular with CAPWAN,” presented at 13th International
Symposium on Wireless Communication Systems, Poznan, Poland, 2016.
[159] T. Jansen, I. Balan, J. Turk, I. Moerman, and T. Kurner, “Handover Yangcheng Zhou received the bachelor’s degree
parameter optimization in LTE self-organizing networks,” in Proceedings in telecommunications engineering from Southwest
of VTC, Ottawa, ON, Canada, Sep. 2010, pp. 1-5. Jiaotong University, Chengdu, China, in 2017. She
[160] M. Youssef and A. Agrawala, “The horus WLAN location determina- is currently a Master student at the Key Labora-
tion system,” in Proceedings of ACM MobiSys, Seattle, WA, USA, Jun. tory of Universal Wireless Communications (Min-
2005, pp. 205-218. istry of Education), Beijing University of Posts and
[161] X. Wang, L. Gao, and S. Mao, “PhaseFi: Phase fingerprinting for indoor Telecommunications, Beijing, China. Her research
localization with a deep learning approach,” in Proceedings of IEEE interests include intelligent resource allocation and
Globecom, San Diego, USA, Dec. 2015, pp. 1–6. cooperative communications for fog radio access
[162] O. G. Aliu, A. Imran, M. A. Imran, and B. Evans, “A survey of self networks.
organisation in future cellular networks,” IEEE Commun. Surveys Tuts.,
vol. 15, no. 1, pp. 336-361, First Quarter, 2013.
[163] Z. Yan, M. Peng, and C. Wang, “Economical energy efficiency: An
advanced performance metric for 5G systems,” IEEE Wireless Commun.,
vol. 24, no. 1, pp. 32–37, Feb. 2017.
[164] H. Xiang, W. Zhou, M. Daneshmand, and M. Peng, “Network slicing
in fog radio access networks: Issues and challenges,” IEEE Commun.
Mag., vol. 55, no. 12, pp. 110–116, Dec. 2017. Yuzhe Huang received the bachelor’s degree in
[165] Y. Sun, M. Peng, S. Mao, and S. Yan, “Hierarchical radio resource telecommunications engineering (with management)
allocation for network slicing in fog radio access networks,” IEEE Trans. from Beijing University of Posts and Telecommu-
Veh. Technol., vol. 68, no. 4, pp. 3866–3881, Apr. 2019. nications (BUPT), Beijing, China, in 2016. He is
[166] Z. Zhao et al., “Deep reinforcement learning for network slic- currently a Master student at the Key Laboratory
ing,” arXiv:1805.06591v3, Nov. 2018, accessed on Jun. 13, 2019. of Universal Wireless Communications (Ministry
[167] X. Chen et al., “Multi-tenant cross-slice resource orchestration: A of Education), BUPT, Beijing, China. His research
deep reinforcement learning approach,” arXiv:1807.09350v2, Jun. 2019, interest is user personal data mining.
accessed on Jun. 13, 2019.
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2019.2924243, IEEE
Communications Surveys & Tutorials
37
1553-877X (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.