A Very Brief Introduction To Machine Learning With Applications To Communication Systems
A Very Brief Introduction To Machine Learning With Applications To Communication Systems
A Very Brief Introduction To Machine Learning With Applications To Communication Systems
Abstract—Given the unprecedented availability of data This paper provides a very brief introduction to key
and computing resources, there is widespread renewed concepts in machine learning and to the literature on
interest in applying data-driven machine learning methods machine learning for communication systems. Unlike
to problems for which the development of conventional other review papers such as [9]–[11], the presentation
arXiv:1808.02342v4 [cs.IT] 5 Nov 2018
1
(a) acquisition acquisition
of domain of data
knowledge
hypothesis training set
physics-based acquisition
class
mathematical model of domain learning
algorithm knowledge
machine
development
algorithm with
Fig. 2. Machine learning methodology that integrates domain knowl-
performance edge during model selection.
guarantees
2
input points that are close to each other, hence advantages. The following criteria, inspired by [20], offer
assigning a label – the cluster index – to each useful guidelines on the type of engineering tasks that
input point (clusters are delimited by dashed lines). can benefit from the use of machine learning tools.
Applications include clustering of documents with 1. The traditional engineering flow is not applicable or
similar topics. It is emphasized that clustering is is undesirable due to a model deficit or to an algorithm
only one of the learning tasks that fall under the deficit [21].
category of unsupervised learning (see Sec. V).
• With a model deficit, no physics-based mathematical
• Reinforcement learning: Reinforcement learning
models exist for the problem due to insufficient
lies, in a sense, between supervised and unsuper-
domain knowledge. As a result, a conventional
vised learning. Unlike unsupervised learning, some
model-based design is inapplicable.
form of supervision exists, but this does not come
• With an algorithm deficit, a well-established math-
in the form of the specification of a desired output
ematical model is available, but existing algorithms
for every input in the data. Instead, a reinforcement
optimized on the basis of such model are too com-
learning algorithm receives feedback from the envi-
plex to be implemented for the given application.
ronment only after selecting an output for a given
In this case, the use of hypothesis classes including
input or observation. The feedback indicates the
efficient “machines”, such as neural network of lim-
degree to which the output, known as action in re-
ited size or with tailored hardware implementations
inforcement learning, fulfils the goals of the learner.
(see, e.g., [22], [23] and references therein), can
Reinforcement learning applies to sequential deci-
yield lower-complexity solutions.
sion making problems in which the learner interacts
with an environment by sequentially taking actions 2. A sufficiently large training data sets exist or can be
– the outputs – on the basis of its observations – created.
its inputs – while receiving feedback regarding each 3. The task does not require the application of logic,
selected action. common sense, or explicit reasoning based on back-
Most current machine learning applications fall in ground knowledge.
the supervised learning category, and hence aim at 4. The task does not require detailed explanations for
learning an existing pattern between inputs and outputs. how the decision was made. The trained machine is by
Supervised learning is relatively well-understood at a and large a black box that maps inputs to outputs. As
theoretical level [14], [15], and it benefits from well- such, it does not provide direct means to ascertain why a
established algorithmic tools. Unsupervised learning has given output has been produced in response to an input,
so far defied a unified theoretical treatment [16]. Never- although recent research has made some progress on
theless, it arguably poses a more fundamental practical this front [24]. This contrasts with engineered optimal
problem in that it directly tackles the challenge of learn- solutions, which can be typically interpreted on the
ing by direct observation without any form of explicit basis of physical performance criteria. For instance, a
feedback. Reinforcement learning has found extensive maximum likelihood decoder chooses a given output
applications in problems that are characterized by clear because it minimizes the probability of error under the
feedback signals, such as win/lose outcomes in games, assumed model.
and that entail searches over large trees of possible 5. The phenomenon or function being learned is station-
action-observation histories [17], [18]. ary for a sufficiently long period of time. This is in order
This paper only covers supervised and unsupervised to enable data collection and learning.
learning. Reinforcement learning requires a different 6. The task has either loose requirement constraints,
analytical framework grounded in Markov Decision Pro- or, in the case of an algorithm deficit, the required
cesses and will not be discussed here (see [17]). For a performance guarantees can be provided via numeri-
broader discussion on the technical aspects of supervised cal simulations. With the conventional engineering ap-
and unsupervised learning, we point to [19] and refer- proach, theoretical performance guarantees can be ob-
ences therein. tained that are backed by a physics-based mathematical
model. These guarantees can be relied upon insofar as
C. When to Use Machine Learning? the model is trusted to be an accurate representation
Based on the discussion in Sec. I-A, the use of a of reality. If a machine learning approach is used to
machine learning approach in lieu of a more conventional address an algorithm deficit and a physics-based model
engineering design should be justified on a case-by- is available, then numerical results may be sufficient in
case basis on the basis of its suitability and potential order to compute satisfactory performance measures. In
3
5 Core
(a) Core Cloud
Network
4
Cloud
Edge
3
Access Edge
Network Cloud
2
Wireless
1 Edge
0
4 5 6 7 8 9 Fig. 4. A generic cellular wireless network architecture that dis-
5 tinguishes between edge segment, with base stations, access points,
(b) and associated computing resources, and cloud segment, consisting
4 of core network and associated cloud computing platforms.
3
Throughout, we focus on tasks carried out at the
2
network side, rather than at the users, and organize the
applications along two axes. On one, with reference to
1
Fig. 4, we distinguish tasks that are carried out at the
edge of the network, that is, at the base stations or
0 access points and at the associated computing platforms,
4 5 6 7 8 from tasks that are instead responsibility of a centralized
cloud processor connected to the core network (see, e.g.,
Fig. 3. Illustration of (a) supervised learning and (b) unsupervised [25]). The edge operates on the basis of timely local
learning. information collected at different layers of the protocol
stack, which may include all layers from the physical up
to the application layer. In contrast, the centralized cloud
contrast, weaker guarantees can be offered by machine processes longer-term and global information collected
learning in the absence of a physics-based model. In this from multiple nodes in the edge network, which typically
case, one can provide performance bounds only under encompasses only the higher layers of the protocol stack,
the assumptions that the hypothesis class is sufficiently namely networking and application layers. Examples of
general to include “machines” that can perform well on data that may be available at the cloud and at the edge
the problem and that the data is representative of the can be found in Table I and Table II, respectively.
actual data distribution to be encountered at runtime (see, As a preliminary discussion, it is useful to ask which
e.g., [19][Ch. 5]). The selection of a biased hypothesis tasks of a communication network, if any, may benefit
class or the use of an unrepresentative data set may hence from machine learning through the lens of the criteria re-
yield strongly suboptimal performance. viewed in Sec. I-C. First, as seen, there should be either a
We will return to these criteria when discussing ap- model deficit or an algorithm deficit that prevents the use
plications to communication systems. of a conventional model-based engineering design. As an
example of model deficit, proactive resource allocation
II. M ACHINE L EARNING FOR C OMMUNICATION that is based on predictions of human behaviour, e.g., for
N ETWORKS caching popular contents, may not benefit from well-
established and reliable models, making a data-driven
In order to exemplify applications of supervised and approach desirable (see, e.g., [26], [27]). For an instance
unsupervised learning, we will offer annotated pointers of algorithm deficit, consider the problem of channel
to the literature on machine learning for communication decoding for channels with known and accurate models
systems. Rather than striving for a comprehensive, and based on which the maximum likelihood decoder entails
historically minded, review, the applications and refer- an excessive complexity.
ences have been selected with the goal of illustrating Assuming that the problem at hand is characterized
key aspects regarding the use of machine learning in by model or algorithm deficits, one should then consider
engineering problems. the rest of the criteria discussed in Sec. I-C. Most are
4
TABLE I
E XAMPLES OF DATA AVAILABLE AT THE EDGE SEGMENT OF A COMMUNICATION NETWORK
Layer Data
Physical Baseband signals, channel state information
Medium Access Control/ Link Throughput, FER, random access load and latency
Network Location, traffic loads across services, users’ device types, battery levels
Application Users’ preferences, content demands, computing loads, QoS metrics
TABLE II
E XAMPLES OF DATA AVAILABLE AT THE CLOUD SEGMENT OF A COMMUNICATION NETWORK
Layer Data
Network Mobility patterns, network-wide traffic statistics, outage rates
Application User’s behaviour patterns, subscription information, service usage statistics, TCP/IP traffic statistics
5
4.5 B. Defining Supervised Learning
4 Having introduced the goal of supervised learning, we
? now provide a more formal definition of the problem.
3.5
Throughout, we use Roman font to denote random
3 variables and the corresponding letter in regular font for
2.5
realizations.
As a starting point, we assume that the training set D
2 is generated as
1.5 (xn , tn ) ∼ p(x, t), n = 1, ..., N , (1)
i.i.d.
1
that is, each training sample pair (xn , tn ) is generated
0.5 from the same true joint distribution p(x, t) and the sam-
4 5 6 7 8 9
ple pairs are independent identically distributed (i.i.d.).
As discussed, based on the training set D, we wish
Fig. 6. Illustration of the supervised learning problem of classi- to obtain a predictor t̂(x) that performs well on any
fication: Given input-output training examples (xn , tn ), with n =
possible relevant input x. This requirement is formalized
1, ..., N , how should we predict the output t for an unobserved value
of the input x? by imposing that the predictor is accurate for any test
pair (x, t) ∼ p(x, t), which is generated independently
of all the pairs in the training set D.
The quality of the prediction t̂(x) for a test pair (x, t)
variables that take a finite number of possible values. The is measured by a given loss function `(t, t̂) as `(t, t̂(x)).
value of the output t for a given input x indicates the Typical examples of loss functions include the quadratic
class to which x belongs. For instance, the label t is a loss `(t, t̂) = (t − t̂)2 for regression problems; and the
binary variable as in Fig. 6 for a binary classification error rate `(t, t̂) = 1(t 6= t̂), which equals 1 when the
problem. Based on the training set D, the goal is to prediction is incorrect, i.e., t 6= t̂, and 0 otherwise, for
predict the label, or the class, t for a new, as of yet classification problems.
unobserved, input x. The formal goal of learning is that of minimizing the
average loss on the test pair, which is referred to as the
To sum up, the goal of both regression and clas- generalization loss. For a given predictor t̂, this is defined
sification is to derive from the training data set D a as
predictor t̂(x) that generalizes the input-output mapping Lp (t̂) = E(x,t)∼p(x,t) [`(t, t̂(x))]. (2)
in D to inputs x that are not present in D. As such,
learning is markedly distinct from memorizing: while The generalization loss (2) is averaged over the distribu-
memorizing would require producing a value tn for some tion of the test pair (x, t).
recorded input xn in the training set, learning is about Before moving on to the solution of the problem of
generalization from the data set to the rest of the relevant minimizing the generalization loss, we mention that the
input space. formulation provided here is only one, albeit arguably
the most popular, of a number of alternative formula-
The problem of extrapolating a predictor from the tions of supervised learning. The frequentist framework
training set is evidently impossible unless one is willing described above is in fact complemented by other view-
to make some assumption about the underlying input- points, including Bayesian and Minimum Description
output mapping. In fact, the output t may well equal Length (MDL) (see [19] and references therein).
any value for an unobserved x if nothing else is specified
about the problem. This impossibility is formalized by
the no free-lunch theorem: without making assumptions C. When The True Distribution p(x, t) is Known: Infer-
about the relationship between input and output, it is not ence
possible to generalize the available observations outside Consider first the case in which the true joint dis-
the training set [14]. The set of assumptions made in tribution p(x, t) relating input and output is known.
order to enable learning are known as inductive bias. This scenario can be considered as an idealization of
As an example, for the regression problem in Fig. 5, the situation resulting from the conventional engineering
a possible inductive bias is to postulate that the input- design flow when the available physics-based model is
output mapping is a polynomial function of some order. accurate (see Sec. I). Under this assumption, the data set
6
D is not necessary, since the mapping between input and D. When the True Distribution p(x, t) is Not Known:
output is fully described by the distribution p(x, t). Machine Learning
If the true distribution p(x, t) is known, the problem Consider now the case of interest in which domain
of minimizing the generalization loss reduces to a stan- knowledge is not available and hence the true joint
dard inference problem, i.e., an estimation problem in a distribution is unknown. In such a scenario, we have a
regression set-up, in which the outputs are continuous learning problem and we need to use the examples in the
variables, or a detection problem in a classification set- training set D in order to obtain a meaningful predictor
up, in which the outputs are finite discrete variables. that approximately minimizes the generalization loss.
In an inference problem, the optimal predictor t̂ can At a high level, the methodology applied by machine
be directly computed from the posterior distribution learning follows three main steps, which are described
p(x, t) next.
p(t|x) = , (3) 1. Model selection (inductive bias): As a first step,
p(x)
one needs to commit to a specific class of hypotheses that
where p(x) is the marginal distribution of the input x. the learning algorithm may choose from. The hypothesis
The latter can be computed from the joint distribution class is also referred to as model. The selection of the hy-
p(x, t) by summing or integrating out all the values of t. pothesis class characterizes the inductive bias mentioned
In fact, given a loss function `(t, t̂), the optimal predictor above as a pre-requisite for learning. In a probabilistic
for any input x is obtained as framework, the hypothesis class, or model, is defined
by a family of probability distributions parameterized
t̂∗ (x) = arg min Et∼p(t|x) [`(t, t̂)|x]. (4) by a vector θ. Specifically, there are two main ways
t̂
of specifying a parametric family of distributions as a
In words, the optimal predictor t̂∗ (x) is obtained by model for supervised learning:
identifying the value (or values) of t that minimizes the • Generative model: Generative models specify a
average loss, where the average is taken with respect family of joint distributions p(x, t|θ);
to the posterior distribution p(t|x) of the output given • Discriminative model: Discriminative models pa-
the input. Given that the posterior p(t|x) yields the rameterize directly the predictive distribution as
optimal predictor, it is also known as the true predictive p(t|x, θ).
distribution. Broadly speaking, discriminative models do not make
The optimal predictor (4) can be explicitly evaluated any assumptions about the distribution of the inputs
for given loss functions. For instance, for the quadratic x and hence may be less prone to bias caused by a
loss, which is typical for regression, the optimal predictor misspecification of the hypothesis class. On the flip side,
is given by the mean of the predictive distribution, or the generative models may be able to capture more of the
posterior mean, i.e., structure present in the data and consequently improve
the performance of the predictor [29]. For both types of
t̂∗ (x) = Et∼p(t|x) [t|x], (5)
models, the hypothesis class is typically selected from
while, with the error rate loss, which is typical for a common set of probability distributions that lead to
classification, problems, the optimal predictor is given efficient learning algorithms in Step 2. Furthermore, any
by the maximum of the predictive distribution, or the available basic domain knowledge can be in principle
maximum a posteriori (MAP) estimate, i.e., incorporated in the selection of the model (see also Sec.
VII).
t̂∗ (x) = arg max p(t|x). (6) 2. Learning: Given data D, in the learning step, a
t
learning criterion is optimized in order to obtain the
For a numerical example, consider binary inputs parameter vector θ and identify a distribution p(x, t|θ)
and outputs and the joint distribution p(x, t) such that or p(t|x, θ), depending on whether a generative or dis-
p(0, 0) = 0.05, p(0, 1) = 0.45, p(1, 0) = 0.4 and criminative model was selected at Step 1.
p(1, 1) = 0.1. The predictive distribution for input x = 0 3. Inference: In the inference step, the learned model
is then given as p(t = 1|x = 0) = 0.9, and hence is used to obtain the predictor t̂(x) by using (4) with
we have the optimal predictor given by the average the learned model in lieu of the true distribution. Note
t̂∗ (x = 0) = 0.9 × 1 + 0.1 × 0 = 0.9 for the quadratic that generative models require the calculation of the
loss, and by the MAP solution t̂∗ (x = 0) = 1 for the predictive distribution p(t|x) via marginalization, while
error rate loss. discriminative models provide directly the predictive
7
distribution. As mentioned, the predictor should be eval-
uated on test data that is different from the training set
D. As we will discuss, the design cycle typically entails
a loop between validation of the predictor at Step 3 and
model selection at Step 1.
The next examples illustrate the three steps introduced
above for a binary classification problem.
Example 1: Consider a binary classification problem Fig. 7. An illustration of the hypothesis class p(t|x, w) assumed by
in which the input is a generic D-dimensional vector logistic regression using a neural network representation: functions
x = [x1 , ..., xD ]T and the output is binary, i.e., t ∈ φi , with i = 1, ..., D0 , are fixed and compute features of the input
{0, 1}. The superscript “T ” represents transposition. In vector x = [x1 , ..., xD ]. The learnable parameter vector θ here
corresponds to the weights w used to linearly combine the features
Step 1, we select a model, that is, a parameterized family in (7).
of distributions. A common choice is given by logistic
regression1 , which is a discriminative model whereby
the predictive distribution p(t|x, θ) is parameterized as
illustrated in Fig. 7. The model first computes D0 fixed of the predictor should be tested on new, test, input-
features φ(x) = [φ1 (x) · · · φD0 (x)]T of the input, where output pairs, e.g., new emails in the spam classification
a feature is a function of the data. Then, it computes the example.
predictive probability as
Example 2: Logistic regression requires to specify a
suitable vector of features φ(x). As seen in the email
p(t = 1|x, w) = σ(wT φ(x)), (7)
spam classification example, this entails the availability
where w is the set of learnable weights – i.e., the pa- of some domain knowledge to be able to ascertain which
rameter θ defined above – and σ(a) = (1 + exp(−a))−1 functions of the input x may be more relevant for the
is the sigmoid function. classification task at hand. As discussed in Sec. I, this
knowledge may not be available due to, e.g., cost or
Under logistic regression, the probability that the label
time constraints. Multi-layer neural networks provide an
is t = 1 increases as the linear combination of features
alternative model choice at Step 1 that obviates the need
becomes more positive, and we have p(t = 1|x, w) > 0.5
for hand-crafted features. The model is illustrated in
for wT φ(x) > 0. Conversely, the probability that the
Fig. 8. Unlike linear regression, in a multi-layer neural
label is t = 0 increases as the linear combination of
network, the feature vector φ(x) used by the last layer to
features becomes more negative, with p(t = 0|x, w) >
compute the logit, or LLR, that determines the predictive
0.5 for wT φ(x) < 0. As a specific instance of this
probability (7) is not fixed a priori. Rather, the feature
problem, if we wish to classify emails between spam
vector is computed by the previous layers. To this end,
and non-spam ones, possible useful features may count
each neuron, represented as a circle in Fig. 8, computes
the number of times that certain suspicious words appear
a fixed non-linear function, e.g., sigmoid, of a linear
in the text.
combination of the values obtained from the previous
Step 2 amounts to the identification of the weight
layer. The weights of these linear combinations are part
vector w on the basis of the training set D with the
of the learnable parameters θ, along with the weights of
ideal goal of minimizing the generalization loss (2). This
the last layer. By allowing the weights at all layers of the
step will be further discussed in the next subsection.
model to be trained simultaneously, multi-layer neural
Finally, in Step 3, the optimal predictor is obtained
networks enable the joint learning of the last-layer linear
by assuming that the learned model p(t|x, w) is the
classifier and of the features φ(x) the classifier operates
true predictive distribution. Assuming an error rate loss
on. As a notable example, deep neural networks are
function, following the discussion in Sec. III-C, the
characterized by a large number of intermediate layers
optimal predictor is given by the MAP choice t̂∗ (x) = 1
that tend to learn increasingly abstract features of the
if wT φ(x) > 0 and t̂∗ (x) = 0 otherwise. It is noted that
input [7].
the linear combination wT φ(x) is also known as logit
or log-likelihood ratio (LLR). This rule can be seen to In the rest of this section, we first provide some
correspond to a linear classifier [19]. The performance technical details about Step 2, i.e., learning, and then we
return to Step 1, i.e., model selection. As it will be seen,
1
The term ”regression” may be confusing, since the model applies this order is dictated by the fact that model selection
to classification. requires some understanding of the learning process.
8
where we have defined as γ > 0 the learning rate, and,
for simplicity of notation, we have considered a mini-
batch given by a single example (xn , tn ). It is noted
that, with multi-layer neural networks, the computation
of the gradient ∇θ ln p(tn |xn , θ) yields the standard
backpropagation algorithm [7], [19]. The learning rate is
an example of hyperparameters that define the learning
algorithm. Many variations of SGD have been proposed
Fig. 8. An illustration of the hypothesis class p(t|x, w) assumed that aim at improving convergence (see, e.g., [7], [19]).
by multi-layer neural networks. The learnable parameter vector θ ML has evident drawbacks as an indirect means of
here corresponds to the weights wL used at the last layer to linearly
combine the features φ(x) and the weight matrices W 1 , ..., W L−1
minimizing the generalization error. In fact, ML only
used at the preceding layers in order to compute the feature vector. considers the fit of the probabilistic model on the training
set without any consideration for the performance on
unobserved input-output pairs. This weakness can be
E. Learning somewhat mitigated by regularization [7], [19] during
Ideally, a learning rule should obtain a predictor learning and by a proper selection of the model via
that minimizes the generalization error (2). However, as validation, as discussed in the next subsection. Regu-
discussed in Sec. III-C, this task is out of reach without larization adds a penalty term to the log-likelihood that
knowledge of the true joint distribution p(x, t). There- depends on the model parameters θ. The goal is to
fore, alternative learning criteria need to be considered prevent the learned model parameters θ to assume values
that rely on the training set D rather than on the true that are a priori too unlikely and that are hence possible
distribution. symptoms of overfitting. As an example, for logistic
In the context of probabilistic models, the most basic regression, one can add a penalty that is proportional
learning criterion is Maximum Likelihood (ML). ML to the norm ||w||2 of the weight vector w in order to
selects a value of θ in the parameterized family of models prevent the weights to assume excessively high values
p(x, t|θ) or p(t|x, θ) that is the most likely to have when fitting the data in the learning step.
generated the observed training set D. Mathematically,
ML solves the problem of maximizing the log-likelihood
function F. Model Selection
maximize ln p(D|θ) (8)
We now discuss the first, key, step of model selection,
over θ, where p(D|θ) is the probability of the data set which defines the inductive bias adopted in the learning
D for a given value of θ. Given the assumption of i.i.d. process. In order to illustrate the main ideas, here we
data points in D (see Sec. III-B), the log-likelihood can study a particular aspect of model selection, namely that
be written as of model order selection. To this end, we consider a
XN hierarchical set of models of increasing complexity and
ln p(D|θ) = ln p(tn |xn , θ), (9) we address the problem of selecting (in Step 1) the order,
n=1 or the complexity, of the specific model to be posited
where we have used as an example the case of discrim- for learning (in Step 2). As an example of model order
inative models. Note that most learning criteria used in selection, one may fix a set of models including multi-
practice can be interpreted as ML problems, including layer networks of varying number of intermediate layers
the least squares criterion – ML for Gaussian models – and focus on determining the number of layers. It is
and cross-entropy – ML for categorical models. emphasized that the scope of model selection goes much
The ML problem (8) rarely has analytical solutions beyond model order selection, including the possible
and is typically addressed by Stochastic Gradient De- incorporation of domain knowledge and the tuning of
scent (SGD). Accordingly, at each iteration, subsets of the hyperparameters of the learning algorithm.
examples, also known as mini-batches, are selected from For concreteness, we focus on the regression problem
the training set, and the parameter vector is updated in illustrated in Fig. 5 and assume a set of discriminative
the direction of gradient of the log-likelihood function models p(t|x, w) under which the output t is distributed
as evaluated on these examples. The resulting learning as
rule can be written as XM
wm xm + N (0, 1). (11)
θnew ← θold + γ∇θ ln p(tn |xn , θ)|θ=θold , (10)
m=0
9
fitting cannot be ascertained on the basis of the training
3
data as it refers to the performance of the predictor out-
2 side D. It follows that model selection cannot be carried
M= 9
out by observing only the training set. Rather, some
1 information must be available about the generalization
performance of the predictor. This is typically obtained
0
M= 1
by means of validation. In its simplest instantiation,
validation partitions the available data into two sets, a
-1
training set D and a validation set. The training set is
-2 M= 3 used for learning as discussed in Sec. III-E, while the
validation set is used to estimate the generalization loss.
-3 This is done by computing the average in (12) only over
0 0.2 0.4 0.6 0.8 1
the validation set. More sophisticated forms of validation
exist, including cross-validation [7].
Fig. 9. Training set in Fig. 5, along with a predictor trained by Keeping some data aside for validation, one can obtain
using the discriminative model (11) and ML for different values of a plot as in Fig. 10, where the training loss (12) is
the model order M .
compared with the generalization loss (2) estimated
via validation. The figure allows us to conclude that,
In words, the output t is given by a polynomial function when M is large enough, the generalization loss starts
of order M of the input x plus zero-mean Gaussian noise increasing, indicating overfitting. Note, in contrast, that
of power equal to one. The learnable parameter vector underfitting is detectable by observing the training loss.
θ is given by the weights w = [w0 , ..., wM −1 ]T . Model A figure such as Fig. 10 can be used to choose a value
selection, to be carried out in Step 1, amounts to the of M that approximately minimizes the generalization
choice of the model order M . loss.
Having chosen M in Step 1, the weights w can be More generally, validation allows for model selection,
learned in Step 2 using ML, and then the optimal pre- as well as for the selection of the parameters used
dictor can be obtained for inference in Step 3. Assuming by learning the algorithm, such as the learning rate γ
the quadratic loss, the optimal in (10). To this end, one compares the generalization
PM predictor m
is given by the
posterior mean t̂(x) = m=0 wm x for the learned loss, estimated via validation, for a number of models
parameters w. This predictor is plotted in Fig. 9 for and then chooses the one with the smallest estimated
different values of M , along with the training set of Fig. generalization loss.
5. Finally, it is important to remark that the performance
With M = 1, the predictor t̂(x) is seen to underfit of the model selected via validation should be estimated
the training data. This is in the sense that the model is on the basis of a separate data set, typically called
not rich enough to capture the variations present in the the test set. This is because the generalization loss
training data, and, as a result, we obtain a large training estimated using validation is a biased estimate of the
loss true generalization loss (2) due to the process of model
N
1 X selection. In particular, the loss on the validation set will
LD (w) = (tn − t̂(xn ))2 . (12) tend to be small, since the model was selected during
N
n=1
validation with the aim of minimizing it. Importantly, the
The training loss measures the quality of the predictor test set should never be used during the three steps that
defined by weights w on the points in the training set. In make up the machine learning methodology and should
contrast, with M = 9, the predictor fits well the training ideally only be used once to test the trained predictor.
data – so much so that it appears to overfit it. In other
words, the model is too rich and, in order to account
IV. A PPLICATIONS OF S UPERVISED L EARNING TO
for the observations in the training set, it appears to
C OMMUNICATION S YSTEMS
yield inaccurate predictions outside it. As a compromise
between underfitting and overfitting, the selection M = 3 In this section, we provide some pointers to existing
seems to be preferable. applications of supervised learning to communication
As implied by the discussion above, underfitting can networks. The discussion is organized by following the
be detected by observing solely the training data D via approach described in Sec. II. Accordingly, we distin-
the evaluation of the training loss (12). In contrast, over- guish between tasks carried out at edge and cloud (see
10
check the remaining criteria listed in Sec. II, particularly
1.6
underfitting overfitting those regarding the rate of change of the phenomenon
1.4 under study and the requirements in terms of perfor-
root average squared loss
11
model available based on domain knowledge is only a links in the core network, as well as regarding the status
coarse approximation of the physical model, the resulting of the queues at the network routers. In the presence
training set can be used to augment the data in order to of wireless or optical communications, the quality of a
carry out a preliminary training of a machine learning link may not be available at the network controller, but
model [45]2 . it may be predicted using available historical data [33],
For an application at a full-duplex transceiver, we refer [54] in the absence of agreed-upon dynamic availability
to [47], which learns how to cancel self-interference in models. In a similar manner, predicting congestion can
order to overcome the lack of well-established models be framed as a data-aided classification problem [55].
for the transmitter-receiver chain of non-linearities. 2) Application: Finally, a relevant supervised learning
2) Link and Medium Access Control Layers: At the task is that of traffic classification, whereby data streams
medium access control layer, we highlight some ap- are classified on the basis of some extracted features,
plications of machine learning that tackle the lack of such as packet sizes and inter-arrival times, in terms of
mathematical models for complex access protocols and their applications, e.g., Voice over IP. [56]
communication environments. In [48], a mechanism is
proposed to predict whether a channel decoder will suc- V. U NSUPERVISED L EARNING
ceed on the basis of the outputs of the first few iterations As introduced in Sec. I, unlike supervised learning,
of the iterative decoding process. This binary predictor unsupervised learning tasks operate over unlabelled data
is useful in order to request an early retransmission at sets consisting solely of the inputs xn , with n = 1, ..., N ,
the link layer using Automatic Retransmission Request and the general goal is that of discovering properties
(ARQ) or Hybrid ARQ (HARQ) in order to reduce of the data. We start this section by reviewing some
latency. At the medium access control layer, data-aided of the typical specific unsupervised learning tasks. We
methods can instead be used to predict the availability then cover methodology, models, and learning, includ-
of spectrum in the presence of interfering incumbent ing advanced methods such as Generative Adversarial
devices with complex activation patterns for cognitive Networks (GANs) [7].
radio applications [49] (see also [50]). An approach
that leverages depth images to detect the availability of
mmwave channels is proposed in [51]. A. Goals and Definitions
3) Network and Application Layers: A task that In unsupervised learning, taking a frequentist formu-
is particularly well-suited for machine learning is the lation (see Sec. III-A), we are given a training set D
caching of popular contents for reduced latency and consisting of N i.i.d. samples xn ∼ p(x) with n =
network congestion [52]. Caching may take place at the 1, ..., N generated from an unknown true distribution
edge and, more traditionally, within the core network p(x). The high-level goal is that of learning some useful
segment. Caching at the edge has the advantage of properties of the distribution p(x). More specifically, we
catering directly to the preference of the local population can identify the following tasks.
of users, but it generally suffers from a reduced hit rate • Density estimation: Density estimation aims at es-
due to the smaller available storage capacity. Optimizing timating directly the distribution p(x). This may be
the selection of contents to be stored at the edge can be useful, for example, for use in plug-in estimators of
formulated as a classification problem that can benefit information-theoretic quantities, for the design of
from a data-driven approach in order to adapt to the compression algorithms, or to detect outliers;
specific features of the local traffic [52]. • Clustering: Clustering aims at partitioning all points
in the data set D in groups of similar objects, where
B. At the Cloud the notion of similarity is domain-dependent;
• Dimensionality reduction, representation, and fea-
We now turn to some relevant tasks to be carried out
ture extraction: These three related tasks represent
at the cloud at both network and application layers.
each data point xn in a different space, typically
1) Network: The main task of the network layer is
of lower dimensionality, in order to highlight in-
routing (see [53] for further discussion). Considering a
dependent explanatory factors and/or to ease visu-
software-defined networking implementation, routing re-
alization, interpretation, or the implementation of
quires the availability at a network controller of informa-
successive tasks, e.g., classification;
tion regarding the quality of individual communication
• Generation of new samples: Given the data set D ,
2
This can be thought of as an example of experience learning as we wish to learn a machine that produces sam-
part of small-sample learning techniques [46]. ples that are approximately distributed according
12
to p(x). As an example, if the data set contains
images of celebrities, the idea is to produce plausi-
ble images of non-existent celebrities. This can be
useful, e.g., to produce artificial scenes for video
parameterizes or films.
As suggested by the variety of tasks listed above,
unsupervised learning does not have a formal unified
formulation as supervised learning. Nevertheless, the
general methodology follows three main steps in a
manner similar to supervised learning (see Sec. III-D). In
Step 1 (model selection), a model, or a hypothesis class,
is selected, defining the inductive bias of the learning
process. This is done by positing a family of probability
distributions p(x|θ) parameterized by a vector θ. In
Step 2 (learning), the data D is used to optimize a Fig. 11. Illustration of typical unsupervised learning models: (a)
directed generative models; (b) undirected generative models; (c)
learning criterion with the aim of choosing a value for the discriminative models; and (d) autoencoders.
parameter vector θ. Finally, in Step 3, the trained model
is leveraged in order to carry out the task of interest,
e.g., clustering or sample generation. models and autoencoders, and we point to [19] and
In the following, we discuss Step 1 (model selection) references therein for a discussion about the remaining
and Step 2 (learning). For the formulation of specific models.
tasks to be carried out at Step 3, we refer to, e.g., [7], As illustrated in Fig. 11(a), directed generative models
[19], [57]. assume that each data point x is caused4 by a hidden
variable z. This is in the sense that the joint distribution
B. Models p(x, z|θ) is parameterized as p(x, z|θ) = p(z|θ)p(x|z, θ),
Unsupervised learning models, selected at Step 1 of where p(z|θ) is the distribution of the hidden cause and
the machine learning process, typically involve a hidden p(x|z, θ) is the conditional distribution of the data x
or latent (vector of) variables zn for each data point xn . given the cause z. As a result, under a directed generative
For example, in a clustering problem, the latent variable model, the distribution of an observation x = x can be
zn represents the cluster index of xn . Latent variables are written as
hidden or unobserved in the sense that they do not appear X
p(x|θ) = p(z|θ)p(x|z, θ) = Ez∼p(z|θ) [ln p(x|z, θ)],
for any of the data points xn in D.3 The relationship z
between latent variables zn and observable variables xn (13)
can be modelled in different ways, giving rise to a where the sum in the second term should be replaced by
number of different types of models for unsupervised an integration for continuous hidden variables, and the
learning. These are illustrated in Fig. 11 and discussed last equality expresses the marginalization over z as an
next. expectation.
By way of a short round-up of types of models, As an example, for the problem of document clus-
with reference to Fig. 11, directed generative models, tering, variable x represents a document in the training
illustrated by Fig. 11(a), posit that there exist hidden set and z is interpreted as a latent topic that “causes”
causes z yielding the observation x. Undirected genera- the generation of the document. Model selection requires
tive models, represented in Fig. 11(b) model the mutual the specification of a parameterized distribution p(z|θ)
correlation between x and z. Discriminative models, over the topics, e.g., a categorical distribution with
illustrated by Fig. 11(c) model the extraction of the parameters equals to the probability of each possible
latent representation z from x. Finally, autoencoders, value, and the distribution p(x|z, θ) of the document
represented in Fig. 11(d) assume that x is encoded into given a topic. Basic representatives of directed generative
a latent representation z in such as way that x can then models include mixture of Gaussians and likelihood-free
be approximately recovered from z. In the following, we models [19], [58].
provide some additional details about directed generative
4
The use of the term “cause” is meant to be taken in an intuitive,
3
Problems in which some of the inputs in D are labelled by a value rather than formal, way. For a discussion on the study of causality,
zn are filed under the rubric of semi-supervised learning [29]. we refer to [8].
13
As represented in Fig. 11(d), autoencoders model over the hidden variables and on the optimization of a
encoding from data x to hidden variables z, as well as de- tractable lower bound on the log-likelihood known as
coding from hidden variables back to data. Accordingly, the Evidence Lower BOund (ELBO). To elaborate, for
model selection for autoencoders requires the specifica- any fixed value x and any distribution q(z) on the latent
tion of a parameterized family of encoders p(z|x, θ) and variables z (possibly dependent on x), the ELBO L(q, θ)
decoders p(x|z, θ). As an example, autoencoders can be is defined as
used to learn how to compress an input signal x into a
representation z in a smaller space so as to ensure that L(q, θ) = Ez∼q(z) [ln p(x|z, θ)]−KL(q(z)||p(z|θ)), (15)
x can be recovered from z within an admissible level of
distortion. Representatives of autoencoders, which cor- where KL(q||p) = Ez∼q(z) [ln(q(z)/p(z))] is the
respond to specific choices for the encoder and decoder Kullback-Leibler (KL) divergence. The latter is a mea-
families of distributions, include Principal Component sure of the distance between the two distributions, as
Analysis (PCA), dictionary learning, and neural network- we will further discuss in Sec. V-D (see [59], [60]).
based autoencoders [19], [57], [58]. The analytical advantages of the ELBO L(q, θ) over
the original log-likelihood are that: (i) it entails an
expectation of the logarithm of the model p(x|z, θ),
C. Learning which, as mentioned, is typically a tractable function;
We now discuss learning, to be carried out as Step 2. and (ii) the average is over a fixed distribution q(z),
For brevity, we focus on directed generative models and which does not depend on the model parameter θ.
refer to [19] and references therein for a treatment of Using Jensen’s inequality, it can be seen that the
learning for the other models in Fig. 11. In this regard, ELBO (15) is a global lower bound on the log-likelihood
we note that the problem of training autoencoders is function, that is,
akin to supervised learning in the sense that autoencoders
specify the desired output for each input in the training ln p(x|θ) ≥ L(q, θ). (16)
set.
As for supervised learning, the most basic learning An illustration of the lower bounding property of the
criterion for probabilistic models is ML. Following the ELBO can be found in Fig. 12. An important feature
discussion in Sec. III-E, ML tackles the problem of of this inequality is that the ELBO “touches” the log-
maximizing the log-likelihood of the data, i.e., likelihood function at values θ0 , if any, for which the
distribution q(z) satisfies the equality
maximize ln p(x|θ) = ln Ez∼p(z|θ) [ln p(x|z, θ)]. (14)
θ
q(z) = p(z|x, θ0 ). (17)
Note that problem (14) considers only one data point x in
the data set for the purpose of simplifying the notation, In words, the ELBO is tight if the variational distribution
but in practice the log-likelihood needs to be summed is selected to equal the posterior distribution of the
over the N examples in D. hidden variables given the observation x under the model
Unlike the corresponding problem for supervised parameter θ0 . Stated less formally, in order to ensure
learning (8), the likelihood in (14) requires an average that the ELBO is tight at a value θ0 , one needs to solve
over the hidden variables. This is because the value the problem of inferring the distribution of the hidden
of the hidden variables z is not known, and hence the variables z given the observation x under the model
probability of the observation x needs to account for identified by the value θ0 .
all possible values of z weighted by their probabilities The property (16) leads to the natural idea of the
p(z|θ). This creates a number of technical challenges. Expectation-Maximization (EM) algorithm as a means
First, the objective in (14) is generally more complex to to tackle the ML problem. As illustrated in Fig. 13,
optimize, since the average over z destroys the typical EM maximizes the ELBO iteratively, where the ELBO
structure of the model p(x|z, θ), whose logarithm is often at each iteration is computed to be tight at the current
selected as a tractable function (see, e.g., logistic re- iterate for θ. More formally, the EM algorithm can be
gression). Second, the average in (14) cannot be directly summarized as follows5 . The model vector is initialized
approximated using Monte Carlo methods if the goal is to some value θold and then for each iteration the
to optimize over the model parameters θ, given that the following two steps are performed.
distribution p(z|θ) generally depends on θ itself.
To tackle these issues, a standard approach is based 5
EM is an instance of the more general Majorization-Minimization
on the introduction of a variational distribution q(z) algorithm [61].
14
-1.5
LL LL
-2
-2.5
Log-likelihood
-3
ELBO ( 0= 3)
-3.5
-4
ELBO ( 0
= 2)
-4.5
-5
-4 -3 -2 -1 0 1 2 3 4
old new ...
Fig. 12. The ELBO (15) is a global lower bound on the log-likelihood Fig. 13. Illustration of the EM algorithm: At each iteration, a tight
that is tight at values of the model parameters θ0 for which equality ELBO is evaluated in the E step by solving the problem of estimating
(17) holds. the latent variables (via the posterior distribution p(z|x, θ)), and then
the ELBO is maximized in the M step by solving a problem akin to
supervised learning with the estimated latent variables.
• Expectation, or E, step: For fixed parameter vector
θold , solve the problem
learning with probabilistic models entail some approxi-
maximize L(q, θold ). (18) mation of the EM algorithm. Notably, the E step can be
q
approximated by parametrizing the variational distribu-
The solution of this problem is given by q new (z) = tion with some function q(z|ϕ), or q(z|x, ϕ) to include
p(z|x, θold ). In fact, as discussed, the tightest (i.e., the dependence on x, and by maximizing ELBO over
largest) value of the ELBO is obtained by choosing the variational parameters ϕ. This approach underlies
the variational distribution q(z) as the posterior the popular variational autoencoder technique [7]. In the
of the latent variables under the current model M step, instead, one can approximate the expectation
θold . This step can be interpreted as estimating the in (19) using Monte Carlo stochastic approximation
latent variables z, via the predictive distribution based on randomly sampled values of z from the current
p(z|x, θold ), assuming that the current model θold distribution q(z). Finally, gradient descent can be used
is correct. to carry out the mentioned optimizations for both E and
• Maximization, or M, step: For fixed variational M steps (see, e.g., [62]).
distribution q new (z), solve the problem
maximize L(q new , θ) = Ez∼qnew (z) [ln p(x, z|θ)] . D. Advanced Learning Methods
θ
(19) As discussed in the previous section, ML is generally
This optimization is akin to that carried out in prone to overfitting for supervised learning. For unsu-
the corresponding supervised learning problem with pervised learning, the performance of ML depends on
known latent variables z with the difference that the task of interest. For example, consider the tasks of
these are randomly selected from the fixed varia- density estimation or of generation of new samples (see
tional distribution q new (z) obtained in the E step. Sec. V-A). In order to illustrate some of the typical issues
Given that the EM algorithm maximizes at each step encountered when applying the ML criterion, in Fig. 14
a lower bound on the log-likelihood that is tight at the we report a numerical result for a problem in which
current iterate θold , EM guarantees decreasing objective the true data distribution p(x) is multi-modal and the
values along the iterations, which ensures convergence model distribution p(x|θ) is assumed to be a mixture
to a local optimum of the original problem. We refer to of Gaussians, i.e., a directed generative model. The
[57], [58] for detailed examples. ML problem is tackled by using EM based on samples
The EM algorithm is generally impractical for large- generated from the true distribution (see [19] for details).
scale problems due to the complexity of computing the The learned distribution is seen to be a rather“blurry”
posterior of the latent variables in the E step and of estimate that misses the modes of p(x) in an attempt
averaging over such distribution in the M step. Many of being inclusive of the full support of p(x). Being a
state-of-the-art solutions to the problem of unsupervised poor estimate of the true distribution, the learned model
15
0.35 𝑥~𝑝(𝑥)
𝑝 𝑥 if 𝑇 𝑥 large
0.3
𝑇(𝑥) 𝑞 𝑥 if 𝑇 𝑥 small
0.25
𝑥~𝑞(𝑥)
0.2
discriminator
0
-5 0 5
the latter hypothesis otherwise. Intuitively, one should
expect that, the more distinct the two distributions p(x)
Fig. 14. Illustration of the limitations of ML unsupervised learning,
here obtained via the EM algorithm: The ML solution tends to be
and q(x) are, the easier it is to design a discriminator
blurry, missing the modes of the true distribution p(x). that is able to choose the correct hypothesis with high
probability.
The connection between the hypothesis testing prob-
can clearly also be problematic for sample generation in lem in Fig. 15 and the KL divergence becomes evident
the sense that samples generated from the model would if one recalls that the LLR ln(p(x)/q(x)) is known
tend to be quite different from the data samples. In the to be the best statistic T (x) in the Neyman-Pearson
rest of this section, we briefly review advanced learning sense [63]. The KL divergence is hence associated to
methods that address this limitation of ML. a particular way of evaluating the performance of the
In order to move beyond ML, we first observe that discriminator between the two distributions. Considering
ML can be proven to minimize the KL divergence a broader formulation of the problem of designing the
pD (x) discriminator in Fig. 15, one can generalize the notion
KL(pD (x)||p(x|θ)) = Ez∼pD (x) ln (20)
p(x|θ) of KL divergence to the class of f -divergences. These
between the empirical distribution, or histogram, of the are defined as
data
N [x]
pD (x) = , (21) Df (p||q) = max Ex∼p(x) [T (x)] − Ex∼q(x) [g(T (x))],
N T (x)
where N [x] counts the number of occurrences of value (22)
x in the data, and the parameterized model distribution for some concave increasing function g(·). The expres-
p(x|θ). In other words, ML fits the model to the his- sion above can be interpreted as measuring the perfor-
togram of the data by using the KL divergence as a mance of the best discriminator T (x) when the design
measure of fitness. Indeed, as mentioned in Sec. V-C, the criterion is given by the right-hand side of (22), i.e.,
KL divergence is a quantitative measure of “difference” Ex∼p(x) [T (x)] − Ex∼q(x) [g(T (x))], for a given function
between two distributions. More precisely, as per (20), g(·). Note that this criterion is indeed larger for a
the KL divergence KL(p||q) quantifies the difference discriminator that is able to output a large value of the
between two distributions p(x) and q(x) by evaluating statistic T (x) under p(x) and a small value under q(x).
the average of the LLR ln(p(x)/q(x)) with respect to The KL divergence corresponds to a specific choice of
p(x). such function (see [19] for details).
Consider now the problem illustrated in Fig. 15, in In order to move beyond ML, one can then consider
which a discriminator wishes to distinguish between two fitting the model distribution to the data histogram by
hypotheses, namely the hypothesis that the data x is a using a divergence measure that is tailored to the data and
sample from distribution p(x) and the hypothesis that it that captures the features of the empirical distribution
is instead generated from q(x). To fix the ideas, one can that are most relevant for a given application. Such
focus as an example on the case where p(x) and q(x) a divergence measure can be obtained by choosing a
are two Gaussian distributions with different means. To suitable function g(·) in (22) and by optimizing (22) over
this end, the discriminator computes a statistic, that is, a parameterized (differentiable) discriminator function
a function, T (x) of the data x, and then decides for the Tϕ (x). Integrating the evaluation of the divergence with
former hypothesis if T (x) is sufficiently large and for the problem of learning the model parameters yields the
16
min-max problem Autoencoders can also be used for their capacity to
denoise the input signal by means of filtering through
min max Ex∼pD (x) [Tϕ (x)] − Ex∼p(x|θ) [g(Tϕ (x))]. (23)
θ ϕ the lower dimensional representation z . This is done
This can be famously interpreted as a game between the in [70] for the task of localization on the basis of the
learner, which optimizes the model parameters θ, and the received baseband signal. To this end, an autoencoder
discriminator, which tries to find the best function Tϕ (x) is learned for every reference position in space with the
to distinguish between data and generated samples. The objective of denoising signals received from the given
resulting method, known as GAN, has recently led to location. At test time, the location that corresponds to
impressive improvements of ML for sample generation the autoencoder with the smallest reconstruction error is
[64]. taken as an estimate of the unknown transmitting device.
We now review some applications of the generative
VI. A PPLICATIONS OF U NSUPERVISED L EARNING TO models illustrated in Fig. 11(a). A natural idea is that
C OMMUNICATION S YSTEMS of using generative models to learn how to generate
samples from a given channel [71], [72]. This approach
In this section, we highlight some applications of
is sound for scenarios that lack tractable channel models.
unsupervised learning to communication networks.
As a pertinent example, generative models can be used to
mimic and identify non-linear channels for satellite com-
A. At the Edge munications [2]. The early works on the subject carried
1) Physical Layer: Let us first consider some appli- out in the nineties are also notable for the integration
cations of autoencoders at the physical layer as imple- of the domain knowledge into the definition of machine
mented by the network edge nodes. A fundamental idea learning models (see Sec. IV). In fact, mindful of the
is to treat the chain of encoder, channel, and decoder in strong linear components of the channels, these works
a communication link as an autoencoder, where, with posit a learnable model that includes linear filters and
reference to Fig. 11(d), the input message is x, the non-linearities [2].
transmitted codewords and received signals represent Another approach that can be considered as unsu-
the intermediate representation z , and the output of the pervised was proposed in [73] in order to solve the
decoder should match the input [30]. Note that, for this challenging problem of power control for interference
particular autoencoder, the mapping p(x|z) can only be channels. The approach tackles the resulting algorithm
partially learned, as it includes not only the encoder but deficit by means of a direct optimization of the sum-rate
also the communication channel, while the conditional with the aim of obtaining the power allocation vector (as
distribution p(x|z) defining the decoder can be learned. fractions of the maximal available powers) at the output
We should now ask when this viewpoint can be beneficial of a neural network. Related supervised learning methods
in light of the criteria reviewed in Sec. I-C. were discussed in Sec. IV. A similar approach – also
To address this question, one should check whether based on the idea of directly maximizing the criterion
a model or algorithm deficit exists to justify the use of of interest so as to obtain an approximate solution at the
machine learning tools. Training an autoencoder requires output of a neural network – was considered in [74] for
the availability of a model for the channel, and hence minimum mean squared error channel estimation with
a model deficit would make this approach inapplicable non-Gaussian channels, e.g., multi-path channels.
unless further mechanisms are put in place (see below). 2) Medium Access Layer: At the medium access
Examples of algorithm deficit include channels with layer, generative models have been advocated in [75] as a
complex non-linear dynamical models, such as optical way to generate new examples so as to augment a data
links [65]; Gaussian channels with feedback, for which set used to train a classifier for spectrum sensing (see
optimal practical encoding schemes are not known [66]; Sec. IV). An unsupervised learning task that has found
multiple access channels with sparse transmission codes many applications in communications is clustering. For
[67]; and joint source-channel coding [68]. example, in [76], clustering is used to support radio
Other applications at the physical layer leverage the resource allocation in a heterogeneous network.
use of autoencoders as compressors (see Sec. V-B) or
denoisers. For channels with a complex structure with B. At the Cloud
unavailable channel models or with unknown optimal 1) Network Layer: Another typical application of
compression algorithms, autoencoders can be used to clustering is to enable hierarchical clustering for routing
compress channel state information for the purpose in self-organizing multi-hop networks. Thanks to cluster-
of feedback on frequency-division duplex links [69]. ing, routing can be carried out more efficiently by routing
17
first at the level of clusters, and then locally within of work investigates the resulting interplay between
each cluster [77]. For an application of the unsupervised computation and communication [80].
learning task of density estimation, consider the problem
of detecting anomalies in networks. For instance, by R EFERENCES
learning the typical distribution of the features of a [1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,
working link, one can identify malfunctioning ones. This N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath
et al., “Deep neural networks for acoustic modeling in speech
approach may be applied, e.g., to optical networks [54]. recognition: The shared views of four research groups,” IEEE
2) Application Layer: Finally, we point to two in- Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
stances of unsupervised learning at the application layer [2] M. Ibnkahla, “Applications of neural networks to digital
communications–a survey,” Signal processing, vol. 80, no. 7,
that are usually carried out at data centers in the cloud. pp. 1185–1215, 2000.
These tasks follow a conceptually different approach [3] H. J. Levesque, Common Sense, the Turing Test, and the Quest
as they are based on discovering structure in graphs. for Real AI: Reflections on Natural and Artificial Intelligence.
The first problem is community detection in social MIT Press, 2017.
[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
networks. This amounts to a clustering problem whereby internal representations by error propagation,” California Univ
one wishes to isolate communities of nodes in a social San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.
graph on the basis of the observation of a realization of [5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum
likelihood from incomplete data via the em algorithm,” Journal
the underlying true graph of relationships [78]. Another
of the royal statistical society. Series B (methodological), pp.
application is the ranking of webpages based on the 1–38, 1977.
graph of hyperlinks carried out by PageRank [19], [79]. [6] C. Watkins, “Learning form delayed rewards,” Ph. D. thesis,
King’s College, University of Cambridge, 1989.
[7] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep
VII. C ONCLUDING R EMARKS learning. MIT press Cambridge, 2016, vol. 1.
[8] J. Pearl and D. Mackenzie, The Book of Why: The New Science
In the presence of modelling or algorithmic deficien- of Cause and Effect. Basic Books, 2018.
cies in the conventional engineering flow based on the [9] M. A. Alsheikh, S. Lin, D. Niyato, and H.-P. Tan, “Machine
learning in wireless sensor networks: Algorithms, strategies,
acquisition of domain knowledge, data-driven machine and applications,” IEEE Communications Surveys & Tutorials,
learning tools can speed up the design cycle, reduce vol. 16, no. 4, pp. 1996–2018, 2014.
the complexity and cost of implementation, and improve [10] C. Jiang, H. Zhang, Y. Ren, Z. Han, K.-C. Chen, and L. Hanzo,
over the performance of known algorithms. To this end, “Machine learning paradigms for next-generation wireless net-
works,” IEEE Wireless Communications, vol. 24, no. 2, pp. 98–
machine learning can leverage the availability of data 105, 2017.
and computing resources in many engineering domains, [11] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep Learning
including modern communication systems. Supervised, in Physical Layer Communications,” ArXiv e-prints, Jul. 2018.
[12] S. Lin and D. J. Costello, Error control coding. Pearson
unsupervised, and reinforcement learning paradigms lend
Education India, 2001.
themselves to different tasks depending on the availabil- [13] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep
ity of examples of desired behaviour or of feedback. learning-based channel decoding,” in CISS 2017, 2017, pp. 1–6.
The applicability of learning methods hinges on specific [14] S. Shalev-Shwartz and S. Ben-David, Understanding machine
learning: From theory to algorithms. Cambridge university
features of the problem under study, including its time press, 2014.
variability and its tolerance to errors. As such, a data- [15] D. Arpit, S. Jastrzȩbski, N. Ballas, D. Krueger, E. Bengio, M. S.
driven approach should not be considered as a universal Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and
solution, but rather as a useful tool whose suitability S. Lacoste-Julien, “A Closer Look at Memorization in Deep
Networks,” ArXiv e-prints, Jun. 2017.
should be assessed on a case-by-case basis. Further- [16] T. Hastie, R. Tibshirani, and J. Friedman, “Unsupervised learn-
more, machine learning tools allow for the integration ing,” in The elements of statistical learning. Springer, 2009,
of traditional model-based engineering techniques and pp. 485–585.
[17] R. S. Sutton, A. G. Barto et al., Reinforcement learning: An
of existing domain knowledge in order to leverage the introduction. MIT press, 2018.
complementarity and synergy of the two solutions (see [18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Fig. 2). Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-
As a final note, while this paper has focused on appli- vam, M. Lanctot et al., “Mastering the game of go with deep
neural networks and tree search,” Nature, vol. 529, no. 7587,
cations of machine learning to communication systems, p. 484, 2016.
communication is conversely a key element of distributed [19] O. Simeone, “A brief introduction to machine learning for en-
machine learning platforms. In these systems, learning gineers,” Foundations and Trends in Signal Processing, vol. 12,
no. 3-4, pp. 200–431, 2018.
tasks are carried out at distributed machines that need
[20] E. Brynjolfsson and T. Mitchell, “What can machine learning
to coordinate via communication, e.g., by transferring do? Workforce implications,” Science, vol. 358, no. 6370, pp.
the results of intermediate computations. A recent line 1530–1534, 2017.
18
[21] S. Kannan, H. Kim, and S. Oh, “Deep learning and information [41] H. Agirman-Tosun, Y. Liu, A. M. Haimovich, O. Simeone,
theory: An emerging interface,” IEEE ISIT 2018 Tutorial. W. Su, J. Dabin, and E. Kanterakis, “Modulation classification
[22] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. of mimo-ofdm signals by independent component analysis and
Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: support vector machines,” in Proc. ASILOMAR 2011, 2011, pp.
A neuromorphic manycore processor with on-chip learning,” 1903–1907.
IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018. [42] S.-H. Fang and T.-N. Lin, “Indoor location system based on
[23] A. Bagheri, O. Simeone, and B. Rajendran, “Training proba- discriminant-adaptive neural network in ieee 802.11 environ-
bilistic spiking neural networks with first-to-spike decoding,” ments,” IEEE Transactions on Neural networks, vol. 19, no. 11,
arXiv preprint arXiv:1710.10704, 2017. pp. 1973–1978, 2008.
[24] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan, “Learn- [43] Q. Wang, H. Li, Z. Chen, D. Zhao, S. Ye, and J. Cai, “Su-
ing to explain: An information-theoretic perspective on model pervised and Semi-Supervised Deep Neural Networks for CSI-
interpretation,” arXiv preprint arXiv:1802.07814, 2018. Based Authentication,” ArXiv e-prints, Jul. 2018.
[25] M. Polese, R. Jana, V. Kounev, K. Zhang, S. Deb, and M. Zorzi, [44] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropou-
“Machine Learning at the Edge: A Data-Driven Architecture los, “Learning to optimize: Training deep neural networks for
with Applications to 5G Cellular Networks,” ArXiv e-prints, wireless resource management,” in IEEE Signal Processing
Aug. 2018. Advances in Wireless Communications (SPAWC) 2017, 2017,
[26] G. Paschos, E. Bastug, I. Land, G. Caire, and M. Debbah, pp. 1–6.
“Wireless caching: Technical misconceptions and business bar- [45] A. Zappone, M. Di Renzo, M. Debbah, T. T. Lam, and X. Qian,
riers,” IEEE Communications Magazine, vol. 54, no. 8, pp. 16– “Model-Aided Wireless Artificial Intelligence: Embedding Ex-
22, 2016. pert Knowledge in Deep Neural Networks Towards Wireless
[27] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, Systems Optimization,” ArXiv e-prints, Aug. 2018.
“Machine learning for wireless networks with artificial in- [46] J. Shu, Z. Xu, and D. Meng, “Small Sample Learning in Big
telligence: A tutorial on neural networks,” arXiv preprint Data Era,” ArXiv e-prints, Aug. 2018.
arXiv:1710.02913, 2017. [47] A. Balatsoukas-Stimming, “Non-linear digital self-interference
[28] M. Angjelichinoski, K. F. Trillingsgaard, and P. Popovski, cancellation for in-band full-duplex radios using neural net-
“A statistical learning approach to ultra-reliable low latency works,” arXiv preprint arXiv:1711.00379, 2017.
communication,” arXiv preprint arXiv:1809.05515, 2018. [48] N. Strodthoff, B. Göktepe, T. Schierl, C. Hellge, and W. Samek,
[29] M. Seeger, “A taxonomy for semi-supervised learning methods,” “Enhanced Machine Learning Techniques for Early HARQ
MIT Press, Tech. Rep., 2006. Feedback Prediction in 5G,” ArXiv e-prints, Jul. 2018.
[30] T. J. O’Shea and J. Hoydis, “An introduction to machine [49] V. K. Tumuluru, P. Wang, and D. Niyato, “A neural net-
learning communications systems,” arXiv preprint, vol. 1702, work based spectrum prediction scheme for cognitive radio,”
2017. in IEEE International Conference on Communications (ICC
[31] N. Farsad and A. Goldsmith, “Neural network detection of 2010), 2010, pp. 1–5.
data sequences in communication systems,” arXiv preprint [50] D. Del Testa, M. Danieletto, G. M. Di Nunzio, and M. Zorzi,
arXiv:1802.02046, 2018. “Estimating the number of receiving nodes in 802.11 networks
[32] S. Bouchired, D. Roviras, and F. Castanié, “Equalisation of via machine learning techniques,” in IEEE Global Communica-
satellite mobile channels with neural network techniques,” tions Conference (GLOBECOM), 2016, pp. 1–7.
Space Communications, vol. 15, no. 4, pp. 209–220, 1998. [51] H. Okamoto, T. Nishio, K. Nakashima, Y. Koda, K. Ya-
[33] Y. Wang, M. Martonosi, and L.-S. Peh, “A supervised learning mamoto, M. Morikura, Y. Asai, and R. Miyatake, “Machine-
approach for routing optimizations in wireless sensor networks,” learning-based future received signal strength prediction using
in Proc. Int. Workshop on Multi-hop ad hoc Networks. ACM, depth images for mmwave communications,” arXiv preprint
2006, pp. 79–86. arXiv:1803.09698, 2018.
[34] G. De Veciana and A. Zakhor, “Neural net-based continuous [52] M. Chen, W. Saad, C. Yin, and M. Debbah, “Echo state
phase modulation receivers,” IEEE Transactions on Communi- networks for proactive caching in cloud-based radio access
cations, vol. 40, no. 8, pp. 1396–1408, 1992. networks with mobile users,” IEEE Transactions on Wireless
[35] X. Jin and H.-N. Kim, “Deep Learning Detection Networks in Communications, vol. 16, no. 6, pp. 3520–3535, 2017.
MIMO Decode-Forward Relay Channels,” ArXiv e-prints, Jul. [53] M. Zorzi, A. Zanella, A. Testolin, M. D. F. De Grazia, and
2018. M. Zorzi, “Cognition-based networks: A new perspective on
[36] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Bur- network optimization using learning and distributed intelli-
shtein, and Y. Be’ery, “Deep learning methods for improved gence,” IEEE Access, vol. 3, pp. 1512–1530, 2015.
decoding of linear codes,” IEEE Journal of Selected Topics in [54] F. Musumeci, C. Rottondi, A. Nag, I. Macaluso, D. Zibar,
Signal Processing, vol. 12, no. 1, pp. 119–131, 2018. M. Ruffini, and M. Tornatore, “A survey on application of ma-
[37] L. Lugosch and W. J. Gross, “Neural offset min-sum decoding,” chine learning techniques in optical networks,” arXiv preprint
in IEEE int. Symp. Information Theory (ISIT 2017). IEEE, arXiv:1803.07976, 2018.
2017, pp. 1361–1365. [55] F. Tang, B. Mao, Z. M. Fadlullah, N. Kato, O. Akashi, T. Inoue,
[38] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, “Scaling and K. Mizutani, “On removing routing protocol from future
deep learning-based decoding of polar codes via partitioning,” wireless networks: A real-time deep learning approach for intel-
in IEEE GLOBECOM 2017, 2017, pp. 1–6. ligent traffic control,” IEEE Wireless Communications, vol. 25,
[39] S. Schibisch, S. Cammerer, S. Dörner, J. Hoydis, and S. t. no. 1, pp. 154–160, 2018.
Brink, “Online label recovery for deep learning-based com- [56] T. T. Nguyen and G. Armitage, “A survey of techniques for
munication through error correcting codes,” arXiv preprint internet traffic classification using machine learning,” IEEE
arXiv:1807.00747, 2018. Communications Surveys & Tutorials, vol. 10, no. 4, pp. 56–76,
[40] F. Liang, C. Shen, and F. Wu, “An iterative bp-cnn architecture 2008.
for channel decoding,” IEEE Journal of Selected Topics in [57] C. M. Bishop, Pattern recognition and machine learning.
Signal Processing, vol. 12, no. 1, pp. 144–159, Feb 2018. springer, 2006.
19
[58] K. P. Murphy, Machine learning: a probabilistic perspective. [80] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Redundancy
MIT press, 2012. techniques for straggler mitigation in distributed optimization
[59] T. M. Cover and J. A. Thomas, Elements of information theory. and learning,” arXiv preprint arXiv:1803.05397, 2018.
John Wiley & Sons, 2012.
[60] O. Simeone, “Introducing information measures via inference
[lecture notes],” IEEE Signal Processing Magazine, vol. 35,
no. 1, pp. 167–171, 2018.
[61] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-minimization
algorithms in signal processing, communications, and machine
learning,” IEEE Transactions on Signal Processing, vol. 65,
no. 3, pp. 794–816, 2017.
[62] A. Mnih and K. Gregor, “Neural variational inference and
learning in belief networks,” arXiv preprint arXiv:1402.0030,
2014.
[63] H. V. Poor, An introduction to signal detection and estimation.
Springer Science & Business Media, 2013.
[64] I. Goodfellow, “NIPS 2016 tutorial: Generative adversarial
networks,” arXiv preprint arXiv:1701.00160, 2016.
[65] B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow,
D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end deep
learning of optical fiber communications,” arXiv preprint
arXiv:1804.04097, 2018.
[66] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath,
“Deepcode: Feedback codes via deep learning,” arXiv preprint
arXiv:1807.00801, 2018.
[67] M. Kim, N. I. Kim, W. Lee, and D. H. Cho, “Deep learning-
aided scma,” IEEE Communications Letters, vol. 22, no. 4, pp.
720–723, April 2018.
[68] E. Bourtsoulatze, D. Burth Kurka, and D. Gunduz, “Deep Joint
Source-Channel Coding for Wireless Image Transmission,”
ArXiv e-prints, Sep. 2018.
[69] C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive
mimo csi feedback,” IEEE Wireless Communications Letters,
2018.
[70] C. Xiao, D. Yang, Z. Chen, and G. Tan, “3-d ble indoor
localization based on denoising autoencoder,” IEEE Access,
vol. 5, pp. 12 751–12 760, 2017.
[71] T. J. O’Shea, T. Roy, and N. West, “Approximating the void:
Learning stochastic channel models from observation with
variational generative adversarial networks,” arXiv preprint
arXiv:1805.06350, 2018.
[72] H. Ye, G. Y. Li, B.-H. F. Juang, and K. Sivanesan, “Channel
agnostic end-to-end learning based communication systems
with conditional gan,” arXiv preprint arXiv:1807.00447, 2018.
[73] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards Optimal Power
Control via Ensembling Deep Neural Networks,” ArXiv e-prints,
Jul. 2018.
[74] D. Neumann, T. Wiese, and W. Utschick, “Learning the mmse
channel estimator,” IEEE Transactions on Signal Processing,
2018.
[75] K. Davaslioglu and Y. E. Sagduyu, “Generative ad-
versarial learning for spectrum sensing,” arXiv preprint
arXiv:1804.00709, 2018.
[76] A. Abdelnasser, E. Hossain, and D. I. Kim, “Clustering and
resource allocation for dense femtocells in a two-tier cellular
ofdma network,” IEEE Transactions on Wireless Communica-
tions, vol. 13, no. 3, pp. 1628–1641, 2014.
[77] A. A. Abbasi and M. Younis, “A survey on clustering algo-
rithms for wireless sensor networks,” Computer communica-
tions, vol. 30, no. 14-15, pp. 2826–2841, 2007.
[78] E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the
stochastic block model,” arXiv preprint arXiv:1405.3267, 2014.
[79] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank
citation ranking: Bringing order to the web.” Stanford InfoLab,
Tech. Rep., 1999.
20