A Survey of Preference-Based Reinforcement Learning Methods
A Survey of Preference-Based Reinforcement Learning Methods
A Survey of Preference-Based Reinforcement Learning Methods
A Survey of
Preference-Based Reinforcement Learning Methods
Abstract
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires
a lot of task-specific prior knowledge. The designer needs to consider different objectives
that do not only influence the learned behavior but also the learning progress. To alle-
viate these issues, preference-based reinforcement learning algorithms (PbRL) have been
proposed that can directly learn from an expert’s preferences instead of a hand-designed
numeric reward. PbRL has gained traction in recent years due to its ability to resolve the
reward shaping problem, its ability to learn from non numeric rewards and the possibility
to reduce the dependence on expert knowledge. We provide a unified framework for PbRL
that describes the task formally and points out the different design principles that affect
the evaluation task for the human as well as the computational complexity. The design
principles include the type of feedback that is assumed, the representation that is learned
to capture the preferences, the optimization problem that has to be solved as well as how
the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings
of current algorithms, propose open research questions and briefly survey practical tasks
that have been solved using PbRL.
Keywords: Reinforcement Learning, Preference Learning, Qualitative Feedback, Markov
Decision Process, Policy Search, Temporal Difference Learning, Preference-based Reinforce-
ment Learning
1. Introduction
Due to its solid theoretical foundations, reinforcement learning (RL) has become a very
attractive tool that has seen many successful applications. It achieved a world first success
of machine learning, when TD-Gammon learned to play backgammon (Tesauro, 1995), and
©2017 Christian Wirth, Riad Akrour, Gerhard Neumann and Johannes Fürnkranz.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v18/16-634.html.
C. Wirth et al
its applications which used to focus on games and robotics (Mnih et al., 2015; Kober et al.,
2013) now extend to diverse problems in the industrial and medical domain (Wang and
Usher, 2005; Zhao et al., 2011). However, the success of reinforcement learning applications
often crucially depends on the prior knowledge that is put into the definition of the reward
function. Although a reward function is a transferable and compact definition of the learning
task, the learned policy may be sensitive to small changes of the reward, possibly yielding
very different behaviors, depending on the relative values of the rewards. Thus, the choice
of the reward function may have a crucial impact on the success of the learner and many
tasks, in particular in robotics, require a high amount of reward-engineering. Four problems
are mainly responsible for the difficulties in reward engineering.
(1) Reward Hacking: The agent may maximize the given reward, without performing the
intended task (Amodei et al., 2016). For example, consider a robotic vacuum cleaner
obtaining a positive reward in case all dirt has been removed. As the robot only
observes dirt via camera, it may try to cover up the dirt, effectively removing it from
its sensor space
(2) Reward Shaping: In many applications, the reward does not only define the goal but
also guides the agent to the correct solution, which is also known as reward shaping
(Ng et al., 1999). Striking a balance between the goal definition and the guidance task
is also often very difficult for the experimenter.
(3) Infinite rewards: Some applications require infinite rewards, for instance, if certain
events should be avoided at any cost. For example, in the cancer treatment problem
(Zhao et al., 2009), the death of a patient should be avoided at any cost. However,
an infinitely negative reward breaks classic reinforcement learning algorithms and ar-
bitrary, finite values have to be selected.
The reward shaping problem has especially gained attention in recent years, as it is closely
related to intrinsic vs. extrinsic motivation problem (Barto, 2013). An optimal reward would
define an extrinsic motivation, e.g., for achieving a defined goal, while also motivating the
agent to perform useful actions in states where extrinsic motivation is absent. Hence, the
intrinsic motivation can be seen as a guidance method. Several methods have been proposed
to learn such a reward automatically, but usually only using an extrinsic goal description
(Singh et al., 2009, 2004). However, expert’s may have an intuition about useful actions or
intrinsic motivation that should be used to guide the agent. Defining such a motivation can
be challenging, especially in terms of scalar, numeric rewards and several publications show
that an agent’s performance can be very sensitive to the used values (Singh et al., 2009; Lu
et al., 2016).
Preference-based reinforcement learning (PbRL) is a paradigm for learning from non-
numerical feedback in sequential domains. Its key idea is that the requirement for a numer-
ical feedback signal is replaced with the assumption of a preference-based feedback signal
2
Preference-Based Reinforcement Learning Survey
that indicates relative instead of absolute utility values. Preferences enable a definition of
feedback that is not subject to arbitrary reward choices, reward shaping, reward engineering
or predefined objective trade-offs. PbRL aims at rendering reinforcement learning applicable
to a wider spectrum of tasks and non-expert users. Several algorithms have been developed
for solving this class of problems, however, the field still lacks a coherent framework. Fur-
thermore, most publications do not explicitly state the assumptions that are relevant for the
application of the algorithms. In this paper, we will survey and categorize preference-based
formulations of reinforcement learning and make these assumptions explicit. We also show
the relation to other RL-based settings that deviate from the basic setting, such as inverse
reinforcement learning or learning with advice.
This paper is based on a preliminary survey (Wirth and Fürnkranz, 2013c), but has
been extended in almost every aspect. In particular, the design principles have been re-
worked to better capture all relevant aspects, we provide substantially more depth about
the differences, advantages and disadvantages of the surveyed methods, and we considerably
broadened the scope of the survey. The paper starts with the problem setting and the task
description in Sec. 2, including a short discussion of related approaches. We then explain
the design principles of the algorithms that fit in this framework in Sec. 3. Experimental do-
mains, including some real-world problems that have been solved with PbRL, are described
in Sec. 4. Open research questions are pointed out in Sec. 5 before the paper concludes.
2. Preliminaries
Preference-based reinforcement learning algorithms try to solve the reinforcement learning
problem (Sutton and Barto, 1998) using preferences between states, actions or trajectories.1
The goal is to learn a policy that is most consistent with the preferences of the expert.
Hence, a PbRL algorithm must determine the intention of the expert and find a policy
that is maximally consistent with it. For improving its estimate, the algorithms create
trajectories which are evaluated by the expert, thereby, exploring the state-action space.
Therefore, many algorithms use the policy evaluation/policy improvement cycle (Sutton
and Barto, 1998) encountered in classical reinforcement learning. Yet, this cycle differs in
the evaluation method from classical RL as we can’t obtain and propagate numeric feedback,
but can only use pairwise preferences.
1. State and action preferences can be mapped to trajectory preferences, as we discuss in Sec. 3.1
3
C. Wirth et al
correct target choice, but instead feedback about a single option (in reinforcement learning)
or a comparison between two options (in preference learning).
A preference z i z j denotes that z i is preferred over z j for two choices z i and z j in a
set of possible choices Z. Several short-hand notations can be used (Kreps, 1988):
Depending on what constitutes the set of choices Z, Fürnkranz and Hüllermeier (2010)
distinguish between learning from object preferences, where the training data is given in
the form of pairwise comparisons between data objects, and learning from label preferences,
where the training data is given in the form of pairwise comparisons between labels that
are attached to the objects. In the former case, a common performance task is to rank a
new set of objects (object ranking; Kamishima et al. 2010), whereas in the latter case, the
performance task is to rank the set of labels for a new object (label ranking; Vembu and
Gärtner 2010). For example, in a reinforcement learning context, states may be considered
as objects, and actions may be considered as labels, so that the decision problem of ranking
the states according to their desirability is an object ranking problem, whereas ranking the
available actions in a given state may be considered as a label ranking problem (Fürnkranz
et al., 2012).
There are two main approaches to learning representations of preferences, namely utility
functions, which evaluate individual alternatives, and preference relations, which compare
pairs of competing alternatives. From a machine learning point of view, the two approaches
give rise to two kinds of learning problems, to which we will return in Sec. 3.2.
Some algorithms (Wirth et al., 2016; Wilson et al., 2012; Akrour et al., 2011, 2012; Jain
et al., 2013, 2015; Gritsenko and Berenson, 2014; Zucker et al., 2010; Jain et al., 2013, 2015;
4
Preference-Based Reinforcement Learning Survey
Gritsenko and Berenson, 2014) also require the definition of a feature-based description of
a trajectory, which we denote as ψ(τ ).
As in the standard MDP definition, the state transition model δ(s0 | s, a) is assumed
to be stochastic and the parameter γ ∈ [0, 1) is the discount factor. A policy π(a | s) is
a conditional distribution that assigns probabilities to action choices based on the current
state. The policy is typically given by a parametric distribution π(a | s, θ), however, non-
parametric approaches are also considered (Akrour et al., 2012; Wirth et al., 2016; Busa-
Fekete et al., 2013, 2014).
In contrast to conventional MDPs, we do not assume a numeric reward signal r(s, a).
Instead, the agent can observe a preference relation over trajectories τ i τ j . Later, in
Sec. 3.1, we will also consider preferences between states and actions, and show that they
can be reduced to preferences between trajectories. We further assume that a preference
for a given pair of trajectories is generated stochastically with a probability distribution ρ
because the expert can err and therefore introduce noise. We use ρ(τ i τ j ) for denoting
the probability with which τ i τ j holds for a given pair of trajectories (τ i , τ j ). The
distribution ρ is typically not known, yet the agent can observe a set of preferences
which has been sampled using ρ. Υ denotes the set of all trajectories that are part of ζ. A
key problem in PbRL is to obtain a representative set of preferences ζ. Two common simpli-
fications are to disregard the stochasticity of the expert and assume strict preferences. The
strict preference assumption implies a total order, i.e., for each pair τ i and τ j , exactly one of
the two strict preference relations holds. Phrased otherwise, ρ(τ i τ j ) = 1 − ρ(τ j τ i ).
As a consequence, there are no incomparable pairs of trajectories. Incomparability can occur
when evaluating trajectories based on multiple criteria, where it is impossible to improve one
criteria while decreasing another one. If there are incomparable pairs where no preference
relation can be defined, the preferences form a partial order.
2.3 Objective
The general objective of the agent is to find a policy π ∗ that maximally complies with the
given set of preferences ζ. A preference τ 1 τ 2 ∈ ζ is satisfied if
with
|τ |
Y
Prπ (τ ) = µ(s0 ) π(at | st )δ(st+1 | st , at ),
t=0
5
C. Wirth et al
• learning a policy computes a policy that tries to maximally comply with the prefer-
ences;
• learning a preference model learns a model for approximating the expert’s preference
relation, and
6
Preference-Based Reinforcement Learning Survey
Figure 1: PbRL: Learning policies from preferences via direct (dashed path) and surrogate-
based (dotted path) approaches.
• learning a utility function estimates a numeric function for the expert’s evaluation
criterion.
The dashed path in Fig. 1 relates to policy learning. Both other approaches, learning either
a preference model or a utility function, are surrogate-based in that they learn a surrogate
from the obtained preferences, which is in turn used for deriving a maximizing policy (cf.
the dotted path in Fig. 1). For this optimization, often conventional reward-based policy
improvement (PI) algorithms can be used. All three cases are explained in more detail
in Sec. 3.2, but the result is always a new policy πnew , that can be used to sample new
trajectories.
Typically, new trajectories are evaluated repeatedly, and hence, learning from preferences
becomes an interactive process between the expert and the algorithm, where the algorithm
presents new trajectory pairs to the expert and the expert evaluates them. However, in the
interactive setting, a criterion has to be defined for the selection of the preference queries
that should be presented to the expert, as described in Sec. 3.4. This criterion must also
resolve the exploration/exploitation trade-off as we can only obtain feedback for explicitly
requested preferences.
A single pass through the PbRL cycle is often performed in a batch setting, where
available trajectories and preference data are used to compute a policy offline.
7
C. Wirth et al
8
Preference-Based Reinforcement Learning Survey
function. Furthermore, as all trajectories are given by an expert, IRL approaches are not
able to obtain new feedback for trajectories and, hence, can not improve upon the expert
demonstrations.
9
C. Wirth et al
There are three different types of preference feedback that can be found in the literature,
action, state and trajectory preferences. The most important difference of the preference
types is that they impose different challenges for the expert and the algorithm.
An action preference compares two actions for the same state. The action preference ai1 s
ai2 expresses that in state s, action ai1 should be preferred to ai2 .
We need to distinguish between short-term optimality (e.g. optimal, immediate reward
in terms of reinforcement learning) and long-term optimality (optimal, expected return).
Knox and Stone (2012) and Thomaz and Breazeal (2006) analyzed this problem in an ad-
vice setting and observed that feedback relating to immediate rewards are difficult to use
and feedback concerning the expected, long-term return should be preferred. This obser-
vation can be explained by the argument that it is difficult to aggregate short-term action
preferences, as they are only valid for a given state. It is unclear how the short-term prefer-
ences relate to long-term optimality as we are not able to trade-off preferences for different
states. Action preferences are demanding for the expert, she needs to be familiar with the
expected long-term outcome. Computationally, this problem is fairly simple as it is sufficient
to select the most preferred action in every state, which already implies the best long-term
outcome. Hence, only generalization issues remain. The action-preference-based approach
of Fürnkranz et al. (2012) also uses long-term action preferences, but not concerning an
unknown, optimal policy. They supply a policy for computing the long term expectation.
Hence, the expert does not need to be familiar with the expected, optimal policy but can
consider samples from the provided policy for its comparison.
A state preference si1 si2 determines that state si1 is preferred to state si2 . A state
preference is equivalent to saying that there is an action in state si1 that is preferred to all
actions available in state si2 .
State preferences are more informative than action preferences as they define relations
between parts of the global state space. Yet, they also suffer from the long-term/short-
term optimality problem. Long-term state preferences define a clearly defined setting as it
is sufficient to discover the most preferred successor state for each state and maximize its
selection probability, as analyzed by Wirth and Fürnkranz (2012, 2015). Short-term state
preferences do not define a trade-off, because it is unclear whether visiting an undominated
state once should be preferred over visiting a rarely dominated state multiple times. Short-
term state preferences are used by Zucker et al. (2010) where they try to resolve the trade-off
problem using an approximated reward.
State preferences are slightly less demanding for the expert as it is not required to
compare actions for a given state. However, the expert still needs to estimate the future
outcome of the policy for a given state. State preferences do not directly imply a policy, but
it can be easily derived with knowledge of the transition model, as explained in Sec. 3.2.3.
10
Preference-Based Reinforcement Learning Survey
Approximating the Policy Distribution. Wilson et al. (2012) approximate the policy
distribution via a Bayesian likelihood function subject to the preference-based data proba-
bility. A parameterized policy space π(a | s, θ) is used for defining the policy distribution
Pr(π | ζ). Algorithm 1 shows how to collect the preferences. The basic strategy is to sam-
11
C. Wirth et al
ple two policies, described by a parameter vector, from the posterior distribution Pr(π | ζ)
induced by the preferences. Each policy pair is used to create one or more trajectories,
defining a potential preference query by pairing the resulting trajectories. One of multiple
selection criteria, that we discuss in Sec. 3.4.3, is used to select the query to pose to the
expert. Variants of the criteria also depend on the policy that created the trajectory or
allow to stop the sampling process before m policy pairs are created.
The likelihood Pr(π | ζ) is modeled by comparing trajectories τ π that are realized by a
policy π with the preferred and dominated trajectories in ζ. The likelihood is high if the
realized trajectories τ π of the policy π are closer to preferred trajectory, i.e.,
!
E[d(τ 1 , τ π )] − E[d(τ 2 , τ π )]
Pr(τ 1 τ 2 | π) = Φ √ , (4)
2σp
where the function Φ is the c.d.f of N (0, 1), which resembles a sigmoidal squashing function.
The parameter σp accounts for feedback noise to allow the expert to err. The function d
is a distance function comparing two trajectories. The policy distribution is then given by
applying Bayes theorem, i.e,
Y
Pr(π | ζ) ∝ Pr(π) Pr(τ 1 τ 2 | π), (5)
i
and the posterior is approximated using Markov Chain Monte Carlo simulation (Andrieu
et al., 2003). This approach requires the specification of a meaningful distance function,
which is hard to define in many domains and requires a large amount of domain knowledge.
Wilson et al. (2012) use an Euclidean distance function. Yet, Euclidean distances are hard
to use in many domains such as those with high-dimensional continuous state spaces. Fur-
thermore, if the state space is not continuous it may be hard to define a suitable distance
12
Preference-Based Reinforcement Learning Survey
measure d. The definition of a parametric policy also requires domain knowledge, although
good parametric policy representations are known for many domains, such as robotics (Kober
et al., 2013). Often, such policy representations reduce the dimensionality of the policy and
therefore the learning complexity.
Comparing and Ranking Policies. Alternatively, we can directly compare two policies
π1 and π2 by querying the expert for different trajectory pairs τ π1 and τ π2 that have been
realized by the two policies (Busa-Fekete et al., 2013, 2014). The method maintains a set of
policies Πi , described by their parameter vectors. The preference set ζ is used to compute a
ranking over policies, as shown in Algorithm 2. A confidence bound method that we discuss
in Sec. 3.4.1 determines which policies to select for generating the next trajectory pair for
which a preference query is posed.
The outcome of this comparison is then used to compute new policies that are more likely
to realize preferred trajectories. A policy π1 is preferred over a policy π2 if the generated
trajectories τ π1 are on expectation preferred over the trajectories τ π2 realized by the second
policy, i.e.,
Pr(π1 π2 ) ⇔ Eτ π1 ,τ π2 [ρ(τ π1 τ π2 )]. (6)
For an observed set of preferences ζ, the expectation in (6) can be approximated as
N
1 X
Pr(π1 π2 ) ≈ I(τ πi 1 τ πi 2 ). (7)
N
i=1
The objective is to find the set of optimal polices π ∗ = arg maxπ π,π0 ∈Π Pr(π π 0 ). In
P
contrast to Wilson et al. (2012), no distance function is required. However, preferences can
not be reused as we need to obtain new preferences for each policy pair and, hence, a high
amount of preferences is required. Busa-Fekete et al. (2013, 2014) deal with incomparabilities
and indifference by assuming
13
C. Wirth et al
The optimization itself can be performed with algorithms similar to evolutionary direct
policy search (EDPS; Heidrich-Meisner and Igel 2009). Policy comparison approaches fol-
lowing the ideas of Busa-Fekete et al. (2013, 2014) can be seen as cases of preference-based
multi-armed bandit optimization, where each arm represents one policy. For a survey of
such preference-based bandit methods, we refer to Busa-Fekete and Hüllermeier (2014).
Instead of directly learning a policy, one can also try to learn a model C(a a0 | s) that
predicts the expected preference relation between a and a0 for a given state s. The preference
relation between actions can be used to obtain a ranking for actions given a state, from which
a policy can be derived in turn. The problem of learning C(a a0 | s) can be phrased as
a classification problem trying to correctly predict all observed preferences. As shown in
Algorithm 3, preference-based approximate policy iteration obtains multiple trajectories for
each action in k sampled states (Fürnkranz et al., 2012). This allows to derive multiple action
preferences by comparing trajectories, based on the initial actions. Action preferences are
directly used as training data {si , ai1 } {si , ai2 } ⇔ (ai1 ai2 | si ) for C. From these, a
separate classifier Cij (s) is trained, which predicts whether ai aj will hold in a state s.
For deriving a ranking over all actions in a given state, the pairwise preference predictions
of the trained classifiers Cij may be combined via voting or weighted voting (Hüllermeier
et al., 2008), where each prediction issues a vote for its preferred action. The resulting count
k(s, a) for each action a in state s
X X
k(s, a) = C(ai aj | s) = Cij (s)
(8)
∀ai ∈A(s),aj 6=a ∀ai ∈A(s),aj 6=a
14
Preference-Based Reinforcement Learning Survey
correlates with ρ ( (s, ai ) (s, aj ) ) if the classifiers C approximate ρ well. Hence, we can
derive a policy (
∗ 1 if a = arg maxa0 k(s, a0 )
π (a | s) = ,
0 else
that maximizes the realization probability for the most preferred action, satisfying (2).
Wirth and Fürnkranz (2013a,b) also use action preferences in a similar way. However,
they treat C as a continuous function, returning a weighted vote for each action instead of
a binary vote. Hence, k(s, a) now also contains the uncertainty of the estimated preference,
which allows for more efficient exploration. In this case, the function C is defined by a
tabular representation with a single value for each state/action/action triplet, and updated
using a gradient-based method. Fürnkranz et al. (2012) use a neural network (Bishop,
1995) that is retrained completely in every iteration, allowing for generalization at the cost
of potential approximation errors.
which is comparable to the expected return in classic reinforcement learning. Besides explor-
ing the policy space, as in classic reinforcement learning, we also need to explore the utility
space in preference-based RL algorithms. Estimating a utility function has the advantage
that we can evaluate new trajectories without asking the expert for explicit feedback. Yet,
utility function approaches may suffer from the approximation errors induced by the esti-
mated utility function. A common utility-based approach is to assume a scalar utility for
the trajectories, i.e.,
τ i1 τ i2 ⇔ U (τ i1 ) > U (τ i2 ).
A scalar utility function always exists provided that there are no incomparable trajectory
pairs (von Neumann and Morgenstern, 1944).
Algorithm 4 summarizes a general process of utility-based PbRL, which forms the basis
of a variety of different realizations. Preference queries are generated by sampling one or
2. The definition of a utility function may differ from the functions encountered in utility theory, but is
inline with existing PbRL approaches and the preference learning literature.
15
C. Wirth et al
more trajectories from a policy and derive queries by exhaustive or selective comparison
(cf. Sec. 3.4.1 & 3.4.2). The preferences are then used to compute a utility function (Fried-
man and Savage, 1952) which, in turn, can be used to optimize policies according to the
expected utility. Different learning algorithms can be used for modeling the utility function.
In the following, we primarily discriminate between linear and non-linear utility models.
Linear utility functions. The most common approach is to use utility functions that
are linear in a feature vector. We may use state action features ϕ(s, a) resulting in a utility
U (s, a) = θ T ϕ(s, a), or trajectory features ψ(τ ) yielding U (τ ) = θ T ψ(τ ). In order to find
such a linear utility function, we can define a loss function L which is given by the (weighted)
sum of the pairwise disagreement loss L, i.e.,
|ζ|
(10)
X
L(θ, ζ) = αi L(θ, ζi ),
i=1
as described in Sec. 2.3. Using the linearity of the utility function, we can define the utility
difference
d(θ, ζi ) = θ T (ψ(τ i1 ) − ψ(τ i2 )) (11)
for two trajectories. Different definitions of the pairwise disagreement loss L(θ, ζi ) have been
used in the literature and most of them use the utility difference. An intuitive loss directly
correlating with the obtained binary feedback is the indicator loss
which states that the utility difference has to be always larger then , as used by Sugiyama
et al. (2012). However, such a binary loss function is typically difficult to optimize and
there is no notion of preference violation (or fulfillment) for trading off multiple preferences.
16
Preference-Based Reinforcement Learning Survey
0/1
p01 θ (ζi ) = I d(θ, ζi ) < 0 Sugiyama et al. (2012)
|ζ|−1
combined pθ (ζi ) = |ζ| p01 θ (ζi ) + 1
|ζ| psig θ (ζi ) Wirth et al. (2016)
R
pθ (ζi ) = 0 max c(d(θ, ζi ), )
integrated 0
if x < −
Akrour et al. (2013, 2014)
piecewise c(x, ) = 1 if x >
+x
2 else,
Therefore, continuous loss functions are more often used. A number of continuous loss
functions have been proposed in the literature.
Tbl. 1 shows a summary of loss functions and likelihood functions that can be found
in the literature. All these approaches try to find a good trade-off between potentially
violating preferences and optimizing the utility difference of preferred trajectories. Some
algorithms (Akrour et al., 2011, 2012; Zucker et al., 2010; Wirth and Fürnkranz, 2012, 2015;
Runarsson and Lucas, 2012, 2014) define the loss function L (θ, ζi ) as hinge loss (Fig. 2a)
that can be optimized using SVM ranking algorithms (Joachims, 2002; Herbrich et al., 1999),
others (Wirth et al., 2016) as a piecewise linear loss function (Fig. 2b) inspired by inverse
reinforcement learning algorithms (Ng and Russell, 2000).
Other algorithms use sigmoidal loss functions to saturate the effect of a high utility
differences, in order to overcome the limitations of a linear loss. Such sigmoidal loss functions
are often modeled as likelihood functions pθ (ζi ) for the preferences. In this case, we have to
optimize the log likelihood, i.e.,
with L(θ, ζ) = − log (pθ (ζi )). Most algorithms (Akrour et al., 2014; Wirth et al., 2016;
Kupcsik et al., 2015) do not directly maximize the likelihood, but obtain the posterior
17
C. Wirth et al
loss
loss
d(θ, ζi ) d(θ, ζi )
(a) hinge (b) IRL
likelihood
likelihood
likelihood
d(θ, ζi ) d(θ, ζi ) d(θ, ζi ) d(θ, ζi )
(a) cum. likelihood (b) sigmoid likelihood (c) int. piecewise likel. (d) 0/1 likelihood
likelihood
loss
loss
(e) comb. likelihood (f) cum. log loss (g) combined log loss
Figure 3: Shape of likelihood functions and the according negative log likelihood
distribution Y
Pr(θ | ζ) ∝ Pr(θ) pθ (ζi ). (13)
i
The expected utility of a new trajectory is then obtained by computing the expectation over
the posterior. The posterior is computed using MCMC (Andrieu et al., 2003) by Akrour
et al. (2014) and Wirth et al. (2016), whereas the latter uses elliptic slice sampling (ELS;
Murray et al. 2010). These procedures are sampling based and can only obtain a costly
approximation of the posterior. The utility function’s posterior is also computed by Kupcsik
et al. (2015), but using more efficient convex optimization. Wirth et al. (2016) also compare
with a direct maximization of the likelihood. The direct maximum likelihood approach is
more optimistic in the sense that it disregards the uncertainty.
The only approach that does not use an explicit cost function for obtaining the utility is
by Jain et al. (2013, 2015). It uses the preferences to compute a gradient
for the parameter
vector of the utility function, i.e., θ k+1 = θ k + α ψ(τ k1 ) − ψ(τ k2 ) , where α is a step-
18
Preference-Based Reinforcement Learning Survey
size. In this case, the parameter vector is updated such that it is correlated with the
feature vector of the preferred trajectory. Such a gradient can be mapped to the simple
loss function L(θ, ζi ) = −d(θ, ζi ). In general, it is unclear how the aggregated utility loss
L(θ, ζ) is related to the policy loss L(π, ζ) (see Sec. 2.3), as the policy is subject to the
system dynamics whereas the utility is not. Nevertheless, it is assumed that maximizing the
expected utility (9) yields more preferred trajectories.
Non-Linear Utility Functions. Few approaches allow the use of non-linear utility func-
tions. The utility feature space is usually assumed to be defined in advance, shifting the
problem to the user. However, defining such a feature space usually requires expert knowl-
edge. Methods that can use non-linear utilities are easier to apply but may require a higher
amount of preferences or trajectory samples, because the learning problem is more complex.
Gritsenko and Berenson (2014) convert the preferences into a ranking and do not operate
directly on the preference level. Instead, their algorithm tries to learn a utility distance
d(U, ζi ), which models the rank difference ki of a trajectory pair τ i1 and τ i1 , i.e.,
This loss function can be minimized by any regression algorithm. For example, Gritsenko
and Berenson (2014) use M5’ (Wang and Witten, 1997).
The rank values are subject to a somewhat arbitrary scale (e.g., the difference between
mispredictions of neighboring ranks is always the same). As an alternative, the rank values
can be encoded as classes instead of numeric values (Gritsenko and Berenson, 2014). Such a
multi-class problem can be learned by any conventional classification algorithm, Gritsenko
and Berenson (2014) use C4.5 (Quinlan, 1993). The classification model is again used as
the utility function where the class is used as utility value. The multi-class loss can not be
directly mapped to a preference loss L (θ, ζi ) as the classification problem is discrete and
disregards the rank or utility difference.
Christiano et al. (2017) use deep neural networks for approximating P a non-linear util-
ity function. They directly minimize the negative log likelihood − N i=1 log (pθ (ζi )) with
a sigmoid likelihood function. The utility function is not linear in the features, but con-
vex. Hence, it is possible to compute a gradient for the parameters θ and use common
backpropagation techniques for minimizing the loss.
Wirth et al. (2016) and Kupcsik et al. (2015) also allow for non-linear utility functions
by using kernel-based feature spaces. While the likelihood function is linear in the resulting,
infinite feature space, the effective feature space depends non-linearly on the used samples.
Wirth et al. (2016) use Gaussian kernel functions whereas Kupcsik et al. (2015) employ
Gaussian process preference learning (Chu and Ghahramani, 2005).
In case the utility function should be applicable to states or state/action pairs, the
linearity of the MDPs reward has to be considered (Wirth et al., 2016). Hence, the utility
must be linear over the state/action pairs within a trajectory and non-linearity has to be
achieved by mapping the state/action features itself, as we will discuss in Sec. 3.3.3.
19
C. Wirth et al
as a greedy policy maximizing the expected value of the next state s0 . However, this ex-
pectation can only be computed exactly if the transition model δ is known and the state
space is discrete. Moreover, value-based utilities always depend on a specific, optimal policy.
Hence, it is difficult to transfer the learned value-utility function to other domains.
20
Preference-Based Reinforcement Learning Survey
that specifies the policy. Instead of using a linear feature representation, a Gaussian process
is used to model the return utility U (τ ), see Sec. 3.2.3.
|τ |
(16)
X
ψ(τ ) = γ t ϕ(st , at ),
t=0
with U (st , at ) = θ T ϕ(st , at ), which coincides with the definition of the return in reinforce-
ment learning. The trajectory utility U (τ ) is a linear sum over the state/action utilities
but the state/action utilities themselves can be non-linear. As an example, Wirth et al.
(2016) use a kernel function to obtain such a non-linear utility by applying the kernel on a
state/action level. Sugiyama et al. (2012) proposed an alternative learning method, based
on the sum of future utilities
!!
X X
U (s, a) = δ(s0 | s, a) U (s, s0 ) + max δ(s00 , s0 , a0 )U (s0 | s00 ) ,
s0 ∈S
a0 ∈A
s00 ∈S (17)
U (s, s0 ) = θ T ϕ(s, s0 ),
with U (s, s0 ) as the utility for the state transition s to s0 . However, this idea is only applicable
with a known transition function and a discrete action space.
More recently, a method was proposed that does not require the utility to be linear in
the state/action features. Christiano et al. (2017) define a learning method for a trajectory
P|τ |
utility U (τ ) = t=0 U (st , at ), that uses a deep neural network for U (st , at ). This method
greatly improves on the scalability of the learning process and allows more expressive utility
functions, but requires a high amount of preferences.
In contrast to conventional reinforcement learning, a discount factor of γ = 1 is used for
U (τ ) in all approaches because the expert should not need to consider the effects of decay.
However, using an undiscounted return is not a major problem as all considered trajectories
have a finite length. The parameters θ can now be estimated by using one of the loss
functions presented in Sec. 3.2.3, where trajectory preferences are used as constraints for
21
C. Wirth et al
the utility return U (τ ). The underlying assumption of using immediate utility functions is
that the expert’s evaluation function complies with the Markov property. If this is not the
case, the decomposition of the utility return in immediate utilities is error prone. Instead
of using state-action utilities U (s, a), some algorithms (Wirth et al., 2016) also disregard
action costs and just use a state utility function, i.e., U (s, a) = U (s).
In contrast to value-based utility functions, reward-based utility functions can be easily
transferred across domains. Moreover, a reward-based utility is also independent of the
system dynamics and, therefore, often has a simpler structure than value based utilities
rendering them simpler to learn and generalize.
22
Preference-Based Reinforcement Learning Survey
it is not used for preference generation. Hence, the number of trajectories can be consid-
erably higher than the number of generated preferences. Yet, many algorithms ignore the
potentially higher costs of preference generation in comparison to trajectory generation, as
we will elaborate in the following discussion.
23
C. Wirth et al
learning method explained in Sec. 3.4.3. However, they also only use a single, homogeneous
exploration method. The policy improvement is computed by updating the posterior dis-
tribution of a parametric policy space, given a preference-based likelihood function. A new
preference query is then created by applying two policies to the MDP, selected by the di-
rected exploration criterion. The computation of these criteria requires multiple trajectories,
but they are not used for the queries themselves or for the policy improvement step.
Exhaustive Preference Query Generation. Some approaches require that each tra-
jectory is evaluated by the expert by defining preferences regarding every possible trajectory
pair, either for comparing and selecting policies (Busa-Fekete et al., 2013, 2014) or for eval-
uating and updating a single policy (Jain et al., 2013, 2015; Wirth and Fürnkranz, 2013a,b;
24
Preference-Based Reinforcement Learning Survey
Kupcsik et al., 2015). These approaches ignore that generating preferences can be more
costly than generating trajectories and, hence, typically require a large amount of expert
queries.
Greedy Preference Query Generation. In greedy approaches (Akrour et al., 2014), we
fully optimize a given objective, described in Sec. 3.4.3, which can be either an undirected
or directed exploration objective. After the optimization is finished, we use a trajectory
generated from the optimized policy to query the expert’s evaluation function. While this
approach is typically much more efficient in the number of expert queries, it may need many
trajectories to be generated in order to optimize the posed objective. Furthermore, many of
these trajectories might have been unnecessarily generated as the posed objective is typically
quite error-prone in the beginning of the learning process. Hence, the optimization might
explore poor trajectories which could have been avoided if new preferences would have been
requested earlier before reaching the maximum of the given objective.
Interleaved Preference Query Generation. The above-mentioned problem was recog-
nized by several authors (Akrour et al., 2012, 2011; Wirth et al., 2016; Wilson et al., 2012;
Christiano et al., 2017). A simple solution to alleviate it is to prematurely stop the optimiza-
tion of the given objective and request new preferences. The preferences are subsequently
used to update the given optimization objective (see Sec. 3.4.3). This approach is followed by
Akrour et al. (2012, 2011) where the search in parameter space is stopped early, after a given
number of iterations. Wirth et al. (2016); Christiano et al. (2017) request new preferences
after a single iteration of the used policy optimization algorithm. In each iteration, several
trajectories are generated by the current policy but only a subset (typically one) is used for
an expert query. Wirth et al. (2016) propose to use the trajectory with the highest expected
accumulated utility, whereas Christiano et al. (2017) use a variance estimate, based on an
ensemble technique. Wilson et al. (2012) apply the same idea to a parametric policy space,
using a selection criterion based on expected belief change. An alternative method selects
two policies that are used for creating trajectories for a new preference query. In contrast
to Wirth et al. (2016), this algorithm completely discards trajectories that are not used for
preference queries, as it is only able to use explicitly evaluated trajectories. Hence, not all
trajectories are evaluated, but only evaluated trajectories are used to update the policy.
The resulting algorithms allow to choose the trade-off between the cost of trajectory
generation and preference generation by controlling the ratio of generated trajectories to
generated preferences. Akrour et al. (2012, 2011); Wirth et al. (2016) can also efficiently
explore the trajectory space as well as the preference space as they can use trajectories not
evaluated by the expert for policy optimization.
25
C. Wirth et al
where d(τ i , τ j ) is a trajectory difference function (see (4)). If two policies disagree greatly
on what trajectories to create, it is assumed to be beneficial to rule out one. Therefore,
samples from πi and πj are evaluated and a query is posed in case the value of the criterion
exceeds a given threshold.
The second criterion computes the expected belief change over the policy space that
results from adding a preference
using potential queries by sampling policies from the current distribution. Different measures
could be used as V , e.g. the Kullback-Leibler divergence. Wilson et al. (2012) use an
approximation of the variational distance. The query maximizing the criterion, based on a
finite trajectory sample set, is then posed to the expert.
Akrour et al. (2011, 2012, 2014) define a utility function that includes an exploration
term. The simplest form
used by Akrour et al. (2011), adds a diversity term to the expected utility of a policy,
comparing the states in S1 , only visited by π, the states in S2 , only visited by π 0 and
the states in S3 visited by both policies. Πt are all policies obtained up to the current
iteration. Candidate policies are generated by an evolutionary strategy and EPrπ (τ ) [U (τ )] is
approximated via samples. The diversity function ∆(π, π 0 ) rewards differences in the states
visited and is only applicable to discrete state spaces, however, a discrete description can be
obtained using a clustering algorithm (see Sec. 3.3.2). Akrour et al. (2012, 2014) propose a
different selection criterion that maximizes the expected utility of selection
with τ ∗ as the currently undominated trajectory. It computes the expected utility for the
undominated trajectory, differentiating the cases where the new trajectory τ is preferred
over the current best and where it does not improve. Pr(τ τ ∗ | U ) is the probability
of improvement, given the current distribution over utility functions. The expected utility
Eζ∪{τ τ ∗ } [U (τ )] is then computed with the assumption that the new preference is added to
the preference set ζ. The two publications (Akrour et al., 2012, 2014) differ in the method
used to approximate the criterion.
26
Preference-Based Reinforcement Learning Survey
Wirth et al. (2016) also maintains the current, best trajectory for preference queries.
The trajectory maximizing the expected utility
is selected to obtain the second trajectory for the preference query. τ [i] is a set of trajectories
obtained by sampling from the current policy πi . Trajectories obtained from sampling former
policies are not considered.
Christiano et al. (2017) also use trajectories obtained from the current policy, but eval-
uate pairs based on the variance of the utility. They approximate the variance by obtaining
utility samples from an ensemble of utility functions. Each utility function uses a randomly
sampled subset of the available preferences, hence, the learned functions may differ.
A major consideration is how the policy is optimized and how the policy optimization method
affects the optimality of the learned policy and the sample requirements of the algorithm.
Direct policy search approaches typically directly search in the parameter space of the pol-
icy and do not assume the Markov property or use dynamic programming such as standard
reinforcement learning approaches. They maximize the objective defined by (2) and (3),
however, this is difficult to compute directly due to the dependence on the transition dynam-
ics. Hence, methods like policy likelihood or ranking (cf. Sec. 3.2.1) are used. Alternatively,
common return-based optimization can be employed by defining an utility-based return, as
introduced in Sec. 3.3.2.
Direct policy search methods are typically not very sample-efficient, as the temporal
structure of the problem is ignored, but they can obtain good results because of their
simplicity. These methods can be typically used for policies with a moderate number of
parameters, typically less than one hundred.
Wilson et al. (2012) employ Markov Chain Monte Carlo (MCMC; Andrieu et al. 2003)
sampling to obtain the posterior distribution of a parametric policy space. The sampling is
computational costly, but does not require new trajectory samples. A sampling approach
is also used by Busa-Fekete et al. (2013, 2014), but within an evolutionary algorithm where
policies are described with a set of parameters. Akrour et al. (2011, 2012, 2014) also use an
evolutionary strategy (1 + λ ES; Auger 2005) for optimization in a parametric policy space,
which requires a high amount of trajectory samples.
The contextual REPS (Kupcsik et al., 2014) algorithm, as employed by Kupcsik et al.
(2015), is also a strategy for optimizing parameters of the policy. In the contextual settings,
we can learn to adapt the policy parameters to the context of the task. For example, it
can learn to adapt the robot’s trajectory if we need to throw a ball for a larger distance
(Kupcsik et al., 2014).
27
C. Wirth et al
Value-based approaches exploit the temporal structure of the problem by using the Markov
property, which, in principle allows for more sample efficient policy optimization. However,
these methods might suffer from approximation errors which can lead to instabilities and
divergence. Typically, value-based methods use some variant of policy iteration that is
again composed of a policy evaluation step and a policy improvement step. In the policy
evaluation step, the estimated reward-based utility (Akrour et al. 2014; Wirth et al. 2016;
Christiano et al. 2017; cf. Sec. 3.2.3) is used to compute a value function for the policy.
This value function can, for example, be estimated by temporal difference methods such as
least squares temporal difference learning (LSTD; Boyan 1999). Given the value function,
a new policy can be obtained in the policy improvement step. For example, Wirth et al.
(2016) uses a variant of the relative entropy policy search (REPS) algorithm (Peters et al.,
2010) to update the policy. The REPS algorithm performs a soft-greedy policy update that
stabilizes the policy iteration process and is also applicable in continuous action spaces.
The TRPO (Schulman et al., 2015) algorithm employed by Christiano et al. (2017) works
comparably, but uses deep neural networks for approximating the value function and the
policy. A separate algorithm for computing the value function is not required.
In discrete action spaces, greedy updates using the max-operator have been used (Akrour
et al., 2014). However, this approach requires a sufficient amount of transition samples to
be obtained beforehand which again stabilizes the policy iteration process.
Other authors (Fürnkranz et al., 2012; Wirth and Fürnkranz, 2012, 2013a,b, 2015;
Runarsson and Lucas, 2012, 2014; Sugiyama et al., 2012) assume a utility function directly
defining the expected outcome of a state or action. Hence, an optimal policy is derived
by directly selecting the action maximizing the utility. In case of state values (Wirth and
Fürnkranz, 2012, 2015; Runarsson and Lucas, 2012, 2014), this step requires the transition
model to be known.
3.5.3 Planning
Many other approaches (Zucker et al., 2010; Jain et al., 2013, 2015; Gritsenko and Beren-
son, 2014) use planning algorithms for optimizing policies. Planning-based algorithms use
utility-based approaches to derive an evaluation measure for trajectories (or state/actions).
Furthermore, a model of the system dynamics is required to determine the effects of actions.
It is then possible to plan a sequence of actions that maximizes the given utility. However,
a model of the system dynamics is in many cases not available with a sufficient accuracy.
The use of sampling-based rapidly-exploring random trees (RRT; LaValle and Kuffner
1999) and their constrained bi-directional variant (CBiRRT; Berenson et al. 2009) was ex-
plored in (Jain et al., 2013, 2015; Gritsenko and Berenson, 2014). RRTs are guaranteed to
find a solution to a planning problem, but provide no guarantees for the optimality of the
found solution. Other planning algorithms come with such guarantees, such as anytime A*
(ARA*; Likhachev et al. 2003), as used by Zucker et al. (2010). Planning methods require
an accurate model of the system dynamics and inherently suffer from model errors. Small
inaccuracies in the model might be exploited by the planning method which may result in
solutions that are not feasible on the real system.
28
Preference-Based Reinforcement Learning Survey
The reviewed algorithms also differ with respect to the amount of available model knowl-
edge, which has a big influence on which tasks the approaches are applicable. Model-based
approaches assume that the system dynamics are known in advance, which allows the use
of planning algorithms (Jain et al., 2013, 2015; Gritsenko and Berenson, 2014; Zucker et al.,
2010) or to directly derive an optimal policy (Wirth and Fürnkranz, 2012, 2015; Runars-
son and Lucas, 2012, 2014; Sugiyama et al., 2012). In addition, some approaches require
a model to be able to simulate the system initialized at arbitrary states (Fürnkranz et al.,
2012; Wilson et al., 2012).
The transition dynamics of real world problems are usually not known, as they are
subject to a wide variate of (unknown) factors. Hence, many model-based approaches try
to learn the dynamics based on observed transitions (Grünewälder et al., 2012). However,
model learning can be a complicated task in particular for continuous systems and is a
research field on its own (Nguyen-Tuong and Peters, 2011). It might require a high amount
of samples, we need to choose a suitable model class and the resulting model errors often
decrease the quality of the resulting policy.
Tbl. 2 shows a tabular summary of the discussed algorithms, grouped by the different learn-
ing problems, introduced in Sec. 3.2. We list the introduced key concepts and the different
requirements, stated throughout this paper. The preferences column is related to the dif-
ferent types of preferences mentioned in Sec. 3.1. The trajectory generation and preference
(query) generation columns are based on the subsections of Sec. 3.4, where we mention how
trajectories and preference queries are generated to resolve the exploration problem. The
various approaches use different optimization strategies for performing policy optimization
and surrogate loss minimization, if applicable. The columns surrogate optimization and
policy optimization show the strategies mentioned in Sec. 3.2 and Sec. 3.5 explicitly.
29
Learning a Policy Preferences Traj. Gen. Pref. Gen. Surrogate Opt. Policy Opt. Require.
Busa-Fekete et al. (2013, 2014) Trajectory Homogen Exhaustive - CMA-ES PP
Wilson et al. (2012) Trajectory Homogen Interleaved - MCMC PP,M/S0
Learning a Preference Model Preferences Traj. Gen. Pref. Gen. Surrogate Opt. Policy Opt. Require.
Fürnkranz et al. (2012) Action - Exhaustive MLP Direct Max. M/S0 , DA
Wirth and Fürnkranz (2013a,b) Trajectory Homogen Exhaustive Gradient Direct Max. DS, DA
Learning a Utility Function Preferences Traj. Gen. Pref. Gen. Surrogate Opt. Policy Opt. Require.
Akrour et al. (2011, 2012) Trajectory Heterogen Interleaved SVM 1 + λ-ES PP
Akrour et al. (2014) Trajectory Heterogen Greedy MCMC LSTD, CMA-ES MA or PP
Christiano et al. (2017) Trajectory Homogen Interleaved Gradient TRPO, A3C
Gritsenko and Berenson (2014) Trajectory - - C4.5, M5P CBiRRT M
Jain et al. (2013, 2015) Trajectory Homogen+Guided Exhaustive Gradient RRT M
Kupcsik et al. (2015) Trajectory Homogen Exhaustive convex solver C-REPS PP
C. Wirth et al
30
Wirth and Fürnkranz (2012, 2015) State - - SVM, LP Direct Max. M
Wirth et al. (2016) Trajectory Homogen Interleaved LP, ELS LSTD, REPS
Zucker et al. (2010) State Homogen Guided SVM ARA* M
Table 2: Overview of PbRL approaches by design principle
Algorithms: 1 + λ-ES : 1 + λ Evolution Strategy (Auger, 2005), A3C : Asynchronous Advantage Actor-Critic (Mnih et al., 2016),
ARA* : Anytime A* (Likhachev et al., 2003), C-REPS : Contextual Relative Entropy Policy Search (Kupcsik et al., 2014), C4.5 : C4.5
Classifier (Quinlan, 1993), CBiRRT : Constrained Bi-directional Rapidly-Exploring Random Tree (Berenson et al., 2009), CMA-ES :
Covariance Matrix Adaptation Evolution Strategy (Hansen and Kern, 2004), ELS : Elliptic Slice Sampling (Murray et al., 2010), LP : Linear
Programming, LSTD: Least Squares Temporal Difference Learning (Boyan, 1999), M5P : M5P Regression Model Trees (Wang and Witten,
1997), MCMC :Markov Chain Monte Carlo (Andrieu et al., 2003), MLP : Multi-layer Perceptron (Bishop, 1995), REPS : Relative Entropy
Policy Search (Peters et al., 2010), RRT : Rapidly-exploring Random Tree (LaValle and Kuffner, 1999), SVM : Ranking Support Vector
Machine (Joachims, 2002; Herbrich et al., 1999), TRPO: Trust Region Policy Optimization (Schulman et al., 2015)
Requirements: DA: discrete action space, DS : discrete state space, M : transition model is known, MA: pre-generated samples required, PP :
parametric policy, S0 : all states can be used as initial state
Preference-Based Reinforcement Learning Survey
4. Experiment Domains
Most practical applications for PbRL focus on robot teaching tasks and on board game
domains. These domains can be subject to complex reward shaping problems (Ng et al.,
1999) as it is required to not only solve the task itself, but also consider constraints like motor
limitations or obstacle avoidance. Furthermore, a robot acting in uncontrolled environments
needs to cope with changing environments, interact with non expert users and adapt to
new tasks. For such tasks, PbRL is a suitable tool as it allows robots to learn without
expert knowledge or predefined rewards. In the following, we highlight a few domains where
preference-based approaches have been successfully employed in the past.
Figure 4: The robot handover task with a sitting, standing and running person
As described in Strabala et al. (2013), this task consists of three stages: approach, signal
and transfer. First, the robot needs to approach the human and get ready for the actual
handover. In the signaling phase, the robot and the human need to exchange information
for determining the intended joint position for the handover. This information exchange is
often non-verbal. Lastly, the joint contact needs to be established. This physical transfer
should be dynamic and should not require a stationary contact phase. The focus of Kupcsik
et al. (2015) is on this last phase. Experiments are performed with a real 7-DoF KUKA
LBR arm and a Robotiq 3-finger hand in a laboratory setup, as shown in Fig. 5. The robot
performs several handover trials with a human receiving the object. Feedback is obtained in
terms of trajectory preferences, but can be enhanced with an ordinal evaluation on a 1–10
scale. The trajectories are considered to be samples of a parametric policy and the related
feedback is used to learn an evaluation function for policies, as described in Sec. 3.2.3 and
3.4.1. The evaluation function uses the policy parameters as input, allowing to define a
31
C. Wirth et al
likelihood function for the feedback, given a parametric policy. Using Bayesian inference
and a Gaussian process prior, it is possible to infer the policy evaluation function explicitly.
Policies can be evaluated without obtaining trajectory samples by directly evaluating the
policy parameters. Hence, no costly environment calls are needed for optimizing the policy.
The performed evaluation shows a steady improvement, averaged over 5 trials. Further-
more, the return of the policy approaches a plateau after 80 policy updates. However, the
exact amount of used preferences as well as the performance difference to an optimal policy
remain unclear.
In general, the approach is very interesting for domains where (low-dimensional) para-
metric policies are available and trajectory generation is expensive. Only learning the utility
function requires trajectory sampling. Furthermore, the process can probably be improved
further by considering specific preference function exploration criteria, as introduced in
Sec. 3.4.3. However, the approach is most likely not suited for high-dimensional policy
spaces because of the Gaussian process. Gaussian process-based methods are computational
expensive and require a high number of samples to sufficiently capture complex policies.
Hence, a high number of trajectory samples and preferences would be required.
• heavy-fragile movement, e.g., moving a heavy object with a fragile object close by;
32
Preference-Based Reinforcement Learning Survey
Overall, for evaluating their approach, they defined 35 tasks by changing the objects and/or
environment. The robot demonstrates a set of trajectories, which are ranked by the al-
gorithm. The user can then define preferences by re-ranking them or by supplying new
trajectories, which are assumed to be an improvement over all trajectories demonstrated by
the robot. Hence, they are added to the ranking as the top-ranked element. Preferences are
then obtained by computing all pairwise preferences contained in the ranking.
For evaluating the approach, an expert scored 2100 trajectories in total, based on a 1–5
scale. From these scores, the authors derive a ranking and compare it with a ranking based
on the learned utility. They achieve an nDCG@1(nDCG@3) ranking error (Manning et al.,
2008) of 0.78(0.66) to 0.93(0.92) after obtaining feedback 20 times. The ranking error is
averaged over multiple tasks, grouped into three categories. The used feedback is only a
re-ranking of 5 obtained trajectories by picking the most preferred one. The exact amount
of derived preferences is unclear.
The work of Jain et al. (2013, 2015) is especially interesting because it was demonstrated
to be applicable in a wide range of domains. In contrast to Kupcsik et al. (2015), a low-
dimensional policy space is not required, but a predefined trajectory feature space. Arguably,
this requirement is easier to satisfy and apply. However, the utility function has to be
linear in the given feature space but the method could be combined with non-linear utility
functions, as introduced by Wirth et al. (2016).
3. http://portablegamenotation.com/Symbols.html
33
C. Wirth et al
remarked that White’s move 13.Bd2! was a very good move (indicated by the exclama-
tion mark at the end of the move), and provides three alternative replies for the move
13. . . fXe5 that was played in the game: 13. . . a5?! is dubious, 13. . . QXc2? is a mistake,
and 13. . . NXc2?? is a losing move. The positions at the end of these variations are also
annotated with symbols like ±, indicating a small advantage for White, or +− indicating
a decisive advantage for White. It is straight-forward to derive preferences over moves and
positions from such annotations. The preferences can then be used to learn a state utility
function which estimates the expected outcome of a position. A policy is then derived by
maximizing the utility, using the already known chess transition function.
The results in this domain show that it is possible to approach the performance of a near-
optimal handcrafted utility function. However, the best results require 2.12M preferences
and use a pre-defined utility feature space with only 15 dimensions.
It can be concluded that this approach is interesting for domains where it is easily
possible to derive high number of (noisy) preferences. However, if optimal policies are of
importance, it should probably only be used for computing an initial solution that can be
further refined by applying one of the other PbRL approaches.
34
Preference-Based Reinforcement Learning Survey
and the bicycle balance (Randløv and Alstrøm, 1998) domains. Two approaches also use a
simple grid world defined by Akrour et al. (2014). This grid world is easy to solve from a
reinforcement learning point of view, but difficult for preference learning as the ground-truth
reward values are very similar for some states. Typical state-of-the-art approaches require
5 − 40 preferences to compute an optimal policy in these domains. Several algorithms are
also evaluated on a cancer treatment task, derived from a reward based setting by Zhao
et al. (2009). More complex toy domains have been used by Christiano et al. (2017), as they
employ the ATARI learning environment (Bellemare et al., 2013) and the MuJoCo framework
(Todorov et al., 2012) for learning movement and other robotic tasks. Interestingly, they
have been able to outperform state-of-the-art RL approaches with numeric rewards in several
domains, however they required several hundred preferences.
Runarsson and Lucas (2012, 2014) use the board game Othello, in a setting that is sim-
ilar to the experiments by Wirth and Fürnkranz (2012, 2015). However, preferences are not
derived from human annotations, but by assuming that observed state/action choices are
preferable to unseen alternatives. This relates the approach closely to inverse reinforcement
learning, although, the system directly learns a value function instead of a reward function.
Sugiyama et al. (2012) derive preferences from numeric ratings and also learn a value func-
tion, however, in a human dialog setting. They show improvements over classic RL and
IRL approaches. A detailed overview of experiment domains and evaluated approaches is
given in Tbl. 3. Domain setups are not always identical and may differ in details such as
the noise level, used feature space or action limits. Hence, even in the few cases where two
papers do publish results of the same evaluation domain, the results are usually not directly
comparable.
5. Discussion
While there has been much progress in the recent years in PbRL, many open questions and
shortcomings of the approaches remain. For example, none of the utility-based approaches
are able to produce a set of Pareto-optimal policies in case of incomparabilities. Only
the direct policy learning method of Busa-Fekete et al. (2014) can handle incomparabilities
by assuming that the incomparable trajectories should be evaluated equally, i.e., that both
trajectories receive the same probability. A potential solution to cope with incomparabilities
would be to learn a non-scalar utility function, allowing the utility functions to differ in
multiple dimensions. A set of Pareto-optimal policies could then be learned using multi-
objective reinforcement learning (Liu et al., 2015).
The question of preference-space exploration has also not yet been addressed satis-
factorily. While a few approaches for preference-space exploration have been proposed
(Sec. 3.4.3), a principled integration into the policy-space exploration strategy is still miss-
ing. Both tasks should be combined efficiently to reduce the amount of trajectory samples
and preferences that are required. In addition, very few utility-based approaches are able
to learn non-linear utility functions. Gritsenko and Berenson (2014) learn non-linear util-
ities but require that the PbRL problem is turned into an ordinal ranking problem first
(Sec. 3.2.3). Ordinal rankings are problematic in that arbitrary distances between ranks
are introduced when they are treated as regression problems, whereas the classification case
suffers from a severe loss of information. Wirth et al. (2016); Kupcsik et al. (2015) employ
35
C. Wirth et al
classic classic
robotics games
discrete continuous
Inverted Pendulum
terrain locomotion
object handover
Bicycle Balance
Mountain Car
path planning
unhang wheel
End of Maze
NAO Robot
Dialog Data
River Swim
Grid world
Swing Up
MuJoCo
Acrobot
Othello
ATARI
Cancer
Chess
Akrour et al. (2011) 7 7
Akrour et al. (2012) 7 7
Akrour et al. (2014) 7 7 7 7
Busa-Fekete et al. (2013, 2014) 7
Christiano et al. (2017) 7 7
Fürnkranz et al. (2012) 7 7 7
Gritsenko and Berenson (2014) 7
Jain et al. (2013, 2015) 7
Kupcsik et al. (2015) 7
Runarsson and Lucas (2012, 2014) 7
Sugiyama et al. (2012) 7
Wilson et al. (2012) 7 7 7 7
Wirth and Fürnkranz (2013a,b) 7 7 7
Wirth and Fürnkranz (2012, 2015) 7
Wirth et al. (2016) 7 7 7 7
Zucker et al. (2010) 7
kernels for overcoming this problem, an idea applicable to all linear utility function learning
problems. However, this results in a high-dimensional utility space and is computational
costly. Hence, the approaches do not scale well to high dimensional spaces. The deep neural
network based approach by Christiano et al. (2017) overcomes the scalability issues of kernel-
based methods but requires a high amount of preferences. Further research is required to
either scale preference-based learning with kernels to larger problems spaces or to improve
the results from DNN-based approaches with low amounts of training data.
Another important research question is to find principled ways for combining various
types of feedback from the human expert, such as preferences, ordinal or numeric feedback,
and demonstrations. In particular, inverse reinforcement learning from demonstrated tra-
jectories could be used to provide good initializations for learning a utility function. This
utility could then be refined by preferences of the experts.
Furthermore, PbRL can currently not be used to perform risk-averse optimization
(Coraluppi and Marcus, 1999; Geibel, 2001). Risk-adverse optimization guarantees that
36
Preference-Based Reinforcement Learning Survey
the worst case performance of the algorithm is bounded. In many cases, it is required that
certain trajectories are not realized, for example, trajectories that would damage the robot
or its environment. Such risk-adverse behavior could be realized with weighted preference
loss functions such as (3), i.e., by substantially increasing the weights of preferences that
concern risky trajectories.
Another important open issue, which would substantially extend the range of possible
applications of PbRL, is to develop techniques that are able to use preferences between
trajectories starting from different initial states. A potential solution could be to take
the expected value of initial states into account and compute the preference based on the
advantage function.
As mentioned in Secs. 3.4 and 3.5, most approaches are lacking in terms of sample
efficiency. While sample efficiency could already be improved by using reward-based utilities
(see Sec. 3.2.3) which allow sample reuse for policy optimization (Wirth et al., 2016), more
efficient model-based policy search strategies such as the PILCO algorithm (Deisenroth and
Rasmussen, 2011) or guided policy search (Levine and Koltun, 2013) could be employed
that learn a system dynamics model from the data and employ it for policy optimization.
Moreover, we have seen that the complexity of the tasks that have been used for learning
is limited. More complex policies, which, e.g., can directly use sensory feedback as inputs,
could be acquired by incorporating recent successes from deep reinforcement learning (Mnih
et al., 2015, 2016).
Furthermore, PbRL is still lacking a unified evaluation framework. The different pub-
lications use a wide variety of testing domains, usually derived from classic RL problems.
However, the specific setups depend on the publication and usually modified the basic con-
figuration by adding noise or removing the terminal conditions. Moreover, many algorithms
disregard specific subproblems like policy generalization or utility exploration. Hence, a
satisfactory comparison of algorithms is currently not possible.
6. Conclusion
Preference-based reinforcement learning (PbRL) is a suitable tool for learning from quali-
tative, non-numeric rewards. On the one hand, it can provide solutions in domains where
numeric feedback is not readily available, and, on the other hand, it may reduce the de-
mands and prior knowledge that is needed for applying RL to real-world tasks where a
numeric feedback signal has to be shaped. This survey provided a first overview over the
design choices that have to be considered for PbRL and presented current work in a coherent
framework, shedding light on the advantages and limitations of various approaches.
The most substantial problem seems to be the complexity of the learning problem. The
presented approaches require a high amount of preferences or well defined feature spaces,
either on a policy or trajectory level. The definition of suitable, low-dimensional feature
spaces requires expert knowledge, if possible at all. In many domains, high-dimensional
feature spaces are required to approximate optimal policies to an acceptable degree. Fur-
thermore, preferences are costly to obtain in an incremental setting with human evaluations.
Even in cases where it is possible to obtain preferences from non-expert users, human labor
is usually quite expensive. Hence, it is very important to use the available information as
efficiently as possible. Utility-based approaches allow to generalize the obtained preference
37
C. Wirth et al
feedback and are an important step towards practical algorithms. Especially reward-based
utilities seem to be appropriate for domains without well defined policy spaces, because
they allow the application of efficient value-based reinforcement learning methods. Hence,
it is possible to use the substantial progress that has already been achieved in this field.
However, it is also important to consider the novel problems introduced by using preference-
based feedback. Foremost, efficient methods for exploring the preference function space are
required.
Recently, interesting real-world applications could be solved by preference-based meth-
ods. Yet, several open questions remain. Foremost, sample efficiency and scalibility issues
still prevent applications to more complex tasks. DNN-based approaches can tackle the
scalibility issues to some degree, but improved solutions to the preference-space exploration
problem are required to limited the amount of required preferences. Furthermore, PbRL is
still limited by several assumptions like the total order assumption, the uniform preference
weighting and the single starting state. Without these assumptions, it would be possible to
apply PbRL in multi-objective domains, for risk-adverse problems and in settings that can
not be simulated. Methods that can use multiple feedback forms, like preferences, demon-
strations or rewards, are also of interested as they are possible able to reduce the cognitive
demand even further.
We believe that this line of research could provide the foundation to make reinforcement
learning applicable to the non-scientific community because despite many recent successes
in reinforcement learning, we believe that the required expertise in machine learning is still
a key limitation when it comes to applying RL to real-world tasks.
Acknowledgments
This work was supported by the German Research Foundation (DFG) as part of the Priority Pro-
gramme 1527 “Autonomous Learning”, the research project FU 580/10, and under the EU H2020
project ’RoMaNS’, Ref. 645582. We would like to thank Eyke Hüllermeier, Robert Busa-Fekete and
Stefan Riezler for interesting and stimulating discussions that greatly influenced this paper.
References
P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Pro-
ceedings of the 21st International Conference on Machine learning (ICML-04), volume 69.
2004.
A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with
multi-point bandit feedback. In Proceedings of the 23rd Conference on Learning Theory
(COLT-10), pages 28–40, 2010.
38
Preference-Based Reinforcement Learning Survey
39
C. Wirth et al
40
Preference-Based Reinforcement Learning Survey
A. Gritsenko and D. Berenson. Learning cost functions for motion planning from human
preferences. In Proceedings of the IROS 2014 Workshop on Machine Learning in Planning
and Control of Robot Motion, volume 1, pages 48–6, 2014.
N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test func-
tions. In Proceedings of the 8th International Conference on Parallel Problem Solving from
Nature (PPSN-04), volume 3242 of LNCS, pages 282–291. 2004.
V. Heidrich-Meisner and C. Igel. Hoeffding and Bernstein races for selecting policies in
evolutionary direct policy search. In Proceedings of the 26th International Conference on
Machine Learning (ICML-09), pages 401–408. 2009.
R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression.
In Proceedings of the 9th International Conference on Artificial Neural Networks (ICANN-
99), volume 1, pages 97–102, 1999.
A. Jain, B. Wojcik, T. Joachims, and A. Saxena. Learning trajectory preferences for manip-
ulators via iterative improvement. In Advances in Neural Information Processing Systems
26 (NIPS-13), pages 575–583, 2013.
W. B. Knox and P. Stone. Interactively shaping agents via human reinforcement: The
TAMER framework. In Proceedings of the 5th International Conference on Knowledge
Capture (K-CAP-09), pages 9–16, 2009.
W. B. Knox and P. Stone. Combining manual feedback with subsequent MDP reward
signals for reinforcement learning. In Proceedings of the 9th International Conference on
Autonomous Agents and Multiagent Systems (AAMAS-10), pages 5–12. 2010.
41
C. Wirth et al
W. B. Knox and P. Stone. Reinforcement learning from simultaneous human and MDP
reward. In Proceedings of the 11th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS-12), pages 475–482. 2012.
D. Kreps. Notes on the theory of choice. Underground classics in economics. Westview Press,
1988.
S. Levine and V. Koltun. Guided policy search. In Proceedings of the 30th International
Conference on Machine Learning (ICML-13), pages 1–9, 2013.
M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* search with provable bounds
on sub-optimality. In Advances in Neural Information Processing Systems 16 (NIPS-03),
pages 767–774. 2003.
42
Preference-Based Reinforcement Learning Survey
D. Nguyen-Tuong and J. Peters. Model learning for robot control: a survey. Cognitive
Processing, 12(4):319–340, 2011.
J. Peters, K. Muelling, and Y. Altun. Relative entropy policy search. In Proceedings of the
24th National Conference on Artificial Intelligence (AAAI-10), pages 1607–1612. 2010.
J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and
shaping. In Proceedings of the 15th International Conference on Machine Learning (ICML-
98), pages 463–471, 1998.
T. P. Runarsson and S. M. Lucas. Preference learning for move prediction and evaluation
function approximation in othello. IEEE Transactions on Computational Intelligence and
AI in Games, 6(3):300–313, 2014.
T. P. Runarsson and S. M. Lucas. Imitating play from game trajectories: Temporal difference
learning versus preference learning. In IEEE Conference on Computational Intelligence
and Games (CIG-12), pages 79–82. IEEE, 2012.
43
C. Wirth et al
J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy op-
timization. In Proceedings of the 32nd International Conference on Machine Learning
(ICML-15), pages 1889–1897, 2015.
O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-
point feedback. Journal of Machine Learning Research, 18:1–11, 2017.
A. Sokolov, J. Kreutzer, C. Lo, and S. Riezler. Learning structured predictors from bandit
feedback for interactive NLP. In Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (ACL-16), 2016.
E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS-12), pages 5026–5033, 2012.
S. Vembu and T. Gärtner. Label ranking algorithms: A survey. In Fürnkranz and Hüller-
meier (2010), pages 45–64.
J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton
University Press, 1944.
44
Preference-Based Reinforcement Learning Survey
Y. Wang and I. H. Witten. Induction of model trees for predicting continuous classes. In
Poster papers of the 9th European Conference on Machine Learning (ECML-97). 1997.
Y.-C. Wang and J. M. Usher. Application of reinforcement learning for agent-based produc-
tion scheduling. Engineering Applications of Artificial Intelligence, 18(1):73–82, 2005.
P. Weng. Markov decision processes with ordinal rewards: Reference point-based prefer-
ences. In Proceedings of the 21st International Conference on Automated Planning and
Scheduling (ICAPS-11), 2011.
A. Wilson, A. Fern, and P. Tadepalli. A bayesian approach for policy learning from trajectory
preference queries. In Advances in Neural Information Processing Systems 25 (NIPS-12),
pages 1142–1150. 2012.
C. Wirth and J. Fürnkranz. First steps towards learning from game annotations. In Pro-
ceedings of the ECAI Workshop on Preference Learning: Problems and Applications in
AI, pages 53–58, 2012.
C. Wirth and J. Fürnkranz. A policy iteration algorithm for learning from preference-based
feedback. In Advances in Intelligent Data Analysis XII: 12th International Symposium
(IDA-13), volume 8207 of LNCS, pages 427–437. 2013a.
C. Wirth and J. Fürnkranz. EPMC: Every visit preference Monte Carlo for reinforcement
learning. In Proceedings of the 5th Asian Conference on Machine Learning, (ACML-13),
volume 29 of JMLR Proceedings, pages 483–497. 2013b.
Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem.
Journal of Computer System Sciences, 78(5):1538–1556, 2012.
Y. Zhao, M. Kosorok, and D. Zeng. Reinforcement learning design for cancer clinical trials.
Statistics in Medicine, 28(26):3294–3315, 2009.
45
C. Wirth et al
S. Zhifei and E. Meng Joo. A survey of inverse reinforcement learning techniques. Interna-
tional Journal of Intelligent Computing and Cybernetics, 5(3):293–311, 2012.
46