Provably Efficient Maximum Entropy Exploration
Provably Efficient Maximum Entropy Exploration
Provably Efficient Maximum Entropy Exploration
several mainstream RL tasks in Section 5. a host of recent empirical success using deep RL methods
which encourage exploration in some form (Mnih et al.,
1.1. Informal statement of contributions 2015; Silver et al., 2016). The approaches are based on a
few related ideas: that of encouraging exploration through
To facilitate exploration in potentially unknown MDPs state visitation frequencies (e.g. (Ostrovski et al., 2017;
within a restricted policy class Π, we assume access to Bellemare et al., 2016; Tang et al., 2017)) and those based on
the environment using the following two oracles: a intrinsic reward signal derived from novelty or prediction
Approximate planning oracle: Given a reward function error (Lopes et al., 2012; Pathak et al., 2017; Savinov et al.,
(on states) r : S → R and a sub-optimality gap ε, the 2018; Fu et al., 2017; Mohamed & Jimenez Rezende, 2015;
planning oracle returns a policy π = A PPROX P LAN(r, ε) Houthooft et al., 2016; Weber et al., 2017), aligning an
with the guarantee that V (π) ≥ maxπ∈Π V (π) − ε, where intrinsic reward to the target objective (Kaelbling, 1993;
V (π) is the value of policy π. Chentanez et al., 2005; Singh et al., 2010; 2009; Zheng
et al., 2018), or sample based approaches to tracking of
State distribution estimate oracle: A state distri- value function uncertainty (Osband et al., 2016; 2018).
bution oracle estimates the state distribution dˆπ =
D ENSITY E ST(π, ε) of any given (non-stationary) policy π, Intrinsic Learning: Works in (Chentanez et al., 2005;
guaranteeing that kdπ − dˆπ k∞ ≤ ε. Singh et al., 2009; 2010) established computational the-
ories of intrinsic reward signals (and how it might help with
Given access to these two oracles, we describe a method that downstream learning of tasks) and other works also showed
provably optimizes any continuous and smooth objective how to incorporate intrinsic rewards (in the absence of any
over the state-visitation frequencies. Of special interest is true reward signal) (Warde-Farley et al., 2018; Burda et al.,
the maximum entropy and relative entropy objectives. 2018b;a; Nair et al., 2018). The potential benefit is that such
Theorem 1.1 (Main Theorem - Informal). There exists learning may help the agent reach a variety of achievable
an efficient algorithm (Algorithm 1) such that for any β- goals and do well on other extrinsically defined tasks, not
smooth measure R, and any ε > 0, in O( 1ε log 1ε ) calls to just the task under which it was explicitly trained for under
A PPROX P LAN & D ENSITY E ST , it returns a policy π̄ with one specific reward function (e.g. see (Chentanez et al.,
2005; Singh et al., 2009; Warde-Farley et al., 2018; Nair
R(dπ̄ ) ≥ max R(dπ ) − ε .
π∈Π et al., 2018)).
The (discounted) value Vπ of a policy π is the expected maximum possible R(dπ ) among the policy class Π. When
cumulative reward an action sequence sampled from the considering the class of all policies, Lemma 3.3 assures
policy π gathers. us that the search over the class of stationary policies is
∞ sufficient.
π ∗ ∈ arg max R(dπ ).
X
Vπ = E V (τ ) = (1 − γ) E γ t r(st , at ) π∈Π
τ ∼P (·|π) τ ∼P (·|π)
t=0
Our goal is to find a policy that induces a state distribution
Induced state distributions: The t-step state distribution with a comparable value of the reward functional.
and the (discounted) state distribution of a policy π that
result are 3.1. Examples of reward functionals
A possible quantity of interest that serves as a motivation for
X
dt,π (s) = P (st = s|π) = P (τ |π), (2.1)
all τ with st =s
considering such functionals is the entropy of the induced
X distribution2 .
dt,π (s, a) = P (st = s, at = a|π) = P (τ |π),
all τ with st =s,at =a max{H(dπ ) = − E log dπ (s)}
π∈Π s∼dπ
(2.2)
∞
X The same techniques we derive can also be used to optimize
dπ (s) = (1 − γ) γ t dt,π (s), (2.3) other entropic measures. For example, we may be interested
t=1 in minimizing:
∞
dπ (s)
X
dπ (s, a) = (1 − γ) γ t dt,π (s, a). (2.4) min KL(dπ ||Q) = E log
t=1 π∈Π s∼dπ Q(s)
The latter distribution can be viewed as the analogue of the for some given distribution Q(s). Alternatively, we may
stationary distribution in the infinite horizon setting. seek to minimize a cross entropy measure:
Mixtures of stationary policies: Given a sequence of k 1
min E log = KL(Q||dπ ) + H(Q)
policies C = (π0 , . . . πk−1 ), and α ∈ ∆k (the simplex), π∈Π s∼Q dπ (s)
we define πmix = (α, C) to be a mixture over stationary
policies. The (non-stationary) policy πmix is one where, at where the expectation is now under Q. For uniform Q, this
the first timestep t = 0, we sample policy πi with probability latter measure may be more aggressive in forcing π to have
αi and then use this policy for all subsequent timesteps. In more uniform coverage than the entropy objective.
particular, the behavior of a mixture πmix with respect to
an MDP is that it induces infinite-length trajectories τ = 3.2. Landscape of the objective function
(s0 , a0 , s1 , a1 . . . ) with the probability law : In this section, we establish that the entropy of the state
k−1 distribution is not a concave function of the policy. Similar
X
P (τ |πmix ) = αi P (τ |πi ) (2.5) constructions can establish analogous statements for other
i=0 non-trivial functionals. Subsequently, when the policy class
in consideration is exhasutive, we discuss a possible con-
and the induced state distribution is: vex reformulation of the objective in the space of induced
k−1
X distributions which constitute a convex set.
dπmix (s) = αi dπi (s). (2.6)
i=0 3.2.1. N ON - CONVEXITY IN THE POLICY SPACE
Note that such a distribution over policies need not be rep- Despite the concavity of the entropy function, our overall
resentable as a stationary stochastic policy (even if the πi ’s maximization problem is not concave as the state distribu-
are stationary) due to that the sampled actions are no longer tion is not an affine function of the policy. This is stated
conditionally independent given the states. precisely in the following lemma.
Lemma 3.1. H(dπ ) is not concave in π.
3. The Objective: MaxEnt Exploration
Proof. Figure 1 demonstrates the behavior of π0 , π1 , π2
As each policy induces a distribution over states, we can as-
on a 6-state MDP with binary actions. Note that for suf-
sociate a concave reward functional R(·) with this induced
ficiently large γ → 1 and any policy π, the discounted
distribution. We say that a policy π ∗ is a maximum-entropy
exploration policy, also to referred to as the max-ent pol- 2
Please note the distinction from the conditional entropy of
icy, if the corresponding induced state distribution has the actions given the state, e.g. (Todorov, 2007; Haarnoja et al., 2018).
Provably Efficient Maximum Entropy Exploration
Algorithm 1 Maximum-entropy policy computation. H to its smoothed variant Hσ , while the rest of the lemma
1: Input: Step size η, number of iterations T , planning quantifies smoothness of Hσ . The factors of |S| incurred
oracle error tolerance ε1 > 0, state distribution oracle below are a consequence of imposed smoothing on H, and
error tolerance ε0 > 0, reward functional R. are not necessary for naturally smooth objectives.
0.1ε
2: Set C0 = {π0 } where π0 is an arbitrary policy. Corollary 4.2. For any ε > 0, set σ = 2|S| , ε1 = 0.1ε,
3: Set α0 = 1. 0.1ε
ε0 = 80|S|
2
0.1ε
, and η = 40|S|
2
. When Algorithm 1 is run for T
4: for t = 0, . . . , T − 1 do
iterations with the reward functional Hσ , where:
5: Call the state distribution oracle on πmix,t = (αt , Ct ):
40|S| log |S|
T ≥ log ,
dˆπmix,t = D ENSITY E ST (πmix,t , ε0 ) 0.1ε 2 0.1ε
6: Define the reward function rt as we have that:
H(dπmix,T ) ≥ max H(dπ ) − ε .
dR(X) π∈Π
rt (s) = ∇R(dˆπmix,t ) := .
dX
X=dˆπmix,t We continue with the proof of the main theorem.
7: Compute the (approximately) optimal policy on rt :
Proof of Theorem 4.1. Let π ∗ be a maximum-entropy pol-
πt+1 = A PPROX P LAN (rt , ε1 ) . icy, ie. π ∗ ∈ arg maxπ∈Π R(dπ ).
8: Update πmix,t+1 = (αt+1 , Ct+1 ) to be R(dπmix,t+1 ) = R((1 − η)dπmix,t + ηdπt+1 ) Equation 2.6
Ct+1 = (π0 , . . . , πt , πt+1 ), (4.1) ≥R(dπmix,t ) + ηhdπt+1 − dπmix,t , ∇R(dπmix,t )i
αt+1 = ((1 − η)αt , η). (4.2) − η 2 βkdπt+1 − dπmix,t k22 smoothness
We shall assume in the following discussion that the reward ≥ hdπt+1 , ∇R(dˆπmix,t )i − βkdπmix,t − dˆπmix,t k∞
functional R is β-smooth, B-bounded, and that it satisfies ≥ hdπ∗ , ∇R(dˆπ )i − βε0 − ε1
mix,t
the following inequality for all X, Y . ≥ hdπ∗ , ∇R(dπmix,t )i − 2βε0 − ε1
k∇R(X) − ∇R(Y )k∞ ≤ βkX − Y k∞ (4.3) The first and last inequalities invoke the assumptions laid
2 out in Equation 4.3. Note that the second inequality above
− βI ∇ R(X) βI; k∇R(X)k∞ ≤ B (4.4)
follows from the defining character of the planning oracle,
Theorem 4.1 (Main Theorem). For any ε > 0, set ε1 = ie. with respect to the reward vector rt = ∇R(dˆπmix,t ), for
0.1ε, ε0 = 0.1β −1 ε, and η = 0.1β −1 ε. When Algorithm 1 any policy π 0 ∈ Π, it holds true that
is run for T iterations where:
Vπt+1 = hdπt+1 , rt i ≥ Vπ0 − ε1 = hdπ0 , rt i − ε1
T ≥ 10βε−1 log 10Bε−1 ,
In particular, this statement holds5 for the choice π 0 = π ∗ .
we have that: Using the above fact and continuing on
R(dπmix,T ) ≥ max R(dπ ) − ε . R(dπmix,t+1 )
π∈Π
≥R(dπmix,t ) + ηhdπ∗ − dπmix,t , ∇R(dπmix,t )i
Before we begin the proof, we state the implication for − 2ηβε0 − ηε1 − η 2 β
maximizing the entropy of the induced distribution. While ≥(1 − η)R(dπmix,t ) + ηR(dπ∗ )
the entropy objective is, strictly speaking, not smooth, one
may consider a smoothed alternative Hσ defined below. − 2ηβε0 − ηε1 − η 2 β
4
See Section 2.1 in (Bubeck et al., 2015) for equivalent defini-
Hσ (dπ ) = −Es∼dπ log(dπ (s) + σ) tions of smoothness in terms of the function value and the Hessian.
5
Even when when Π is chosen to be set of all stationary policies,
When the algorithm is fed Hσ as the proxy reward functional, this argument does not rely on π ∗ being a stationary policy, since
it is possible make sub-optimality guarantees on the true πt+1 is an optimal policy for the reward function rt among the
objective H. Lemma A.1 (D) relates the entropy functional class of all policies.
Provably Efficient Maximum Entropy Exploration
The last step here utilizes the concavity of R. Indeed, the steps, and is then able to reset. Our measure of sample
com-
inequality follows immediately from the sub-gradient char- plexity in this setting is the number of Õ (1 − γ)−1 -length
acterization of concave functions. Now, with the aid of the episodes the agent must sample to achieve a ε-suboptimal
above, we observe the following inequality. performance guarantee.
Algorithm 2 Sample-based planning for an unknown MDP. Lemma 4.7. For any ε0 , δ > 0, when Algorithm 3 is run
1: Input: Reward r, error tolerance ε > 0, exact planning with m = 200 ε20
log 2|S|δ log
log 0.1ε
γ
0.1ε0 ˆ
, t0 = loglog γ , dπ satisfies
oracle tolerance ε1 > 0, oversampling parameter m, ˆ
kdπ − dπ k∞ ≤ ε0 with probability at least 1 − δ. In this
number of rollouts n, rollout length t0 . process, the algorithm samples m episodes of length t0 .
2
2: Initialize a persistent data structure C ∈ R|S| ×|A| ,
which is maintained across different calls to the
planning algorithm to keep transition counts, to Proof of Theorem 4.4. The claim follows immediately from
C(s0 |s, a) = 0 for every (s0 , s, a) ∈ S 2 × A. the invocations of the two lemmas above with the parameter
3: repeat settings proposed in Theorem 4.1.
Declare K = {s : mina∈A s0 ∈S C(s0 |s, a) ≥ m},
P
4:
( C(s0 |s,a)
P 0 , if s ∈ K 5. Proof of Concept Experiments
P̂ (s0 |s, a) = s0 ∈S C(s |s,a)
7: Run π on the true MDP M to obtain n independently 5.1. Environments and density estimation
sampled t0 -length trajectories (τ1 , . . . τn ), and incre- Pendulum. The 2-dimensional state space for Pendulum
ment the corresponding counts in C(s0 |s, a). (from (Brockman et al., 2016)) was discretized evenly to
8: If and only if no trajectory τi contains a state s ∈ a grid of dimension 8 × 8. The dimensions represented
S − K, mark π as stable. angular position and angular velocity. For Pendulum, the
9: until π is stable. maximum torque and velocity were capped at 1.0 and 7.0,
10: return π. respectively.
Ant. The 29-dimensional state space for Ant (with a Mu-
Algorithm 3 Sample-based estimate of the state distribu-
joco engine) was first reduced to dimension 7, combin-
tion.
ing the agent’s x and y location in the gridspace with a
1: Input: A policy π, termination length t0 , oversampling 5-dimensional random projection of the remaining 27 states.
parameter m. The x and y dimensions were discretized into 16 bins in the
2: Sample m trajectories (τ0 , . . . τm−1 ) of length t0 fol- range [−12, 12]. The other dimensions were each normal-
lowing the policy π. ized and discretized into 15 bins in the range [−1, 1]. While
3: For every t < t0 , calculate the empirical state distribu- the planning agent agent had access to the full state repre-
tion dˆt,π . sentation, the density estimation was performed exclusively
on the reduced representation.
|{i < m : τi = (s0 , a0 , . . . ) with st = s}|
dt,π (s) =
m Humanoid. The 376-dimensional state space for the Mu-
joco Humanoid environment was too large and complex to
4: return dˆπ = 1−γ Pt0 −1
γ t dˆt,π effectively discretize without significantly affecting the ac-
1−γ t0 t=0
curacy of estimation. In view of this, the agent’s state space
density was estimated using kernel density estimation (from
scikit-learn (Pedregosa et al., 2011)), with an Epanechnikov
of episodes sampled
across all the invocations is n(T + kernel with a bandwidth of 0.10.
BT B 3 |S|2 |A|
m|S||A|) = Õ ε + ε3 (1−γ)2 , each episode being of 6
The open-source implementations may be found at https:
length t0 . //github.com/abbyvansoest/maxent.
Provably Efficient Maximum Entropy Exploration
(d) (f)
(e)
Figure 2. Results of the preliminary experiments. In each plot, blue represents the MaxEnt agent, and orange represents the random
baseline. 2a, 2b, and 2c show the entropy of the policy evolving with the number of epochs. 2d, 2e, and 2f show the log-probability of
occupancy of the two-dimensional state space. In 2e and 2f, the infinite xy grid is limited to [−20, 20] × [−20, 20].
Acknowledgements Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global
convergence of policy gradient methods for the linear
Sham Kakade acknowledges funding from the Washing-
quadratic regulator. In International Conference on Ma-
ton Research Foundation for Innovation in Data-intensive
chine Learning, pp. 1466–1475, 2018.
Discovery, the DARPA award FA8650-18-2-7836, and the
ONR award N00014-18-1-2247. The authors thank Shie Frank, M. and Wolfe, P. An algorithm for quadratic program-
Mannor for helpful discussions. ming. Naval Research Logistics (NRL), 3(1-2):95–110,
1956.
References
Fu, J., Co-Reyes, J., and Levine, S. Ex2: Exploration
Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse
with exemplar models for deep reinforcement learning.
reinforcement learning. In ICML, 2004.
In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,
Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Ad-
Azar, M. G., Osband, I., and Munos, R. Minimax re-
vances in Neural Information Processing Systems 30, pp.
gret bounds for reinforcement learning. arXiv preprint
2577–2587. Curran Associates, Inc., 2017.
arXiv:1703.05449, 2017.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-
Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T.,
critic: Off-policy maximum entropy deep reinforcement
Saxton, D., and Munos, R. Unifying count-based explo-
learning with a stochastic actor. CoRR, abs/1801.01290,
ration and intrinsic motivation. In Advances in Neural
2018.
Information Processing Systems 29: Annual Conference
on Neural Information Processing Systems 2016, Decem-
Houthooft, R., Chen, X., Chen, X., Duan, Y., Schulman, J.,
ber 5-10, 2016, Barcelona, Spain, pp. 1471–1479, 2016.
De Turck, F., and Abbeel, P. Vime: Variational informa-
Bertsekas, D. P. Dynamic programming and optimal control, tion maximizing exploration. In Lee, D. D., Sugiyama,
volume 1. Athena Scientific, 2005. M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.),
Advances in Neural Information Processing Systems 29,
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., pp. 1109–1117. Curran Associates, Inc., 2016.
Schulman, J., Tang, J., and Zaremba, W. Openai gym.
arXiv preprint arXiv:1606.01540, 2016. Kaelbling, L. P. Learning to achieve goals. In Proceed-
ings of the Thirteenth International Joint Conference on
Bubeck, S. et al. Convex optimization: Algorithms and com- Artificial Intelligence, Chambery, France, 1993. Morgan
plexity. Foundations and Trends in Machine Learning, 8 Kaufmann.
(3-4):231–357, 2015.
Kakade, S. M. On the sample complexity of reinforcement
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., learning. PhD thesis, University of London London,
and Efros, A. A. Large-scale study of curiosity-driven England, 2003.
learning. In arXiv:1808.04355, 2018a.
Kearns, M. and Singh, S. Near-optimal reinforcement learn-
Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Ex- ing in polynomial time. Machine learning, 49(2-3):209–
ploration by random network distillation. arXiv preprint 232, 2002.
arXiv:1810.12894, 2018b.
Lattimore, T. and Hutter, M. Near-optimal pac bounds for
Chentanez, N., Barto, A. G., and Singh, S. P. Intrinsically discounted mdps. Theoretical Computer Science, 558:
motivated reinforcement learning. In Saul, L. K., Weiss, 125–143, 2014.
Y., and Bottou, L. (eds.), Advances in Neural Information
Processing Systems 17, pp. 1281–1288. MIT Press, 2005. Lim, S. H. and Auer, P. Autonomous exploration for navi-
gating in mdps. In Conference on Learning Theory, pp.
Dann, C. and Brunskill, E. Sample complexity of episodic 40–1, 2012.
fixed-horizon reinforcement learning. In Advances in
Neural Information Processing Systems, pp. 2818–2826, Lopes, M., Lang, T., Toussaint, M., and yves Oudeyer, P.
2015. Exploration in model-based reinforcement learning by
empirically estimating learning progress. In Pereira, F.,
De Farias, D. P. and Van Roy, B. The linear programming Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.),
approach to approximate dynamic programming. Opera- Advances in Neural Information Processing Systems 25,
tions research, 51(6):850–865, 2003. pp. 206–214. Curran Associates, Inc., 2012.
Provably Efficient Maximum Entropy Exploration
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, Puterman, M. L. Markov decision processes: discrete
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- stochastic dynamic programming. John Wiley & Sons,
land, A. K., Ostrovski, G., et al. Human-level control 2014.
through deep reinforcement learning. Nature, 518(7540):
Ross, S., Gordon, G. J., and Bagnell, J. A. A reduction of
529, 2015.
imitation learning and structured prediction to no-regret
Mohamed, S. and Jimenez Rezende, D. Variational informa- online learning. In AISTATS, 2011.
tion maximisation for intrinsically motivated reinforce-
Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Polle-
ment learning. In Cortes, C., Lawrence, N. D., Lee, D. D.,
feys, M., Lillicrap, T. P., and Gelly, S. Episodic curiosity
Sugiyama, M., and Garnett, R. (eds.), Advances in Neu-
through reachability. CoRR, abs/1810.02274, 2018.
ral Information Processing Systems 28, pp. 2125–2133.
Curran Associates, Inc., 2015. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine,
Panneershelvam, V., Lanctot, M., et al. Mastering the
S. Visual reinforcement learning with imagined goals.
game of go with deep neural networks and tree search.
CoRR, abs/1807.04742, 2018.
nature, 529(7587):484, 2016.
Neu, G., Jonsson, A., and Gómez, V. A unified view of Singh, S., Lewis, R. L., and Barto, A. G. Where do rewards
entropy-regularized markov decision processes. arXiv come from? Proceedings of the Annual Conference of
preprint arXiv:1705.07798, 2017. the Cognitive Science Society (CogSci), 2009.
Ng, A. Y., Harada, D., and Russell, S. J. Policy invariance Singh, S. P., Lewis, R. L., Barto, A. G., and Sorg, J. Intrinsi-
under reward transformations: Theory and application to cally motivated reinforcement learning: An evolutionary
reward shaping. In ICML, 1999. perspective. IEEE Transactions on Autonomous Mental
Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Development, 2:70–82, 2010.
Deep exploration via bootstrapped dqn. In Lee, D. D., Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman,
Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. M. L. Pac model-free reinforcement learning. In Pro-
(eds.), Advances in Neural Information Processing Sys- ceedings of the 23rd international conference on Machine
tems 29, pp. 4026–4034. Curran Associates, Inc., 2016. learning, pp. 881–888. ACM, 2006.
Osband, I., Aslanides, J., and Cassirer, A. Randomized prior Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,
functions for deep reinforcement learning. In Bengio, S., Y. Policy gradient methods for reinforcement learning
Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, with function approximation. In Advances in neural in-
N., and Garnett, R. (eds.), Advances in Neural Infor- formation processing systems, pp. 1057–1063, 2000.
mation Processing Systems 31, pp. 8625–8637. Curran
Associates, Inc., 2018. Szita, I. and Szepesvári, C. Model-based reinforcement
learning with nearly tight exploration complexity bounds.
Ostrovski, G., Bellemare, M. G., van den Oord, A., and In Proceedings of the 27th International Conference on
Munos, R. Count-based exploration with neural density Machine Learning (ICML-10), pp. 1031–1038, 2010.
models. In Proceedings of the 34th International Con-
ference on Machine Learning, ICML 2017, Sydney, NSW, Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O.,
Australia, 6-11 August 2017, pp. 2721–2730, 2017. Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. #ex-
ploration: A study of count-based exploration for deep
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. reinforcement learning. In Guyon, I., Luxburg, U. V., Ben-
Curiosity-driven exploration by self-supervised predic- gio, S., Wallach, H., Fergus, R., Vishwanathan, S., and
tion. In ICML, 2017. Garnett, R. (eds.), Advances in Neural Information Pro-
cessing Systems 30, pp. 2753–2762. Curran Associates,
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Inc., 2017.
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- Todorov, E. Linearly-solvable markov decision problems.
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. In Advances in neural information processing systems,
Scikit-learn: Machine learning in Python. Journal of pp. 1369–1376, 2007.
Machine Learning Research, 12:2825–2830, 2011.
Warde-Farley, D., de Wiele, T. V., Kulkarni, T., Ionescu,
Peters, J., Mulling, K., and Altun, Y. Relative entropy policy C., Hansen, S., and Mnih, V. Unsupervised control
search. In Twenty-Fourth AAAI Conference on Artificial through non-parametric discriminative rewards. CoRR,
Intelligence, 2010. abs/1811.11359, 2018.
Provably Efficient Maximum Entropy Exploration