1808 09940
1808 09940
1808 09940
Management
Zhipeng Liang ∗† ,Kangkang Jiang∗† ,Hao Chen ∗† ,Junhao Zhu ∗† ,Yanran Li ∗† ,
∗ Likelihood Technology
† Sun Yat-sen University
Abstract—In this paper, we implement two state-of-art continu- based search algorithm [2]. Saud Almahdi et al. extended
ous reinforcement learning algorithms, Deep Deterministic Policy recurrent reinforcement learning and built an optimal vari-
Gradient (DDPG) and Proximal Policy Optimization (PPO) in able weight portfolio allocation under the expected maximum
portfolio management. Both of them are widely-used in game
playing and robot control. What’s more, PPO has appealing drawdown [3]. Xiu Gao et al. used absolute profit and relative
theoretical propeties which is hopefully potential in portfolio risk-adjusted profit as performance function to train the system
management. We present the performances of them under differ- respectively and employ a committee of two network, which
ent settings, including different learning rate, objective function, was found to generate appreciable profits from trading in the
markets, feature combinations, in order to provide insights for foreign exchange markets [4].
parameter tuning, features selection and data preparation.
Index Terms—Reinforcement Learning; Portfolio Manage- Thanks to the development in deep learning, well known
ment; Deep Learning; DDPG; PPO for its ability to detect complex features in speech recogni-
tion, image identification, the combination of reinforcement
I. I NTRODUCTION learning and deep learning, so called deep reinforcement
learning, has achieved great performance in robot control,
Utilizing deep reinforcement learning in portfolio manage-
game playing with few efforts in feature engineering and
ment is gaining popularity in the area of algorithmic trading.
can be implemented end to end [5]. Function approximation
However, deep learning is notorious for its sensitivity to neural
has long been an approach in solving large-scale dynamic
network structure, feature engineering and so on. Therefore,
programming problem [6]. Deep Q Learning, using neural
in our experiments, we explored influences of different op-
network as an approximator of Q value function and replay
timizers and network structures on trading agents utilizing
buffer for learning, gains remarkable performance in playing
two kinds of deep reinforcement learning algorithms, deep
different games without changing network structure and hyper
deterministic policy gradient (DDPG) and proximal policy op-
parameters [7]. Deep Deterministic Policy Gradient(DDPG),
timization (PPO). Our experiments were conveyed on datasets
one of the algorithms we choose for experiments, uses actor-
of China and America stock market. Our codes can be viewed
critic framework to stabilize the training process and achieve
on github1 .
higher sampling efficiency [8]. Another algorithm, Proximal
II. S UMMARY Policy Optimization(PPO), turns to derive monotone improve-
ment of the policy [9].
This paper is mainly composed of three parts. First, port-
Due to the complicated, nonlinear patterns and low signal
folio management, concerns about optimal assets allocation
noise ratio in financial market data, deep reinforcement learn-
in different time for high return as well as low risk. Sev-
ing is believed potential in it. Zhengyao Jiang et al. proposed
eral major categories of portfolio management approaches
a framework for deep reinforcement learning in portfolio man-
including ”Follow-the-Winner”, ”Follow-the-Loser”, ”Pattern-
agement and demonstrated that it can outperform conventional
Matching” and ”Meta-Learning Algorithms” have been pro-
portfolio strategies [10]. Yifeng Guo el at. refined log-optimal
posed. Deep reinforcement learning is in fact the combination
strategy and combined it with reinforcement learning [11].
of ”Pattern-Matching” and ”Meta-Learning” [1].
However, most of previous works use stock data in America,
Reinforcement learning is a way to learn by interacting
which cannot provide us with implementation in more volatile
with environment and gradually improve its performance by
China stock market. What’s more, few works investigated the
trial-and-error, which has been proposed as a candidate for
influence of the scale of portfolio or combinations of different
portfolio management strategy. Xin Du et al. conducted Q-
features. To have a closer look into the true performance and
Learning and policy gradient in reinforcement learning and
uncover pitfalls of reinforcement learning in portfolio man-
found direct reinforcement algorithm (policy search) enables
agement, we choose mainstream algorithms, DDPG and PPO
a simpler problem representation than that in value function
and do intensive experiments using different hyper parameters,
1 https://github.com/qq303067814/Reinforcement-learning-in-portfolio- optimizers and so on.
management- The paper is organized as follows: in the second section
P
we will formally model portfolio management problem. We i wi,t−1 = 1. We assume initial wealth is P0 . Definitions of
will show the existence of transaction cost will make the state, action and reward in portfolio management are as below.
problem from a pure prediction problem whose global op- • State(s): one state includes previous open, closing, high,
timized policy can be obtained by greedy algorithm into a low price, volume or some other financial indexes in a
computing-expensive dynamic programming problem. Most fixed window.
reinforcement learning algorithms focus on game playing and • Action(a): the desired allocating weights, at−1 =
robot control, while we will show that some key characters (a0,t−1 , a1,t−1 , . . . , am,t−1 )T is the allocating vector at
in portfolio management requires some modifications. The Pn
period t − 1, subject to the constraint i=0 ai,t−1 = 1.
third part we will go to our experimental setup, in which Due to the price movement in a day, the weights vector
we will introduce our data processing, our algorithms and at−1 at the beginning of the day would evolve into wt−1
our investigation into effects of different hyper parameters at the end of the day:
to the accumulated portfolio value. The fourth part we will
yt−1 at−1
demonstrate our experiment results. In the fifth part we would wt−1 =
come to our conclusion and future work in deep reinforcement yt−1 · at−1
learning in portfolio management.
using neural network as its function approximator based on πθ∗ = arg max J(πθ )
πθ
Deterministic Policy Gradient Algorithms.[12] To illustrate X
its idea, we would briefly introduce Q-learning and policy = arg max Eτ ∼pθ (τ ) [ γ t r(st , at )]
πθ
t
gradient and then we would come to DDPG.
Q-learning is a reinforcement learning based on Q-value = arg max Eτ ∼pθ (τ ) [r(τ )]
πθ
function. To be specific, a Q-value function gives expected
Z
accumulated reward when executing action a in state s and = arg max πθ (τ )r(τ )dτ
πθ
follow policy π in the future, which is:
In deep reinforcement learning, gradient descent is the most
Qπ (st , at ) = Eri ≥t,si >t E,ai >t π [Rt |st , at ] common method to optimize given objective function, which
is usually non-convex and high-dimensional. Taking derivative Algorithm 1 DDPG
of the objective function equals to take derivative of policy. 1: Randomly initialize actor µ(s|θ µ ) and critic Q(s, a|θ Q )
0 0
Assume the time horizon is finite, we can write the strategy 2: Create Q0 and µ0 by θ Q → θ Q ,θ µ → θ µ
in product form: 3: Initialize replay buffer R
4: for i = 1 to M do
πθ (τ ) = πθ (s1 , a1 , . . . , sT , aT )
5: Initialize a UO process N
T
Y 6: Receive initial observation state s1
= p(s1 ) πθ (at |st )p(st+1 |st , at )
7: for t = 1 to T do
t=1
8: Select action at = µ(st |θµ ) + Nt
However, such form is difficult to make derivative in terms 9: Execute action at and observe rt and st+1
of θ. To make it more computing-tractable, a transformation 10: Save transition (st ,at ,rt ,st+1 ) in R
has been proposed to turn it into summation form: 11: Sample a random minibatch of N transitions
∇θ πθ (τ ) (si ,ai ,ri ,si+1 ) in R
0 0
∇θ πθ (τ ) = πθ (τ ) 12: Set yi = ri + γQ0 (si+1 , µ0 (si+1 |θµ )|θQ ) P
πθ (τ )
13: Update critic by minimizing the loss:L = N1 i (yi −
= πθ (τ )∇θ log πθ (τ )
Q(si , ai |θQ ))2
T
X 14: Update actor policy by policy gradient:
∇θ log πθ (τ ) = ∇θ (log p(s1 ) + log πθ (at |st ) + log p(st+1 ))
t=1 ∇θ µ J
T 1 X
≈ ∇θµ Q(s, a|θQ )|s=st ,a=µ(st |θµ ) ∇θµ µ(s|θµ )|st
X
= ∇θ log πθ (at , st ) N i
t=1
(1)
Therefore, we can rewrite differentiation of the objective
function into that of logarithm of policy: 15: Update the target networks:
0 0
∇J(πθ ) = Eτ ∼πθ (τ ) [r(τ )] θQ → τ θQ + (1 − τ )θQ
= Eτ ∼πθ (τ ) [∇θ log πθ (τ )r(τ )] 0 0
θµ → τ θµ + (1 − τ )θµ
T
X T
X
= Eτ ∼πθ (τ ) [( ∇θ log πθ (at |st ))( γ t r(st , at ))] 16: end for
t=1 t=1 17: end for
In deep deterministic policy gradient, four networks are
required: online actor, online critic, target actor and target
critic. Combining Q-learning and policy gradient, actor is Optimization(TRPO) [13], we would introduce TRPO first and
the function µ and critic is the Q-value function. Agent then PPO.
observe a state and actor would provide an ”optimal” action in TRPO finds an lower bound for policy improvement so that
continuous action space. Then the online critic would evaluate policy optimization can deal with surrogate objective function.
the actor’s proposed action and update online actor. What’s This could guarantee monotone improvement in policies.
more, target actor and target critic are used to update online Formally, let π denote a stochastic policy π : S×A → [0, 1],
critic. which indicates that the policy would derive a distribution in
Formally, the update scheme of DDPG is as below: continuous action space in the given state to represent all the
For online actor: action’s fitness. Let
∞
X
η(π) = Es0 ,a0 ,... [ γ t r(st )]
∇θµ J ≈ Est ∼ρβ [∇θµ Q(s, a|θQ )|s=st ,a=µ(st |θµ ) ] t=0
= Est ∼ρβ [∇a Q(s, a|θQ )|s=st ,a=µ(st ) ∇θµ µ(s|θµ )|s=st ] s0 ∼ ρ0 (s0 ), at ∼ π(at |st ),st+1 ∼ P (st+1 , at+1 |st , at )
For online critic, the update rule is similar. The target actor Following standard definitions of the state-action value
and target critic are updated softly from online actor and online function Qπ , the value function Vπ and the advantage function
critic. We would leave the details in the presentation of the as below:
algorithm: X∞
Vπ (st ) = Eat ,st+1 ,... [ γ l r(st+l )]
B. Proximal Policy Optimization l=0
Most algorithms for policy optimization can be classified ∞
X
into three broad categories:(1) policy iteration methods. (2) Qπ (st , at ) = Est+1 ,at+1 ,... [ γ l r(st+l )]
policy gradient methods and (3) derivative-free optimization l=0
methods. Proximal Policy Optimization(PPO) falls into the
second category. Since PPO is based on Trust Region Policy Aπ (s, a) = Qπ (s, a) − Vπ (s)
The expected return of another policy π̃ over π can be ex-
pressed in terms of the advantage accumulated over timesteps:
η(πi+1 ) ≥ Mi (πi+1 )
X∞
η(π̃) = η(π) + Es0 ,a0 ,···∼π̃ [ γ t Aπ (st , at )] Therefore, we could give out the lower bound of the policy
t=0 improvement:
The above equation can be rewritten in terms of states:
η(πi+1 ) − η(πi ) ≥ Mi (πi+1 ) − Mi (πi )
∞ X
X X
η(π̃) = η(π) + P (st = s|π̃) π̃(a|s)γ t Aπ (s, a) Thus, by maximizing Mi at each iteration, we guarantee that
s
t=0 a the true objective η is non-decreasing. Consider parameterized
∞
XX X policies πθi , the policy optimization can be turned into:
= η(π) + γ t P (st = s|π̃) π̃(a|s)Aπ (s, a)
s t=0 a
X X max
= η(π) + ρπ̃ (s) π̃(a|s)Aπ (s, a) max[Lπθi−1 (πθi ) − CDKL (πθi−1 , πθi )]
πθi
s a
where ρπ̃ = P (s0 = s) + γP (s1 = s) + γ 2 P (s2 = s) + · · · However, the penalty coefficient C from the theoretical
denotes the discounted visitation frequencies of state s given result would provide policy update with too small step sizes.
policy π̃. While in the final TRPO algorithm, an alternative optimization
However, the complexity due to the reliance to policy problem is proposed after carefully considerations of the
π̃ makes the equation difficult to compute. Instead, TRPO structure of the objective function:
proposes the following local approximation.
X X maxLπθi
Lπ (π̃) = η(π) + ρπ (s) π̃(a|s)Aπ (s, a) πθi
ρπθ
s a s.t. DKLi−1 (πθi−1 , πθi ) ≤ δ
The lower bound of policy improvement, as one of the key
ρ
results of TRPO, provides theoretical guarantee for monotonic where DKL (πθ1 , πθ2 ) = Es∼ρ [DKL (πθ1 (·|s)||πθ2 (·|s))]
policy improvement: Further approximations are proposed to make the optimiza-
tion tractable. Recalled that the origin optimization problem
4γ
η(πnew ) ≥ Lπold (πnew ) − α2 can be written as :
(1 − γ)2
X X
where max ρπθi−1 (s) πθi (a|s)Aθi−1 (s, a)
πθ
= max |Aπ (s, a)| s a
s,a
α= DTmax
V (πold , πnew )
After some approximations including importance sampling,
= max DT V (πold (·|s)||πnew (·|s)) the final optimization comes into:
s
1 πθi (a|s)
P
DT V (p||q) = 2i |pi −qi | is the total variation divergence max Es∼ρπθi−1 ,a∼q [ Aπθi−1 (s, a)]
distance between two discrete probability distributions. π θi q(a|s)
Since DKL (p||q) ≥ DT V (p||q)2 , we can derive the fol-
lowing inequation, which is used in the construction of the s.t. Es∼ρπθi−1 [DKL (πθi−1 (·|s)||πθi (·|s))] ≤ δ
algorithm:
The proofs of above equations are available in [13] LCLIP (θ) = E[min(r(θ)A, clip(r(θ), 1 − , 1 + )A)]
To go further into the detail, let Mi (π) = Lπi (π) −
max
CDKL (πi , π). Two properties would be uncovered without This net surrogate objective function can constrain the
much difficulty as follow: update step in a much simpler manner and experiments show
it does outperform the original objective function in terms of
η(πi ) = Mi (πi ) sample complexity.
Algorithm 2 PPO In order to derive a general agent which is robust with
1: Initialize actor µ : S → Rm+1 and different stocks, we normalize the price data. To be specific,
σ : S → diag(σ1 , σ2 , · · · , σm+1 ) we divide the opening price, closing price, high price and low
2: for i = 1 to M do price by the close price at the last day of the period. For
3: Run policy πθ ∼ N (µ(s), σ(s)) for T timesteps and missing data which occurs during weekends and holidays, in
collect (st , at , rt ) order to maintain the time series consistency, we fill the empty
0
Estimate advantages Ât = t0 >t γ t −t rt0 − V (st )
P
4: price data with the close price on the previous day and we also
5: Update old policy πold ← πθ set volume 0 to indicate the market is closed at that day.
6: for j = 1 to N do
7: Update actor policy by policy gradient:
X
∇θ LCLIP
i (θ)
i
B. network structure
8: Update critic by:
T
X 2
∇L(φ) = − ∇Ât Motivated by Jiang et al., we use so called Identical In-
t=1 dependent Evaluators(IIE). IIE means that the networks flow
independently for the m+ assets while network parameters
9: end for
are shared among these streams. The network evaluates one
10: end for
stock at a time and output a scaler to represent its preference
to invest in this asset. Then m+1 scalers are normalized by
V. E XPERIMENTS softmax function and compressed into a weight vector as the
A. Data preparation next period’s action. IIE has some crucial advantages over
an integrated network, including scalability in portfolio size,
Our experiments are conducted on China Stock data and data-usage efficiency and plasticity to asset collection. The
America Stock data from investing.com, wind and Shinging- explanation can be reviewed in[10] and we are not going to
Midas Private Fund. We select two baskets of stocks with low illustrate them here.
correlation or even negative correlation from these markets
to demonstrate our agent’s capability to allocate between We find that in other works about deep learning in portfolio
different assets. In order to hold our assumption, we choose management, CNN outperforms RNN and LSTM in most
stocks with large volume so that our trades would not affect cases. Howevero, different from Jiang et al., we alternate
the market. In China stock market, we choose 18 stocks in CNN with Deep Residual Network. The depth of the neural
order to test the algorithm in large-scale portfolio management network plays an important role in its performance. However,
setting. In America stock market we choose 6 stocks. What’s conventional CNN network is stopped from going deeper
more, we choose last 3 years as our training and testing because of gradient vanishment and gradient explosion when
period, with 2015/01/01−2016/12/31 as training period and the depth of the networks increases. Deep residual network
2017/01/01 − 2018/01/01 as testing period. The stock codes solves this problem by adding a shortcut for layers to jump
we select are as follow: to the deeper layers directly, which could prevent the network
from deteriorating as the depth adds. Deep Residual Network
market code market code has gained remarkable performance in image recognition and
China 000725 USA AAPL greatly contributes to the development of deep learning.
China 000002 USA ADBE
China 600000 USA BABA
China 000862 USA SNE
China 600662 USA V
China 002066
China 600326
China 000011
China 600698
China 600679
China 600821
China 600876
China 600821
China 000151
China 000985
China 600962
DDPG PPO
Algorithm
Actor Critic Actor Critic
Optimizer Adam Adam GradientDescent GradientDescent
Learning Rate 10−3 10−1 10−3 10−3
τ 10−2 10−2 10−2 10−2
Fig. 6. Critic loss under different critic learning rates
TABLE II
H YPER PARAMETERS IN OUR EXPERIMENTS
2) Risk: Due to the limitation of training data, our rein-
forcement learning agent may underestimate the risk when
C. result training in bull market, which may occur disastrous deteriora-
tion in its performance in real trading environment. Different
1) learning rate: Learning rate plays an essential role in approaches in finance can help evaluate the current portfolio
neural network training. However, it is also very subtle. A risk to alleviate the effect of biased training data. Inspired by
high learning rate will make training loss decrease fast at the Almahdi et al. in which objective function is risk-adjusted and
beginning but drop into a local minimum occasionally, or even Jacobsen et al. which shows the volatility would cluster in a
vibrate around the optimal solution but could not reach it. A period, we modify our objective function as follow:
low learning rate will make the training loss decrease very
slowly even after a large number of epochs. Only a proper T
X
learning rate can help network achieve a satisfactory result. R= γ t (r(st , at ) − βσt2 )
Therefore, we implement DDPG and test it using different t=1
learning rates. The results show that learning rates have Pt Pm+1
significant effect on critic loss even actor’s learning rate does where σt2 = L1 t0 =t−L+1 i=1 (yi,t0 − yi,t0 )2 · wi,t and
t
yi,t0 = L1 t0 =t−L+1 yi,t0 measure the volatility of the re-
P
not directly control the critic’s training. We find that when
the actor learns new patterns, critic loss would jump. This turns of asset i in the last L day. The objective function is
indicates that the critic has not sufficient generalization ability constrained by reducing the profit from investing in highly
towards new states. Only when the actor becomes stable can volatile assets which would make our portfolio exposed in
the critic loss decreases. exceeded danger.
Fig. 7. Comparison of portfolio value with different risk penalties(β) Fig. 8. Comparison of critic loss with different features combinations