Developing A Multi-Agent and Self-Adaptive Framework With Deep Reinforcement Learning For Dynamic Portfolio Risk Management
Developing A Multi-Agent and Self-Adaptive Framework With Deep Reinforcement Learning For Dynamic Portfolio Risk Management
Developing A Multi-Agent and Self-Adaptive Framework With Deep Reinforcement Learning For Dynamic Portfolio Risk Management
ABSTRACT 1 INTRODUCTION
arXiv:2402.00515v3 [q-fin.PM] 5 Sep 2024
Deep or reinforcement learning (RL) approaches have been adapted Computational Finance (CF) [12, 28, 37] is a very active research
as reactive agents to quickly learn and respond with new investment area involving the studies of computational approaches to tackle
strategies for portfolio management under the highly turbulent many different challenging and practical problems in Finance. Con-
financial market environments in recent years. In many cases, due ventionally, algorithmic methods had been employed to simulate
to the very complex correlations among various financial sectors, various investment strategies and their plausible results in the fi-
and the fluctuating trends in different financial markets, a deep or nancial markets. Recently, many researchers have tried to explore
reinforcement learning based agent can be biased in maximising the the potential uses of machine learning approaches [31] including
total returns of the newly formulated investment portfolio while the support vector machines [35], deep learning (DL) or reinforce-
neglecting its potential risks under the turmoil of various market ment (RL) learning approaches [17, 39, 42, 43] in a diversity of
conditions in the global or regional sectors. Accordingly, a multi- real-world applications [9, 25, 27, 46] in CF. Among these applica-
agent and self-adaptive framework namely the MASA is proposed tions, DL or RL approaches such as the Twin Delayed DDPG (TD3)
in which a sophisticated multi-agent reinforcement learning (RL) algorithm [10] for dynamic environments with continuous action
approach is adopted through two cooperating and reactive agents to spaces have been adapted as reactive agents [6] to quickly learn and
carefully and dynamically balance the trade-off between the overall respond with new investment strategies for portfolio management
portfolio returns and their potential risks. Besides, a very flexible under the highly turbulent financial market environments in recent
and proactive agent as the market observer is integrated into the years. In many cases, due to the very complex correlations among
MASA framework to provide some additional information on the various financial sectors, and the fluctuating trends in different
estimated market trends as valuable feedbacks for multi-agent RL financial markets, a deep or reinforcement learning based agent
approach to quickly adapt to the ever-changing market conditions. can be mainly focused on maximising the total returns of the newly
The obtained empirical results clearly reveal the potential strengths formulated investment portfolio while ignoring the potential risks
of our proposed MASA framework based on the multi-agent RL of the new investment portfolio under the turmoil of various mar-
approach against many well-known RL-based approaches on the ket conditions such as the unpredictable and sudden changes of the
challenging data sets of the CSI 300, Dow Jones Industrial Average market trends frequently occurring in the global or regional sectors
and S&P 500 indexes over the past 10 years. More importantly, our of financial markets, especially due to the COVID-19 pandemic,
proposed MASA framework shed lights on many possible directions natural disasters brought by extreme weathers, and local conflicts
for future investigation. across different regions, etc.
To overcome the above pitfall, a multi-agent and self-adaptive
KEYWORDS framework namely the MASA is proposed in this work in which
two cooperating and reactive agents are utilised to implement a
Deep Reinforcement Learning; Multi-Agent; Risk Management;
radically new multi-agent RL scheme so as to carefully and dynami-
Self-Adaptive; Portfolio Optimisation
cally balance the trade-off between the overall returns of the newly
ACM Reference Format: revised portfolio and their potential risks especially when the con-
Zhenglong Li, Vincent Tam, and Kwan L. Yeung. 2024. Developing A Multi- cerned financial markets are highly turbulent. The first cooperating
Agent and Self-Adaptive Framework with Deep Reinforcement Learning agent is based on the TD3 algorithm targeted to optimise the overall
for Dynamic Portfolio Risk Management. In Proc. of the 23rd International returns of the current investment portfolio while the second intelli-
Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), gent agent is based on a complete constraint solver, or possibly any
Auckland, New Zealand, May 6 – 10, 2024, IFAAMAS, 9 pages. efficient local optimisers such as the evolutionary algorithms [36]
or particle swarm optimisation (PSO) methods [19], trying to adjust
the current investment portfolio in order to minimise its potential
This work is licensed under a Creative Commons Attribution risks after considering the estimated market trend as provided by
International 4.0 License.
another adaptive agent as the market observer in the proposed
MASA framework. Clearly, the multi-agent RL scheme of the pro-
Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J.S. Sichman (eds.), May 6 – 10, 2024, posed MASA framework may help to produce more balanced in-
Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and vestment portfolios in terms of both portfolio returns and potential
Multiagent Systems (www.ifaamas.org).
risks with the clear division of works between the two cooperat- environment in order to maximise the cumulative rewards of the
ing agents to continuously learn and adapt from the underlying performed actions with respect to the underlying environment.
financial market environment. It is worth noting that multi-agent The key focus of RL approaches is to strive for a balance between
RL-based frameworks have been actually considered in some previ- the exploration of the unknown environment and the exploitation
ous research studies. For instance, a TD3-based multi-agent deep of the current knowledge gained through the iterative learning
reinforcement learning (DRL) approach [45] was investigated in a process. The underlying environment is usually stated in the form
previous work to improve the function approximation error and of a partially observable Markov decision process (POMDP) [32]
complex mission adaptability through applications to the mixed in which dynamic programming techniques [29] can be employed
cooperation-competition environment in a general perspective. Yet to solve the involved POMDP. Yet the RL approaches are targeted
instead of relying on the complex and dual-centered Q-network to to handle large POMDPs where exact methods like the dynamic
reduce the bias of function estimation as in the previous work, our programming techniques may fail since the RL approaches do not
proposal has uniquely focused on using the TD3-based agent to need to assume any prior knowledge of the involved POMDP to
firstly optimise on the overall returns of the newly revised portfolio represent the underlying environment. Clearly, the RL approaches
with some possibly under-estimated bias/error in its potential risks are very suitable to explore the uncharted and also unpredictable
to be quickly rectified by the second solver-based agent using a environments of various financial markets when solving a diversity
loosely-coupled and pipelining computational model to tackle this of real-world problems in CF.
specific and challenging problem of dynamic portfolio risk man- In recent years, the RL approaches have attained remarkable
agement in the real-world applications of CF. It should be noted successes for portfolio optimisation in which RL-based investment
that by adopting the loosely-coupled and pipelining computational strategies have demonstrated adaptive and fast learning abilities
model, the proposed MASA framework will become more resilient to adjust the portfolios for maximising the overall returns after
and reliable since the overall framework will continue to work a targeted trading period. Among the numerous RL approaches,
even when any particular agent fails. Moreover, to make the pro- a successful example is the TD3 algorithm [10] as a model-free,
posed MASA framework more adaptive to the extremely volatile online, off-policy reinforcement learning method. Generally speak-
environments of financial markets, the market observer as a very ing, a TD3 agent is an actor-critic reinforcement learning agent
flexible and proactive agent to continuously provide the estimated that is aimed to look for an optimal policy to maximise the ex-
market trends as valuable feedbacks for the other two cooperating pected cumulative long-term reward. For portfolio optimisation,
agents to quickly adapt to the ever-changing market conditions. the expected cumulative long-term reward of the TD3 agent can
Undoubtedly, this simply highlights another key difference of our be straightforwardly formulated as the expected overall returns
proposal on the multi-agent RL scheme when compared to those of the concerned portfolio after a specific trading period. Yet with
multi-agent RL-based frameworks examined in the previous studies. the highly volatile financial market conditions, it can be difficult
Furthermore, when the market observer agent is implemented as a for most RL approaches to strive for a good balance between the
deep neural network such as the multi-layer perceptron (MLP) [40] intrinsically conflicting objectives of maximising returns and also
model, the resulting MASA framework can be extended as a DRL minimising the risks of portfolios over a specific trading period. In
approach for dynamic portfolio management in CF. most cases with turbulent market conditions, increasing the portfo-
To demonstrate the effectiveness of our proposal, a prototype lio returns will likely increase the potential risks that may possibly
of the proposed MASA framework is implemented in Python and lead to great and sudden losses of the investment portfolio in an
tested on a GPU server installed with the Nvidia RTX 3090 GPU card. extremely short period of time due to some unexpected crises.
The attained empirical results demonstrate the potential strengths
of our proposed MASA framework based on the multi-agent RL
approach against many well-known RL-based approaches on the
challenging data set of the CSI 300, Dow Jones Industrial Aver- 2.2 Multi-Agent Systems
age (DJIA) and S&P 500 indexes over the past 10 years. More im- Multi-agent systems (MAS) [18] is a core and very active research
portantly, our proposed MASA framework shed lights on many area of artificial intelligence [47] in which many different perspec-
possible directions including the exploration of utilising different tives and methodologies including the neural networks [13] or
meta-heuristic based optimisers such as the PSO for the solver- evolutionary algorithms [8] have been adopted and contributed to
based agent, various machine learning approaches for the market the latest development of MAS. In many real-world applications
observer agent, or the potential applications of the proposed MASA such as various challenging problems in CF, multiple intelligent
framework for various resource allocation, planning or disaster agents may try to optimise their own returns and/or other objec-
recovery problems in which the risk management is very critical tive(s), that may unavoidably collide with the interests of other
for our future investigation. investors with the same objective(s).
Besides, there are other research studies [3, 4] describing how
2 THE PRELIMINARIES MAS may facilitate the simulations in research studies of CF and
Computational Economics. To more precisely model from the per-
2.1 Reinforcement Learning spective of MAS, each agent in the multi-agent market simulation
As one of the active research areas in machine learning [11], RL [20] environment may find it difficult to learn a static investment strat-
is mainly focused on how intelligent agents make rational decisions egy due to the fluctuating market dynamics. Thus, the involved
on actions based on specific observations in a possibly unexplored agents may need to deploy intelligent algorithms capable of learning
to compete well with adaptive mechanisms in the adversarial mar- where 𝑁 is the number of assets in a portfolio, 𝑎𝑡,𝑖 is the weight of
ket environments. In addition, studying intelligent trading through 𝑖 th asset, and 𝑝𝑡,𝑖
𝑐 is the close price of 𝑖 th asset at time 𝑡.
such simulations from a multi-agent perspective can lead to many Accordingly, each investment portfolio is constrained as below.
exciting research directions with possible findings of relevance to 𝑁
∑︁
policy makers and investors. An example is the market simulator ∀𝑎𝑡,𝑖 ∈ A𝑡 : 𝑎𝑡,𝑖 ≥ 0, 𝑎𝑡,𝑖 = 1, (2)
(MAXE) [3] to examine different types of agent behaviour, market 𝑖=1
rules and anomalies on market dynamics through the simulation where A𝑡 ∈ A is the weight vector A at time 𝑡. Clearly, the sum-
of large-scale MAS. mation of all the allocation weights 𝑎𝑡,𝑖 for the total 𝑁 assets of a
It is worth noting that there are some previous research studies complete portfolio should be 1.
investigating the potential uses of RL-based algorithms in MAS Based on the well-known Markowitz model [29], both the short-
for many applications. For instance, a TD3-based multi-agent DRL term and long-term risks of a portfolio over a specific trading period
approach [45] was examined in a previous work to improve the can be defined in terms of the corresponding covariance-weighted
function approximation error and complex mission adaptability risk and the volatility of strategies as follows.
through applications to the mixed cooperation-competition envi-
ronment from a general perspective. Essentially, the TD3-based Definition 2.2. (Short-term Portfolio Risk) The short-term port-
DRL approach makes use of the complex and dual-centered Q- folio risk 𝜎𝑝,𝑡 at time 𝑡 is defined as below.
network to reduce the bias of function estimation. On the other 𝜎𝑝,𝑡 = 𝜎𝛽 + 𝜎𝛼,𝑡
hand, our proposal of the MASA framework has focused on using √︃ (3)
the TD3-based agent to firstly optimise on the overall returns of 𝜎𝛼,𝑡 = A𝑇𝑡 Σ𝑘 A𝑡 = ∥Σ𝑘 A𝑡 ∥ 2,
the newly revised portfolio while leaving the potential risks as the
where 𝜎𝛼,𝑡 is the trading strategy risk, 𝜎𝛽 is the market risk and
possibly under-estimated error to be effectively handled by the
second solver-based agent with a loosely-coupled and pipelining A𝑡 ∈ R 𝑁 ×1 is the matrix of weights. The covariance matrix Σ𝑘 ∈
mechanism for dynamic portfolio risk management in CF. Through R 𝑁 ×𝑁 between any two assets can be calculated by the rate of
adopting the loosely-coupled and pipelining computational model, daily returns of assets in the past 𝑘 days.
the proposed MASA framework will become a dependable MAS Definition 2.3. (Long-term Portfolio Risk) The long-term portfo-
with high availability and reliability since the resulting MAS will lio risk 𝑉 𝑜𝑙𝑝 is defined as the strategy volatility that is the sampled
continue to work even when any specific agent fails. variance of the daily return rates 𝑟 𝑝,𝑡 of a trading strategy over the
whole trading period. 𝑟 𝑝,𝑡 is the average daily return rate.
2.3 Portfolio Optimisation in Computational v
u
Finance t
252 ∑︁
𝑇
2
𝑉 𝑜𝑙𝑝 = 𝑟 𝑝,𝑡 − 𝑟 𝑝,𝑡 . (4)
Portfolio optimisation is a very challenging multi-objective optimi- 𝑇 − 1 𝑡 =1
sation problem in CF where the uncharted and highly volatile finan-
cial market environments can be difficult for many intelligent algo- Besides, the following gives a formal definition of the Sharpe
rithms or well-known mathematical programming [34] approaches ratio as one of the most widely adopted performance measures on
to tackle. Conventionally, many investors and researchers utilised the risk-adjusted relative returns of a portfolio.
specific financial indicators such as the moving averages [33] or the Definition 2.4. (Sharpe Ratio) The Sharpe Ratio (SR) is a per-
relative strength index [1], together with the heuristic or machine formance indicator for evaluating a portfolio in terms of the total
learning approaches including the follow-the-winner, follow-the- annualized returns 𝑅𝑝 , risk-free rate 𝑟 𝑓 and annualized long-term
loser, pattern-matching or meta-learning algorithms [22] to try portfolio risk 𝑉 𝑜𝑙𝑝 .
to capture the momentum of price changes. Recently, there have 𝑅𝑝 − 𝑟 𝑓
been many interesting research studies trying to apply DL or RL SR = . (5)
techniques to explore the turbulent and uncharted financial market 𝑉 𝑜𝑙𝑝
environments. For instance, [25, 44] consider the news data as an More importantly, it should be noted that the portfolio opti-
additional information for portfolio management while [17, 42] misation problem in CF is used in this work to demonstrate the
utilise specific modules as intelligent agents to carefully deal with feasibility of our proposed MASA framework for risk management
the assets information and then capture the correlations among the under the highly volatile and unknown environments. In the future
involved assets. investigation, it would be interesting to explore how the multi-
To facilitate our subsequence discussion, some essential concepts agent RL-based approach of the proposed MASA framework can
including the portfolio value, both the short-term and long-term be adapted to various planning or resource allocation problems
risks of a portfolio, etc. related to portfolio management in CF are under certain hostile and unknown environments such as those for
given as below. disaster recovery or emergency management.
Definition 2.1. (Portfolio Value) The total value of a portfolio at 3 THE PROPOSED MULTI-AGENT AND
time 𝑡 is
SELF-ADAPTIVE FRAMEWORK
𝑁
To overcome the pitfall of the RL-based approaches to bias on
∑︁
𝑐
𝐶𝑡 = 𝑎𝑡,𝑖 × 𝑝𝑡,𝑖 , (1)
𝑖=1 optimising the investment returns, a multi-agent and self-adaptive
framework namely the MASA is proposed in this work in which two Algorithm 1 The Training Procedure of the MASA Framework
cooperating and reactive agents, namely the RL-based and solver- 1: Input: 𝑇 as the total number of trading days, 𝑀𝑎𝑥𝐸𝑝𝑖𝑠𝑜𝑑𝑒 as
based agents, are utilised to implement a radically new multi-agent the maximum number of episodes, the settings of RL-based
RL scheme in order to dynamically balance the trade-off between agent, and the selected market observer agent.
the overall returns of the newly revised portfolio and potential risks 2: Output: The revised RL policy 𝜋 ∗ and (possibly) updated mar-
especially when the financial markets are highly turbulent. ket observer agent.
3: Initialise the RL policy 𝜋0 , the market observer agent and mem-
Market Observer Solver-based Agent ory tuple 𝐷ˆ and 𝑀. ˆ
𝝈𝒔 𝒂𝑪𝒕𝒓𝒍 𝒂𝑭𝒊𝒏𝒂𝒍
1
…
Minimize 𝑥 𝑇 𝑃𝑥 + 𝑞𝑇 𝑥
4: for 𝑘 = 1 to 𝑀𝑎𝑥𝐸𝑝𝑖𝑠𝑜𝑑𝑒 do
2
Trading Market
𝒗𝒎 s.t. 𝐺𝑥 + 𝑠 = ℎ,
𝑠 ≥ 0, …
5: Reset the trading environment and set the initial action
Environment States 𝑎 𝐹𝑖𝑛𝑎𝑙
0 and 𝑎𝑅𝐿
0 .
Softmax
6: for 𝑡 = 1 to 𝑇 do
…
…
𝒂𝑹𝑳
RL-based Agent 7: Observe the current market state 𝑜𝑡
8: Calculate the reward 𝑟𝑡 −1 by 𝑎𝑡𝐹𝑖𝑛𝑎𝑙 −1
Return Reward Action Reward
Store tuple (𝑜𝑡 −1 , 𝑎𝑡𝐹𝑖𝑛𝑎𝑙 𝑅𝐿 ˆ
𝑱𝒓 𝑱𝑱𝑺(𝒂𝑹𝑳 ||𝒂𝑭𝒊𝒏𝒂𝒍) 9: −1 , 𝑎𝑡 −1 , 𝑜 𝑡 , 𝑟 𝑡 −1 ) in 𝐷
10: Store tuple (𝑜𝑡 −1 , 𝑜𝑡 , 𝜎𝑠,𝑡 −1 , 𝑣𝑚,𝑡 −1 ) in 𝑀 ˆ
Figure 1: The System Architecture of the Proposed MASA 11: Invoke the market observer agent to compute the sug-
Framework gested risk boundary 𝜎𝑠,𝑡 and market vector 𝑣𝑚,𝑡 as the
additional feedback for updating both the RL-based and
Figure 1 reviews the overall system architecture of the proposed solver-based agents
MASA framework in which the RL-based agent is based on the TD3 12: Invoke the RL-based agent 𝜋𝑡 to generate the current ac-
algorithm to optimise the overall returns of the current investment tion 𝑎𝑡𝑅𝐿 as portfolio weights
portfolio while the solver-based agent is based on a complete con- 13: Invoke the solver-based agent to generate the adjusted
straint solver, or possibly any efficient local optimisers such as the action 𝑎𝐶𝑡𝑟𝑙
𝑡
evolutionary algorithms [36] or PSO methods [19], that works to 14: Adjust the current portfolio by 𝑎𝑡𝐹𝑖𝑛𝑎𝑙 = 𝑎𝑡𝑅𝐿 + 𝑎𝐶𝑡𝑟𝑙 𝑡
further adapt the investment portfolio returned by the RL-based 15: Execute the portfolio order with 𝑎𝑡𝐹𝑖𝑛𝑎𝑙
agent so as to minimise its potential risks after considering the 16: if the RL policy update condition is triggered then
estimated market trend as provided by the market observer of the 17: Update the RL policy 𝜋 by learning the historical trading
proposed MASA framework. In essence, through the clear divi- data from 𝐷ˆ
sion of works between both RL-based and solver-based agents to 18: end if
continuously learn and adapt from the underlying financial mar- 19: if the predefined update condition of the market observer
ket environment with the support by market observer agent, the agent is triggered then
multi-agent RL scheme of the proposed MASA framework may 20: Update the market observer agent by learning the his-
help to attain more balanced investment portfolios in terms of torical profile 𝑀ˆ
both portfolio returns and potential risks when compared to those 21: end if
portfolios returned by the RL-based approaches. It is worth noting 22: end for
that the proposed MASA framework adopts a loosely-coupled and 23: end for
pipelining computational model among the three cooperating and 24: return the best RL policy 𝜋 ∗ and the possibly updated market
intelligent agents, thus making the overall multi-agent RL-based observer agent
approach more resilient and reliable since the overall framework
will continue to work in the worst case of any individual agent
being failed. 3 cooperating agents are working with each other to adaptively
In addition, to make the proposed MASA framework more adap- achieve the conflicting objectives of optimising returns and min-
tive to the extremely volatile environments of financial markets, the imising risks in response to the possibly highly turbulent financial
market observer agent will continuously provide the estimated mar- market conditions. Firstly, before the iterative training process is
ket trends as valuable feedbacks for both RL-based and solver-based started, all the relevant information including the RL policy, the
agents to quickly adapt to the ever-changing market conditions. market state information stored in the market observer agent, etc.
Furthermore, when the market observer agent is implemented as are initialised. During the training process, the current market state
a deep neural network such as the MLP [40] model, the resulting information 𝑜𝑡 such as the most recent downward or upward trend
MASA framework can be extended as a multi-agent DRL approach of the underlying financial market over the past few trading days
for dynamic portfolio management in CF. The empirical evaluation will be collected as the basic information for the subsequent com-
results of the market observer agent implemented as an algorith- putation of the market observer agent. Besides, the reward of the
mic approach [38], the MLP and another deep learning models are previously executed action 𝑎𝑡𝐹𝑖𝑛𝑎𝑙
−1 will be computed as the feedback
carefully analysed in Section 4. for the RL-based algorithm to revise its RL policy. The market ob-
The pseudo-code of the training procedure of the proposed server agent will then be invoked to compute the suggested risk
MASA framework is shown in Algorithm 1 to illustrate how the boundary 𝜎𝑠,𝑡 and market vector 𝑣𝑚,𝑡 as some additional feedback
for updating both the RL-based and solver-based agents on the 1 Single Agent Method 2 No Guidance Method 3 MASA
latest market conditions. As aforementioned, to maintain the flexi-
bility and self-adaptivity of the proposed MASA framework, there Optimal Policy Set Policy Set
by RL-based Agent by Solver-based Agent
can be various approaches including the algorithmic approach such
as the directional changes [2, 38], deep neural networks such as ′′
′
𝜋𝑡+1 𝜋𝑡+1
the MLP or other DL approaches [15, 41] that can be considered 𝑜𝑝𝑡𝑖𝑚𝑎𝑙
in more detail in Section 4. More importantly, it should be noted 𝜋𝑡+1 2 𝜋𝑠𝑜𝑙𝑣𝑒𝑟−𝑏𝑎𝑠𝑒𝑑
that both the RL-based and solver-based agents are already secured 3
with the current market information as the most valuable feedback 1
obtained from the existing trading environment as shown in Figure
1. The provision of the suggested market condition information 𝜋𝑡
by the market observer agent are used solely as an additional in-
formation to quickly adapt and enhance the performance of both Figure 2: An Illustration of the Guiding Mechanism of the
the RL-based and solver-based agents especially when the latest MASA Framework to Gradually Enhance the Constructed
market conditions are highly volatile. In the worst cases when the Policies of the RL-Based Agent
suggested market condition produced by the market observer agent
can be incorrect as ’noises’ to mislead the search of both RL-based
and solver-based agents for possibly biased actions in specific trad- potential risks. On the other hand, the red shaded area of Figure 2
ing days, the self-adaptive nature of the reward mechanism of the represents the policy set as recommended by the solver-based agent
RL-based agent to adapt from the underlying trading environment to minimise the potential risks for portfolio management. When
in the subsequent iterations of training, and also the auto-corrective working independently, each of the agents cannot combine the best
learning capability of the intelligent market observer algorithm will advantages to achieve a more optimal portfolio for both objectives
help to ensure that such misleading noises can be effectively and on the overall returns and potential risks. Besides, as shown in Fig-
quickly fixed over a longer period of trading to gain more valu- ure 2, when all the 3 proposed agents working together without any
able domain knowledge and insights about the underlying market intelligent guiding mechanism such as the reward-based guidance,
conditions through updating the learning history profile 𝑀ˆ of the the resulting framework can be easily stuck in specific local min-
market observer agent. Interestingly, as observed from the empir- ima. Accordingly, through the reward-based guiding mechanism
ical evaluation results obtained in Section 4, there can be fairly as adopted by the MASA framework to carefully respond to the
impressive enhancements in the ultimate performance of both RL- ever-changing environment, both the RL-based and solver-based
based and solver-based agents even when some relatively simple agents can iteratively enhance the current investment portfolio
algorithmic approach based on the directional changes is used to with respect to both objectives of the overall returns and potential
implement the market observer agent in the proposed MASA frame- risks after considering the valuable feedback from the third market
work on the challenging data sets of the CSI 300, DJIA and S&P 500 observer agent. At the same time, the reward-based guiding mecha-
indexes over the past 10 years. Clearly, for a deeper understanding nism of the MASA framework utilises an entropy-based divergence
of the ultimate impacts of the suggested information by the market measure such as the Jensen–Shannon divergence (JSD) [26, 30]
observer agent to the other two intelligent agents, the proposed for promoting the diversity of the generated action sets as an in-
MASA framework should be applied to more challenging data sets telligent and self-adaptive strategy to cater for the highly volatile
in CF or other application domains for more in-depth and thorough environments of various financial markets.
analyses in the future research studies. After the market observer The augmented reward function for the RL-based agent is de-
agent is invoked, the RL-based agent will be triggered to generate picted as follows.
the current action 𝑎𝑡𝑅𝐿 as portfolio weights that can be further re-
vised by the subsequent solver-based agent after considering its 𝐽 (𝜃 ) = 𝜆1 𝐽𝑟 (𝜃 ) + 𝜆2 𝐽 𝐽 𝑆 (𝜃 ) , (6)
own risk management strategy and the suggested market condition where 𝜆1 and 𝜆2 are the learning rates of the return reward 𝐽𝑟 (𝜃 )
provided by the market observer agent. All in all, through adopt- and the action reward 𝐽 𝐽 𝑆 (𝜃 ). To maximise the overall returns of
ing this loosely-coupled and pipelining computational model, the the current investment portfolio, the 𝐽𝑟 (𝜃 ) can be computed as the
resulting MASA framework will continue to work as a dependable sum of the logarithm of returns as stated in Equation (7).
MAS even when any individual agent fails.
𝑇
Figure 2 demonstrates the strengths of the reward-based guid- 1 Ö
ing mechanism adopted by the proposed MASA framework to 𝐽𝑟 (𝜃 ) = log 𝐶 0 𝑟𝑡
𝑇 𝑡 =1
gradually enhance the various policies constructed by the single- ! (7)
𝑇
agent RL-based approach, the proposed MASA framework without 1 ∑︁
= log 𝐶 0 + log 𝑟𝑡 ,
the reward-based guiding mechanism, and the proposed MASA 𝑇 𝑡 =1
framework utilising the reward-based guiding mechanism. The
single-agent RL-based approach can update the policy 𝜋𝑡 +1 into the where 𝐶 0 is the initial portfolio value, 𝑇 is the number of trading
relatively more optimal set (i.e., the blue shaded area) by maximis- days, and 𝑟𝑡 = 𝐶𝐶𝑡 −1
𝑡
is the growth rate of portfolio at 𝑡.
ing the total returns of portfolios yet it may possibly neglect the The action-guided reward 𝐽 𝐽 𝑆 (𝜃 ) to promote the diversity of
the action sets as generated by the proposed MASA framework is
defined as below. the balance between the portfolio returns and risks as attained by
𝑇 each approach. All the reported results are averaged over 10 runs.
1 ∑︁
Performance Analysis: Table 1 reviews the performance of vari-
𝐽 𝐽 𝑆 (𝜃 ) = − 𝐷 𝐽 𝑆 a𝑡𝑅𝐿 || a𝑡𝐹𝑖𝑛𝑎𝑙 , (8)
𝑇 𝑡 =1 ous well-known RL-based approaches against that of the proposed
MASA framework using different market observers, with the sym-
where a𝑡𝑅𝐿 is the action generated by the RL-based agent at 𝑡, a𝑡𝐹𝑖𝑛𝑎𝑙 bol ↑ to denote the preference of a larger value in the metrics of
is the adjusted action after considering the actions as recommended AR and SR while the symbol ↓ denoting the favour of a smaller
by both the RL-based and solver-based agents, and 𝐷 𝐽 𝑆 is the JSD to value in MDD and Risk. From the results of the CSI 300 data set, the
measure the similarity between a𝑡𝑅𝐿 and a𝑡𝐹𝑖𝑛𝑎𝑙 as two probability AR of the MASA frameworks is at least 1.5% larger than those of
distributions of the actions generated by the MASA framework. other methods while maintaining the portfolio risks at a relatively
low level. In particular, the MASA framework integrated with an
4 AN EMPIRICAL EVALUATION MLP-based market observer achieves the highest AR at 8.87% and
Datasets: To demonstrate the effectiveness of the proposed MASA the highest SR at 0.27, thus demonstrating the higher capability
framework in tackling the real-world portfolio risk management of all the proposed agents in the MASA framework to balance the
with conflicting objectives under the mostly uncharted and highly trade-offs among different objectives.
volatile financial market environments, a preliminary prototype For the attained results on the DJIA index, the MASA-DC ap-
of the proposed MASA framework is implemented in Python, and proach utilising the directional changes (DC) method [38] as the
evaluated on a GPU server machine installed with the AMD Ryzen market observer agent to estimate the market trends significantly
9 3900X 12-Core processor running at 3.8 GHz and two Nvidia RTX outperforms other baseline models in all metrics. Specially, the
3090 GPU cards. Furthermore, the MASA framework is compared MASA-DC approach attains the AR about 4% higher than that of
with other methods on three challenging yet representative data the single-agent TD3-Profit approach while reducing the maximum
sets of CSI 300, Dow Jones Industrial Average (DJIA) and S&P 500 possible losses by 3% when compared to the TD3-Profit in terms
indexes from September 2013 to August 2023 in which the first five- of MDD. For a clear presentation of the overall results, Figure 3
year data is used to train the model, followed by the subsequent shows the changes of portfolio values of each approach under evalu-
data set of two years to validate the trained model. Lastly, all the ation. In addition, Figure 4 shows an interesting example of upward
validated models of various approaches are evaluated on the data set trends in the DJIA market where the MASA-DC approach achieves
of the latest three years. The top 10 stocks of each index are selected competitive returns while maintaining the potential risks at a rel-
to construct the investment portfolio in terms of the company atively lower level as compared to those of other approaches. On
capital. In addition, all the involved data sets contain both upward the contrary, when the financial market stays for long periods of
and downward trends of stock prices, and also various patterns downward trends, the portfolio values of those baseline models
of fluctuation for different market conditions so as to avoid any dramatically decreases as shown in Figure 3. Yet the MASA-DC
possible bias toward a specific approach under the evaluation. approach can manage to effectively minimise the losses during such
Comparative Methods: Ten representative methods based on algo- adverse market conditions when compared to the other approaches.
rithmic or RL approaches are carefully selected to compare against Figure 4 reveals such a challenging example of downward trends in
the MASA framework. The Constant Rebalanced Portfolio (CRP) which the short-term risk of the involved portfolio can be managed
[5] is the vanilla strategy of equal weighting. The Exponential Gra- well by the MASA-DC approach with less fluctuation even if the
dient (EG) method [14] is based on the follow-the-winner approach market index drops over 10%, thus confirming the effectiveness of
while the Online Moving Average Reversion (OLMAR) [21], Pas- the DC-based market observer agent to timely capture the environ-
sive Aggressive Mean Reversion (PAMR) [24], and Robust Median ment changes as the valuable feedback for the solver-based agent
Reversion (RMR) [16] approaches follow the loser assets during to adjust the actions for balancing multiple objectives. Furthermore,
trading. The Correlation-driven Nonparametric Learning Strategy a similar performance is attained by the MASA approach on the
(CORN) [23] is a heuristic strategy to match historical investment other two indexes for which the corresponding graphs of portfolio
patterns. Moreover, the four latest RL-based portfolio optimisation values and risk comparison can be found in the Appendix for a
approaches are considered. Ensemble of Identical Independent Eval- more detailed investigation.
uators (EIIE) [17] is based on a convolution-based neural network to Table 1 reviews the performance of various approaches on the
extract the features of assets while Portfolio Policy Network (PPN) S&P 500 data set in which the OLMAR obtains a relatively high AR
[46] consists of a recurrent-based and a convolution-based neural due to its loser tracking strategy that may typically invest almost
networks to capture the sequential information and correlations the whole capital in a single asset. Yet such strategy may not be
between assets. Besides, Relation-Aware Transformer (RAT) [42] is able to balance the risk in a portfolio and possibly fail in financial
a transformer-based model to learn the patterns from price series. markets of downward trends. Thus, the OLMAR gets a relatively
Lastly, the TD3 with a profit maximisation strategy (TD3-Profit) higher MDD of 68.52% where it may suffer from huge potential
[10] as the classical RL approach is included for the comparison. risks. Similar to the results obtained on the CSI 300 and DJIA data
Besides, to evaluate the profitability and risk management of sets, both the MASA-MLP and MASA-LSTM approaches obtain
the concerned approaches, four commonly adopted performance the best performance on balancing the returns and potential risks,
metrics including the Annual Return (AR), Maximum Drawdown achieving a SR of around 0.9 and a MDD of 26%.
(MDD), Sharpe Ratio (SR), and short-term portfolio risk (Risk) are Moreover, the Wilcoxon rank-sum test [7] is used to compare
considered. Specifically, the SR is a comprehensive metric to indicate the statistical significance of the MASA framework against the
Table 1: The Performance of Various Well-Known RL-Based Approaches Against the Proposed MASA Framework on Different
Challenging Data Sets of Financial Indexes
Market Index
1.2M 34k
0.08 0.08 1 Ablation Study: Table 2 shows the results of the ablation study
1.15 of the proposed MASA framework on the CSI 300 index in which
0.06 0.06 three variants of the TD3-based models are used to compare with
Market Trends
10
between the overall portfolio returns and their potential risks. In
addition, a very flexible and proactive agent as the market observer
8 is integrated into the proposed MASA framework to provide the
6 estimated market conditions and trends as additional information
for multi-agent RL approach to carefully consider so as to quickly
4 adapt to the ever-changing market conditions.
2 To demonstrated the potential advantages of our proposal, a
prototype of the proposed MASA framework is evaluated against
0
various well-known RL-based approaches on the challenging data
−0.1 −0.05 0 0.05 0.1 sets of the CSI 300, Dow Jones Industrial Average and S&P 500 in-
DJIA Index Changes (%)
dexes over the past 10 years. The obtained empirical results clearly
reveal the remarkable performance of our proposed MASA frame-
Figure 5: The Contribution of Solver-based Agents on Differ-
work based on the multi-agent RL approach when compared against
ent Market States
those of other well-known RL-based approaches on the 3 data sets
of widely recognised financial indexes in China and the United
States. More importantly, our proposed MASA framework shed
Table 2 shows the effectiveness of the reward of the action gen- lights on many possible directions for future investigation. First,
erated by the RL-based agent in the proposed MASA framework. the thorough investigation on using different meta-heuristic based
The MASA variant without considering the reward of the action by optimisers such as the evolutionary algorithms or the PSO for the
the RL-based agent still performs better than the single-agent or solver-based agent should be interesting. Besides, experimenting
dual-agent framework in balancing the profits and risks especially various intelligent approaches for the market observer agent is
when the MLP or LSTM model is used for the market observer worth exploring. Last but not least, the potential applications of the
agent. When considering the rewards of generated actions, the risk- proposed MASA model for various resource allocation, planning
aware information provided by the solver-based agent can guide or disaster recovery in which the risk management is critical and
the policy of the RL-based agent toward higher profits and less timely should be very valuable for our future studies.
potential risks for which the MASA framework can enhance the
AR by 0.5% and the MDD by 1%. ACKNOWLEDGMENTS
Furthermore, the top 20 and 30 stocks of each index are selected
The authors wish to express our deepest gratitude to Professor
to study the scalability of the MASA framework on large-scale port-
Edward Tsang for his fruitful discussion on this work, and also the
folios, except for the CSI 300 index due to the limited data sources.
anonymous reviewers for their valuable feedback.
When constructing a portfolio of 20 assets in the DJIA market, the
MASA-MLP achieves the highest SR of 0.80 and the highest AR of
REFERENCES
14% while the best baseline approach obtains a SR of 0.61 and a
[1] Rodrigo Alfaro and Andres Sagner. 2010. Financial Forecast for the Relative
AR of 11% only. After increasing the portfolio size to 30 assets, the Strength Index. (2010).
MASA framework still has a significant improvement against those [2] Amer Bakhach, Edward PK Tsang, and Hamid Jalalian. 2016. Forecasting direc-
tional changes in the fx markets. In 2016 IEEE Symposium Series on Computational
of other approaches on all metrics. Similar results are obtained by Intelligence (SSCI). IEEE, 1–8.
the MASA framework in the S&P 500 market, that can be found in [3] Peter Belcak, Jan-Peter Calliess, and Stefan Zohren. 2022. Fast Agent-Based
the Appendix. Undoubtedly, all the obtained results validate that the Simulation Framework with Applications to Reinforcement Learning and the
Study of Trading Latency Effects. In Multi-Agent-Based Simulation XXII, Koen H.
MASA framework can achieve a better performance in balancing Van Dam and Nicolas Verstaevel (Eds.). Springer International Publishing, Cham,
different goals when the problem size increases. 42–56.
[4] Jan-Peter Calliess and Stefan Zohren. 2021. Agent-Based Models in Finance and [26] J. Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transac-
Market Simulations. (2021). https://oxford-man.ox.ac.uk/projects/agent-based- tions on Information Theory 37, 1 (1991), 145–151. https://doi.org/10.1109/18.61115
models-in-finance-and-market-simulations/ [27] Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive
[5] Thomas M Cover. 1991. Universal Portfolios. Mathematical finance (1991). Quantitative Trading: An Imitative Deep Reinforcement Learning Approach. In
[6] Nigel Cuschieri, Vince Vella, and Josef Bajada. 2021. TD3-Based Ensemble Re- Proceedings of the AAAI conference on artificial intelligence.
inforcement Learning for Financial Portfolio Optimisation. FinPlan 2021 (2021), [28] Costis Maglaras, Ciamac C Moallemi, and Muye Wang. 2022. A deep learning
6. approach to estimating fill probabilities in a limit order book. Quantitative Finance
[7] Joaquín Derrac, Salvador García, Daniel Molina, and Francisco Herrera. 2011. A 22, 11 (2022), 1989–2003.
practical tutorial on the use of nonparametric statistical tests as a methodology [29] Harry M Markowitz and G Peter Todd. 2000. Mean-variance Analysis in Portfolio
for comparing evolutionary and swarm intelligence algorithms. Swarm and Choice and Capital Markets. John Wiley & Sons.
Evolutionary Computation 1, 1 (2011), 3–18. [30] M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo. 1997. The Jensen-Shannon
[8] Rafał Dreżewski and Krzysztof Doroz. 2017. An agent-based co-evolutionary divergence. Journal of the Franklin Institute 334, 2 (1997), 307–318. https:
multi-objective algorithm for portfolio optimization. Symmetry 9, 9 (2017), 168. //doi.org/10.1016/S0016-0032(96)00063-4
[9] Yitong Duan, Lei Wang, Qizhong Zhang, and Jian Li. 2022. FactorVAE: A Proba- [31] Mojtaba Nabipour, Pooyan Nayyeri, Hamed Jabani, S Shahab, and Amir Mosavi.
bilistic Dynamic Factor Model Based on Variational Autoencoder for Predicting 2020. Predicting stock market trends using machine learning and deep learning
Cross-sectional Stock Returns. In Proceedings of the AAAI Conference on Artificial algorithms via continuous and binary data; a comparative analysis. IEEE Access
Intelligence. 8 (2020), 150199–150212.
[10] Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing Function Ap- [32] Martin L Puterman. 1990. Markov decision processes. Handbooks in operations
proximation Error in Actor-critic Methods. In Proceedings of the International research and management science 2 (1990), 331–434.
Conference on Machine Learning. PMLR. [33] Aistis Raudys, Vaidotas Lenčiauskas, and Edmundas Malčius. 2013. Moving aver-
[11] John W Goodell, Satish Kumar, Weng Marc Lim, and Debidutta Pattnaik. 2021. ages for financial data smoothing. In Information and Software Technologies: 19th
Artificial intelligence and machine learning in finance: Identifying foundations, International Conference, ICIST 2013, Kaunas, Lithuania, October 2013. Proceedings
themes, and research clusters from bibliometric analysis. Journal of Behavioral 19. Springer, 34–45.
and Experimental Finance 32 (2021), 100577. [34] Bartosz Sawik. 2012. Bi-criteria portfolio optimization models with percentile
[12] Abhishek Gunjan and Siddhartha Bhattacharyya. 2023. A brief review of portfolio and symmetric risk measures by mathematical programming. Przeglad Elek-
optimization techniques. Artificial Intelligence Review 56, 5 (2023), 3847–3886. trotechniczny 88, 10B (2012), 176–180.
[13] Reza Hafezi, Jamal Shahrabi, and Esmaeil Hadavandi. 2015. A bat-neural network [35] M Sivaram, E Laxmi Lydia, Irina V Pustokhina, Denis Alexandrovich Pustokhin,
multi-agent system (BNNMAS) for stock price prediction: Case study of DAX Mohamed Elhoseny, Gyanendra Prasad Joshi, and K Shankar. 2020. An optimal
stock price. Applied Soft Computing 29 (2015), 196–210. least square support vector machine based earnings prediction of blockchain
[14] David P Helmbold, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. financial products. IEEE Access 8 (2020), 120321–120330.
1998. On-line Portfolio Selection Using Multiplicative Updates. Mathematical [36] Rainer Storn and Kenneth Price. 1997. Differential evolution–a simple and
Finance (1998). efficient heuristic for global optimization over continuous spaces. Journal of
[15] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural global optimization 11, 4 (1997), 341–359.
computation 9, 8 (1997), 1735–1780. [37] Edward PK Tsang. 2023. AI for Finance. CRC Press.
[16] Dingjiang Huang, Junlong Zhou, Bin Li, Steven CH Hoi, and Shuigeng Zhou. [38] Edward PK Tsang, Ran Tao, Antoaneta Serguieva, and Shuai Ma. 2017. Profiling
2016. Robust median reversion strategy for online portfolio selection. IEEE high-frequency equity price movements in directional changes. Quantitative
Transactions on Knowledge and Data Engineering 28, 9 (2016), 2480–2493. finance 17, 2 (2017), 217–225.
[17] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. 2017. A Deep Reinforcement [39] Heyuan Wang, Shun Li, Tengjiao Wang, and Jiayi Zheng. 2021. Hierarchical Adap-
Learning Framework for the Financial Portfolio Management Problem. arXiv tive Temporal-Relational Modeling for Stock Trend Prediction.. In Proceedings of
preprint arXiv:1706.10059 (2017). the IJCAI Conference on Artificial Intelligence.
[18] Michael Kampouridis, Panagiotis Kanellopoulos, Maria Kyropoulou, Themistoklis [40] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong,
Melissourgos, and Alexandros A. Voudouris. 2022. Multi-agent systems for and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons
computational economics and finance. AI Communications 35, 4 (sep 2022), for web-scale learning to rank systems. In Proceedings of the web conference 2021.
369–380. https://doi.org/10.3233/aic-220117 1785–1797.
[19] James Kennedy and Russell Eberhart. 1995. Particle swarm optimization. In [41] Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. 2021. Deep-
Proceedings of ICNN’95-international conference on neural networks, Vol. 4. IEEE, Trader: a deep reinforcement learning approach for risk-return balanced portfolio
1942–1948. management with market conditions Embedding. In Proceedings of the AAAI
[20] Petter N Kolm and Gordon Ritter. 2020. Modern perspectives on reinforcement conference on artificial intelligence, Vol. 35. 643–650.
learning in finance. Modern Perspectives on Reinforcement Learning in Finance [42] Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, and Mingkui Tan. 2021. Relation-
(September 6, 2019). The Journal of Machine Learning in Finance 1, 1 (2020). aware Transformer for Portfolio Policy Learning. In Proceedings of the IJCAI
[21] Bin Li and Steven CH Hoi. 2012. On-Line Portfolio Selection with Moving Average Conference on Artificial Intelligence.
Reversion. In Proceedings of the International Conference on Machine Learning. [43] Mengyuan Yang, Xiaolin Zheng, Qianqiao Liang, Bing Han, and Mengying Zhu.
PMLR. 2022. A Smart Trader for Portfolio Management based on Normalizing Flows. In
[22] Bin Li and Steven CH Hoi. 2014. Online Portfolio Selection: A Survey. ACM Proceedings of the IJCAI Conference on Artificial Intelligence.
Computing Surveys (CSUR) (2014). [44] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Ju Xiao, and Bo
[23] Bin Li, Steven CH Hoi, and Vivekanand Gopalkrishnan. 2011. Corn: Correlation- Li. 2020. Reinforcement-learning Based Portfolio Management with Augmented
driven Nonparametric Learning Approach for Portfolio Selection. ACM Transac- Asset Movement Prediction States. In Proceedings of the AAAI Conference on
tions on Intelligent Systems and Technology (2011). Artificial Intelligence.
[24] Bin Li, Peilin Zhao, Steven CH Hoi, and Vivekanand Gopalkrishnan. 2012. PAMR: [45] Fengjiao Zhang, Jie Li, and Zhi Li. 2020. A TD3-based multi-agent deep reinforce-
Passive Aggressive Mean Reversion Strategy for Portfolio Selection. Machine ment learning method in mixed cooperation-competition environment. Neuro-
learning (2012). computing 411 (2020), 206–215. https://doi.org/10.1016/j.neucom.2020.05.097
[25] Qianqiao Liang, Mengying Zhu, Xiaolin Zheng, and Yan Wang. 2021. An Adaptive [46] Yifan Zhang, Peilin Zhao, Qingyao Wu, Bin Li, Junzhou Huang, and Mingkui
News-Driven Method for CVaR-sensitive Online Portfolio Selection in Non- Tan. 2020. Cost-sensitive Portfolio Selection via Deep Reinforcement Learning.
Stationary Financial Markets.. In Proceedings of the IJCAI Conference on Artificial IEEE Transactions on Knowledge and Data Engineering (2020).
Intelligence. [47] Xiao-lin Zheng, Meng-ying Zhu, Qi-bing Li, Chao-chao Chen, and Yan-chao Tan.
2019. FinBrain: when finance meets AI 2.0. Frontiers of Information Technology &
Electronic Engineering 20, 7 (2019), 914–924.