**30** **Learning Wealth Flow Matrices**

JIANFEI YIN,College of Computer Science and Software Engineering and National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University

RUILI WANG,School of Natural and Computational Sciences, Massey University

YEQING GUO,Tisson Regaltc Communication Technology

YIZHE BAI, SHUNDA JU, WEILI LIU, and JOSHUA ZHEXUE HUANG,Shenzhen University

This article proposes a deep learning solution to the online portfolio selection problem based on learning a latent structure directly from a price time series. It introduces a novel wealth flow matrix for representing a latent structure that has special regular conditions to encode the knowledge about the relative strengths of assets in portfolios. Therefore, a wealth flow model (WFM) is proposed to learn wealth flow matrices and maximize portfolio wealth simultaneously. Compared with existing approaches, our work has several dis- tinctive benefits: (1) the learning of wealth flow matrices makes our model more generalizable than models that only predict wealth proportion vectors, and (2) the exploitation of wealth flow matrices and the explo- ration of wealth growth are integrated into our deep reinforcement algorithm for the WFM. These benefits, in combination, lead to a highly-effective approach for generating reasonable investment behavior, including short-term trend following, the following of a few losers, no self-investment, and sparse portfolios. Extensive experiments on five benchmark datasets from real-world stock markets confirm the theoretical advantage of the WFM, which achieves the Pareto improvements in terms of multiple performance indicators and the steady growth of wealth over the state-of-the-art algorithms.

CCS Concepts: •**Theory of computation**→**Online learning algorithms**; •**Computing methodologies**

→**Sequential decision making**; •**Applied computing**→*Economics;*

Additional Key Words and Phrases: Online portfolio selection, wealth flow matrix, deep reinforcement learn- ing, regret bound

We thank the editors and reviewers for their expert comments and constructive suggestions, and the support of the Shen- zhen Basic Research Fund (Grant No: JCYJ20200813091134001) and National Natural Science Foundation of China (Grant No: 61972261).

Authors’ addresses: J. Yin, College of Computer Science and Software Engineering and National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Nanshan District, Shenzhen 518060, China; email:

[email protected]; R. Wang (corresponding author), School of Natural and Computational Sciences, Massey University, Auckland 102904, New Zealand; email: [email protected]; Y. Guo, Tisson Regaltc Communication Technology, Guangzhou 510623, China; email: [email protected]; Y. Bai, S. Ju, W. Liu, and J. Z. Huang, Shenzhen University, Shen- zhen 518060, China; emails: {baiyizhe2017, jushunda2017}@email.szu.edu.cn, {liuwl, zx.huang}@szu.edu.cn.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from[email protected].

© 2021 Association for Computing Machinery.

1556-4681/2021/08-ART30 $15.00 https://doi.org/10.1145/3464308

ACM Transactions on Knowledge Discovery from Data, Vol. 16, No. 2, Article 30. Publication date: August 2021.

**ACM Reference format:**

Jianfei Yin, Ruili Wang, Yeqing Guo, Yizhe Bai, Shunda Ju, Weili Liu, and Joshua Zhexue Huang. 2021. Wealth
Flow Model: Online Portfolio Selection Based on Learning Wealth Flow Matrices.*ACM Trans. Knowl. Discov.*

*Data.*16, 2, Article 30 (August 2021), 27 pages.

https://doi.org/10.1145/3464308

**1** **INTRODUCTION**

In this article, we propose a deep learning solution to the online portfolio selection problem
based on learning a latent structure directly from price time series. Online portfolio selec-
tion [2,7,10,13,15,46] is about how to rebalance wealth proportion vectors**x***t* among*N* assets
of a portfolio during each trading period*t* ∈ [1,*T*]to maximize the wealth of the portfolio at
the end of period*T*. Many studies have demonstrated that markets have short-term trends, such
as the mean reversion [56,67] and trend following [26]. Therefore, when we make rebalancing
decisions, the short-term market trends need to be naturally embedded in a latent structure that
defines different capital preferences for the various assets of portfolios.

Since learning latent structures is an effective way to capture structural knowledge [18,24], we
propose a wealth flow matrix**Y***t* for representing the latent structure, and it encodes knowledge
about the relative strengths of assets in portfolios. Wealth flow matrices**Y***t*can be learned directly
from the price time series by our algorithm. The*i*th row of**Y***t*represents the output wealth propor-
tions of asset*i*relative to other assets*ji*. The sum of each row vector of**Y**_{t}is required to be one
to meet the self-financing property of portfolios [37]. Given a wealth flow matrix**Y**_{t}, we can obtain
the wealth proportion vector**x***t* from**Y***t* by using simple linear operations, but not vice versa. As
such, a portfolio selection model that can predict wealth flow matrices is more generalizable than
models that can only predict wealth proportion vectors.

Although, there are exist portfolio selection algorithms that can represent the latent structures
of portfolios, such as the stability and sparsity of wealth proportions [17,34,67], directly from price
data, they only treat these latent structures as regular conditions in their objective functions for
limiting the searching space of candidate solutions (i.e., wealth proportions). These latent struc-
tures are implicitly presented in the solutions. That is, their solutions cannot learn these latent
structures from the dataset*D*1and then reuse them in another dataset*D*2without searching*D*2

again. Also, these latent structures cannot represent the relative strength relationship between any pair of assets in portfolios. There exist a few studies on learning useful signals from out-of-band sources, such as the use of news [60,61] to generate wealth proportions from neural networks.

The signals found by these methods have longer delays than the signals provided by our wealth
flow matrices**Y**_{t}, since**Y**_{t} are learned directly from price time series.

With the concept of the wealth matrix in mind, we build a**wealth flow model** (**WFM**) to
learn wealth flow matrices and maximize portfolio wealth simultaneously. First, we define a linear
function*h*to bridge the wealth flow matrices**Y***t* and wealth proportion vectors**x***t*. Since all**Y***t*

represent probabilistic latent structures of portfolios, based on the variational Bayesian method [6],
we use the Kullback–Leibler divergence [54] to measure the similarity between the predictive and
posterior distributions of**Y***t*so that**Y***t*can be learned online from samples. Second, to address both
the exploitation of wealth flow matrices and the exploration of wealth growth, we investigate the
regular conditions for updating**x***t* when updating directions**Z***t* for**Y***t* are given. We find that the
regular conditions are*ϕ*_{t+1}=1/(*Ht*)and(∇*t*−**m***t*)^{}(**x***t*−**x**^{∗}) ≥0, where*ϕ*_{t+1}are the learning rates
for**x***t*,**m***t* are the updating directions for**x***t* given**Z***t*, and*H* is some scalar constant. The regular
conditions are derived by letting**x***t* chase a regret bound under the online convex optimization

framework [21]. The regular conditions are later integrated into our deep reinforcement learning algorithm for the WFM.

To learn wealth flow matrices effectively from high dimensional and massive price time series, we design a neural network to implement the WFM. The neural network of the WFM is a recur- sive neural network that uses long short-term memory [23] neural cells and runs on graphics processing units. We strengthen the neural cells by designing an attention mechanism based on the work [57,59]. Therefore, the correlations of hidden features can be learned automatically to improve the prediction accuracy with respect to wealth flow matrices. Furthermore, we design a deep reinforcement learning algorithm to train the neural network of the WFM online. The algo- rithm estimates future losses by using a reply buffer [32,48] without using the recursive estimation method used by Q-learning [20,28] and the policy gradient [42,43], because we have a precise definition of loss at each time step, instead of the uncertain rewards defined by the Q-value and action-value functions [28,48].

In summary, we make the following main contributions in this article:

— We propose a novel wealth flow matrix to represent latent structures of portfolios. The wealth flow matrix has special regular conditions for encoding knowledge about the rela- tive strengths of assets in portfolios. A model that can predict wealth flow matrices is more generalizable than models that can only predict wealth proportion vectors. The wealth flow matrix can be considered as side information [16,65], but it can be learned directly from the given price time series without using out-of-band sources.

— We propose a novel portfolio selection model called the WFM, which learns wealth flow matrices and maximizes portfolio wealth simultaneously. The learning of wealth flow matri- ces is based on the principle of the variational Bayesian method [6]. The Kullback–Leibler divergence [54] is used to measure the similarities between the predictive and posterior dis- tributions of wealth flow matrices. We investigate the conditions for updating the wealth proportion vectors when the updating directions for the wealth flow matrices are given.

These conditions are later integrated into our deep reinforcement learning algorithm for the WFM so that the exploitation of wealth flow matrices and the exploration of wealth growth are integrated together in our training algorithm.

— We design a neural network to implement the WFM and a deep reinforcement learning al- gorithm to train the neural network. The neural network of the WFM is a recursive neural network in which cells are built with an attention mechanism. The deep reinforcement learn- ing algorithm uses a replay buffer [32,48] to estimate the loss at every time step without recursively estimating the Q-value [20,28] and action-value [42,43] functions.

The rest of this article is organized as follows. Section2reviews some related work regarding online portfolio selection. Section3gives the preliminaries and notations used throughout this article. Section4defines the proposed WFM for online portfolio selection. Section5presents the neural network for the WFM and the deep reinforcement learning algorithm. Section6presents a behavior analysis of the WFM neural network and some performance comparison experiments.

Finally, we present our conclusion and future work in Section7.

**2** **RELATED WORK**

Online portfolio selection is a special case of an online optimization problem [4,14,63], in which
an agent needs to make periodic decisions based on partially observable streaming data, such as
financial time series. Cover [15] proposed the universal portfolio algorithm, which has a good
sublinear regret bound*O*(*N*log*T*)with respect to the best constant rebalanced portfolio in hind-
sight. The universal portfolio algorithm belongs to a class of sampling-based algorithms [4]. Blum

et al. [7] provided a proof of the regret bound in the universal portfolio algorithm based on*α*-
parameterized integration. Following the logic of the proof given by Blum et al. [7], one can show
that if an algorithmAgives any proportion of its wealth to its sub-algorithms that mimic the pol-
icy of the universal portfolio algorithm, then algorithmAobtains a sublinear regret bound [9,65].

As such, sampling-based algorithms provide reasonable explorations of wealth growth in the space of wealth proportion vectors.

To improve the time complexity of universal portfolio algorithm while keeping sublinear regret
bounds, some other algorithms were proposed. The exponentiated gradient algorithm [22] is a non-
sampling-based algorithm that has a regret bound*O*(

*N*/*T*). The algorithm can be formalized as
a special case of the multiplicative weight update method [4]. According to their method, if an
algorithmAcomputes the probability of choosing portfolio decisions by the product of weights,
algorithmA can have a sublinear regret bound. Agarwal et al. [2] proposed the online Newton
step algorithm that has a bound*O*(*GDN*log*T*)by computing the inverse of the sum of matrices.

Thus, the ONS algorithm exhibits poor scalability when the number of stocks *N* is large. Luo
et al. [46] designed an efficient portfolio selection algorithm based on the online mirror descent
framework [51].

The regret bounds of many algorithms [2,15,22,46] are estimated in the sense of asymptotic behavior. These algorithms are distribution independent and do not consider the short-term pat- terns of inputs [31], such as the mean reversion [56] and the sparse weights of assets [17,34,67].

As tested on several benchmark datasets [9,15,38], these algorithms perform poorly,^{1}compared
with some other practical approaches [9,36]. These practical algorithms are not proven to have
sublinear regret bounds but perform well on benchmark datasets. For example, Borodin et al. [9]

proposed the ANTICOR algorithm, which predicts wealth proportions by calculating the sample correlation coefficients between each pair of stocks in two adjacent time windows. Li et al. [36]

proposed the online portfolio selection with the moving average reversion algorithm, which pre-
dicts wealth proportions based on a moving average estimation for the relative price vector**r**_{t+1}
at the next trading period. Many online portfolio selection algorithms [10, 13,17, 34,67] have
been proposed using different prior structures, and these have achieved good performances on
benchmark datasets.

If users are most concerned about the intermediate losses in the contract life of a portfolio, they can adopt appropriate risk control strategies, such as introducing the indicator maximum draw- down into the optimization target [49], executing long-and-short-term risk control [5], optimizing for transaction costs [40], and conducting market sentiment analysis [60,62]. If trading costs are so high as to restrict high frequency trading, timing of trading is one of the most effective trad- ing methods [25,66]. The basic idea of the method is to separate the given market sequence into sub-sequences and run an online portfolio selection algorithmA for each sub-sequence. Kozat et al. [33] proved that this method still has the property of possessing a sublinear regret bound if the algorithmAhas a sublinear regret bound. The mean variance analysis approach [12] uses the covariance of rates of returns to measure the risk of the given portfolio. Xing et al. [61] proposed an algorithm using market sentiment views [11,47] to estimate the mean vector and covariance matrix in the Black-Litterman model so that the wealth proportions from the first-order condition of the mean variance analysis can be computed.

Many regret bound-based online algorithms [2,15,22, 46] and other high-performance algo- rithms [9,34,36] design handcrafted features [50] for making decisions and use a minimal amount of historical data, usually only one to several data points, so that algorithms with limited comput- ing power can make quick decisions when a new batch of data arrives. If the reality is that we only

1The test can be verified by using the open source codes athttps://github.com/Marigold/universal-portfolios.

Fig. 1. Temporal relationship of variables in online portfolio selection. Trading time*t*1is generally the closing
time in period*t*. The portfolio wealth*w**t*+1is evaluated immediately after trading assets at time*t*1. The*i*th
entry of relative price vector**r*** t*+

**1**is the ratio of the

*i*th asset’s price at time(

*t*+1)1to the price at time

*t*1. need to rebalance assets on a day-to-day basis (since relative price vectors are given daily); we need to manage a large number of assets, for example,

*N*> 1000; the market has years of long-term memory [29]; the short-term behavior of the market do not meet the assumptions of the Markov process; or we do have sufficient computing power (e.g., in terms of graphics processing units), then, the performances of these algorithms will suffer from using too little data [3].

Online portfolio selection can be considered as an online game. Some**deep reinforcement**
**learning**(**DRL**)- based methods [27,28,30,42,52] have been proposed for online trading and
online portfolio selection. DRL is an integrated framework that combines high-dimensional feature
learning and decision making. DRL uses graphics processing units to optimize complex functions
built by neural networks, and it has been successful in many related areas, such as online video
games [48] and the complex game Go [58]. To successfully apply DRL to solve the online portfolio
selection problem, we need to carefully design reward functions that match the problem definitions
to make the regrets of the obtained solutions bounded [27,31, 52]. Q-learning [48] and policy
gradient-based methods [42,43] may be helpful for designing reward functions.

**3** **PRELIMINARIES AND NOTATIONS**

In this section, we present the basic definition of the rebalancing process for a portfolio so that we can understand the objective functions defined in the following sections. The notations used in this article are also given in this section for reference.

**3.1** **Wealth Dynamic Model of Online Portfolio Selection**

A portfolio has*N* ≥1 tradable assets, which can be risk-free or risky assets. There are*T*periods for
trading assets. At the end of each period*t*,*t* ∈ [1,*T*), we can trade*N*assets so that the portfolio has
a new wealth proportion**x*** t*+

**1**∈

^{N}, where

^{N}

^{def}= {

*:*

**x***∈R*

**x**_{+}

^{N},

**x**_{t}

^{}

**1**=1}. For example, we sell and purchase assets at time

*t*1during the period

*t*to make the wealth of the

*i*th asset equal to

*w*

*t*+1

*x*

*t*+1,

*i*

for*i* ∈ [1,*N*], where*w**t*+1is the portfolio wealth just after trading. The temporal relationship of
portfolio selection is illustrated in Figure1. Since the prices of the assets have changed from*t*1to
(*t*+1)1, the portfolio wealth*w*_{t+1}and its wealth proportion**x**_{t+1}will be changed implicitly by the
market to

*w*^{}_{t+1}=*w*_{t+1}**x**_{t+1}^{} **r**_{t+1},
**x**^{}*t*+1=**x***t*+1◦**r***t*+1

**x**_{t}^{}_{+}_{1}**r**_{t+1} , (1)

where◦denotes the entry-wise product of two vectors. Note that rebalancing the portfolio at time
*t*1will shrink the wealth portfolio from*w*^{}*t*to

*w*_{t+1}=*μ*_{t+1}*w*^{}*t*, (2)

by a factor*μ*_{t+1}∈ (0,1)if trading costs are considered. The factor*μ*_{t+1}satisfies a balance equation
under the self-financing assumption [19]:

*N*
*i*=1

(1−*cs**i*)(*x*^{}*t*,*i*−*μ**t*+1*x*_{t+1,i})^{+}=
*N*
*i*=1

(1+*cp**i*)(*μ**t*+1*x*_{t+1,i}−*x*^{}*t*,*i*)^{+}, (3)
where*cp**i*and*cs**i*are the commission ratios of purchasing and selling assets, respectively. Typically,
we set*cp**i* =*cs**i* =0 for trading risk-free assets and*cp**i* =*cs**i* =*c*for trading risky assets.(*y*)^{+}=*y*
if*y* ≥0, otherwise,(*y*)^{+} =0. The factor*μ**t* can be solved iteratively from the following iterative
equation:

*μ**t*+1=1−*c**s*

*N*
*i*=2

(**x**^{}_{t,i}−*μ**t*+1**x***t*+1,*i*)^{+}−*c**p*

*N*
*i*=2

(*μ**t*+1**x***t*+1,*i*−**x**^{}_{t,i})^{+},

where the initial value of*μ**t*is(1−*c**s*)/(1+*c**p*). By applying Equation (1) and Equation (2) recursively
on*w*^{}*T*, we know that the cumulative wealth of a portfolio at the end of period*T* will be*w*^{}*T* =
*w*^{}1*T*

*t*=2*μ**t***x**_{t}^{}**r***t*, where*w*^{}1is the initial wealth. The goal of portfolio selection is to maximize the
cumulative wealth*w*^{}*T* at the end of period*T* under random relative price vectors*r**i*.

Hereafter, we assume that the relative price vector*r*_{i} of any single asset*i* in a portfolio is
bounded by 0 < *L**r* ≤ *r**i* ≤ *U**r*, where*L**r* and*U**r* are some scalar constants. For the sake of
readability, unless necessary, we suppress the suffix*t* for the time index when the meaning of
time-dependent variables, e.g.,* x*,

*, can be understood from the context. We use the notation*

**r***to represent the vectorization of the wealth flow matrix*

**y***. The notations used in this article are summarized in Table1.*

**Y****4** **OPTIMIZATION MODEL WITH WEALTH FLOW**

This section presents an optimization model, i.e., the WFM for online portfolio selection. The model is based on the proposed prior structure of the wealth flow matrix. Several gradient-based proper- ties of the WFM are investigated to enable the learning of wealth flow matrices and the maximiza- tion of portfolio wealth simultaneously.

**4.1** **Wealth Flow Matrix**

Numerous studies support the existence of different short-term market trends [26,56]. That is, be-
fore we make rebalancing decisions, the short-term market trends need to be naturally embedded
in a latent structure that defines different capital preferences for the assets of portfolios. Based on
this observation, we propose a wealth flow matrix to represent such latent structures. Each wealth
flow matrix* Y* encodes knowledge about the relative strengths of the assets in the context of a
portfolio.

*is an asymmetric and a nonnegative matrix. An example of*

**Y***is given as follows:*

**Y**

0 0.5 0.5 0.2 0 0.8 0.3 0.7 0

.

The example has*N* =3 assets. The*i*th row vector represents the proportions of wealth that flow
from asset*i* to the other assets *j* *i*. For example,*y*1,2 and*y*1,3 are the proportions of wealth
that flow from asset*i* =1 to asset*j* =2 and asset*j* =3, respectively. We assume that there is no
cache of wealth remaining in asset*i*’s virtual account when asset*i*’s wealth flows out to the virtual
accounts of other assets. Thus, the diagonal elements of a wealth matrix are zeros, and the sum of
each row is one.

Table 1. Glossary of Notations

Notation Representation

*N*,*T* *N* assets of a portfolio are traded in*T* trading periods
^{N} *N*-dimensional simplex space,^{N} ={* x*:

*∈R*

**x**_{+}

^{N},

**x**_{t}

^{}

**1**=1}

**e**_{i},**e**_{w} vectors, in which the*i*th or*w*th component is 1 and the others are zeros
**1**,**I** **1**denotes a vector whose components are all ones,**I**denotes an unit matrix

**x***t*,**x**^{}*t* wealth proportion vectors at the beginning and end of the trading period*t*, respectively
*w*^{}*t*,*w**t*+1 portfolio wealth before and after the rebalancing at the end of the trading period*t*,

respectively

*c**p*,*c**s* commission ratios for purchasing and selling assets
*μ**t*+1 discount factor for calculating the wealth*w**t*+1

**r***t* relative price vector at the end of the trading period*t*

**Y***t*,**y***t* **Y***t* denotes the wealth flow matrix at the beginning of the trading period*t*,**y***t*denotes
the vectorization of**Y***t*, i.e.,**y***t* =vec(**Y***t*)

*h* function that maps a**Y***t* to a**x****Z***t* updating direction matrix for**Y***t*

**m***t* updating direction vector for**x***t* when updating**Y***t*

*ρ*,*φ*,*ϕ* *ρ* =*φ*/*ϕ*, where*φ*and*ϕ*are the learning rates for* Y* and

*, respectively*

**x***ϵ*,

*λ*,Ψ

_{1},Ψ

_{2}penalty weights used in objective functions

*f*_{t+1}(* Y*) objective function for

**Y***д**t*(* x*) evaluation function for

*at the trading period*

**x***t*

*θ*,*J**t*+1(*θ*) weights and objective functions for the WFM neural network, respectively
R replay buffer

*L*,*R* loss function and regular function used by*J**t*+1(*θ*), respectively
p,q density functions

*H*,*G* constants used for estimating regret bounds
*u* dimension of hidden features

*v* number of convolutional kernels
*w* window length of a sequence of**r***t*

*k* next*k*trading periods for computing the loss function*L*

Note that we can use the elements of a wealth flow matrix* Y* to compute a wealth proportion

*, but we cannot do the reverse. This means that a wealth flow matrix*

**x***provides more information than yielded by a wealth proportion*

**Y***. It is supposed that if a model can predict wealth flow matrices*

**x***and the wealth proportion*

**Y***simultaneously, then the performance of the model will be improved.*

**x****4.2** **Optimization Model**

Since wealth flow information can be used as a prior to derive decisions regarding wealth propor-
tions, we can represent the joint probability of wealth proportion* x* and wealth flow matrix

*as p(*

**Y***,*

**x***)= p(*

**Y***|*

**x***)p(*

**Y***). Inspired by the method of maximum a posteriori estimation [53,64], we design a WFM to maximize the logarithm of the joint probability p(*

**Y***,*

**x***). That is,*

**Y**max**Y***f*_{t+1}(* Y*)=log(

**r**_{t}

^{}

*h*

_{t+1}(

*))*

**Y**single-period goal

−*ϵ*

2*h**t*+1(* Y*)

^{2}

long-term goal

−*λ*
*N*

*i*=1

*j**i*

*y**i*,*j*log
*y*_{i,j}

ˆ
*y**i*,*j*

short-term goal

(4)

*s*.*t*.

* Y* ∈R

^{N}

_{+}

^{×}

^{N}, Tr(

*)=0,*

**Y**

**Y****1**=

**1**,

where*ϵ*and*λ*are small positive numbers, Tr(* Y*)denotes the trace of matrix

*,.2denotes the*

**Y***l*

^{2}norm function, and the linear function

*h*

_{t+1}:R

^{N}

_{+}

^{×}

^{N}×R

_{+}

^{N}→R

_{+}

^{N}is defined by

*h*_{t}_{+1}(* Y*)=

**x**_{t}+

**Y**^{}

**x**_{t}−

**x**_{t}◦ (

**Y****1**). (5) For a future period

*t*+1,

*h*

*t*+1computes the net incomes (may be negative) for all assets by the vector

**Y**^{}

**x***t*−

**x***t*◦ (

**Y****1**). This vector is then added to the previous wealth proportion

**x***t*to output the new proportion

**x***t*+1. The symbol◦denotes the entry-wise product. Since we do not allow assets to store wealth flows in themselves when rebalancing a portfolio, the diagonal elements

*y*

_{i,i}of

*are constrained to zeros by Tr(*

**Y***) = 0, and the sum of each row is normalized to 1 by the constraint*

**Y**

**Y****1**=

**1**.

The short-term goal is defined in Equation (4) is to make profits by following the short-term
market trends provided by the sample matrices ˆ**Y***t*, i.e., samples of the wealth flow matrices. We
use the Kullback–Leibler divergence [54] to measure the similarity between the empirical distribu-
tion q(* Y*)and the posterior distribution p(

*|*

**Y***ˆ*

**Y***t*), as shown in Equation (4). The Kullback–Leibler divergence expression is derived by applying the variational Bayesian method [6]:

*min*qKL

q(* Y*) ||p(

*|*

**Y***ˆ*

**Y***t*)

=E**Y**

q(* Y*)log(q(

*)) −q(*

**Y***)log(p(*

**Y***|*

**Y***ˆ*

**Y***t*)

. (6)

The sample matrices ˆ**Y***t*can be learned by the neural networks designed in Section5.

The single-period goal is to make profits by assuming that the relative price vector**r**_{t+1} in
the near future will not change much from the current vector**r***t*. The long-term goal in Equa-
tion (4) is to encourage the WFM to output an increasingly-balanced wealth proportion**x**_{t+1}, which
may cope with risks from market reversals. The long-term goal is based on the fact that the uni-
form vector _{N}^{1}**1** is the solution for minimizing the*l*^{2}-norm of the vectors in^{N}. The uniform
vector _{N}^{1}**1**can be thought of as a bias for the predicted**x***t*+1. Since the expression of the single-
period goal alone is concave but not strictly concave, the long-term goal makes the expression
log(**r**_{t}^{}*h*_{t+1}(* Y*)) −

^{ϵ}

_{2}

*h*

*t*+1(

*)*

**Y**^{2}2 a

*H*-strong concave function by setting a tiny number for

*ϵ*, as proved by the following lemma.

Lemma 4.1. *Letд*_{t}(* x*) = log(

**r**_{t}

^{}

*) −*

**x**^{ϵ}

_{2}

**x**^{2}

*; then, whenϵ*≥

*H*> 0

*,д*

_{t}

*is aH-strong concave*

*function, that is,*

∀**x***t* ∈R^{N} : ∇^{2}*д**t*(**x***t*) −*H***I**.

*Here, we use the notation ABifB*−

**A**is a positive semidefinite matrix.Proof. By the definitions of log and the*l*^{2}-norm, we have

∇^{2}log(**r**_{t}^{}* x*)= −

**r***t*

**r**_{t}

^{}(

**r**_{t}

^{}

*)*

**x**^{2},

∇^{2}−*ϵ*

2 **x**^{2}=−*ϵ***I**.

Thus, let* y*∈R

^{N}; then, we have

**y**^{}(−*H***I**− ∇^{2}*д**t*)* y*=(

*ϵ*−

*H*)

**y**^{2}+(

**r**_{t}

^{}

*)*

**y**^{2}(

**r**_{t}

^{}

*)*

**x**^{2},

which means that *ϵ* ≥ *H* > 0 is a sufficient condition for making the above equation

nonnegative.

**4.3** **Gradient Information about Wealth Proportions and Flows**

In this section, we analyze the gradient information related to the wealth proportion vector* x*and
the wealth flow matrix

*to understand the optimization process of the WFM and the dynamics of the wealth flows between each pair of assets in a portfolio.*

**Y**In the WFM’s Definition (4), the wealth proportion* x* =

*h*(

*)is, in fact, a dependent variable, which is driven by the independent variable, i.e., the wealth flow matrix*

**Y***. In the following lemma, we apply differential calculus to analyze how the dependent variable*

**Y***changes with the indepen- dent variable*

**x***.*

**Y**Lemma 4.2. *Let Y* ←

*+*

**Y***φ*

**Z**be the updating rule for the independent variable**Y**of the model (4)*and*←

**x***+*

**x***ϕ*=

**m**be the corresponding change in the dependent variable**x***h*(

*)*

**Y***, where the learning*

*rates areφ*>0

*andϕ*>0

*. Letρ*=

^{φ}

_{ϕ}

*; then, we havem*

*i*=

*ρ*

*j**i*(*z*_{j,i}*x**j* −*z*_{i,}*j**x**i*),*i*,*j* ∈ [1,*N*]*and*
* m*2 ≤2√

*N ρ.*

Proof. Let the symbol*d*denote the forward difference operator; then, we have
*d Y* =

*φ*,

**Z***d x* =

*ϕ*. (7)

**m**Recall that by Equation (5), the function*h*is used to obtain a new* x*, that is,

*h*(* Y*)=

*+*

**x**

**Y**^{}

*−*

**x***◦ (*

**x**

**Y****1**). (8) If we fix

*and make a very small change to*

**x***by*

**Y***d*in Equation (8), then

**Y***dh*can be used to approximate

*d*, that is,

**x***d x*=

*d*

**Y**^{}

*−*

**x***◦ (*

**x***d*

**Y****1**). (9)

By the constraints of* Y*defined in Equation (4) and Equation (7), the elements

*dy*

*i*,

*j*of the difference matrix

*d*are

**Y***dy**i*,*j* =

*φz*_{i,j} if*ij*,

0 otherwise,where*z**i*,*j* ∈ [−1,1]. (10)
Let* a* =

*d*

**Y**^{}

*and*

**x***=*

**b***◦ (*

**x***d*

**Y****1**). After a simple calculation, we know that the elements of

*and*

**a***are*

**b***a**i* =

*j**i*

(*x**j**dy*_{j,i}),
*b**i* =*x**i*

*j**i*

*dy**i*,*j*. (11)

Combining Equations (9) and (11), we have
*dx**i* =

*ji*

(*x**j**dy**j*,*i*) −*x**i*

*j**i*

*dy**i*,*j*.

Furthermore, using Equations (7) and (10), we arrive at
*m**i* =*φ*

*ϕ*

*j**i*

*z**j*,*i**x**j*−*z**i*,*j**x**i*

. (12)

Let*ρ*=^{φ}_{ϕ}; then, the square*l*^{2}-norm of* m*is

**m**^{2}2=

*ρ*

^{2}

*N*
*i*=1

*j**i*

*z**j*,*i**x**j*−*z**i*,*j**x**i*

2

≤4*N ρ*^{2},

by*z*_{i,j} ∈ [−1,1]and*x**i* ∈ ^{N}.

Equation (12) gives the net-income rate*m**i* for asset*i*by using the gradient matrix* Z*. The net-
income rate

*m*

_{i}is then scaled by the learning rate

*ϕ*and added to the previous wealth proportion

*x*

*i*. This updating logic demonstrates the validity of using differential calculus for the definition of the function

*h*, as shown in Equation (9). The bound

*m*2 ≤2√

*N ρ*established by Lemma4.2
shows that the amplitude of the search direction for finding the optimal wealth proportion**x**^{∗}can
be estimated by the relative learning ratio*ρ*. The parameter*ρ*will be further used in the proof of
the regular conditions for the wealth proportions**x***t* in Section4.4.

The gradient information related to wealth flow matrix* Y* demonstrates an interesting property
about the directions of the wealth flows between pairs of assets in a portfolio. Note that the gradient
for the single-period goal defined in Equation (4) is as follows:

∇* Y*log(

**r**^{}

*h*(

*))*

**Y**=∇* Y*log(

**r**^{}(

*+*

**x**

**Y**^{}

*−*

**x***◦ (*

**x**

**Y****1**))

= **rx**^{}− (* r*◦

**x****1**

^{})

^{}

**r**^{}(

*+*

**x**

**Y**^{}

*−*

**x***◦ (*

**x**

**Y****1**).

(13)

An example of the numerator in Equation (13) is the following 3 by 3 matrix:

0 (*r*_{1}−*r*_{2})*x*_{2} (*r*_{1}−*r*_{3})*x*_{3} →output of asset_{1}
(*r*2−*r*1)*x*1 0 (*r*2−*r*3)*x*3

(*r*3−*r*1)*x*1 (*r*3−*r*2)*x*2 0

↑input of asset_{1}

,

in which the element(*r*1−*r*2)*x*2in row 1 and column 2 means that if the relative price vector*r*1is
greater than*r*2, then the first asset will output a wealth rate(*r*1−*r*2), scaled by*x*2, to the second
asset, where the weight*x*2is the previous wealth proportion of the second asset. This behavior of
wealth flow is consistent with that of the mean reversion market. Note that the gradient matrix in
Equation (13) is obtained by using the numerator layout of the scale-by-matrix derivative [8]. If
the denominator layout [8] is used, then the gradient matrix must be the transpose of the matrix
in Equation (13), and trend-following behavior will be discovered.

**4.4** **Conditions for Updating Wealth Proportion Vectors**

To integrate both the exploitation of wealth flow matrices and the exploration of wealth growth in
our model, in this section, we investigate the regular conditions for updating the wealth proportion
vectors**x***t*when the updating directions**Z***t*for the wealth flow matrices**Y***t* are given. The regular
conditions are derived by using an online convex optimization framework [21], where the regret
bound defined for algorithmAcan be written as follows.

*Definition 1.*

Regret(A,[*д*_{1}, . . . ,*д*_{T}])^{def}= max

* x*∈

^{N}

*T*
*t*=1

*д*_{t}(* x*) −E{

**x***t*∼A(

*1, ...,*

**x**

**x***t*−1)}

_{T}

*t*=1

*д*_{t}(**x**_{t})

. (14)

The real valued functions*д*_{t} are evaluation functions for computing regret bounds. The first
item of Equation (14) is the profit obtained by algorithmA’s adversary, who can see all the mar-
ket data, i.e., the relative price vectors* r*1, . . . ,

**r***T*, before making only one decision regarding the wealth proportion. The second item is the expected profit obtained by algorithmA. The game rules defined forAare the same as those defined in Section3.1. Note thatAcan output a probabilistic wealth proportion

**x***t*given the previous proportions

*1, . . . ,*

**x**

**x***t*−1.

Based on the definition of a regret bound in Equation (14), we have the following result.

Theorem 4.3. *Letд**t*(* x*) =log(

**r**_{t}

^{}

*) −*

**x**^{ϵ}

_{2}

**x**^{2}

*be the evaluation functions required by the regret*

*bound definition (14). Let*

**x**^{∗}=arg max

_{x}

_{∈}

*N*

_{N}

*t*=1*д**t*(* x*)

*be the solution found by the adversary in*

*hindsight, and let*

**x***t*

*and*

**m***t*

*be the wealth proportions and gradients given by Lemma4.2, respectively.*

*If conditionsϵ* ≥*H* >0*and*(∇*t*−**m***t*)^{}(**x***t*−**x**^{∗}) ≥0,∀*tare satisfied, then the wealth proportions x*

*t*

*have regrets bounded by*2*N ρ*^{2}(1+log*T*)/*H.*

Proof. Let∇*t* =∇*д**t*(**x***t*),∇_{t}^{2}=∇^{2}*д**t*(**x***t*)be the gradient and Hessian of*д**t*at point* x*. Since each

*д*

*t*is twice differentiable, we can apply a Taylor series, up to second-orders, around our point

**x***t*to represent the adversary’s value

*д*

*t*(

**x**^{∗}). Let

*=*

**ξ***α*

**x***t*+(1−

*α*)

**x**^{∗},

*α*∈ [0,1]and∇

*t*

^{2}(

*)be the Hessian at*

**ξ***; thus, we have*

**ξ***д**t*(**x**^{∗})=*д**t*(**x***t*)+∇^{}*t*(**x**^{∗}−**x***t*)+1

2(**x**^{∗}−**x***t*)^{}∇^{2}*t*(* ξ*)(

**x**^{∗}−

**x***t*). (15) By Lemma4.1, if

*ϵ*≥

*H*>0, then∇

_{t}

^{2}(

*) −*

**x***H*

**I**,∀

*∈R*

**x**^{N}; thus,

2[*д**t*(**x**^{∗}) −*д**t*(**x***t*)] ≤ −2∇_{t}^{}(**x***t*−**x**^{∗}) −*H*(**x***t* −**x**^{∗})^{2}2. (16)
We define two variables**d***t* =**x***t*−**x**^{∗}and*d*^{2}_{t} =**x***t* −**x**^{∗}^{2}. Given(∇*t*−**m***t*)^{}**d***t* ≥0, the inequality
(16) becomes

2[*д**t*(**x**^{∗}) −*д**t*(**x***t*)] ≤ −2∇^{}_{t}**d***t*−*Hd*^{2}_{t}

≤ −2**m**^{}_{t}**d***t*−*Hd*^{2}_{t}. (17)
We need to find a good upper bound for the last expression in inequality (17).

By Lemma4.2, we know**x**_{t}_{+1}=**x**_{t}+*ϕ*_{t+1}**m**_{t}; thus,

*d*_{t}^{2}_{+}_{1}=Proj_{}*N*(**x***t*+*ϕ*_{t+1}**m***t*) −**x**^{∗}^{2}

≤ **d***t*+*ϕ**t*+1**m***t*^{2}

=*d*_{t}^{2}+2*ϕ*_{t+1}**m**^{T}_{t}**d***t*+*ϕ*^{2}_{t+1}**m***t*^{2},

(18)

where Proj_{}*N*(.)denotes the projection operation that maps its inputs into the space^{N}. Let a con-
stant*G*be an upper bound of**m***t*2, i.e.,**m***t*2≤*G*, so by the inequality (18), the last expression
in inequality (17) is upper bounded by

−2**m**^{T}*t***d***t* −*Hd*^{2}_{t} ≤
1

*ϕ*_{t+1}(*d**t*^{2}−*d*_{t}^{2}_{+}_{1}) −*Hd*_{t}^{2}

+*ϕ*_{t+1}*G*^{2}, (19)
where the bracketed part of the inequality (19) must not depend on*d**t* to make the sum of the
right-hand side of the inequality (19) sublinearly bounded. Thus, the repeated terms in the sum
have to be zeros by making*ϕ**t* and*H*solve the following equation:

1
*ϕ**t*+1

*d*_{t}^{2}− 1
*ϕ**t*

*d*_{t}^{2}−*Hd*_{t}^{2}=0,

⇔ 1

*ϕ**t*+1 − 1

*ϕ**t* −*H* =0.

(20)

If we treat difference equations as differential equations, the equation above means that _{ϕ}^{1}

*t* is a
linear function of*t*with a gradient*H*. A simple solution is

1
*ϕ**t*+1 =*Ht*;

thus, the sum of the inequality (17) is
*T*
*t*=1

[*д**t*(**x**^{∗}) −*д**t*(**x***t*)] ≤ *G*^{2}
2

*T*
*t*=1

*ϕ*_{t+1}

≤ *G*^{2}
2*H*

*T*
*t*=1

1
*t*

≤ *G*^{2}

2*H*(1+log*T*).

(21)

By Lemma4.2, the*l*^{2}-norm of**m***t* has an upper bound 2√

*N ρ*, which can be a choice for*G*.

The condition*ϵ* ≥ *H* > 0 leads to a relatively high constant factor 1/*H* in the regret bound
2*N ρ*^{2}(1+log*T*)/*H* according to Theorem4.3. However, it enables us to regulate the wealth pro-
portions**x***t*by using only first-order gradients∇*t*and**m***t*, which can be easily obtained from neural
network optimizers. If the condition*ϵ* ≥*H* >0 is relaxed, a better regret bound [21] can be ob-
tained by using the second-order gradient of the evaluation function*д*_{t}(* x*)=log(

**r**_{t}

^{}

*), where the computation of the second-order gradients has higher time complexity and numerical instability than the first-order gradients.*

**x****5** **NEURAL NETWORK BASED ALGORITHM**

In this section, we design a neural network for the WFM by using deep neural network architec- tures [23,59]. The WFM neural network can learn high-dimensional features [45] for the prediction of wealth flow matrices directly from massive financial data, and it can also implement the goals of the WFM by using automatic differentiation frameworks [1] and graphics processing units. Based on the DRL paradigm [63], the main aspects of the WFM neural network are as follows:

— The negative of the objective function*f*_{t+1}defined in Equation (4) is used as the loss function
*J*_{t+1}of the WFM neural network.*J*_{t+1}is different from the general value functions used in
DRL; that is, the loss is recalculated for every trading period and not accumulated during the
execution of our reinforcement learning algorithm. Using single-period losses can enable us
to avoid the problem of unbounded value functions in Q-learning [20,28] and to remove the
dependencies between the Q-values [48] and action-values [42,43] for different states.

— The wealth flow matrices* Y*are parameterized and vectorized as

*(*

**y***). The parameter*

**θ***is pre- dicted by our reinforcement learning algorithm. We add a new loss term to the loss function*

**θ***J*

*t*+1; that is, the difference between the profit obtained by using the best constant rebalanced portfolio strategy in the next

*k*trading periods and the profit obtained by using

**y**_{t}

^{t+k}

_{+}

_{1}(

*). The constraint Tr(*

**θ***)is approximated by adding a penalty item with a large weight to*

**Y***J*

_{t}

_{+1}. We also add a sparsity penalty item to

*J*

*t*+1, since using the wealth proportions obtained from sparse wealth flow matrices can yield higher returns with lower risks.

— The WFM neural network is based on the recursive neural network architecture [23] for ef- fectively predicting wealth flow matrices by learning high-dimensional hidden features [45]

since short-term trends are stateful and relative price vectors are high-dimensional data. In the WFM neural network, we use the long short-term memory [23] neural cells optimized with an attention mechanism [57,59] to capture the correlations of hidden feature vectors.

Because hidden feature vectors represent the relative price vectors of the hidden layers of the WFM neural network, the predictions of wealth flow matrices can be improved by using this attention optimization approach.

For the above reasons, the improved objective function*J*_{t+1}(* θ*)is
min

**θ***J**t*+1(* θ*)=−log

**r**_{t}^{}*h**t*+1(**y***t*+1(* θ*))
+

*ϵ*

2*h**t*+1(**y***t*+1(* θ*))

^{2}+

*L*

**y**^{t+k}_{t}_{+}_{1}(* θ*)

+*R*(**y***t*+1(* θ*)), (22)
where

**y***t*+1(

*) =vec(*

**θ**

**Y***t*+1),

**y***t*+1(

*) ∈ R*

**θ**^{N}

^{2}is a wealth flow vector predicted by the WFM neural network for trading period

*t*+1. The vectorization operator vec can rearrange the elements of a matrix

*into a column vector*

**A***by the column-wise order of*

**a***. The loss function*

**A***L*(

**y**_{t+1}

^{t+k}(

*))is defined by*

**θ***L*

**y**^{t}_{t+1}^{+k}(* θ*)

=E_{{}**r**_{τ}∼p(**r**_{τ})}

*max***x**

*t*+*k*
*τ*=*t*+1

log(**r***τ*^{}* x*) −

^{t+k}

*τ*=*t*+1

log(**r***τ*^{}*h**τ*(**y***τ*(* θ*)))

, (23)

which is the expected difference between the forecasts generated by the WFM and those generated
by using the best constant rebalanced portfolio strategy in*k*future trading periods. Since the rela-
tive price vectors**r***τ* follow some unknown distribution p(**r***τ*), we use the historic relative vectors,
which are stored in replay buffer, as an empirical distribution of p(**r***τ*)so that the loss function
*L*(**y**^{t+k}_{t+1}(* θ*))can be estimated.

The regular item*R*(**y***t*+1(* θ*))in the new objective function

*J*

*t*+1(

*)is utilized for the short-term goal and the constraints defined in the objective function*

**θ***f*

*t*+1(

*). That is,*

**Y***R*(**y***t*+1(* θ*))=

*λ*

*i*=1, ...,*N*^{2}

[*i*mod(*N*+1)1]*y**t*+1,*i*(* θ*)log

*y*_{t+1,i}(* θ*)

*y*ˆ

*t*,

*i*

+Ψ1

*i*=1, ...,*N*^{2}

[*i*mod(*N*+1)=1]*y**t*+1,*i*(* θ*)
+Ψ2

**y***t*+1

_{1}

, (24)

where the first part is the Kullback–Leibler divergence between the current estimation**y**_{t+1}and
the sample**y****ˆ***t*;[*a*]is an indicator function: it is 1 if its parameter*a*is true; otherwise, it is zero;

Ψ1,Ψ2 0 are large numbers for enforcing the constraint Tr(**Y***t*+1) = 0 defined in Equation (4)
and the sparsity of**y**_{t+1}, respectively. The*l*^{1}-norm of**y**_{t+1}is used to measure the sparsity of**y**_{t+1}.
If the constraint Tr(**Y***t*+1) =0 is approximately satisfied by the second part of Equation (24), the
constraint**Y****1**=1 can be satisfied by applying the softmax function on the output**y**_{t+1}(*θ*).

The data processing mechanism of the WFM neural network is given by the following series of tensor operations:

**h**^{1}_{t} =* W*1

**r***t*+

*1, (25a)*

**b****H**_{t}^{j}^{+}^{1}=lstm*a*(**H**_{t}^{j}), (25b)

**H**_{t}^{j+1}=BatchNorm◦Dropout(**H**_{t}^{j+1}),*j*=1, . . . ,2, (25c)
**y***t* =**W***y*vec

[**H**_{t}^{1}**e***w*,**H**_{t}^{2}**e***w*,**H**_{t}^{3}**e***w*], (25d)
where the*j*th matrix**H**^{j} =[**h**_{t−w+1}^{j} , . . . ,**h**_{t}^{j}]is a set of the last*w* feature vectors**h**_{t}^{j} ∈ R^{u} sorted
by their time-steps*t*. Each feature vector**h***t* is a high-dimensional representation of each relative
price vector**r***t*, as shown in Equation (25a). There are three hidden layers for the three feature
matrices**H**^{j}. Equations (25a) to (25c) are used for nonlinear processing, which maps a relative price
vector**r***t*to a feature vector**h**_{t}^{j}. Equation (25d) is a linear transformation that outputs a wealth flow