• No results found

30 Learning Wealth Flow Matrices


Academic year: 2023

Share "30 Learning Wealth Flow Matrices"


Loading.... (view fulltext now)

Full text


30 Learning Wealth Flow Matrices

JIANFEI YIN,College of Computer Science and Software Engineering and National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University

RUILI WANG,School of Natural and Computational Sciences, Massey University

YEQING GUO,Tisson Regaltc Communication Technology


This article proposes a deep learning solution to the online portfolio selection problem based on learning a latent structure directly from a price time series. It introduces a novel wealth flow matrix for representing a latent structure that has special regular conditions to encode the knowledge about the relative strengths of assets in portfolios. Therefore, a wealth flow model (WFM) is proposed to learn wealth flow matrices and maximize portfolio wealth simultaneously. Compared with existing approaches, our work has several dis- tinctive benefits: (1) the learning of wealth flow matrices makes our model more generalizable than models that only predict wealth proportion vectors, and (2) the exploitation of wealth flow matrices and the explo- ration of wealth growth are integrated into our deep reinforcement algorithm for the WFM. These benefits, in combination, lead to a highly-effective approach for generating reasonable investment behavior, including short-term trend following, the following of a few losers, no self-investment, and sparse portfolios. Extensive experiments on five benchmark datasets from real-world stock markets confirm the theoretical advantage of the WFM, which achieves the Pareto improvements in terms of multiple performance indicators and the steady growth of wealth over the state-of-the-art algorithms.

CCS Concepts: •Theory of computationOnline learning algorithms; •Computing methodologies

Sequential decision making; •Applied computingEconomics;

Additional Key Words and Phrases: Online portfolio selection, wealth flow matrix, deep reinforcement learn- ing, regret bound

We thank the editors and reviewers for their expert comments and constructive suggestions, and the support of the Shen- zhen Basic Research Fund (Grant No: JCYJ20200813091134001) and National Natural Science Foundation of China (Grant No: 61972261).

Authors’ addresses: J. Yin, College of Computer Science and Software Engineering and National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Nanshan District, Shenzhen 518060, China; email:

[email protected]; R. Wang (corresponding author), School of Natural and Computational Sciences, Massey University, Auckland 102904, New Zealand; email: [email protected]; Y. Guo, Tisson Regaltc Communication Technology, Guangzhou 510623, China; email: [email protected]; Y. Bai, S. Ju, W. Liu, and J. Z. Huang, Shenzhen University, Shen- zhen 518060, China; emails: {baiyizhe2017, jushunda2017}@email.szu.edu.cn, {liuwl, zx.huang}@szu.edu.cn.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from[email protected].

© 2021 Association for Computing Machinery.

1556-4681/2021/08-ART30 $15.00 https://doi.org/10.1145/3464308

ACM Transactions on Knowledge Discovery from Data, Vol. 16, No. 2, Article 30. Publication date: August 2021.


ACM Reference format:

Jianfei Yin, Ruili Wang, Yeqing Guo, Yizhe Bai, Shunda Ju, Weili Liu, and Joshua Zhexue Huang. 2021. Wealth Flow Model: Online Portfolio Selection Based on Learning Wealth Flow Matrices.ACM Trans. Knowl. Discov.

Data.16, 2, Article 30 (August 2021), 27 pages.



In this article, we propose a deep learning solution to the online portfolio selection problem based on learning a latent structure directly from price time series. Online portfolio selec- tion [2,7,10,13,15,46] is about how to rebalance wealth proportion vectorsxt amongN assets of a portfolio during each trading periodt ∈ [1,T]to maximize the wealth of the portfolio at the end of periodT. Many studies have demonstrated that markets have short-term trends, such as the mean reversion [56,67] and trend following [26]. Therefore, when we make rebalancing decisions, the short-term market trends need to be naturally embedded in a latent structure that defines different capital preferences for the various assets of portfolios.

Since learning latent structures is an effective way to capture structural knowledge [18,24], we propose a wealth flow matrixYt for representing the latent structure, and it encodes knowledge about the relative strengths of assets in portfolios. Wealth flow matricesYtcan be learned directly from the price time series by our algorithm. Theith row ofYtrepresents the output wealth propor- tions of assetirelative to other assetsji. The sum of each row vector ofYtis required to be one to meet the self-financing property of portfolios [37]. Given a wealth flow matrixYt, we can obtain the wealth proportion vectorxt fromYt by using simple linear operations, but not vice versa. As such, a portfolio selection model that can predict wealth flow matrices is more generalizable than models that can only predict wealth proportion vectors.

Although, there are exist portfolio selection algorithms that can represent the latent structures of portfolios, such as the stability and sparsity of wealth proportions [17,34,67], directly from price data, they only treat these latent structures as regular conditions in their objective functions for limiting the searching space of candidate solutions (i.e., wealth proportions). These latent struc- tures are implicitly presented in the solutions. That is, their solutions cannot learn these latent structures from the datasetD1and then reuse them in another datasetD2without searchingD2

again. Also, these latent structures cannot represent the relative strength relationship between any pair of assets in portfolios. There exist a few studies on learning useful signals from out-of-band sources, such as the use of news [60,61] to generate wealth proportions from neural networks.

The signals found by these methods have longer delays than the signals provided by our wealth flow matricesYt, sinceYt are learned directly from price time series.

With the concept of the wealth matrix in mind, we build awealth flow model (WFM) to learn wealth flow matrices and maximize portfolio wealth simultaneously. First, we define a linear functionhto bridge the wealth flow matricesYt and wealth proportion vectorsxt. Since allYt

represent probabilistic latent structures of portfolios, based on the variational Bayesian method [6], we use the Kullback–Leibler divergence [54] to measure the similarity between the predictive and posterior distributions ofYtso thatYtcan be learned online from samples. Second, to address both the exploitation of wealth flow matrices and the exploration of wealth growth, we investigate the regular conditions for updatingxt when updating directionsZt forYt are given. We find that the regular conditions areϕt+1=1/(Ht)and(∇tmt)(xtx) ≥0, whereϕt+1are the learning rates forxt,mt are the updating directions forxt givenZt, andH is some scalar constant. The regular conditions are derived by lettingxt chase a regret bound under the online convex optimization


framework [21]. The regular conditions are later integrated into our deep reinforcement learning algorithm for the WFM.

To learn wealth flow matrices effectively from high dimensional and massive price time series, we design a neural network to implement the WFM. The neural network of the WFM is a recur- sive neural network that uses long short-term memory [23] neural cells and runs on graphics processing units. We strengthen the neural cells by designing an attention mechanism based on the work [57,59]. Therefore, the correlations of hidden features can be learned automatically to improve the prediction accuracy with respect to wealth flow matrices. Furthermore, we design a deep reinforcement learning algorithm to train the neural network of the WFM online. The algo- rithm estimates future losses by using a reply buffer [32,48] without using the recursive estimation method used by Q-learning [20,28] and the policy gradient [42,43], because we have a precise definition of loss at each time step, instead of the uncertain rewards defined by the Q-value and action-value functions [28,48].

In summary, we make the following main contributions in this article:

— We propose a novel wealth flow matrix to represent latent structures of portfolios. The wealth flow matrix has special regular conditions for encoding knowledge about the rela- tive strengths of assets in portfolios. A model that can predict wealth flow matrices is more generalizable than models that can only predict wealth proportion vectors. The wealth flow matrix can be considered as side information [16,65], but it can be learned directly from the given price time series without using out-of-band sources.

— We propose a novel portfolio selection model called the WFM, which learns wealth flow matrices and maximizes portfolio wealth simultaneously. The learning of wealth flow matri- ces is based on the principle of the variational Bayesian method [6]. The Kullback–Leibler divergence [54] is used to measure the similarities between the predictive and posterior dis- tributions of wealth flow matrices. We investigate the conditions for updating the wealth proportion vectors when the updating directions for the wealth flow matrices are given.

These conditions are later integrated into our deep reinforcement learning algorithm for the WFM so that the exploitation of wealth flow matrices and the exploration of wealth growth are integrated together in our training algorithm.

— We design a neural network to implement the WFM and a deep reinforcement learning al- gorithm to train the neural network. The neural network of the WFM is a recursive neural network in which cells are built with an attention mechanism. The deep reinforcement learn- ing algorithm uses a replay buffer [32,48] to estimate the loss at every time step without recursively estimating the Q-value [20,28] and action-value [42,43] functions.

The rest of this article is organized as follows. Section2reviews some related work regarding online portfolio selection. Section3gives the preliminaries and notations used throughout this article. Section4defines the proposed WFM for online portfolio selection. Section5presents the neural network for the WFM and the deep reinforcement learning algorithm. Section6presents a behavior analysis of the WFM neural network and some performance comparison experiments.

Finally, we present our conclusion and future work in Section7.


Online portfolio selection is a special case of an online optimization problem [4,14,63], in which an agent needs to make periodic decisions based on partially observable streaming data, such as financial time series. Cover [15] proposed the universal portfolio algorithm, which has a good sublinear regret boundO(NlogT)with respect to the best constant rebalanced portfolio in hind- sight. The universal portfolio algorithm belongs to a class of sampling-based algorithms [4]. Blum


et al. [7] provided a proof of the regret bound in the universal portfolio algorithm based onα- parameterized integration. Following the logic of the proof given by Blum et al. [7], one can show that if an algorithmAgives any proportion of its wealth to its sub-algorithms that mimic the pol- icy of the universal portfolio algorithm, then algorithmAobtains a sublinear regret bound [9,65].

As such, sampling-based algorithms provide reasonable explorations of wealth growth in the space of wealth proportion vectors.

To improve the time complexity of universal portfolio algorithm while keeping sublinear regret bounds, some other algorithms were proposed. The exponentiated gradient algorithm [22] is a non- sampling-based algorithm that has a regret boundO(

N/T). The algorithm can be formalized as a special case of the multiplicative weight update method [4]. According to their method, if an algorithmAcomputes the probability of choosing portfolio decisions by the product of weights, algorithmA can have a sublinear regret bound. Agarwal et al. [2] proposed the online Newton step algorithm that has a boundO(GDNlogT)by computing the inverse of the sum of matrices.

Thus, the ONS algorithm exhibits poor scalability when the number of stocks N is large. Luo et al. [46] designed an efficient portfolio selection algorithm based on the online mirror descent framework [51].

The regret bounds of many algorithms [2,15,22,46] are estimated in the sense of asymptotic behavior. These algorithms are distribution independent and do not consider the short-term pat- terns of inputs [31], such as the mean reversion [56] and the sparse weights of assets [17,34,67].

As tested on several benchmark datasets [9,15,38], these algorithms perform poorly,1compared with some other practical approaches [9,36]. These practical algorithms are not proven to have sublinear regret bounds but perform well on benchmark datasets. For example, Borodin et al. [9]

proposed the ANTICOR algorithm, which predicts wealth proportions by calculating the sample correlation coefficients between each pair of stocks in two adjacent time windows. Li et al. [36]

proposed the online portfolio selection with the moving average reversion algorithm, which pre- dicts wealth proportions based on a moving average estimation for the relative price vectorrt+1 at the next trading period. Many online portfolio selection algorithms [10, 13,17, 34,67] have been proposed using different prior structures, and these have achieved good performances on benchmark datasets.

If users are most concerned about the intermediate losses in the contract life of a portfolio, they can adopt appropriate risk control strategies, such as introducing the indicator maximum draw- down into the optimization target [49], executing long-and-short-term risk control [5], optimizing for transaction costs [40], and conducting market sentiment analysis [60,62]. If trading costs are so high as to restrict high frequency trading, timing of trading is one of the most effective trad- ing methods [25,66]. The basic idea of the method is to separate the given market sequence into sub-sequences and run an online portfolio selection algorithmA for each sub-sequence. Kozat et al. [33] proved that this method still has the property of possessing a sublinear regret bound if the algorithmAhas a sublinear regret bound. The mean variance analysis approach [12] uses the covariance of rates of returns to measure the risk of the given portfolio. Xing et al. [61] proposed an algorithm using market sentiment views [11,47] to estimate the mean vector and covariance matrix in the Black-Litterman model so that the wealth proportions from the first-order condition of the mean variance analysis can be computed.

Many regret bound-based online algorithms [2,15,22, 46] and other high-performance algo- rithms [9,34,36] design handcrafted features [50] for making decisions and use a minimal amount of historical data, usually only one to several data points, so that algorithms with limited comput- ing power can make quick decisions when a new batch of data arrives. If the reality is that we only

1The test can be verified by using the open source codes athttps://github.com/Marigold/universal-portfolios.


Fig. 1. Temporal relationship of variables in online portfolio selection. Trading timet1is generally the closing time in periodt. The portfolio wealthwt+1is evaluated immediately after trading assets at timet1. Theith entry of relative price vectorrt+1is the ratio of theith asset’s price at time(t+1)1to the price at timet1. need to rebalance assets on a day-to-day basis (since relative price vectors are given daily); we need to manage a large number of assets, for example,N > 1000; the market has years of long-term memory [29]; the short-term behavior of the market do not meet the assumptions of the Markov process; or we do have sufficient computing power (e.g., in terms of graphics processing units), then, the performances of these algorithms will suffer from using too little data [3].

Online portfolio selection can be considered as an online game. Somedeep reinforcement learning(DRL)- based methods [27,28,30,42,52] have been proposed for online trading and online portfolio selection. DRL is an integrated framework that combines high-dimensional feature learning and decision making. DRL uses graphics processing units to optimize complex functions built by neural networks, and it has been successful in many related areas, such as online video games [48] and the complex game Go [58]. To successfully apply DRL to solve the online portfolio selection problem, we need to carefully design reward functions that match the problem definitions to make the regrets of the obtained solutions bounded [27,31, 52]. Q-learning [48] and policy gradient-based methods [42,43] may be helpful for designing reward functions.


In this section, we present the basic definition of the rebalancing process for a portfolio so that we can understand the objective functions defined in the following sections. The notations used in this article are also given in this section for reference.

3.1 Wealth Dynamic Model of Online Portfolio Selection

A portfolio hasN ≥1 tradable assets, which can be risk-free or risky assets. There areTperiods for trading assets. At the end of each periodt,t ∈ [1,T), we can tradeNassets so that the portfolio has a new wealth proportionxt+1N, whereN def= {x:x∈R+N,xt1=1}. For example, we sell and purchase assets at timet1during the periodtto make the wealth of theith asset equal towt+1xt+1,i

fori ∈ [1,N], wherewt+1is the portfolio wealth just after trading. The temporal relationship of portfolio selection is illustrated in Figure1. Since the prices of the assets have changed fromt1to (t+1)1, the portfolio wealthwt+1and its wealth proportionxt+1will be changed implicitly by the market to

wt+1=wt+1xt+1 rt+1, xt+1=xt+1rt+1

xt+1rt+1 , (1)

where◦denotes the entry-wise product of two vectors. Note that rebalancing the portfolio at time t1will shrink the wealth portfolio fromwtto

wt+1=μt+1wt, (2)


by a factorμt+1∈ (0,1)if trading costs are considered. The factorμt+1satisfies a balance equation under the self-financing assumption [19]:

N i=1

(1−csi)(xt,iμt+1xt+1,i)+= N i=1

(1+cpi)(μt+1xt+1,ixt,i)+, (3) wherecpiandcsiare the commission ratios of purchasing and selling assets, respectively. Typically, we setcpi =csi =0 for trading risk-free assets andcpi =csi =cfor trading risky assets.(y)+=y ify ≥0, otherwise,(y)+ =0. The factorμt can be solved iteratively from the following iterative equation:


N i=2


N i=2


where the initial value ofμtis(1−cs)/(1+cp). By applying Equation (1) and Equation (2) recursively onwT, we know that the cumulative wealth of a portfolio at the end of periodT will bewT = w1T

t=2μtxtrt, wherew1is the initial wealth. The goal of portfolio selection is to maximize the cumulative wealthwT at the end of periodT under random relative price vectorsri.

Hereafter, we assume that the relative price vectorri of any single asseti in a portfolio is bounded by 0 < LrriUr, whereLr andUr are some scalar constants. For the sake of readability, unless necessary, we suppress the suffixt for the time index when the meaning of time-dependent variables, e.g.,x,r, can be understood from the context. We use the notationy to represent the vectorization of the wealth flow matrixY. The notations used in this article are summarized in Table1.


This section presents an optimization model, i.e., the WFM for online portfolio selection. The model is based on the proposed prior structure of the wealth flow matrix. Several gradient-based proper- ties of the WFM are investigated to enable the learning of wealth flow matrices and the maximiza- tion of portfolio wealth simultaneously.

4.1 Wealth Flow Matrix

Numerous studies support the existence of different short-term market trends [26,56]. That is, be- fore we make rebalancing decisions, the short-term market trends need to be naturally embedded in a latent structure that defines different capital preferences for the assets of portfolios. Based on this observation, we propose a wealth flow matrix to represent such latent structures. Each wealth flow matrixY encodes knowledge about the relative strengths of the assets in the context of a portfolio.Y is an asymmetric and a nonnegative matrix. An example ofYis given as follows:

0 0.5 0.5 0.2 0 0.8 0.3 0.7 0


The example hasN =3 assets. Theith row vector represents the proportions of wealth that flow from asseti to the other assets j i. For example,y1,2 andy1,3 are the proportions of wealth that flow from asseti =1 to assetj =2 and assetj =3, respectively. We assume that there is no cache of wealth remaining in asseti’s virtual account when asseti’s wealth flows out to the virtual accounts of other assets. Thus, the diagonal elements of a wealth matrix are zeros, and the sum of each row is one.


Table 1. Glossary of Notations

Notation Representation

N,T N assets of a portfolio are traded inT trading periods N N-dimensional simplex space,N ={x:x ∈R+N,xt1=1}

ei,ew vectors, in which theith orwth component is 1 and the others are zeros 1,I 1denotes a vector whose components are all ones,Idenotes an unit matrix

xt,xt wealth proportion vectors at the beginning and end of the trading periodt, respectively wt,wt+1 portfolio wealth before and after the rebalancing at the end of the trading periodt,


cp,cs commission ratios for purchasing and selling assets μt+1 discount factor for calculating the wealthwt+1

rt relative price vector at the end of the trading periodt

Yt,yt Yt denotes the wealth flow matrix at the beginning of the trading periodt,ytdenotes the vectorization ofYt, i.e.,yt =vec(Yt)

h function that maps aYt to ax Zt updating direction matrix forYt

mt updating direction vector forxt when updatingYt

ρ,φ,ϕ ρ =φ/ϕ, whereφandϕare the learning rates forY andx, respectively ϵ,λ12 penalty weights used in objective functions

ft+1(Y) objective function forY

дt(x) evaluation function forxat the trading periodt

θ,Jt+1(θ) weights and objective functions for the WFM neural network, respectively R replay buffer

L,R loss function and regular function used byJt+1(θ), respectively p,q density functions

H,G constants used for estimating regret bounds u dimension of hidden features

v number of convolutional kernels w window length of a sequence ofrt

k nextktrading periods for computing the loss functionL

Note that we can use the elements of a wealth flow matrixY to compute a wealth proportionx, but we cannot do the reverse. This means that a wealth flow matrixY provides more information than yielded by a wealth proportionx. It is supposed that if a model can predict wealth flow matricesY and the wealth proportionx simultaneously, then the performance of the model will be improved.

4.2 Optimization Model

Since wealth flow information can be used as a prior to derive decisions regarding wealth propor- tions, we can represent the joint probability of wealth proportionx and wealth flow matrixY as p(x,Y)= p(x|Y)p(Y). Inspired by the method of maximum a posteriori estimation [53,64], we design a WFM to maximize the logarithm of the joint probability p(x,Y). That is,

maxY ft+1(Y)=log(rtht+1(Y))

single-period goal



long-term goal

λ N



yi,jlog yi,j

ˆ yi,j

short-term goal




Y ∈RN+×N, Tr(Y)=0, Y1=1,

whereϵandλare small positive numbers, Tr(Y)denotes the trace of matrixY,.2denotes thel2 norm function, and the linear functionht+1:RN+×N×R+N →R+N is defined by

ht+1(Y)=xt+Yxtxt ◦ (Y1). (5) For a future periodt+1,ht+1computes the net incomes (may be negative) for all assets by the vectorYxtxt◦ (Y1). This vector is then added to the previous wealth proportionxt to output the new proportionxt+1. The symbol◦denotes the entry-wise product. Since we do not allow assets to store wealth flows in themselves when rebalancing a portfolio, the diagonal elementsyi,i ofY are constrained to zeros by Tr(Y) = 0, and the sum of each row is normalized to 1 by the constraintY1=1.

The short-term goal is defined in Equation (4) is to make profits by following the short-term market trends provided by the sample matrices ˆYt, i.e., samples of the wealth flow matrices. We use the Kullback–Leibler divergence [54] to measure the similarity between the empirical distribu- tion q(Y)and the posterior distribution p(Y|Yˆt), as shown in Equation (4). The Kullback–Leibler divergence expression is derived by applying the variational Bayesian method [6]:


q(Y) ||p(Y|Yˆt)


q(Y)log(q(Y)) −q(Y)log(p(Y|Yˆt)

. (6)

The sample matrices ˆYtcan be learned by the neural networks designed in Section5.

The single-period goal is to make profits by assuming that the relative price vectorrt+1 in the near future will not change much from the current vectorrt. The long-term goal in Equa- tion (4) is to encourage the WFM to output an increasingly-balanced wealth proportionxt+1, which may cope with risks from market reversals. The long-term goal is based on the fact that the uni- form vector N11 is the solution for minimizing thel2-norm of the vectors inN. The uniform vector N11can be thought of as a bias for the predictedxt+1. Since the expression of the single- period goal alone is concave but not strictly concave, the long-term goal makes the expression log(rtht+1(Y)) − ϵ2 ht+1(Y)22 aH-strong concave function by setting a tiny number forϵ, as proved by the following lemma.

Lemma 4.1. Letдt(x) = log(rtx) − ϵ2x2; then, whenϵH > 0t is aH-strong concave function, that is,

xt ∈RN : ∇2дt(xt) −HI.

Here, we use the notationABifBAis a positive semidefinite matrix.

Proof. By the definitions of log and thel2-norm, we have

2log(rtx)= −rtrt (rtx)2,


2 x2=−ϵI.


Thus, lety∈RN; then, we have

y(−HI− ∇2дt)y=(ϵH) y2+(rty)2 (rtx)2,

which means that ϵH > 0 is a sufficient condition for making the above equation


4.3 Gradient Information about Wealth Proportions and Flows

In this section, we analyze the gradient information related to the wealth proportion vectorxand the wealth flow matrixY to understand the optimization process of the WFM and the dynamics of the wealth flows between each pair of assets in a portfolio.

In the WFM’s Definition (4), the wealth proportionx =h(Y)is, in fact, a dependent variable, which is driven by the independent variable, i.e., the wealth flow matrixY. In the following lemma, we apply differential calculus to analyze how the dependent variablexchanges with the indepen- dent variableY.

Lemma 4.2. LetYY+φZbe the updating rule for the independent variableYof the model (4) andxx+ϕmbe the corresponding change in the dependent variablex=h(Y), where the learning rates areφ >0andϕ >0. Letρ = φϕ; then, we havemi =ρ

ji(zj,ixjzi,jxi),i,j ∈ [1,N]and m2 ≤2√

N ρ.

Proof. Let the symbolddenote the forward difference operator; then, we have dY =φZ,

dx =ϕm. (7)

Recall that by Equation (5), the functionhis used to obtain a newx, that is,

h(Y)=x+Yxx◦ (Y1). (8) If we fixx and make a very small change toY bydY in Equation (8), thendhcan be used to approximatedx, that is,

dx=dYxx◦ (dY1). (9)

By the constraints ofYdefined in Equation (4) and Equation (7), the elementsdyi,jof the difference matrixdY are

dyi,j =

φzi,j ifij,

0 otherwise,wherezi,j ∈ [−1,1]. (10) Leta =dYxandb =x◦ (dY1). After a simple calculation, we know that the elements ofa andbare

ai =


(xjdyj,i), bi =xi


dyi,j. (11)

Combining Equations (9) and (11), we have dxi =


(xjdyj,i) −xi



Furthermore, using Equations (7) and (10), we arrive at mi =φ




. (12)


Letρ=φϕ; then, the squarel2-norm ofmis m22=ρ2

N i=1




≤4N ρ2,

byzi,j ∈ [−1,1]andxiN.

Equation (12) gives the net-income ratemi for assetiby using the gradient matrixZ. The net- income ratemi is then scaled by the learning rateϕand added to the previous wealth proportion xi. This updating logic demonstrates the validity of using differential calculus for the definition of the functionh, as shown in Equation (9). The boundm2 ≤2√

N ρestablished by Lemma4.2 shows that the amplitude of the search direction for finding the optimal wealth proportionxcan be estimated by the relative learning ratioρ. The parameterρwill be further used in the proof of the regular conditions for the wealth proportionsxt in Section4.4.

The gradient information related to wealth flow matrixY demonstrates an interesting property about the directions of the wealth flows between pairs of assets in a portfolio. Note that the gradient for the single-period goal defined in Equation (4) is as follows:


=∇Ylog(r(x+Yxx◦ (Y1))

= rx− (rx1) r(x+Yxx◦ (Y1).


An example of the numerator in Equation (13) is the following 3 by 3 matrix:

0 (r1r2)x2 (r1r3)x3 →output of asset1 (r2r1)x1 0 (r2r3)x3

(r3r1)x1 (r3r2)x2 0

↑input of asset1


in which the element(r1r2)x2in row 1 and column 2 means that if the relative price vectorr1is greater thanr2, then the first asset will output a wealth rate(r1r2), scaled byx2, to the second asset, where the weightx2is the previous wealth proportion of the second asset. This behavior of wealth flow is consistent with that of the mean reversion market. Note that the gradient matrix in Equation (13) is obtained by using the numerator layout of the scale-by-matrix derivative [8]. If the denominator layout [8] is used, then the gradient matrix must be the transpose of the matrix in Equation (13), and trend-following behavior will be discovered.

4.4 Conditions for Updating Wealth Proportion Vectors

To integrate both the exploitation of wealth flow matrices and the exploration of wealth growth in our model, in this section, we investigate the regular conditions for updating the wealth proportion vectorsxtwhen the updating directionsZtfor the wealth flow matricesYt are given. The regular conditions are derived by using an online convex optimization framework [21], where the regret bound defined for algorithmAcan be written as follows.

Definition 1.

Regret(A,[д1, . . . ,дT])def= max


T t=1

дt(x) −E{xt∼A(x1, ...,xt−1)}




. (14)


The real valued functionsдt are evaluation functions for computing regret bounds. The first item of Equation (14) is the profit obtained by algorithmA’s adversary, who can see all the mar- ket data, i.e., the relative price vectorsr1, . . . ,rT, before making only one decision regarding the wealth proportion. The second item is the expected profit obtained by algorithmA. The game rules defined forAare the same as those defined in Section3.1. Note thatAcan output a probabilistic wealth proportionxt given the previous proportionsx1, . . . ,xt1.

Based on the definition of a regret bound in Equation (14), we have the following result.

Theorem 4.3. Letдt(x) =log(rtx) − ϵ2x2be the evaluation functions required by the regret bound definition (14). Letx =arg maxxN


t=1дt(x)be the solution found by the adversary in hindsight, and letxtandmtbe the wealth proportions and gradients given by Lemma4.2, respectively.

If conditionsϵH >0and(∇tmt)(xtx) ≥0,∀tare satisfied, then the wealth proportionsxt

have regrets bounded by2N ρ2(1+logT)/H.

Proof. Let∇t =∇дt(xt),∇t2=∇2дt(xt)be the gradient and Hessian ofдtat pointx. Since each дt is twice differentiable, we can apply a Taylor series, up to second-orders, around our pointxtto represent the adversary’s valueдt(x). Letξ =αxt+(1−α)x,α ∈ [0,1]and∇t2(ξ)be the Hessian atξ; thus, we have


2(xxt)2t(ξ)(xxt). (15) By Lemma4.1, ifϵH >0, then∇t2(x) −HI,∀x ∈RN; thus,

2[дt(x) −дt(xt)] ≤ −2∇t(xtx) −H(xtx)22. (16) We define two variablesdt =xtxandd2t =xtx2. Given(∇tmt)dt ≥0, the inequality (16) becomes

2[дt(x) −дt(xt)] ≤ −2∇tdtHd2t

≤ −2mtdtHd2t. (17) We need to find a good upper bound for the last expression in inequality (17).

By Lemma4.2, we knowxt+1=xt+ϕt+1mt; thus,

dt2+1=ProjN(xt+ϕt+1mt) −x2




where ProjN(.)denotes the projection operation that maps its inputs into the spaceN. Let a con- stantGbe an upper bound ofmt2, i.e.,mt2G, so by the inequality (18), the last expression in inequality (17) is upper bounded by

−2mTtdtHd2t ≤ 1

ϕt+1(dt2dt2+1) −Hdt2

+ϕt+1G2, (19) where the bracketed part of the inequality (19) must not depend ondt to make the sum of the right-hand side of the inequality (19) sublinearly bounded. Thus, the repeated terms in the sum have to be zeros by makingϕt andHsolve the following equation:

1 ϕt+1

dt2− 1 ϕt


⇔ 1

ϕt+1 − 1

ϕtH =0.



If we treat difference equations as differential equations, the equation above means that ϕ1

t is a linear function oftwith a gradientH. A simple solution is

1 ϕt+1 =Ht;

thus, the sum of the inequality (17) is T t=1

[дt(x) −дt(xt)] ≤ G2 2

T t=1


G2 2H

T t=1

1 t




By Lemma4.2, thel2-norm ofmt has an upper bound 2√

N ρ, which can be a choice forG.

The conditionϵH > 0 leads to a relatively high constant factor 1/H in the regret bound 2N ρ2(1+logT)/H according to Theorem4.3. However, it enables us to regulate the wealth pro- portionsxtby using only first-order gradients∇tandmt, which can be easily obtained from neural network optimizers. If the conditionϵH >0 is relaxed, a better regret bound [21] can be ob- tained by using the second-order gradient of the evaluation functionдt(x)=log(rtx), where the computation of the second-order gradients has higher time complexity and numerical instability than the first-order gradients.


In this section, we design a neural network for the WFM by using deep neural network architec- tures [23,59]. The WFM neural network can learn high-dimensional features [45] for the prediction of wealth flow matrices directly from massive financial data, and it can also implement the goals of the WFM by using automatic differentiation frameworks [1] and graphics processing units. Based on the DRL paradigm [63], the main aspects of the WFM neural network are as follows:

— The negative of the objective functionft+1defined in Equation (4) is used as the loss function Jt+1of the WFM neural network.Jt+1is different from the general value functions used in DRL; that is, the loss is recalculated for every trading period and not accumulated during the execution of our reinforcement learning algorithm. Using single-period losses can enable us to avoid the problem of unbounded value functions in Q-learning [20,28] and to remove the dependencies between the Q-values [48] and action-values [42,43] for different states.

— The wealth flow matricesYare parameterized and vectorized asy(θ). The parameterθis pre- dicted by our reinforcement learning algorithm. We add a new loss term to the loss function Jt+1; that is, the difference between the profit obtained by using the best constant rebalanced portfolio strategy in the nextktrading periods and the profit obtained by usingytt+k+1(θ). The constraint Tr(Y)is approximated by adding a penalty item with a large weight toJt+1. We also add a sparsity penalty item to Jt+1, since using the wealth proportions obtained from sparse wealth flow matrices can yield higher returns with lower risks.

— The WFM neural network is based on the recursive neural network architecture [23] for ef- fectively predicting wealth flow matrices by learning high-dimensional hidden features [45]

since short-term trends are stateful and relative price vectors are high-dimensional data. In the WFM neural network, we use the long short-term memory [23] neural cells optimized with an attention mechanism [57,59] to capture the correlations of hidden feature vectors.


Because hidden feature vectors represent the relative price vectors of the hidden layers of the WFM neural network, the predictions of wealth flow matrices can be improved by using this attention optimization approach.

For the above reasons, the improved objective functionJt+1(θ)is min

θ Jt+1(θ)=−log

rtht+1(yt+1(θ)) +ϵ



+R(yt+1(θ)), (22) whereyt+1(θ) =vec(Yt+1),yt+1(θ) ∈ RN2 is a wealth flow vector predicted by the WFM neural network for trading periodt+1. The vectorization operator vec can rearrange the elements of a matrixAinto a column vectoraby the column-wise order ofA. The loss functionL(yt+1t+k(θ))is defined by





t+k τ=t+1

log(rτx) − t+k



, (23)

which is the expected difference between the forecasts generated by the WFM and those generated by using the best constant rebalanced portfolio strategy inkfuture trading periods. Since the rela- tive price vectorsrτ follow some unknown distribution p(rτ), we use the historic relative vectors, which are stored in replay buffer, as an empirical distribution of p(rτ)so that the loss function L(yt+kt+1(θ))can be estimated.

The regular itemR(yt+1(θ))in the new objective functionJt+1(θ)is utilized for the short-term goal and the constraints defined in the objective functionft+1(Y). That is,


i=1, ...,N2


yt+1,i(θ) yˆt,i


i=1, ...,N2

[imod(N+1)=1]yt+1,i(θ) +Ψ2yt+11

, (24)

where the first part is the Kullback–Leibler divergence between the current estimationyt+1and the sampleyˆt;[a]is an indicator function: it is 1 if its parameterais true; otherwise, it is zero;

Ψ12 0 are large numbers for enforcing the constraint Tr(Yt+1) = 0 defined in Equation (4) and the sparsity ofyt+1, respectively. Thel1-norm ofyt+1is used to measure the sparsity ofyt+1. If the constraint Tr(Yt+1) =0 is approximately satisfied by the second part of Equation (24), the constraintY1=1 can be satisfied by applying the softmax function on the outputyt+1(θ).

The data processing mechanism of the WFM neural network is given by the following series of tensor operations:

h1t =W1rt+b1, (25a)

Htj+1=lstma(Htj), (25b)

Htj+1=BatchNorm◦Dropout(Htj+1),j=1, . . . ,2, (25c) yt =Wyvec

[Ht1ew,Ht2ew,Ht3ew], (25d) where thejth matrixHj =[htw+1j , . . . ,htj]is a set of the lastw feature vectorshtj ∈ Ru sorted by their time-stepst. Each feature vectorht is a high-dimensional representation of each relative price vectorrt, as shown in Equation (25a). There are three hidden layers for the three feature matricesHj. Equations (25a) to (25c) are used for nonlinear processing, which maps a relative price vectorrtto a feature vectorhtj. Equation (25d) is a linear transformation that outputs a wealth flow


Related documents

The concern has related to the loss of leadership T i m b e r l a n d s was providing to the international forestry community on sustainability matters, the missed chance to fully test

Oliver Humanities Research Academy aims to raise awareness of the presence of history in the local community as well as celebrating students embarking on careers in the rich field of