How

International Journal of Information Management Data Insights 2 (2022) 100094
Contents lists available at ScienceDirect
International Journal of Information Management Data

Insights
journal homepage: www.elsevier.com/locate/jjimei
How are reinforcement learning and deep learning algorithms used for big
data based decision making in financial industries–A review and research
agenda
Vinay Singh a,b,e,∗, Shiuann-Shuoh Chen b, Minal Singhania a, Brijesh Nanavati c,
Arpan kumar kar d, Agam Gupta d
a
BASF SE, Pfalzgrafenstraße 1, 67061 Ludwigshafen am Rhein, Germany
b
Department of Business Administration, National Central University, No. 300, Zhongda Road, Zhongli District, Taoyuan City 320, Taiwan
c
BASF Services Europe GmbH, Rotherstrasse 11, 10245 Berlin, Germany
d
Department of Management Studies and School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi 110016, India
e
Universität Siegen, Adolf-Reichwein-Straße 2, 57076 Siegen
a r t i c l e i n f o a b s t r a c t
Keywords: Data availability and accessibility have brought in unseen changes in the finance systems and new theoretical
Big data and computational challenges. For example, in contrast to classical stochastic control theory and other analytical
Markov decision process approaches for solving financial decision-making problems that rely heavily on model assumptions, new develop-
Online learning
ments from reinforcement learning (RL) can make full use of a large amount of financial data with fewer model
Reinforcement learning
assumptions and improve decisions in complex economic environments. This paper reviews the developments and
Financial applications
Deep reinforcement learning use of Deep Learning(DL), RL, and Deep Reinforcement Learning (DRL)methods in information-based decision-
making in financial industries. Therefore, it is necessary to understand the variety of learning methods, related
terminology, and their applicability in the financial field. First, we introduce Markov decision processes, followed
by Various algorithms focusing on value and policy-based methods that do not require any model assumptions.
Next, connections are made with neural networks to extend the framework to encompass deep RL algorithms.
Finally, the paper concludes by discussing the application of these RL and DRL algorithms in various decision-
making problems in finance, including optimal execution, portfolio optimization, option pricing, hedging, and
market-making. The survey results indicate that RL and DRL can provide better performance and higher efficiency
than traditional algorithms while facing real economic problems in risk parameters and ever-increasing uncer-
tainties. Moreover, it offers academics and practitioners insight and direction on the state-of-the-art application
of deep learning models in finance.
1. Introduction erties learned from the training data. Using many principles and tools
from statistics.
Machine learning(ML) based application has exploded in the past However, machine learning models aspire to a generalized predictive
decade; almost everyone interacts with modern artificial intelligence pattern. For example,most learning problems could be seen as optimiz-
many times every day. ML methods enable machines to conduct com- ing a cost: minimizing a loss or maximizing a reward. But learning algo-
plex tasks such as detecting faces, understanding speech, recommend- rithms seek to optimize a criterion (loss, reward, regret) on training and
ing alternatives, targeted marketing or finding anomalies in business unseen samples(Khadjeh et al., 2014). Financial decision-making prob-
processes (Sharma et al., 2021; Singh et al., 2022; Verma et al., 2021). lems have traditionally been modeled using stochastic processes and
However, financial institutions cannot comprehend medium and large methods arising from stochastic control. However, the availability of
enterprises’ static and dynamic economic development business infor- large amounts of financial data has revolutionized data processing and
mation (Theate &, Ernst,2021). As a result, they are not always aware of statistical modeling techniques in Finance and brought new theoretical
the decisions made by investors for non-entirely rational people. There- and computational challenges. In contrast, the classical stochastic con-
fore, machine learning methods generally decide based on known prop- trol approach describes how agents acting within some system might
∗
Corresponding author: Vinay Singh, BASF SE, Pfalzgrafenstraße 1, 67061 Ludwigshafen am Rhein, Germany.
E-mail address: vinaysingh.bvp@gmail.com (V. Singh).
https://doi.org/10.1016/j.jjimei.2022.100094
Received 8 February 2022; Received in revised form 1 June 2022; Accepted 19 June 2022
2667-0968/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
V. Singh, S.-S. Chen, M. Singhania et al. International Journal of Information Management Data Insights 2 (2022) 100094
learn to make optimal decisions through repeated experience gained by

interacting with the system.
RL is a robust mathematical framework where the agents interact di-
rectly with the environment. It is experience-driven autonomous learn-
ing where the agent enhances its efficiency by trial-and-error to optimize
the cumulative reward. It does not require labeled data to do so. For
Autonomous learning policy, search and value function approximation
are vital tools. RL applies a gradient-based or gradient-free approach
to detect an optimal stochastic policy for continuous and discrete state-
action settings(Sutton and Barto,2014). While considering an economic Fig. 1. Agent-environment interaction in an MDP (adopted from Sutton and
Barto, 2016).
problem, despite traditional approaches, reinforcement learning meth-
ods prevent the suboptimal performance by imposing significant market
constraints that lead to finding an optimal strategy in terms of market
analysis and forecast. Despite RL successes in recent years, these re-
sults lack scalability and cannot manage high-dimensional problems.
By combining both RL and DL methods, the DRL technique, where DL
is equipped with the vigorous function approximation, representation
learning properties of deep neural networks (DNN), and handling com-
plex and nonlinear patterns of economic data, can efficiently overcome
these problems.
This study contributes to the literature in the following ways. First,
we systematically review the state-of-the-art applications of Reinforce-
ment Learning in the field of Finance. Second, we summarize multiple Fig. 2. Representation of state and possible action (one of the cases) in game
RL models regarding specified Finance domains and identify the optimal of tic-tac-toe.
RL model for application scenarios. The study uses the data processing
method as the basis for analysis. Third, our review attempts to bridge
the technological and application levels of RL and Finance, respectively. 2.1. Markov decision process
Fourth, we identify the features of various RL models and highlight their
feasibility toward different Finance domains. This will help Researchers Markov decision process (MDP) is the classical formalisation of se-
about the feasibilities of RL models toward a specified financial field. quential decision making, where actions influence not just immediate
Due to the lack of connections between core economic domains and rewards, but also subsequential decision making, where actions influ-
numerous RL models, they usually face difficulties. This study will fill ence not just the immediate rewards but also subsequent situations or
this literature gap and guide financial analysts. we are guided by the states and through those future rewards (Puterman, 2014; Van Otterlo
following research questions(s): & Wiering, 2012). It involves delayed rewards and the need to tradeoff
immediate and delayed rewards.
RQ1. How and in which area of Finance can we use RL and DRL
models to make better-informed decisions? 2.1.1. Introduction to markov decision processes
RQ2. How Can the existing DL, RL, and DRL approaches, based on MDPs are mathematically idealized forms for decision-theoretic
their application, be classified for better understanding and ap- planning (DTP), reinforcement learning, and other learning problems
plication? in stochastic domains for which precise theoretical statements can be
RQ3. What are the open issues and challenges in current deep made (Sutton and Barto,2016).
reinforcement learning models for information-based decision- MDPs are meant to be a precise framing of learning from interac-
making in Finance? tion to achieve a goal. The learner and decision-maker are agents (Van
Otterlo & Wiering, 2012). It interacts with, comprising everything out-
side the agent, which is called the environment (Van Otterlo & Wier-
The rest of this paper is organized as follows. Section 2 provides a
ing, 2012). These interact continually, the agent selecting actions and
background of DL, RL, and DRL techniques. Section 3 introduces our re-
the environment responding to these actions and presenting new situa-
search framework and methodology. Section 4 analyzes the established
tions to the agent. The environment also gives rise to rewards, special
RL models. Section 5 examines Deep Reinforcement learning methods.
numerical values that the agent seeks to maximize over time through
Section 6 captures a review of existing Financial applications using vari-
its choice of actions∗ . MDPs consist of states, actions, transitions be-
ous RL, DL, and DRL methods. Section 7 discusses the influencing factors
tween states, and a reward function definition (Puterman,2014).Figure 1
in the performance of financial RL models. Finally, Section 8concludes
shows an agent-environment interaction in an MDP.
and outlines the scope for promising future studies.
2.1.2. States
A state is a unique characterization of all that is important in a state
2. Preliminary discussion on terms related to reinforcement of the modeled problem (Van Otterlo & Wiering, 2012). It can be defined
learning as the finite set {} where the size of the state space is N, i.e., |S| = N.
For instance, in the tic-tac-toe game (see Figure below2 on the right),
Since we aim to use RL for our study, we will first do a prelimi- a state could be a tuple of 9 components corresponding to each cell of
nary discussion on terms from RL and understand them in the context the game’s board. It may contain either 0, 1, or 2 (or any other three
of finance. We will emphasize the idea of exploring and exploiting and different symbols) depending on if the cell is empty or occupied by one
the critical role it plays. We will conclude the section with a clear un- of the players. In our research, we confine ourselves to the discrete state
derstanding of Reinforcement learning and deep reinforcement learning set S in which each state is represented by a distinct symbol and, all
and how deep reinforcement learning can help complement traditional states s ∈ S are legal. Figure 2 shows a typical state in a tic-tac-toe games
RL. with possible actions
2
2.1.3. Action
Actions can be used to control the system state (Puterman,2014).
The set of actions that can be applied in some state’s ∈ S is denoted
A(s), where A(s)⊆A. The set of actions A is defined as the finite set {,
. . .,} where the size of the action space is K, i.e., |A| = K. Though not
all actions can be applied in every state, in general, we will assume that
A(s)=A for all s∈ S.
2.1.4. Transition function

By applying an action, 𝑎𝑛 ∈ A in a state ∈ S, the system transitions
from s to a new state ∈ S, based on a probability distribution over the
set of possible transitions (Puterman,2014). The transition function T is
defined as:
T ∶ S × A × S → [0, 1] (1)
i.e., the probability of ending up in state 𝑠′ after doing action a in state
s is denoted T(s, a,𝑠′). It is required that for all actions a, and all states Fig. 3. Moving a pawn to a destination on a grid.
s and 𝑠′:
T(s, a,𝑠′) ≥ 0 and T(s, a,𝑠′) ≤ 1.
Furthermore, for all states s and actions a, example of indefinite horizon tasks would be a robot trying to learn how
to run or walk. In the beginning, it will fall early (so that episodes will
Σ𝑠′∈S T(s, 𝑎, 𝑠′) = 1 (2) be short), but if it is learning properly, at some point, it will be able to
i.e., T defines a proper probability distribution over the possible next run or walk forever. The system’s initial state is initialized with some
state. For talking about the order in which actions occur, one defines a initial state distribution in each episode.
discrete global clock, t ∈ {0, 1, 2, …}, so that it denotes the state and 𝐼 ∶ 𝑆 → [0, 1] (5)
action at time t, respectively. The system being controlled is Markovian∗
if the result of an action does not depend on the previous actions and In continuing or infinite horizon tasks, the system does not end un-
visited states but only depends on the current state (Otterlo & Wier- less done on purpose (theoretically, it never ends). A possible example
ing, 2012) would be software dedicated to sustaining the energy supply for a city
( ) by controlling the power plants (Sutton and Barto,2016).
P(𝑠𝑡+1 |𝑠t , 𝑎t , 𝑠𝑡+1 , 𝑎𝑡−1 , ...) = 𝑃 (𝑠𝑡+1 |𝑠t , 𝑎t , ) = 𝑇 𝑠t , 𝑎t , 𝑠𝑡+1 , (3)
2.1.7. Policies
2.1.5. Reward function Given an MDP(S,A, T, R), a policy is a computable function that out-
Rewards are used to determine on how the MDP system should be puts for each State’s ∈ S an action a ∈ A (or a ∈ A(s)). Formally, a
controlled (Van Otterlo & Wiering, 2012). The agent should control the deterministic policy 𝜋 is a function defined as 𝜋: S → A. It is also possible
system by taking actions that lead to more (positive) rewards over time. to define a stochastic policy as 𝜋: S × A → [0,1] such that for each state’s
For example, in classical optimization problems, we try to find cost func- ∈ S, it holds that 𝜋 (s, a) ≥ 0 and Σa∈A𝜋 (s, a) = 1. A policy 𝜋 can be
tions and aim to maximize or minimize them under certain conditions used to make evolve, i.e., to control, an MDP system in the following
(Sutton and Barto,2016). In RL, the agent’s goal is formalized by a spe- way:
cial signal (rewards) passing from the environment to the agent. A re-
• Starting from an initial state s0 ∈ S, the next action the agent will do
ward can be defined as:
is taken as 𝑎0 = 𝜋 (s0 ).
𝑅 ∶ 𝑆 × 𝐴 × 𝑆 → 𝐼𝑅 (4) • After the action is performed by the agent, according to the transi-
tion probability function T and the reward function R, a transition is
made from s0 to some state s1, with probability T(𝑠0 , a, 𝑠1 ) and an
2.1.6. Markova decision process
obtained reward r0 = R(𝑠0 , 𝑎0 , 𝑠1 ).
A Markov decision process (MDP) is a tuple (S, A, T, R) in which S is
• By iterating this process, one obtains a sequence 𝑠0 , 𝑎0 , r0 ,𝑠1 ,𝑎1 , 𝑟1 ,
a finite set of states, A is a finite set of actions, T a transition probability
… of state-action-reward triples over S × A × ℝ which constitute a
function T: S × A × S ! [0, 1] and R a reward function, R: S × A × S à ℝ.
trajectory (or path) of the MDP.
Let’s take an example of moving a pawn to a destination on a grid (as
shown in below Fig. 3): The policy is part of the agent with the aim to control the environ-
Markov Decision Process for the above figure can be stated as: ment modelled as an MDP.
• States S = {𝑠0 , 𝑠1 , 𝑠2 , .....𝑠7 } MDPs provide a mathematical framework for modeling decision-
• Actions A = {𝑢𝑝, 𝑑𝑜𝑤𝑛, 𝑙𝑒𝑓 𝑡, 𝑟𝑖𝑔ℎ𝑡}, depends on current state S. making in situations where outcomes are partly random and partly un-
• Transition probabilities = {𝑇𝑠𝑢𝑝 0 ,𝑠3
= 0.9; 𝑇𝑠𝑟𝑖𝑔ℎ𝑡
0 ,𝑠1
= 0.1; ......} der the control of a decision-maker, which could be very useful for study-
ing optimization problems solved via dynamic programming in finance.
• Rewards = {𝑅𝑟𝑖𝑔ℎ𝑡 𝑢𝑝
𝑠6 ,𝑠3 = + 10; 𝑅𝑠2 ,𝑠4 = − 10; otherwise, R = 0}
• Start in 𝑠0
2.2. Exploration versus exploitation
• Game over when reaching 𝑠7
With the above definition of an MDP, we can model several distinct To maximize the accumulated reward over time, the agent learns to
systems like episodic and continuing tasks. First, there is the notion of select her actions based on her past experiences (exploitation) and/or
episodes of some length _ where the goal is to take the agent from some by trying new choices (exploration). Exploration provides opportunities
starting state to a goal state. In these cases, one can distinguish between to improve performance from the current sub-optimal solution to the
fixed horizon tasks in which each episode consists of a fixed number of ultimate globally optimal one. Yet, it is time-consuming and computa-
steps or indefinite horizon tasks with an end. Still, episodes can have an tionally expensive as over-exploration may impair the convergence to
arbitrary length (Otterlo & Wiering, 2012). An example for the former the optimal solution. Meanwhile, pure exploitation., myopically pick-
type would be tic-tac-toe, and for the latter, chess board game. Another ing the current solution based solely on experience, though easy to im-
3
plement, yields sub-optimal global solutions. Therefore, an appropri- can be defined as:
ate trade-off between exploration and exploitation is crucial in design- 𝑇 −𝑡
∑
ing RL algorithms to improve learning and optimization performance. 𝑅Y
𝑡 = Y𝑘 𝑅𝑡+𝑘 = 𝑅𝑡 + Y𝑅𝑡+1 + Y2 𝑅𝑡+2 + ⋯ + Y𝑇 −𝑡 𝑅𝑇 (6)
Bergemann & Vlimki(1996) provided a friendly economic application of 𝑘=0
the exploration-exploitation dilemma. In this model, the actual value of Where T <∞ for episodic tasks and T = ∞ for continuing tasks, Y ∈ [0, 1] is
each seller’s product to the buyer is initially unknown, but additional in- a discounted factor for future rewards (Y<1 if the task is continuing).
formation can be gained by experimentation. When assuming that prices One should make a note that Y plays an important role in future
are given exogenously, the buyer’s problem is a common multi-armed rewards .The agent will be myopic if Y = 0 (𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑖𝑛𝑔 Y0 = 1) because
bandit problem. the only reward (when we put Y = 0 in the above equation) would be the
The exploration-exploitation dilemma is a very general problem present reward. Whereas if Y ≈ 1 then the agent will look for long-term
that can be encountered in most data-driven decision-making processes rewards. For the episodic tasks with Y = 1, 𝑅Y 𝑡 is the sum of rewards
when there is some feedback loop between data gathering and decision- at each step of an episode. The goal of the discounted average reward
making. An efficient approach addresses this dilemma. criteria in the context of MDP is to find a policy 𝜋 ∗ that maximizes the
expected return 𝔼𝜋 [𝑅Y 𝑡 ].The expectation 𝔼𝜋 denotes the expected value
2.3. Classification of reinforcement learning algorithms when the MDP is controlled by policy 𝜋.To link the optimality criteria
to policies we use values functions. It’s an estimate of how good it is for
Broadly there are two main branching points in an RL algorithm the agent to be in a certain state (State value function, 𝑉 𝜋 ) or to make
– whether the agent has access to (or learns) a model of the environment a certain action when being some state (action-value function, 𝑄𝜋 ). The
and the second being what to learn. Based on these two aspects, an state value function 𝑽 𝝅 (𝒔)of an MDP is the expected reward starting
RL algorithm will include one(or more) of the following components from state s, and then following once policy 𝜋 (otterlo-wiering,2012).
(Hambly et al., 2021): While considering the discounted average reward,it can be represented
as:
• A value function that provides a prediction of how good each state(state- [ 𝑇 −𝑡 ]
[ ] ∑
action) pair is, 𝑉 𝜋 (𝑠 ) = 𝔼 𝜋 𝑅 Y
𝑡 |𝑆 𝑡 = 𝑠 = 𝔼 𝜋 Y 𝑘
𝑅 |𝑆
𝑡+𝑘 𝑡 = 𝑠 (7)
• Representation of the policy, 𝑘=0
• A model of the environment.
Where T <∞ for episodic tasks and T = ∞ for continuing tasks.
The first two components are related to model-free RL, whereas when Whereas the state-action value function Q𝜋 (𝑠, 𝑎) is the expected re-
the model of the environment is available, the algorithm is referred to ward starting from state s, taking action a, and then following policy 𝜋.
as model-based RL. In the MDP setting, model-based algorithms main- Considering the discounted average reward,it can be represented as:
[ 𝑇 −𝑡 ]
tain an approximate MDP model by estimating the transition probabil- [ Y ] ∑
𝜋 𝑘
ities and the reward function and deriving a value function from the Q (𝑠, 𝑎) = 𝔼𝜋 𝑅𝑡 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = 𝔼𝜋 Y 𝑅𝑡+𝑘 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 (8)
approximate MDP. The policy is then derived from the value function. 𝑘=0
Examples include (Theate et al., 2021; Sutton and Barto,2016; Jiang & Where T <∞ for episodic tasks and T = ∞ for continuing tasks.
Liang,2017; Jiang et al., 2017). Another line of model-based algorithms While 𝜋 can be any policy, 𝜋 ∗ denotes the optimal one with the
makes structural assumptions on the model, using prior knowledge and highest expected cumulative reward i.e.𝑉 𝜋∗ (𝑠) > 𝑉 𝜋 (𝑠) for all s∈S and
utilizing the structural information for algorithm design. all policies 𝜋. So,
Unlike the model-based method, model-free algorithms directly
𝑉 ∗ (𝑠) = max 𝑉 𝜋 (𝑠) (9)
learn a value (or state-value) function or the optimal policy without 𝜋
inferring the model. Model-free algorithms can be further divided into Similarly, optimal Q-value 𝑄∗ :
two categories, value-based methods, and policy-based methods. Policy-
based methods explicitly build a policy representation and keep it in 𝑄∗ (𝑠, 𝑎) = max 𝑄𝜋 (𝑠, 𝑎) (10)
𝜋
memory during learning. Examples include policy gradient methods and
One fundamental property of value function is that they satisfy cer-
trust-region policy optimization methods. As an alternative, value-based
tain recursive properties. For any policy 𝜋 and any MDP (S,A,T,R),the
methods store only a value function without an explicit policy during
value functions (𝑉 𝜋 𝑎𝑛𝑑 Q𝜋 ) can be recursively defines in terms of so-
the learning process (Singh and Sharma, 2022). Here, the policy is im-
called Bellman Equation (Bellman,2003):
plicit and can be derived directly by picking the action with the best
[ ( ) ]
value. 𝑉 𝜋 (𝑠) = 𝔼𝜋 𝑅𝑡 + Y𝑉 𝜋 𝑆𝑡+1 |𝑆𝑡 = 𝑠 (11)
Based on the problem, one can choose a model-based or model-free
approach. Model-based methods rely on planning as their primary com- [ ( ) ]
ponent, while model-free methods primarily rely on learning. 𝑄𝜋 (𝑠, 𝑎) = 𝔼𝜋 𝑅𝑡 + Y𝑄𝜋 𝑆𝑡+1 , 𝐴𝑡+1 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 (12)
for all s∈S and 𝑎∈A.

2.4. Optimal policies, value functions and bellman equations Bellman equations are used for deriving algorithms such as Q-
learning which we have also used in solving the optimization problem.
To solve a reinforcement learning problem, we need to find a policy
that gets a lot of rewards (over a period). So, defining a model of Opti- 2.5. Generalized policy iteration
mality would be a necessary step in this regard. The approach can have
two aspects - the goal of the agent’s goal, what is being Optimized, and It is the idea of alternating between any method for Policy Evaluation
the second aspect is how optimal the way in which the goal is being Op- and any method for Policy Improvement, including methods that are
timized. The first aspect focuses on gathering rewards, while the second partial applications of Policy Evaluation or Policy Improvement. This
aspect is related to the efficiency and Optimality of algorithms. generalized perspective unifies almost all algorithms that solve MDP
There are three models of Optimality in the MDP related to the types Control problems(Sutton and Barto,2016). Generalized Policy Iteration
of tasks in hand – Finite Horizon, Discounted (Infinite Horizon), and is the idea that we can evaluate the Value Function for a policy with
Average rewards (Van Otterlo and Wiering 2016). IN our study, we will any Policy Evaluation method. We can improve a policy with any Policy
focus on discounted average reward criteria for Finite and Infinite Hori- Improvement method and not necessarily the classical Policy Iteration
zon (Koenig et al., 2019). The discounted sum of rewards by the agent DP algorithm. We want to emphasize that neither Policy Evaluation nor
4
Policy Improvement needs to go entirely towards the notion of consis- Prisma model, we created the study database, which we have used for
tency they are striving for. As a simple example, think of modifying the quantitative and qualitative analyses. Our research database com-
Policy Evaluation (say for a policy 𝜋) not to go all the way to V 𝜋, but prises 64 articles, and all our analyses and finding are based on these
instead perform, say, 3 Bellman Policy Evaluations. This means it would articles. Fig. 4 below shows the steps in creating our research database
partially bridge the gap on the first notion of consistency (getting closer based on the Prisma model.
to V 𝜋 but not going all the way to V 𝜋), but it would also mean not
slipping up too much on the second notion of consistency. As another 3.2. Taxonomy of widely applied RL and DRL in finance
example, think of updating just 5 of the states (say in large state space)
with the Greedy Policy Improvement function (rather than the normal Broadly there are two main branching points in an RL algorithm –
Greedy Policy Improvement function that operates on all the states). whether the agent has access to (or learns) a model of the environment and
This means it would partially bridge the gap on the second notion of the second being what to learn. Based on these two questions, the below
consistency (getting closer to G(V 𝜋) but not going all the way to G(V Figure 5 shows a taxonomy of algorithms in modern RL (Lillicrap et al.,
𝜋)), but it would also mean not slipping up too much on the first no- 2015; Mnih et al., 2016; (Schulman et al., 2015) ; (Silver et al., 2018) ;
tion of consistency. A concrete example of Generalized Policy Iteration Fujimoto et al., 2018; Haarnoja et al., 2018;Ha & Schmidhuber, 2018)
is Value Iteration. In Value Iteration, we apply the bellman policy Op-
erator just once before moving on to policy improvement. Almost all 4. Discussion based on reinforcement learning approaches and
control algorithms in Reinforcement Learning can be viewed as special algorithms
cases of Generalized Policy Iteration. In some simple versions of Re-
inforcement Learning Control algorithms, the Policy Evaluation step is 4.1. RL approaches
done for just a single state (versus all states in usual Policy Iteration, or
even in Value Iteration). The Policy Improvement step is also done for 4.1.1. Critic-only approach
just a single state. The critic-only approach is the most frequent application of RL in
So, these Reinforcement Learning Control algorithms are essentially finance(reference). This approach aims to learn a value function based
an alternating sequence of single-state policy evaluation and single-state on which the agent can compare (“criticize”) the expected outcomes of
policy improvement (where the single-state is the state produced by different actions. Then, during the decision-making process, the agent
sampling or the state encountered in a real-world environment inter- senses the current state of the environment and selects the action with
action). the best outcome according to the value function. The reward function
in the critic-only approach does not need to be differentiable and is
2.6. Reinforcement learning and deep reinforcement learning highly flexible, making it applicable to a comprehensive set of problems.
Furthermore, this property allows modeling complex reward schemes
In a reinforcement learning system, input and output pairs are not with several if-else branches.
provided. Instead, the system’s current state is given a specific goal and Moreover, the preference between immediate and future rewards can
set of allowable actions and environmental constraints for their out- be carefully controlled due to the explicit use of a discount factor. How-
comes. The agent interacts with the environment through trial and er- ever, the most noticeable limitation is the agent’s discrete action space
ror and learns to optimize the maximum reward. Reinforcement learn- closely related to Bellman’s “curse of dimensionality” (Bellman, 1957).
ing models have been applied successfully in closed-world environments Some researchers like Brown (2000) find the approach to be “brittle” in
such as games (Silver et al., 2018). Still, they are also relevant for multi- the presence of noise or not to converge under certain conditions. In the
agent systems such as electronic markets. Popular RL algorithms use future, an effort to enrich the state with other data sources and whether
functions Q(s, a) or V(s) to estimate the sum of discounted rewards. As immediate or terminal rewards (one reward at the end) perform better
the function is defined by a tabular mapping of discrete inputs and out- and under what circumstances should be explored.
puts, this is limiting for continuous states or many states. However, a
neural network can estimate the states (Chen et al., 2021). Deep rein- 4.1.2. Actor-only approach
forcement learning uses functional approximation instead of the tabular In this approach, the agent senses the state of the environment and
estimating state value(reference). Functional approximation eliminates acts directly, i.e., without computing and comparing the expected out-
the need to store all state and value pairs in a table and enables the agent comes of different actions. Hence, the agent learns a direct mapping
to generalize the value of states it has never seen before or has partial (a policy) from states to actions. The main advantages are continuous
information about by using the values of similar states(reference). So action space, faster convergence, and higher transparency. Having con-
RL dynamically learns with trial-and-error methods to maximize the re- tinuous actions, the agent can carefully interact with the environment,
wards, while deep reinforcement learns from existing knowledge and for example, to gradually increase an investment. Moreover, using mul-
applies it to a new data set. tiple output neurons combined with a SoftMax activation function, a
portfolio consisting of several assets can be managed simultaneously.
3. Methodology On the other hand, the most noticeable disadvantage of the actor-
only approach is the need for a differentiable reward function, limit-
3.1. Research methodology ing the reward schemes that can be modeled. However, the impact of
different network architectures (e.g., type and number of hidden lay-
The review adopts the Prisma standard to identify and review the ers) for deep RRL agents and the effect of varying reward functions
RL and DRL methods used in Finance. The Prisma method includes four for multi-security portfolios need to be further explored. Furthermore,
steps (Moher et al., 2015): Identification, screening, eligibility, and in- the value of non-price-based information, e.g., sentiment data, could be
clusion. Web-of Science and Elsevier Scoops, 460 of the most relevant analyzed.
articles are identified between 2014 and November 2021. The screening
step includes two sub-steps. First, duplicate articles are eliminated, re- 4.1.3. Actor-critic approach
sulting in260 unique articles only to move for examination of relevance Actor-critic RL combines actor-only and critic-only RL (Yang and
based on title and abstract. As a result, we got 86 articles for further Xie, 2019). As its name suggests, actor-critic RL comprises two agents,
consideration. Then as the next step in Prisma, we looked for eligibility the actor and the critic. The actor determines the actions and forms the
where we had read the full text of these articles, and finally, we had 64 policy of the system. The actor receives the current state as input at ev-
of them considered for final review in this study. As the final step of the ery step and computes the agent’s action as output. The critic evaluates
5
Fig. 4. Methodology for Systematic selection of articles for the review.
these actions. As a result, it gets the current state and the actor’s action 4.2. RL algorithms
as input and computes the discounted future reward as output. The key
idea is to gradually adjust the actor’s policy parameters to maximize the Dynamic programming and Approximate dynamic programming al-
reward predicted by the critic. Despite the ambition to combine the ad- gorithms are widely used to solve the problem of Prediction and control
vantages of both agents, only a few studies are employing actor-critic (Puterman, 2014) with the assumption that the agent has access to a
RL in financial markets. These works include an RL agent to improve the model of the MDP environment.; however, in the real-world scenarios,
forecast of stock returns obtained by an Elman network and combining this does hold. We often need to access the actual MDP environment
actor-critic RL with fuzzy logic. Unfortunately, neither of these agents directly, if not always. One can note that the real MDP environment
is comparable to the critic-only and actor-only agents discussed above. doesn’t give agents transition probabilities, i.e., it simply serves up a new
Future research could develop an actor-critic agent whose action resem- state and reward when the agent act in a specific state. In other words,
bles the trading decisions. Based on that, it should be analyzed whether it gives the agent individual experiences of the next state and reward,
the ambition of combining the advantages of actor-only and critic-only rather than the actual probabilities of occurrence of the next states and
RL can be realized. However, due to a lack of large-scale comparative rewards. So, what remains to be answered is whether we can infer the
study, the actor-only approach has been widely used RL approach in fi- Optimal Value Function/Optimal Policy without access to a model. And
nance (Levy, Platt, and Saenko 2019; Roder et al. 2020). First, the main this can be achieved with Reinforcement Learning algorithms. RL over-
reasons are its continuous actions. Secondly, it is usually a small number comes complexity, specifically, the Curse of Dimensionality and Curse
of parameters (which makes it less prone to overfitting). Third, its good of Modeling (Avella et al., 2014).
convergence behavior (which results in faster training) and its recent Most RL algorithms are founded on the Bellman Equations, and all
improvements with deep learning techniques. RL Control algorithms are based on the fundamental idea of Generalized
6
Policy Iteration (Liu et al., 2015). But the exact ways in which the Bell- in the distribution of occurrences of states. This results in stronger con-
man Equations and Generalized Policy Iteration idea are utilized in RL vergence guarantees for PG algorithms relative to Value Function-based
algorithms differ from one algorithm to another. They vary significantly algorithms.
from the Bellman Equations/ Generalized Policy Iteration idea used in The main disadvantage of PG Algorithms is that they typically con-
DP algorithms. verge to a local optimum because they are based on gradient ascent.
In contrast, Value Function-based algorithms converge to a global opti-
4.2.1. Monte -Carlo mum. Furthermore, the Policy Evaluation of PG is typically inefficient
In the context of RL, Monte-Carlo refers to developing estimates of and can have high variance. Lastly, the Policy Improvements of PG hap-
values without invoking Bellman equations for policy evaluation or op- pen in small steps, so PG algorithms are slow to converge.
timization where both sides of the equations contain unknowns. Rather
it uses direct estimates (Gobillon & Magnac,2016). As a result, Monte- 5. Deep reinforcement learning
Carlos’s method accuracy is entirely independent of the size of the state
space. Hence, it forms the basis for policy validation and hyperparame- 5.1. Basics of neural network
ter tuning in state-of-the-art empirical reinforcement learning research.
However, Monte-Carlo policy evaluation requires the data collection A typical neural network is a nonlinear function represented by a
protocol, and essentially the policy used to collect data and evaluation collection of neurons. These are typically arranged as several layers con-
should be the same, implying the algorithm is on-policy (Mavrotas & nected by operators such as filters, poolings and gates, that map an input
Makryvelios,2021). When such assumptions fail, especially when the variable in ℝ𝑛 to an output variable in ℝ𝑚 for some n,m𝜖ℤ+ (Lim et al.,
data collection policy is different from the evaluated policy, we face an 2021)
off- policy evaluation problem. The extension of Monte-Carlo methods The basic architecture of the perceptron is shown in FigFigure 6.
still benefits from the independence of the state space size; the depen- Consider a situation where each training instance is of the form (X,y),
dence on the horizon is exponential (Li et al., 2020). where each X = [𝑥1 ,…] contains d feature variables, and y ∈ {−1, +1}
contains the observed value of the binary class variable. By “observed
value” we refer to the fact that it is given to us as a part of the training
4.2.2. Temporal-Difference data, and our goal is to predict the class variable for cases in which it is
The temporal difference (TD) algorithm estimates and updates the not observed (Li et al., 2017).
value of the current state using the values of its adjacent states and a The input layer contains d nodes that transmit the d features
reward function (Ng et al., 2018). Therefore, it is a model-free algorithm X = [𝑥1 ....𝑥𝑑 ] with edges of weight W = [𝑤1 …] to an output node.
that effectively combines the strengths of the Monte Carlo Tree Search The input layer does not perform any computation. The linear function
and Dynamic Programing reinforcement learning algorithms. In TD re- W · X = d i = 1 𝑤𝑖 𝑥𝑖 is computed at the output node. Subsequently,
inforcement learning, an agent is placed in an interactive environment the sign of this real value is used to predict the dependent variable of X.
where each action generates a new state. The environment responds by Therefore, the prediction ˆy is computed as follows:
returning a reward value based on reward mechanisms. Like all other
RL algorithms, the TD algorithm’s goal is to maximize the cumulative ŷ = sign{𝑊 ⋅ 𝑋 } = sign{𝑑 𝑗 = 1 𝑤 𝑗 𝑥𝑗 }(1.1) (13)
reward. It achieves this by making continuous adjustments to the strat- The three most popular neural network architectures are – Fully con-
egy; these adjustments are based on the returned reward values (Li et al., nected neural networks(FNN), Convolutional neural networks (CNN)
2019). However, unlike other reinforcement learning approaches, the and Recurrent neural network (RNN).
TD algorithm uses a sampling-learning process, i.e., it uses the value A fully connected Neural Network (FNN) is the simplest neural net-
of the current action and its immediately adjacent states to estimate work architecture where any given neuron is connected to all neurons
and update the value of the current state. Hence, the current model is in the previous layer (Chen et al., 2021). To describe the setup, we fix
updated immediately after a sample is obtained. This iterative process the number of layers L 𝜖ℤ+ in the neural network and the width of the
continues until the model converges. i th layer 𝑛𝑖 𝜖 ℤ+ for i = 1,2….,I. Then for an input variable 𝑧 𝜖 ℝ𝑛 , the
functional form of the FNN is
4.2.3. Policy gradient algorithms ( ( ) )
𝐹 (𝑧; 𝑊 , 𝑏) = 𝑊𝐼 ⋅ 𝜎 𝑊𝑖−1 ⋯ 𝜎 𝑊1 𝑧 + 𝑏1 ⋯ + 𝑏𝐼−1 + 𝑏𝐼 (14)
PG algorithms are practical in large action spaces, high-dimensional
or continuous action spaces because in such spaces selecting an action Where (W,b) represents all the parameters in the neural network, with
by deriving an improved policy from an updating Q-Value function is W = (𝑊1 , 𝑊2 , 𝑊3 ..... 𝑊𝐼 ) and b = (𝑏1 , 𝑏2 , 𝑏3 ..... 𝑏𝐼 ). In the neural net-
intractable. Furthermore, a key advantage of PG is that it naturally ex- work’s literature, the 𝑊𝐼 ’s are often called the weight matrices, the
plores because the policy function approximation is configured as a 𝑏𝐼 ’s are called biasvectors, and 𝜎is referred to as the activation function.
stochastic policy. Moreover, PG finds the best Stochastic Policy. This Several popular choices for the activation function include ReLU with
is not a factor for MDPs since we know an optimal Deterministic Policy 𝜎(u) = max(u; 0), Leaky ReLU with 𝜎(u) = 𝑎1 max(u; 0)- a2 max(-u, 0)
for any MDP. Still, we often deal with Partially Observable MDPs in the where a1; a2 > 0, and smooth functions such as 𝜎(.) = tanh(.).
real world, for which the set of optimal policies might all be stochastic Convolutional neural networks (CNNs) are another type of feed-
policies. We have an advantage in the case of MDPs since PG algorithms forward neural network that are especially popular for image process-
naturally converge to the deterministic policy (the variance in the policy ing. In the finance setting CNNs have been successfully applied to price
distribution will automatically converge to 0). prediction based on inputs which are images containing visualizations
In contrast, we must reduce the 𝜖 of the 𝜖-greedy policy by hand in of price dynamics and trading volumes (Jiang et al., 2020). The CNNs
Value Function-based algorithms. The appropriate declining trajectory have two main building blocks – convolutional layers and pooling lay-
of 𝜖 is typically hard to figure out by manual tuning. In situations where ers. Convolutional layers are used to capture local patterns in the images
the policy function is simpler than the Value Function, we naturally ben- and pooling layers are used to reduce the dimension of the problem and
efit from pursuing Policy-based algorithms than Value Function-based improve the computational efficiency. A simple CNN can be represented
algorithms. Perhaps the most significant advantage of PG algorithms as:
is that prior knowledge of the functional form of the Optimal Policy
𝐹 (𝑧; 𝑯 , 𝑾 , 𝒃) = 𝑾 ⋅ 𝜎(𝑧 ∗ 𝑯 ) + 𝒃 (15)
enables us to structure the known functional form in the function ap-
proximation for the policy. Lastly, PG offers numerical benefits as small Recurrent Neural Networks (RNNs) are a family of neural networks
changes in 𝜃 yield small changes in 𝜋 and consequently small changes that are widely used in processing sequential data, including speech,
7
Fig. 5. Select taxonomy of dominant algorithms in Modern RL widely used in Finance applications (combining Lillicrap et al., 2015; Mnih et al., 2016; Schulman
et al., 2017;Silver et al., 2017; Fujimoto et al., 2018; Ha & Schmidhuber, 2018; Haarnoja et al., 2018).
nique in which the learning agent learns to predict the expected value
of a variable occurring at the end of a sequence of states (Heryadi et al.,
2017). Reinforcement learning (RL) extends this technique by allowing
the learned state-values to guide actions that change the environment
state. Thus, RL is concerned with the more holistic problem of an agent
learning effective interaction with its environment. TD algorithms are
often used in reinforcement learning to predict the total amount of re-
ward expected over the future. Still, they can also be used to predict
other quantities. Continuous-time TD algorithms have also been devel-
oped(Sutton and Barto,2014). Given some samples (s; a; r; s’) obtained
by following a policy 𝝅, the agent can update her estimate of the value
function 𝑉 𝜋 at the (n + 1)-th iteration by:
Fig. 6. Basic architecture of perceptron. ( ) [ ]
𝑉 Π,(𝑛+1) (𝑆 ) ← 1 − 𝛽𝑛 (𝑠, 𝑎) 𝑉 Π,(𝑛) (𝑠) + 𝛽𝑛 (𝑠, 𝑎) 𝑟 + 𝛾𝑉 Π,(𝑛) (𝑠′) (18)
text, and financial time series data. Unlike feed-forward neural net-
works, RNNs are a class of artificial neural networks where connections 5.2.2. Least-Squares temporal difference learning
between units form a directed cycle. RNNs can use their internal mem- The Least-Squares TD(𝜆) algorithm, or LSTD(𝜆), converges to the
ory to process arbitrary sequences of inputs and hence are applicable to same coefficients 𝛽𝜆 that TD(𝜆) does. However, instead of performing
tasks such as sequential data processing. gradient descent, LSTD(𝜆) builds explicit estimates. Estimates of the C
For RNNs, our input is a sequence of data 𝑍1 , 𝑍2 ……𝑍𝑇 . An RNN matrix and d vector (actually, estimates of a constant multiple of C and
models the internal stateℎ𝑡 bya recursive relation d) solve d + C𝛽𝜆 = 0 directly (Bradtke & Barote,1996; Boyan,2002; Xu
( ) et al., 2002). LSTD explicitly solves the value function parameters that
ℎ𝑡 = 𝐹 ℎ𝑡−1 , 𝑍𝑡 ; 𝜃 (16) result in zero-mean TD update overall observed state transitions. The re-
where F is a neural network with parameter 𝜃(for example 𝜃 = (W, b)). sulting algorithm is considerably more data-efficient than TD(0) but less
Then the output is given by computationally efficient. LSTD requires O(n2) computation per time
( ) step, even incremental inverse computations. Hence, the improved data
ŝ 𝑡 = 𝐺 ℎ𝑡−1 , 𝑍𝑡 ; 𝜃 (17) efficiency is obtained at a substantial computational price. LSTD can be
where G is another neural network with parameter 𝜃. There are two im- seen as immediately solving for value function parameters for which the
portant variants of the vanilla RNN introduced above – the long short- sum TD update overall the observed data is zero.
term memory (LSTM) and the gated recurrent units (GRUs) (Zhang et al.,
2022) Compared to vanilla RNNs, LSTM and GRUs are shown to 5.3. Deep Policy- based Methods
have better performance in handling sequential data with long-term
dependence due to their flexibility in propagating information flows Deep policy-based methods are extensions of policy-based meth-
(Heryadi et al., 2017). ods using neural network approximations. Using neural networks to
parametrize the policy and/or value functions in the vanilla ver-
5.2. Deep value-based methods sion of policy-based methods leads to neural Actor-Critic algorithms
(Zhang et al., 2020) and deep DPG (DDPG). In addition, since introduc-
In Reinforcement learning the distribution of reward function r and ing an entropy term in the objective function encourages policy explo-
the transition function P are unknown to the agent However the agents ration and speeds the learning process; there have been some recent de-
can observe samples (s,a,r,s’) by interacting with it, without having any velopments in (off-policy) soft Actor-Critic algorithms (Dai et al., 2022)
prior knowledge of r or P. The agent can learn the value function us- using neural networks, which solve the RL problem with entropy regu-
ing the samples(Sutton and Barto,2014). This is the basic concept used larization.
in classical temporal-difference learning,Q-learning and SARSA (State- Below we introduce the DDPG algorithm, one of the most popular
Action-Reward-State-Action). deep policy-based methods, which has been applied in many financial
problems’ is a model-free off-policy Actor-Critic algorithm, first intro-
5.2.1. Temporal-Difference in prediction learning duced in, which combines the DQN, and DPG algorithms extend the
We already have introduced TD in the previous section. Here we DQN to continuous action spaces by incorporating DPG to learn a deter-
discuss the learning aspect of TD. TD learning is an unsupervised tech- ministic strategy. To encourage exploration, DDPG uses the following
8
Table 1: overview of studies that uses Reinforcement Learning in financial prediction problems
Category Subcategory Decision making objective
Credit Risk Consumer Credit risk - general consumer default

- credit card delinquency and default
- Bill payment in developing countries
- credit card repayment patterns.
Corporate credit risk - Firms’ credit rating changes
- Corporate bankruptcy
- Fintech loan default
- Recovery rates of corporate bonds
Real estate credit - Mortgage loan risk
risk - Commercial real estate default
Financial policy and Financial outcomes - Capital structure
Outcomes - Earnings
Fraud (accounting) - Accounting fraud from financial
statements variables
- Accounting fraud from annual report
topics
Startup investment - Startup acquisitions
outcome - Startup valuations and success
probabilities
Trading and assets Equities - Stock returns
- Stock volatility
- Stock covariance
- Equity risk premium
Bonds - Future excess returns
Foreign Exchange - Direction of changes in exchange rates
Derivatives - Prices of options on index futures
- Prices of general derivatives
Financial claims - Stochastic discount factor
Market - Lifespan of trading orders
Microstructure - General microstructure variables
Investors - Mutual fund performance
Retail investors’ portfolio allocations and
performance
Algorithmic Trading - buy and sell decision
Financial Text - sentiment analysis
Mining - real time streaming news/tweets
- instant text- based information
action: For credit score classification, Luo et al., and Wu (2017), in first of
(
𝐷
) its kind implementation, used a deep belief network learning approach
𝑎 𝑡 ∼ 𝜋 𝑠 𝑡 ; 𝜃𝑡 + 𝜖 (19)
for Corporate Credit rating and corresponding credit classification (A,
Where 𝜋 𝐷 is a deterministic policy and a random variable sampled from B, or C). Out of all the tested models, Deep Belief Network with Re-
some distribution N can be chosen according to the environment. One stricted Boltzmann Machine performed the best. Similarly, Yu et al.,
must make a note that the algorithm requires a small learning rate. and Chen (2018), a cascaded hybrid model of DBN, Backpropagation,
and SVM for credit classification, was implemented, and good perfor-
6. Financial applications
mance results (the accuracy was above 80–90%) were achieved. Finally,
Li et al. (2017) performed credit risk classification using an ensemble
In this section we discuss an overview of various interesting utiliza-
of deep MLP networks, each using subspaces of the whole space by k-
tion of DL, RL and DRL approaches in finance. We have categorized
means (using minority class in each, but only a partial subspace of the
our review of financial applications in the below category (as shown in
majority class).
table 1):
Multiple subspaces were used for each classifier to handle the data
We have taken our study to category level bringing some research
imbalance problem, where each had all the positive (minor) instances.
and finding from each category level as below:
Still, a subsample of negative (majority) instances finally used an en-
6.1. Information management for credit risk semble of deep MLPs combining each subspace model. Credit scoring
was performed using an SAE network and GP model to create credit as-
Risk Assessment measures the risk attached to any given asset, be it sessment rules to generate good or bad credit cases. In another study,
firm, person, product, or the bank. It has been one of RL researchers’ Neagoe et al.(2018) classified credit scores using various DMLP and deep
most studied and researched areas. The problem has been studied un- CNN networks. Zhu et al., 2018approached the consumer credit scoring
der different categories, for example, Credit scoring and evaluation, classification with hybrid deep learning models. They transformed the
Credit rating, bankruptcy prediction (Hosaka, 2019), loan/insurance un- 2-D consumer input data into a 2-D pixel matrix and used the resulting
derwriting, bond rating (Chi et al., 2021), loan application, consumer image to train and test data for DRL using CNN.
credit determination, corporate credit rating, mortgage choice decision, Multiple subspaces were used for each classifier to handle the data
financial distress prediction, and business failure prediction (Chen et al., imbalance problem, where each had all the positive (minor) instances.
2020). Asset pricing is highly dependent on risk assessment measures; Still, a subsample of negative (majority) instances finally used an en-
therefore, identifying the risk status is critical. The majority focus of the semble of deep MLPs combining each subspace model. Credit scoring
risk assessment studies is on credit scoring and bank distress classifica- was performed using an SAE network and GP model to create credit as-
tion. However, a few papers also cover Crisis forecasting and detection sessment rules to generate good or bad credit cases. In another study,
of risky transactions (Serrano, 2018). Neagoe et al. (2018) classified credit scores using various DMLP and
9
Table 2: some of the major articles on Credit risk (2014 - 2020)
Authors Study Period Method Feature set
Li et al. 2017 DRL, DNN Personal finance variables

Tran et al. 2016 RL,DCNN Personal Finance variables
Rawte et al. 2006-2017 RL,LSTM Weighted sentiment polarity and ROA
Malik et al. 1976-2017 DRL,LSTM,RL Macro-economic variables and Bank
transactions
Hosaka 2014-2019 CNN, RL Financial ratios
Ozbayoglu et al. 2016-2019 RL-LSTM, CNN Order details and transaction details
Chatzis 1996-2017 DNN, Logit, Index data, exchange rates
RL,RL-LSTM
Table 3: Some of the major articles on Fraud Management works (2014 - 2020)
Gomes 2009 -2017 RL,Autoencoders Party name, type of Expends, state expenses
Jurgovsky et al 2015 LSTM Transaction and bank details
Heryadi and Warnars 2016-2017 CNN,RL-LSTM Financial transactions over period
Han et al 2018 RL-LSTM Pricing,Interest rates
deep CNN networks. Zhu et al. (2018)approached the consumer credit deep Q-learning (RL) to predict risk-averse firms. Table 3 summarizes
scoring classification with hybrid deep learning models. They trans- some of the major articles on Fraud management.
formed the 2-D consumer input data into a 2-D pixel matrix and used
the resulting image to train and test data for DRL using CNN. 6.3. Trading and assets
Financial distress prediction for banks and corporates is studied ex-
tensively. Lanbouri and Achchaba (2015) used a Deep belief network 6.3.1. Big data driven portfolio optimization
and SVM to predict the financial distress, thereby identifying if it’s in In portfolio optimization problems, a trader needs to select and trade
trouble or not. At the same time, Rawte et al., and Zaki (2018) used the best portfolio of assets to maximize some objective function, which
RL-LSTM for bank risk classification. Ronnqvist and Sarlin (2015) ex- typically includes the expected return and some measure of the risk. By
plored the word sequence learning to extract new semantics, and asso- investing in such portfolios (Lee and Soo, 2017), the trader diversifies
ciated events were labeled with the bank stress. Determined bank stress his investments, which achieves a higher return per unit of risk than
was then classified against the formed semantic vector representation only investing in a single asset (Lee and Yoo, 2020). Both value-based
threshold. Prediction and semantic meaning extraction were integrated methods such as Q-learning (Han et al., 2018; Tsantekidis et al., 2017),
neatly. SARSA (Han et al., 2018), and DQN (Kumar and Vadlamani, 2016), and
Ozbayoglu et al., and Sezer (2020) were able to classify with high policy-based algorithms such as DPG and DDPG (Iwasaki & Chen,2018)
accuracy using CNN and LSTM networks if a stock market transaction have been applied to solve portfolio optimization problems. The state
was risky or not. Researchers have been also developed several RL and variables are often time, asset prices, past asset returns, current assets
DRL models for detecting events that caused the stock market to crash holdings, and remaining balance. The control variables are typically the
(Zhang et al., 2018) amount/proportion of wealth invested in each portfolio component. Ex-
amples of reward signals include portfolio return (Han et al., 2018),
6.2. Financial policies and outcomes Sharpe ratio (Han et al., 2018; Tsantekidis et al., 2017), and profit
(Tsantekidis et al., 2017). constantly rebalanced portfolio (CRP) is used
6.2.1. Information management for fraud detection as a benchmark strategy, where the portfolio is rebalanced to the initial
Several types of financial fraud cases exist, including credit card wealth distribution among the assets at each period. The buy-and-hold
fraud (Jurgovsky et al., 2018), money laundering, consumer credit or do-nothing strategy does not take any action but instead holds the ini-
fraud, tax evasion, bank fraud, and insurance claim fraud. This is one of tial portfolio until the end. The performance measures studied in these
the most extensively studied areas of finance for AI and ML researched papers include the Sharpe ratio (Iwasaki & Chen,2018), the Sortino ra-
as anomaly detection and is generally a classification problem. Several tio (Lim et al., 2021), portfolio returns (Wang,2019; Xiong,2018), port-
survey papers were published accordingly as the topic has attracted folio values (Jiang & Liang,2017; Xiong et al., 2018), and cumulative
much attention from researchers. At different times, Kirkos et al.(2017), profits (Du et al., 2016). Some models incorporate the transaction costs
Ngai et al., and Sun (2011), and West et al. (2016) all reviewed the ac- (Jiang & Liang,2017; Liang et al., 2018) and investments in the risk-
counting and financial fraud detection studies based on soft computing free asset [Du et al., 2016; Wang,2019, 2016; Jiang & Liang,2017].
and data mining techniques. as shown in table 2 Du et al. (2016) considered the portfolio optimization problems of a
Heryadi and co-researchers (2017) have focused on their research risky asset and a risk-free asset for value-based algorithms. They com-
identifying credit card fraud and exploring the effect of data imbalance pared the performance of the Q-learning algorithm and a Recurrent RL
on fraud and non-fraud data. Roy et al. (2018) used the LSTM model (RRL) algorithm under three different value functions, including the
for credit card fraud detection, whereas Gomez et al.(2018) used MLP Sharpe ratio, differential Sharpe ratio, and profit. The RRL algorithm
networks to classify if a credit card transaction was fraudulent or not. is a policy-based method that uses the last action as an input. They con-
Furthermore, many researchers used an ensemble of Feed Forward Neu- cluded that the Q-learning algorithm is more sensitive to the choice of
ral Network for card fraud detection. In a similar study, Gomes et al., the value function and has less stable performance than the RRL algo-
and Carvalho (2017) worked on parliamentary expenditure spending rithm. They also suggested that the (differential) Sharpe ratio is pre-
proposing an anomaly detection model to identify anomalies in spend- ferred rather than the profit as the reward function. For policy-based
ing in Brazilian elections using deep AE. Wang et al. (2018) used text algorithms, Jiang and Liang (2017) proposed a framework combining
mining and DNN models to detect automobile insurance fraud. Other neural networks with DPG. They used a so-called ensemble of identical
similar studies have applied LSTM and used character sequences in fi- independent evaluators (EIIE) topology to predict the potential growth
nancial transactions and the responses from the other side to detect if of the assets in the immediate future using historical data, which in-
the transaction was fraud or not. Finally, Goumagias et al. (2018) used cludes the highest, lowest, and closing prices of portfolio components.
10
Table 4: Some of the major articles on Portfolio management (2014- 2020)
Zhou Bo 1993-2017 LSTM,MLP,RL Firm’s characteristics, Price

Lee and Yoo 1997-2016 RNN,LSTM,GRU Price data, OCHLV
Heaton 2014-2017 CNN, RNN,LSTM Price data
Jiang and Liang 2015-2016 CNN,RL Price data
Almahdi and yang 2017 RL Transaction costs
Liang et al. 2015-2018 DRL, PPO,RL Fundamental data, OCHLV
Iwasaki and Chen 2016-2018 LSTM,CNN Text
The actual crypto currency market data experiments showed that their (Mahmoudi et al., 2018). However, some studies propose stand-alone
framework achieves a higher Sharpe ratio and cumulative portfolio val- Algo-trading models focused on the dynamics of the transaction itself.
ues than three benchmarks, including CRP and several published RL Optimizing trading parameters such as bid-ask spread, analysis of limit
models. Singh et al., 2022 explored the DDPG algorithm for the port- order book, position-sizing was the most preferred DL model in predict-
folio selection of 30 stocks, where at each time point, the agent can ing stock or index prices implementations (Lei et al., 2020). Arpaci et al.
choose to buy, sell, or hold each stock. The DDPG algorithm was shown (2017) studied market microstructures-based trade indicators used to in-
to outperform two classical strategies, including the min-variance port- put RNN with Graves LSTM to perform the price prediction for algorith-
folio allocation method, in terms of several performance measures, in- mic stock trading. Bao et al., 2017 used technical indicators as the input
cluding final portfolio values, annualized return, and Sharpe ratio using into Wavelet Transforms (WT), Stacked Autoencoders (SAEs), and LSTM
historical daily prices of the 30 stocks. One can note that the classical for the forecasting of stock prices. Zhang et al. (2022) implemented.
stochastic control approach for portfolio optimization problems across CNN and the LSTM model were implemented together where CNN was
multiple assets requires both a realistic representation of the temporal used for stock selection, LSTM was used for price prediction. Deng et al.
dynamics of individual assets and an adequate representation of their co- (2017) used Fuzzy Deep Direct Reinforcement Learning (FDDR) for stock
movements. It is challenging when the assets belong to different classes price prediction and trading signal generation. For index prediction, the
(stocks, options, futures, interest rates, and derivatives). On the other following studies are noteworthy. Saleh(2020) used LSTM to make the
hand, the model-free RL approach does not rely on the specification of price prediction of the S&P500 index. Si et al.(2017) implemented Chi-
the joint dynamics across assets . Table 4 summarizes these studies with nese intraday futures (Boukas et al., 2021) market trading model with
their intended purposes. DRL and LSTM. Brzeszczynski et al.(2019) used the feedforward DNN
method and Open, Close, High, Low (OCHL) of the time series index
6.3.2. Information management for asset and derivatives market data to predict Singapore Stock Market index data. In addition, forex
Accurate pricing and valuation of an asset is a fundamental research or cryptocurrency trading was implemented in some studies. Finally,
and study area in finance. Many AI/ML models have been developed for Jeong et al.(2019) combined deep Q-learning and DNN to implement
banks, corporates, real estate, derivative products, etc. However, RL and price forecasting. They intended to solve three different problems: in-
DRL have not been widely applied to this field yet. There are some pos- creasing profit in a market (Bari and Agah, 2020); Prediction of the
sible implementation areas where RL models can assist the asset pricing number of shares to trade and Preventing overfitting with insufficient
researchers or valuation experts(Hsu et al., 2019). Unfortunately, there financial data. A summary of major work on Algo-Trading Text Mining
were only a handful of studies that we could find during our research is presented in below table (Table 6).
within the RL and finance community. In their sentiment analysis to
predict stock price prediction, Iwasaki et al. (2018) used a DFNN model 6.3.4. Financial big data analytics using text mining
and the analyst reports. Based on the forecast, different portfolio se- The evolution of the financial models has been hugely influenced
lection approaches were implemented. Culkin et al.(2017) proposed a by the broad reach of social media, instant text-based information, and
novel method that used the feedforward DNN model to predict option real-time streaming. It results again in the popularity of financial text
prices by comparing their results with the Black & Scholes option pric- mining studies. Though many of these studies are directly linked to
ing formula. Similarly, Hsu et al., 2018proposed a novel method us- sentiment analysis through crowdsourcing, there are many implemen-
ing bid-ask spreads and Black & Scholes option price model parameters tations in financial statements, content retrieval of news, disclosures,
with 3-layer DMLP that predicted TAIEX option prices. Finally, Feng etc. There are a few surveys focused on text mining and financial news
et al.(2019)explored characteristic features like Asset growth, Industry analytics (Kushwaha et al., 2021). For example, Mitra et al. (2012)
momentum, Market equity, Market Beta to predict US equity returns. edited a book on news analytics in finance. In contrast, Kumar et al.,
Meanwhile, derivative products based financial models are widely pop- 2021,Loughran et al.(2016) Kumar et al.(2016) have published surveys
ular and common.RL models development could benefit Options pric- on textual analysis of financial documents, corporate disclosures, and
ing, hedging strategy development, financial engineering with options, news.
futures, forward contracts. Some recent studies show that the topic has Ying et al., and Yang (2017) used individual social security payment
gotten the attention of researchers, and they have started showing inter- types (paid, unpaid, repaid, transferred) to classify and predict using
est in RL models that can provide solutions to this complex and challeng- LSTM, HMM, and SVM. Sohangir et al.(2018), in their study, classify
ing field. Table 5 summarizes these studies with their intended purposes. the authors of Stock Twits as export or non-expert using neural network
models. Costa and Silva (2016) applied RL-LSTM and used character
6.3.3. Algorithmic trading sequences in financial transactions and the responses from the other
It is defined as a buy-sell decision made solely by algorithmic mod- side to detect if the transaction was fraud or not. Wang et al.(2018)used
els based on simple rules, mathematical models, or complex functions. text mining and DNN models to detect automobile insurance fraud. At
With the advent of online trading platforms and guidelines, Algorith- DeepAI, the word sequence learning extracted the news semantics, and
mic trading has been a hugely successful application in the last couple bank stress was determined and classified with the associated events.
of years. Most of the Algo-trading applications are coupled with price Day et al. (2016) used financial sentiment analysis using text mining
prediction models for market timing purposes (Du et al., 2016). There- and DNN for algorithmic stock trading. Cerchiello et al., 2018 used the
fore, most price or trend forecasting models that trigger buy-sell sig- fundamental data and text mining from the financial news (Reuters)
nals based on their prediction are also considered Algo-trading systems to classify the bank distress. Rönnqvist and Sarlin (2017) identified the
11
Table 5: Some of the major articles on asset pricing and derivatives (2014 - 2020)
Feng et al. 2019 RDL,DL Firm characteristics

Iwasaki and chen 2016-2018 RL-LSTM, Bi-LSTM Text
Hsu et al. 2018 Black Scholes, RL Fundamental analysis, option price
Deng et al. 2017 RL Stock market data, Firm data
Martinez-Miranda et al. 2016 RL Stock market data
Feuerriegel and Prendiger 2016 RL Transaction fees, Stock Prices.
Zhang and Maringer 2015 RL Stock Prices and Trade data
Chen et al. 2018 RL Trading data, Market data.
Chakraborty 2019 RL Trading market data
Mosavi et al 2020 DRL -
Table 6: Some of the major articles on Algo-Trading Text Mining (2014-2020)
Dixon et al. 2014 DNN Price Data

Deng et al. 2014-2017 RL,DNN,FDDR Profit, return,Profit-loss
Bari and Agah 2016-2018 RL,DRL Text and Price data
Jeong and Kim 2017 DNN,DRL Total profit, correlation
Zhang et al. 2017 CNN,LSTM Annual return, Maximum Retracement
Sirignano and Rama 2017-2019 RL,DRL Transactional data, Price data
Serrano,will 2016-2019 RL,DNN Asset Price data
Boukas et al - DRL Market data
Lei et al - DRL Price data, Asset price
Table 7: Some of the major articles on Financial Text Mining (2014 - 2020)
Verma et al. 2013-2017 LSTM, RL Index data, news

Ying et al. 2008-2014 RNN Insurance details like ID, area-code.
Zhang et al 2015 RL,CNN Price data, News and social media
Mahmoudi et al 2008-2016 CNN,LSTM,DL Sentences, Stock-Twits messages
Sohangir and Wang 2015-2016 CNN, Doc2vec Sentences, Stock-Twits messages.
Lee and Soo 2017 LSTM, RL,DL Technical indicators, Price data, News
Chi et al. 2014-2020 CNN,LSTM,RL Input news, OCHLV, Technical indicators.
bank distress by extracting the data from the financial news through text librate risk while optimizing the return with a high-performance guar-
mining—the proposed method used DFNN on semantic sentence vectors antee (Chatzis et al., 2018). Another example in the deep RL frame-
to classify if there was an event or event not. work used the DQN scheme to amend the news recommendation ac-
For financial sentiment analysis, Almahdi and Yang (2019)com- curacy dealing with huge users simultaneously with a considerably
pared CNN, LSTM, and GRU-based DL models against MLP. high-performance method guarantee (Mosavi et al., 2020). Our review
Rawte et al. (2018) tried to solve three problems using CNN, LSTM, demonstrated outstanding potential for improving deep learning meth-
SVM, RF: Bank risk classification (Dixon et al., 2017), sentiment ods applied to the RL problems to better analyze the relevant problem
analysis, and Return on Assets (ROA) regression. Finally, Chang et al., for finding the best strategy in a wide range of finance domains.
and Ke (2016) implemented the estimation of information content po-
larity (negative/positive effect) with text mining, word vector, lexical, 7.1. Practical implication
contextual input, and various LSTM models. They used the financial
news from Reuters. Finally, it is worth mentioning that there are also While using deep reinforcement learning architecture, the combi-
some studies (Khadjeh et al., 2014; Kumar and Vadlamani,2016) of text nation of both DL and RL approaches, for RL to resolve the scalabil-
mining for financial prediction models. In the below table (table 7) we ity problem and be applied to the high-dimensional issues desired in
highlight some of the major articles on Financial Text mining . a real-world market setting. We reviewed approaches such as CNN,
RNN, LSTM, and a few others from both DL and Deep RL in various
7. Discussion applications of Finance. The advanced models engaged in providing im-
proved prediction, extracting better information, and finding optimal
Our review establishes that both DL and DRL algorithms have al- strategy mostly in complicated and dynamic market conditions. Our
most the same prediction quality in terms of statistical error when used work represents that the fundamental issue of all proposed approaches
for prediction purposes. In addition, reinforcement learning can simu- is mainly dealing with the model complexity, robustness, accuracy, per-
late more efficient models with more realistic market constraints. While formance, computational tasks, risk constraints, and profitability. Prac-
deep RL goes further, that solves the scalability problem of RL algo- titioners can employ a variety of DL and deep RL techniques, with the
rithms, which is crucial with fast users and financial growth. More- relevant strengths and weaknesses, that serve the Finance problems to
over, it efficiently works with high-dimensional settings as highly de- enable the machine to detect the optimal strategy associated with the
sired in the financial market (Chakraborty, 2019). Where DRL can give market. Recent works showed that the novel techniques in DNN and re-
notable helps to design more efficient forecasting algorithms and an- cently interacting with reinforcement learning, so-called deep RL, can
alyze the market with real-world parameters (as shown in table 8). considerably enhance the model performance and accuracy while han-
For example, the deep deterministic policy gradient (DDPG) method dling real-world finance problems. Finally, our review indicates that the
(Xiong et al., 2018) in stock trading demonstrated how the model could recent DL and deep RL approaches perform better than the classical ML
handle large settings concerning stability, improve data use, and equi- approaches. Therefore, more efficient novel algorithms can be designed
12
Table 8: Comparative study of DL,RL and DRL models for decision making in Finance
Methods Application Dataset type Complexity Accuracy /Processing rate
AE Risk management Historical High High /Reasonable

AE & RBM Credit Card Fraud Historical Reasonably high High/Reasonable
CNNR Profit Market Transactional Reasonable Reasonably high/High
DNN Stock pricing Historical Reasonable High/Reasonable
DDPG Stock trading Historical Reasonable Reasonably high/High
LSTM-SVR Investment Historical Reasonable Reasonably high/High
IMF&MB Portfolio management Historical High Reasonably high/High
by combining Deep Neural network architecture in reinforcement learn- Ethics approval

ing concepts to detect the optimal strategy, namely, optimize the profit
and minimize the loss while concerning the risk parameters in a highly “Not Applicable”
competitive market.
Consent to participate
7.2. Future research agenda
“Not Applicable”
RL is highly growing research in DL. However, most of the well-
known results in RL have been in single-agent environments, while the Consent for publication
cumulative reward involves the single state-action spaces. It would be
interesting that consider multi-agent scenarios in RL that comprise more “Not Applicable”
than one agent where the cumulative reward can be affected by the
actions of other agents. However, some recent research applied multi- Availability of data and material
agent reinforcement learning (MARL) scenarios to a small set of agents
but not to a large set of agents (Jaderberg et al., 2019. For instance, “Not Applicable”
in the financial markets where there are a huge number of agents si-
multaneously, the acts of agents concerning the optimization problem
Code availability
might be affected by the decisions of the other agents. As a result, the
MARL can provide an imprecise model and inaccurate prediction while
“Not Applicable”
considering infinite agents. Though the performance of the deep rein-
forcement learning algorithm for information-based decision-making in
Declarations
finance has been good in the controlled conditions. However, its applica-
tion to the actual trading environment, such as investment banks, com-
We hereby declare that the work described has not been published
mercial banks, and exchanges, has been limited (Arjun et al., 2021). In
before in any form and it is not under consideration for publication else-
the future, with the support of the big data platform and the AI platform,
where, its submission to the Journal of Information management Data
the deep reinforcement learning model will finally realize the landing
Insights has been approved by all authors as well as the responsible au-
and application of intelligent scenario products, such as investment re-
thorities – tacitly or explicitly – at the institute where the work has been
search, internal risk control, and wealth management in the financial
carried out. If accepted, it will not be published elsewhere in the same
field.
form, in English or any other language, including electronically, without
the written consent of the copyright holder and Journal of Information
8. Conclusion
management Data Insights will not be held legally responsible should
there be any claims for compensation or dispute on authorship.
This paper reviews how and in which area of Finance we can use RL
and DRL models to make better-informed decisions. Secondly, we estab-
Declaration of Competing Interest
lish how the existing DL, RL, and DRL approaches can be classified for
better understanding and application based on their application. And
The authors report no conflict of interest in the study undertaken.
finally, it addresses the open issues and challenges in current deep re-
All contributors have been listed as authors and all authors have con-
inforcement learning models for information-based decision-making in
tributed.
Finance. In our review, we reviewed how in Comparison with the tra-
The authors declare that they have no known competing financial
ditional statistical analysis method of Finance, applying deep reinforce-
interests or personal relationships that could have appeared to influence
ment learning in Finance can efficiently obtain valuable information
the work reported in this paper.
from many nonlinear and complex financial data. Thus, RL and DRL
can effectively forecast and detect difficult market trends compared to
References
the traditional ML algorithms, with the significant advantage of high-
level feature extraction properties and proficiency of the problem solver Almahdi, S., & Yang, S. Y. (2019). A constrained portfolio trading system using particle
methods. Furthermore, reinforcement learning enables us to construct swarm algorithm and recurrent reinforcement learning. Expert Systems with Applica-
a more efficient framework considering crucial market constraints by tions, 130, 145–156.
Arjun, R., Kuanr, A., & Suprabha, K. R. (2021). Developing banking intelligence in emerg-
integrating the prediction problem with the portfolio structure task. ing markets: Systematic review and agenda. International Journal of Information Man-
With the support of the big data platform and the AI platform, the agement Data Insights, 1(2), Article 100026.
deep reinforcement learning model will finally realize the landing and Bao, W., Yue, J., & Rao, Y. (2017). A deep learning framework for financial time se-
ries using stacked autoencoders and long-short term memory. PloS one, 12(7), Article
application of intelligent scenario products. e0180944.
Bari, O. A., & Agah, A. (2020). Ensembles of text and time-series models for automatic
Funding generation of financial trading signals from social media content. Journal of Intelligent
Systems, 29(1), 753–772.
Bergemann, D., & Välimäki, J. (1996). Learning and strategic pricing. Econometrica, 64(5),
“Not Applicable” 1125–1149 1125-1149.
13
Boukas, I., Ernst, D., Théate, T., Bolland, A., Huynen, A., Buchwald, M., et al., (2021). Lanbouri, Z., & Achchab, S. (2015). A hybrid Deep belief network approach for Financial
A deep reinforcement learning framework for continuous intraday market bidding. distress prediction. In 2015 10th International Conference on Intelligent Systems: Theories
Machine Learning, 110(9), 2335–2387. and Applications (SITA) (pp. 1–6).
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal differ- Lee, C. Y., & Soo, V. W. (2017). Predict stock price with financial news based on recurrent
ence learning. Machine learning, 22(1), 33–57. convolutional neural networks. In 2017 conference on technologies and applications of
Brzeszczyński, J., & Ibrahim, B. M. (2019). A stock market trading system based on foreign artificial intelligence (TAAI) (pp. 160–165).
and domestic information. Expert Systems with Applications, 118, 381–399. Lee, S. I., & Yoo, S. J. (2020). Threshold-based portfolio: The role of the threshold and its
Cerchiello, P., Nicola, G., Rönnqvist, S., & Sarlin, P. (2018). Deep learning for assessing applications. The Journal of Supercomputing, 76(10), 8040–8057.
banks’ distress from news and numerical financial data. Michael J. Brennan Irish Fi- Lei, K., Zhang, B., Li, Y., Yang, M., & Shen, Y. (2020). Time-driven feature-aware jointly
nance Working Paper Series Research Paper, 18–115. deep reinforcement learning for financial signal representation and algorithmic trad-
Chakraborty, S. (2019). Capturing financial markets to apply deep reinforcement learning. ing. Expert Systems with Applications, 140, Article 112872.
arXiv preprint arXiv:1907.04373. Li, X., Huang, Y. H., Fang, S. C., & Zhang, Y. (2020). An alternative efficient representation
Chang, C. Y., Zhang, Y., Teng, Z., Bozanic, Z., & Ke, B. (2016). Measuring the informa- for the project portfolio selection problem. European Journal of Operational Research,
tion content of financial news. In Proceedings of COLING 2016, the 26th International 281(1), 100–113.
Conference on Computational Linguistics: Technical Papers (pp. 3216–3225). Li, X., Lv, Z., Wang, S., Wei, Z., & Wu, L. (2019). A reinforcement learning model based
Chatzis, S. P., Siakoulis, V., Petropoulos, A., Stavroulakis, E., & Vlachogian- on temporal difference algorithm. IEEE access : Practical innovations, open solutions, 7,
nakis, N. (2018). Forecasting stock market crisis events using deep and statistical 121922–121930.
machine learning techniques. Expert systems with applications, 112, 353–371. Li, Y., Lin, X., Wang, X., Shen, F., & Gong, Z. (2017). Credit risk assessment algorithm
Chen, S. S., Choubey, B., & Singh, V. (2021). A neural network-based price sensitive rec- using deep neural networks with clustering and merging. In 2017 13th International
ommender model to predict customer choices based on price effect. Journal of Retailing Conference on Computational Intelligence and Security (CIS) (pp. 73–176).
and Consumer Services, 61, Article 102573. Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforcement
Chen, Shiuann-Shuoh, & Singh, V. (2020). Developing a Cloud EBC System with 2P-Cloud learning in portfolio management. arXiv preprint arXiv:1808.09940.
Architecture. Journal of Applied Science and Engineering, 23(2), 185–195. Lillicrap, T. P., .Hunt, J. J., .Pritzel, A., Heess, N., Erez, T., & Tassa, Y. et al. (2015).Con-
Chi, M., Hongyan, S., Shaofan, W., Shengliang, L., & Jingyan, L. (2021). Bond default tinuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
prediction based on deep learning and knowledge graph technology. IEEE access : Lim, Q. Y. E., Cao, Q., & Quek, C. (2021). Dynamic portfolio rebalancing through rein-
Practical innovations, open solutions, 9, 12750–12761. forcement learning. Neural Comput & Applic, 34, 7125–7139.
Dai, P., Yu, W., Wang, H., & Baldi, S. (2022). Distributed Actor-Critic Algorithms for Mul- Liu, D., Wei, Q., & Yan, P. (2015). Generalized policy iteration adaptive dynamic pro-
tiagent Reinforcement Learning Over Directed Graphs. IEEE Transactions on Neural gramming for discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and
Networks and Learning Systems. 10.1109/TNNLS.2021.3139138. Cybernetics: Systems, 45(12), 1577–1591.
Day, M. Y., & Lee, C. C. (2016). Deep learning for financial sentiment analysis on finance Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A sur-
news providers. In 2016 IEEE/ACM International Conference on Advances in Social Net- vey. Journal of Accounting Research, 54(4), 1187–1230.
works Analysis and Mining (ASONAM) (pp. 1127–1134). IEEE. Luo, C., Wu, D., & Wu, D. (2017). A deep learning approach for credit scoring using credit
Dixon, M., Klabjan, D., & Bang, J. H. (2017). Classification-based financial markets pre- default swaps. Engineering Applications of Artificial Intelligence, 65, 465–470.
diction using deep neural networks. Algorithmic Finance, 6(3–4), 67–77. Mahmoudi, N., Docherty, P., & Moscato, P. (2018). Deep neural networks understand
Du, X., Zhai, J., & Lv, K. (2016). Algorithm trading using q-learning and recurrent rein- investors better. Decision Support Systems, 112, 23–34.
forcement learning. positions, 1(1). Mavrotas, G., & Makryvelios, E. (2021). Combining multiple criteria analysis, mathemat-
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in ical programming and Monte Carlo simulation to tackle uncertainty in Research and
actor-critic methods. International conference on machine learning, PMLR 1587-1596. Development project portfolio selection: A case study from Greece. European Journal
Gobillon, L., & Magnac, T. (2016). Regional policy evaluation: Interactive fixed effects of Operational Research, 291(2), 794–806.
and synthetic controls. Review of Economics and Statistics, 98(3), 535–551. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al., (2016). Asyn-
Gomes, T. A., Carvalho, R. N., & Carvalho, R. S. (2017). Identifying anomalies in parlia- chronous methods for deep reinforcement learning. In International conference on ma-
mentary expenditures of brazilian chamber of deputies with deep autoencoders. 2017 chine learning (pp. 1928–1937).
16th IEEE International Conference on Machine Learning and Applications (ICMLA), Moher, D., Shamseer, L., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., & Stew-
940-943). art, L. A. (2015). Preferred reporting items for systematic review and meta-analysis
Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122. protocols (PRISMA-P) 2015 statement. Systematic reviews, 4(1), 1–9.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy max- Mosavi, A., Faghan, Y., Ghamisi, P., Duan, P., Ardabili, S. F., Salwana, E., et al., (2020).
imum entropy deep reinforcement learning with a stochastic actor. In International Comprehensive review of deep reinforcement learning methods and applications in
conference on machine learning (pp. 1861–1870). economics. Mathematics, 8(10), 1640.
Hambly, B., Xu, R., & Yang, H. (2021). Recent Advances in Reinforcement Learning in Neagoe, V. E., Ciotec, A. D., & Cucu, G. S. (2018). Deep convolutional neural networks
Finance. arXiv preprint arXiv:2112.04553. versus multilayer perceptron for financial prediction. In 2018 International Conference
Han, J., Jentzen, A., & Weinan, E. (2018a). Solving high-dimensional partial differen- on Communications (COMM) (pp. 201–206).
tial equations using deep learning. Proceedings of the National Academy of Sciences, Ng, J. Y. H., & Davis, L. S. (2018). Temporal difference networks for video action recog-
115(34), 8505–8510. nition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
Heryadi, Y., & Warnars, H. L. H. S. (2017). Learning temporal representation of transac- (pp. 1587–1596).
tion amount for fraudulent transaction recognition using CNN, Stacked LSTM, and Ngai, E. W., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining
CNN-LSTM. In 2017 IEEE International Conference on Cybernetics and Computational techniques in financial fraud detection: A classification framework and an academic
Intelligence (CyberneticsCom) (pp. 84–89). review of literature. Decision support systems, 50(3), 559–569.
Hosaka, T. (2019). Bankruptcy prediction using imaged financial ratios and convolutional Otterlo, M. V., & Wiering, M. (2012). Reinforcement learning and markov decision pro-
neural networks. Expert systems with applications, 117, 287–299. cesses. In Reinforcement learning (pp. 3–42). Berlin, Heidelberg: Springer.
Hsu, P. Y., Chou, C., Huang, S. H., & Chen, A. P. (2018). A market making quotation Ozbayoglu, A. M., Gudelek, M. U., & Sezer, O. B. (2020). Deep learning for financial
strategy based on dual deep learning agents for option pricing and bid-ask spread applications: A survey. Applied Soft Computing, 93, Article 106384.
estimation. In 2018 IEEE international conference on agents (ICA) (pp. 99–104). Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming.
Iwasaki, H., & Chen, Y. (2018). Topic sentiment asset pricing with dnn supervised learning. John Wiley & Sons.
SSRN Electronic Journal. 10.2139/ssrn.3228485. Rawte, V., Gupta, A., & Zaki, M. J. (2018). Analysis of year-over-year changes in risk
Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., & Cas- factors disclosure in 10-k filings. In Proceedings of the Fourth International Workshop on
taneda, A. G. (2019). Human-level performance in 3D multiplayer games with popu- Data Science for Macro-Modeling with Financial and Economic Datasets (pp. 1–4).
lation-based reinforcement learning. Science (New York, N.Y.), 364(6443), 859–865. Rönnqvist, S., & Sarlin, P. (2015). Detect & describe: Deep learning of bank stress in the
Jeong, G., & Kim, H. Y. (2019). Improving financial trading decisions using deep Q-learn- news. In 2015 IEEE Symposium Series on Computational Intelligence (pp. 890–897). IEEE.
ing: Predicting the number of shares, action strategies, and transfer learning. Expert Rönnqvist, S., & Sarlin, P. (2017). Bank distress in the news: Describing events through
Systems with Applications, 117, 125–138. deep learning. Neurocomputing, 264, 57–70.
Jiang, J., Kelly, B. T., & Xiu, D. (2020). (Re-)Imag(in)ing Price Trends. SSRN Electronic Saleh, H. (2020). The the deep learning with pytorch workshop: Build deep neural networks
Journal. 10.2139/ssrn.3756587. and artificial intelligence applications with pytorch. Packt Publishing Ltd.
Jiang, Z., & Liang, J. (2017). Cryptocurrency portfolio management with deep reinforce- Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy
ment learning. In 2017 Intelligent Systems Conference (IntelliSys) (pp. 905–913). IEEE. optimization. In International conference on machine learning (pp. 1889–1897).
Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P. E., He-Guelton, L., et al., Serrano, W. (2018). Fintech model: The random neural network with genetic algorithm.
(2018). Sequence classification for credit-card fraud detection. Expert Systems with Procedia Computer Science, 126, 537–546.
Applications, 100, 234–245. Sharma, S., Rana, V., & Kumar, V. (2021). Deep learning based semantic personalized
Kumar, B. S., & Ravi, V. (2016). A survey of the applications of text mining in financial recommendation system. International Journal of Information Management Data Insights,
domain. Knowledge-Based Systems, 114, 128–147. 1(2), Article 100028.
Kumar, S., Kar, A. K., & Ilavarasan, P. V. (2021). Applications of text mining in services Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., & Hass-
management: A systematic literature review. International Journal of Information Man- abis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi,
agement Data Insights, 1(1), Article 100008. and Go through self-play. Science (New York, N.Y.), 362(6419), 1140–1144.
Kushwaha, A. K., Kar, A. K., & Dwivedi, Y. K. (2021). Applications of big data in emerging Singh, V., Chen, S.-. S., & Schneider, M. (2022). Anomaly detection in procure to pay business
management disciplines: A literature review using text mining. International Journal processes: A clustering and time series analysis-based approach (SSRN Scholarly Paper ID
of Information Management Data Insights, 1(2), Article 100017. 4012815). Social Science Research Network.
14
Singh, V., & Sharma, S. K. (2022). Application of blockchain technology in shaping the Yang, H., & Xie, X. (2019). An actor-critic deep reinforcement learning approach for trans-
future of food industry based on transparency and consumer trust. Journal of Food mission scheduling in cognitive internet of things systems. IEEE Systems Journal, 14(1),
Science and Technology, 1–18. 51–60.
Sohangir, S., & Wang, D. (2018). Finding expert authors in financial forum using deep Ying, J. J. C., Huang, P. Y., Chang, C. K., & Yang, D. L. (2017). A preliminary study on deep
learning methods. In 2018 second IEEE international conference on robotic computing learning for predicting social insurance payment behavior. In 2017 IEEE International
(IRC) (pp. 399–402). Conference on Big Data (Big Data) (pp. 1866–1875).
Théate, T., & Ernst, D. (2021). An application of deep reinforcement learning to algorith- Yu, L., Zhou, R., Tang, L., & Chen, R. (2018). A DBN-based resampling SVM ensemble
mic trading. Expert Systems with Applications, 173, Article 114632. learning paradigm for credit classification with imbalanced data. Applied Soft Com-
Tsantekidis, A., Passalis, N., Tefas, A., Kanniainen, J., Gabbouj, M., & Iosifidis, A. (2017). puting, 69, 192–202.
Forecasting stock prices from the limit order book using convolutional neural net- Zhang, K., Koppel, A., Zhu, H., & Basar, T. (2020). Global convergence of policy gradient
works. In 2017 IEEE 19th conference on business informatics (CBI) (pp. 7–12). 1. methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization,
Verma, S., Sharma, R., Deb, S., & Maitra, D. (2021). Artificial intelligence in marketing: 58(6), 3586–3612.
Systematic review and future research direction. International Journal of Information Zhang, W., Li, L., Zhu, Y., Yu, P., & Wen, J. (2022). CNN-LSTM neural network model
Management Data Insights, 1(1), Article 100002. for fine-grained negative emotion computing in emergencies. Alexandria Engineering
Wang, H. (2019). Large scale continuous-time mean-variance portfolio allocation via re- Journal, 61(9), 6755–6767.
inforcement learning. Available at SSRN 3428125. Zhang, X., Zhang, Y., Wang, S., Yao, Y., Fang, B., & Philip, S. Y. (2018). Improving stock
Wang, Y., & Xu, W. (2018). Leveraging deep learning with LDA-based text analytics to market prediction via heterogeneous information fusion. Knowledge-Based Systems,
detect automobile insurance fraud. Decision Support Systems, 105, 87–95. 143, 236–247.
West, J., & Bhattacharya, M. (2016). Intelligent financial fraud detection: A comprehensive Zhu, B., Yang, W., Wang, H., & Yuan, Y. (2018). A hybrid deep learning model for con-
review. Computers & security, 57, 47–66. sumer credit scoring. In 2018 international conference on artificial intelligence and big
Xiong, Z., Liu, X. Y., .Zhong, S., Yang, H., & Walid, A. (2018). Practical deep reinforcement data (ICAIBD) (pp. 205–208). IEEE.
learning approach for stock trading. arXiv preprint arXiv:1811.07522.
15

How

Uploaded by

Copyright:

Available Formats

How

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How

Uploaded by

Copyright:

Available Formats

International Journal of Information Management Data Insights 2 (2022) 100094

Contents lists available at ScienceDirect

International Journal of Information Management Data

learn to make optimal decisions through repeated experience gained by

2.1.4. Transition function

for all s∈S and 𝑎∈A.

Fig. 4. Methodology for Systematic selection of articles for the review.

Category Subcategory Decision making objective

Credit Risk Consumer Credit risk - general consumer default

Table 2: some of the major articles on Credit risk (2014 - 2020)

Authors Study Period Method Feature set

Li et al. 2017 DRL, DNN Personal ﬁnance variables

Authors Study Period Method Feature set

Table 4: Some of the major articles on Portfolio management (2014- 2020)

Authors Study Period Method Feature set

Zhou Bo 1993-2017 LSTM,MLP,RL Firm’s characteristics, Price

Authors Study Period Method Feature set

Feng et al. 2019 RDL,DL Firm characteristics

Table 6: Some of the major articles on Algo-Trading Text Mining (2014-2020)

Authors Study Period Method Feature set

Dixon et al. 2014 DNN Price Data

Authors Study Period Method Feature set

Verma et al. 2013-2017 LSTM, RL Index data, news

Methods Application Dataset type Complexity Accuracy /Processing rate

AE Risk management Historical High High /Reasonable

by combining Deep Neural network architecture in reinforcement learn- Ethics approval

You might also like