QQL NS 2019012315280690
QQL NS 2019012315280690
QQL NS 2019012315280690
ABSTRACT
Applying quantum computing techniques to machine learning has attracted widespread at-
tention recently and quantum machine learning has become a hot research topic. There are
three major categories of machine learning: supervised, unsupervised, and reinforcement
learning (RL). However, quantum RL has made the least progress when compared to the
other two areas. In this study, we implement the well-known RL algorithm Q learning with a
quantum neural network and evaluate it in the grid world environment. RL is learning
through interactions with the environment, with the aim of discovering a strategy to max-
imize the expected cumulative rewards. Problems in RL bring in unique challenges to the
study with their sequential nature of learning, potentially long delayed reward signals, and
large or infinite size of state and action spaces. This study extends our previous work on
solving the contextual bandit problem using a quantum neural network, where the reward
signals are immediate after each action.
1. INTRODUCTION
Great success has been made in artificial intelligence, machine learning, deep learning, and rein-
forcement learning (RL). Deep learning based on neural networks has demonstrated its power in super-
vised learning problems such as computer vision and machine translation. On the other side, when applied
to RL, amazing results like Alpha Go are possible. The goal of RL is to teach an agent to learn how to act
from a given state in an unknown environment.
Different from the one-step decisions in supervised learning, the sequential decision making charac-
ter in RL is observed in the process of the agent taking an action, and then receiving a reward and the next
state, and then acting upon that state. The purpose of RL is for an agent to learn a strategy or policy that
will obtain the maximum long-term cumulative rewards. In general, a policy is a distribution over actions
given the states, which the agent uses to determine the next action based on the current state. The major
2. RELATED WORK
Because of the impressive performance of deep learning, creating neural networks on quantum com-
puters has been a long-time effort [13, 20, 21]. The work in [22] proposed a general method to build a var-
iational quantum circuit in a continuous-variable (CV) quantum architecture, composed of a layered
structure of continuously parameterized photonic gates which can function like any classical neural net-
work including convolutional, recurrent, and residual networks. Our current study uses the technique in
[22] to design a quantum network to implement the Q learning algorithm.
The nonlinearity of the classical neural networks plays a key role in their success which is realized
with a nonlinear activation function in each layer. In photonic quantum networks, this nonlinearity is ac-
complished with non-Gaussian photonic gates such as the cubic phase gate and the Kerr gate. The com-
mon representation of f (Wx + b ) in the classical networks where f is the nonlinear activation function,
W is the weight matrix and b is the bias can be emulated as layers of quantum gates φ D U 2 S U1 x
in the CV model where D is an array of displacement gates, U i are interferometers, S are squeezing gates,
and φ are non-Gaussian gates to have an effect of nonlinearity.
One quantum advantage of this type of quantum networks is that for certain problems, a classical
neural network would take exponentially many resources to approximate the quantum network. It is
shown in [22] that when a Fourier transform is applied to the first layer and last layer of a quantum net-
work, which changes the input states from position states x to momentum states p and the position
3. METHODS
3.1. Grid World Environment
RL is characterized by the agent, the environment and their interaction. Therefore, each RL algorithm
needs to be tested in certain environments. In this report, we use the grid world environment, which is a
commonly adopted benchmark in RL. Compared with the contextual bandit problem studied in [19], the
grid world is harder as it delays the positive reward until the goal state is reached. Grid world is a 2D rec-
tangular grid, where an agent starts at one grid square (the start state) and tries to move to another grid
square (the goal state). The goal of RL algorithms is for the agent to learn optimal paths on the grid to ar-
rive at the goal state in the least number of steps. The agent can move in up, down, left, or right directions
by one grid square. Our grid world is similar to the Frozen Lake environment from gym
[https://gym.openai.com/envs/FrozenLake-v0/] but with a smaller 2 × 3 size while the standard size is 4 × 4.
It is a 2 × 3 grid which contain four possible areas—Start (S), Frozen (F), Hole (H) and Goal (G) (Figure 1).
The agent moves around the grid until it reaches the goal or the hole. If it falls into the hole, it has to start
from the beginning and is rewarded the value 0. The process continues until it learns how to reach the goal
eventually. The visual description of this grid world is in Figure 1.
The agent in this environment has four possible moves—up, down, left and right. The standard Fro-
zen Lake environment has an option to allow moves to be slippery. If slippery, there could be a random
move happening in every action since the agent is slipping in different directions on a frozen surface. The
episode ends when the agent reaches the goal or falls in the hole. It receives a reward of 1 if it reaches the
goal, and zero otherwise. There is a clear contrast between the contextual bandit problem [19] where the
reward is immediate after each action and this grid world problem. In the grid world problem, the agent
gets one reward signal at the end of a long episode of interaction with the environment, which makes it
difficult to identify exactly which action was the good one.
Figure 1. The grid world of size 2 × 3 used in this study. Each grid (state) is labeled with a letter
which has the following meaning: Start (S), Frozen (F), Hole (H) and Goal (G). The state G has a
reward of one and other states have a reward of zero. Each grid (state) is represented as an integer
from 0 to 5, with the top row: 0, 1, 2 (left to right) and the bottom row: 3, 4, 5 (left to right).
3.3. Q Learning
In general, there are two main approaches to RL: 1) to learn a policy using a given model or learn a
model of the world from experience and then use planning with that learned model to discover a policy
(model-based) and 2) to learn a policy directly or a value function from experience then define a policy
from it indirectly (model-free). The policy gradient approach tries to learn a good policy directly while Q
learning attempts to learn the value of Q ( s, a ) first, then from this Q function, an optimal policy can be
deduced. Q learning can learn the optimal policy even when actions are selected by a different policy in-
cluding a random policy.
Q Learning [24, 25] is a model-free RL method that estimates the long-term expected return of ex-
ecuting an action a in state s. Without a model or in an unknown environment, the agent has to learn from
experience, using an iterative process. The estimated returns, known as Q values, can be learned iteratively
and the updating rule of Q values can be described as:
Q ( st +1 , at +1 ) =
Q ( st , at ) + α Rt +1 + γ max a Q ( st +1 , a ) − Q ( st , at ) (2)
where the max is taken over all the possible actions in state st +1 , γ ∈ [ 0,1) is the discount factor, and
α ∈ ( 0,1] is the learning rate. The updating formula in Equation (2) suggests that the formation of the Q
function is executed by following a greedy policy. To address the challenge of delayed rewards in many RL
problem, Equation (2) gives credit to the past actions through backpropagation of the final reward with a
discount factor. Equation (2) is known as the Bellman equation and is a key tool in RL. In a simple prob-
lem when the state space is small and every state is known, Q function can be represented as a table with
states as rows and actions as columns. However, in a problem with a large state space or unknown states,
deep neural networks have to be used instead, in which the loss function is the mean squared error of the
predicted Q value and the TD-target Q value.
Figure 2. On the left, there is the logical representation of the network to compute the Q function
(assuming there are 4 actions) and on the right, there is the physical representation of the actual
parametrized circuit structure for a CV quantum neural network made of photonic gates: interfe-
rometer, displacement, rotation, squeeze, and Kerr (non-Gaussian) gates. The output is the Fock
space measurements. More details of this quantum network can be found in [19, 20].
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this paper.
REFERENCES
1. Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. 2nd Edition, A Bradford Book,
Cambridge, MA.
2. Ganger, M., Duryea, E. and Hu, W. (2016) Double Sarsa and Double Expected Sarsa with Shallow and Deep
Learning. Journal of Data Analysis and Information Processing, 4, 159-176.
https://doi.org/10.4236/jdaip.2016.44014
3. Duryea, E., Ganger, M. and Hu, W. (2016) Exploring Deep Reinforcement Learning with Multi Q-Learning.
Intelligent Control and Automation, 7, 129-144. https://doi.org/10.4236/ica.2016.74012
4. Serafini, A. (2017) Quantum Continuous Variables: A Primer of Theoretical Methods. CRC Press, Boca Raton.
5. Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.-H., Zhou, X.-Q., Love, P.J., Aspuru-Guzik, A. and O’Brien, J.L.
(2014) A Variational Eigenvalue Solver on a Photonic Quantum Processor. Nature Communications, 5, Article
No.: 4213. https://doi.org/10.1038/ncomms5213
6. Clausen, J. and Briegel, H.J. (2018) Quantum Machine Learning with Glow for Episodic Tasks and Decision
Games. Physical Review A, 97, 022303. https://doi.org/10.1103/PhysRevA.97.022303
7. Crawford, D., Levit, A., Ghadermarzy, N., Oberoi, J.S. and Ronagh, Pooya (2016) Reinforcement Learning Us-
ing Quantum Boltzmann Machines. arXiv:1612.05695 [quant-ph].
8. Dunjko, V., Taylor, J.M. and Briegel, H.J. (2018) Advances in Quantum Reinforcement Learning. ar-
Xiv:1811.08676 [quant-ph].