0% found this document useful (0 votes)
8 views45 pages

SSRN Id4252858

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 45

Algorithmic Pricing and Liquidity in Securities Markets∗

Jean-Edouard Colliard, Thierry Foucault, and Stefano Lovo

October 18, 2022

Abstract

We let “Algorithmic Market-Makers” (AMMs), using Q-learning algorithms, choose prices for

a risky asset when their clients are privately informed about the asset payoff. We find that

AMMs learn to cope with adverse selection and to update their prices after observing trades,

as predicted by economic theory. However, in contrast to theory, AMMs charge a mark-up over

the competitive price, which declines with the number of AMMs. Interestingly, markups tend

to decrease with AMMs’ exposure to adverse selection. Accordingly, the sensitivity of quotes

to trades is stronger than that predicted by theory and AMMs’ quotes become less competitive

over time as asymmetric information declines.

Keywords: Algorithmic pricing, Market Making, Adverse Selection, Market Power, Reinforce-

ment learning.

JEL classification: D43, G10, G14.


Correspondence: colliard@hec.fr, foucault@hec.fr, lovo@hec.fr. All authors are at HEC Paris, Department of
Finance, 1 rue de la Libération, 78351 Jouy-en-Josas, France. We are grateful to participants in “The Microstructure
Exchange”, the Microstructure Asia Pacific Online Seminar, and seminars at the University of Copenhagen, University
Paris 1, and HEC Paris for helpful comments and suggestions. We thank Olena Bogdan, Amine Chiboub, Chhavi Ras-
togi and Andrea Ricciardi for excellent research assistance. This work was supported by the French National Research
Agency (F-STAR ANR-17-CE26-0007-01, ANR EFAR AAP Tremplin-ERC (7) 2019), the Investissements d’Avenir
Labex (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047), the Chair ACPR/Risk Foundation “Regulation
and Systemic Risk”, the Natixis Chair “Business Analytics for Future Banking”.

Electronic copy available at: https://ssrn.com/abstract=4252858


Introduction

Firms (e.g., retailers, airlines, hotels, energy providers etc.) increasingly rely on algorithms to set

the price of their products.1 This evolution reflects efficiency and predictive gains of artificial intel-

ligence but it generates new concerns, in particular about price discrimination and tacit collusion

among algorithms (see MacKay and Weinstein (2022), CMA (2018), OECD (2017)).2 Surprisingly,

this worry has not been expressed for market makers in securities markets even though propri-

etary trading firms (market makers such as Citadel, Virtu, Jane Street, etc.) began using pricing

algorithms at least two decades ago and now dominate liquidity provision in exchanges.3

Is this lack of concern justified? Is tacit collusion among pricing algorithms more difficult in

securities markets? To study this question, we consider a framework in which “algorithmic market

makers” (AMMs) compete in prices and are at risk of trading with better informed investors. Our

setting is, by design, very similar to standard models of market making with asymmetric information

(in the spirit Glosten and Milgrom (1985) and Kyle (1985)). However, in contrast to these models,

we assume that quotes are posted by AMMs that set their quotes using Q-learning algorithms, a

special type of reinforcement learning algorithm (often mentioned as the type of algorithms used for

pricing decisions; see CMA (2018)). We focus on whether Q-learning algorithms cope with adverse

selection, learn to account for the information contained in trades in choosing prices and whether

their prices are competitive. To our knowledge, our paper is the first to analyze how Q-learning

algorithms behave in the presence of asymmetric information (an important feature of trading in

securities markets).

In our framework, AMMs simultaneously post offers in response to clients’ requests to buy

one share of a risky asset. Clients’ valuation for the asset is the sum of the payoff of the asset (a

common value component) and a component specific to each client (a private valuation component).
1
For instance, Chen et al. (2016) find that more than 500 Amazon third-party sellers on Amazon marketplace were
using algorithms to price their products.
2
For instance, MacKay and Weinstein (2022) write: “The explosion in the use of pricing algorithms over the past
decade has sparked concerns about the effect on competition and consumers [...].”.
3
These firms are often referred to as “high-frequency market makers” because their algorithms (and hardware
equipments) generate very frequent new orders (quotes, cancellations etc.). Menkveld (2013) finds that, in 2007-2008,
a single high-frequency market maker accounts for about 15% of total trading volume in Dutch stocks (and more than
60% on one of the trading platforms for these stocks). Brogaard et al. (2015) find that fast traders on the Stockholm
Stock Exchange are primarily market makers, who account for 83% of all limit orders on this exchange (and 44% of
trading volume).

Electronic copy available at: https://ssrn.com/abstract=4252858


Clients arrive sequentially and each one trades with the dealers posting the best offer in response

to her request, provided that this offer is less than her valuation.4 As clients’ demand for the asset

increases with the common value, dealers are exposed to adverse selection (they are more likely to

sell the asset when its payoff is high than when it is low). In the baseline case, a new realization of

the asset payoff is drawn after each client’s arrival.

AMMs behave as follows. In each trading round, each AMM starts with an assessment of the

expected profit associated with each possible price, picks a price (on a grid) based on this assessment,

and updates its assessment of the expected profit associated with this price by taking a weighted

average (with pre-specified weights) of its realized profit at the end of the trading round and its

prior assessment of its expected profit. The assessment of the expected profit associated with other

possible prices is unchanged. In each round t, the AMM picks the price that generates the largest

expected profit according to its assessment with a given probability of “exploitation”. Otherwise,

it “explores” by picking at random (with equal probability for each possible price) another price.

Exploration enables the AMM to receive feedback about the profit generated by a price and therefore

to “learn” the expected profit associated with this price.

This iterative process is repeated over a large number of “episodes” (each made of one trading

round), which collectively constitute one “experiment”. In a given experiment, the set of parameters

(e.g., the number of AMMs, the distribution of the asset payoff, and the distribution of each client’s

private valuations) is constant across episodes and forms the “environment”. For each environment

considered in our analysis (i.e., for a fixed set of parameters), we run 10, 000 experiments, each

made of 200, 000 episodes. In each experiment we record each AMM’s quote, the transaction price,

and the trading volume (0, or 1) in each episode.

In early episodes of a given experiment, AMMs “learn” the expected profit associated with

each possible price, which leads to significant volatility in prices. After a large number of episodes,

their pricing strategy eventually “converges” in most experiments, in the sense that AMMs keep

playing the same price over a large number of episodes. However, this “long run” price can vary
4
In electronic securities markets (e.g., electronic limit order books markets used in most of the major stock markets
in the world), market makers compete in prices (“à la Bertrand”) with no room for product differentiation. Price
priority is strictly enforced, which guarantees that clients’ orders are filled at the best price, as assumed in our analysis.

Electronic copy available at: https://ssrn.com/abstract=4252858


from one experiment to another. Thus, our analysis focuses on the empirical distribution of final

outcomes (e.g., transaction prices and dealers’ profits) across experiments (holding the environment

constant). We study how this distribution varies with parameters of the environment (in particular

the intensity of adverse selection) and we systematically compare final outcomes to those predicted

by economic theory. When there are multiple dealers, the outcomes predicted by theory (e.g.,

transaction prices and profits) are those corresponding to the Bertrand-Nash equilibrium of the

environment considered in our simulations (accounting for the fact that market makers must post

their quotes on a grid).5 When there is a single dealer, predicted outcomes are those corresponding

to the equilibrium of the environment in which the dealer behaves as a monopolist.

We observe several interesting regularities. First, when there is a single AMM, it does not

necessarily learn the monopoly price and its average quote is smaller than the monopoly price. In

contrast, when there are two AMMs, they charge a price above the competitive price on average

(even the smallest price in the distribution of observed prices is well above the competitive price).

This markup decreases with the number of AMMs and becomes close to zero only with 10 AMMs.

Second, in all environments, AMMs learn how not to be adversely selected. That is, they

charge prices that (more than) cover adverse selection costs. However, when their exposure to

adverse selection increases (either because the volatility of the asset payoff increases or the variance

of clients’ private valuations decreases), AMMs tend to choose prices that are more competitive

(in particular, their realized bid-ask spreads are smaller on average). This is particularly striking

when the variance of clients’ private valuations increases. In this case, AMMs’ offers (and therefore

transaction prices) increase, even though the competitive (Nash-Bertrand) price decreases because

adverse selection costs decline. Overall these findings suggest that adverse selection interacts with

the way Q-learning algorithms learn in non-intuitive ways.

In existing models (Kyle (1985) or Glosten and Milgrom (1985)), market makers are assumed

to learn the asset payoff from the trading history (the “order flow”) in a Bayesian way. For this

reason, holding the asset payoff constant, these models predict that dealers eventually discover asset

payoffs. For instance, dealers’ pricing errors (the average squared difference between the asset payoff
5
In particular, as is well-known, price discreteness can generate multiple Bertrand-Nash equilibria. We take this
into account in our analysis.

Electronic copy available at: https://ssrn.com/abstract=4252858


and the transaction price) decrease over time (trades) on average (see, for instance, Glosten and

Milgrom (1985)). To study whether AMMs can also discover asset values, we extend our baseline

setting to allow for two trading rounds per episode. We find that AMMs behave qualitatively like

a Bayesian learner would. That is, after a buy (no trade), they increase (reduce) their offer in the

second trading round. Thus, price discovery takes place, even though AMMs have no knowledge

of the data generating process and their learning process is not Bayesian. However, observing the

outcome of the first period brings new information to the algorithms, which then face less adverse

selection in the second period. As adverse selection curbs algorithms’ rent-seeking behavior, prices

become less competitive on average in the second period. Moreover, this effect is stronger after

observing a trade than after observing no trade in the first period. In this sense, AMMs seem to

overreact to trades relative to a Bayesian learner.

In summary, our findings have two main implications. First, algorithmic market makers settle on

non competitive prices, even though they operate in an environment where economic theory predicts

competitive outcomes. This echoes findings in Hendershott et al. (2011) and Brogaard and Garriott

(2019). The former find empirically that algorithmic trading (AT) increases dealers’ expected profits

net of adverse selection costs (realized bid-ask spreads). Commenting on this result, they write (on

p.4): “This is surprising because we initially expected that if AT improved liquidity, the mechanism

would be competition between liquidity providers.” Brogaard and Garriott (2019) study the effect

of high frequency market makers’ entry on one trading platform for Canadian stocks. They find

that this entry triggers a decrease in bid-ask spreads. However, the entry of two competitors is

not sufficient to obtain the competitive outcome, in contrast to what standard models of market

making predicts. This pattern is exactly what we find when we compare average bid-ask spreads

across environments that only differ by the number of AMMs. Our second main implication is that

adverse selection induces AMMs to post quotes that are more competitive. In a cross-section of

assets, this means that realized bid-ask spreads for AMMs (a measure of dealers’ expected profits net

of adverse selection costs) should be smaller in assets that are more exposed to informed trading.

For instance, they should be smaller for stocks than for Treasuries (high frequency market makers

are active in both types of assets), even though adverse selection costs are larger for stocks. This

Electronic copy available at: https://ssrn.com/abstract=4252858


implication also holds dynamically: as adverse selection is resolved over time we expect AMMs to

quote less competitively, contrary to the predictions of standard asymmetric information models.

It is worth stressing that we do not claim that market-making algorithms necessarily behave

as our AMMs.6 This does not mean however that the patterns uncovered by our analysis are

unlikely to hold in reality.7 Our approach is to make stylized assumptions on pricing algorithms to

develop predictions about their effects on securities markets. In particular, our assumptions capture

that (some) algorithms used in practice rely on experimentation (“trial and error”) but eventually

experiment less and less, as experimentation is costly. We believe these properties are reasonable in

a financial context. As explained above, this approach delivers predictions that are quite different

from those of standard economic models in the same environment. To decide which approach has

more explanatory power, the next step (which is beyond the scope of this paper) would be to test

these predictions empirically.

The rest of the paper proceeds as follows. In the next section, we position our contribution in the

literature. Section 2, we present the economic environment analyzed in our paper. In Section 3, we

study the case in which each episode has a single trading round. In this case, our analysis focuses

on how AMMs’ behavior differ from two benchmarks: (a) competitive behavior (Nash-Bertrand

equilibrium) and (b) monopolistic behavior (monopoly prices). In Section 4, we study whether

AMMs can discover asset fair values by considering the case in which each episode has two trading

rounds. Section 5 concludes.

1 Contribution to the literature

Our paper is related to the emerging literature on algorithmic pricing and the possibility for algo-

rithms to sustain non competitive outcomes. Calvano et al. (2020) show that Q-learning algorithms
6
There is not much guidance on the actual design of market making algorithms in reality because market making
firms do not disclose information on their algorithms. Securities trading firms strongly push back a regulator’s attempt
to require disclosure of their computer codes for surveillance purpose. See “US regulator declares ‘dead’ moves to seize
HFT code”, Financial Times, October 14, 2017. In the EU, proprietary trading firms must make sure that they take
steps to insure that their algorithms will not lead to disordely markets. However, they do not have to disclose their
algorithms to regulators.
7
The behavior of market makers in existing economic models is also highly stylised and, in contrast to our approach
here, they are assumed to have a complete knowledge of their environment (e.g., the distribution of asset payoffs,
traders’ valuations etc.). Yet, these models have explanatory power for the behavior of security prices at high frequency
(see, for instance, Glosten and Harris (1988) and the subsequent literature using price impact regressions).

Electronic copy available at: https://ssrn.com/abstract=4252858


can learn dynamic collusive strategies in a repeated differentiated Bertrand game. Asker et al.

(2021) and Abada et al. (2022) show that supra competitive prices can be reached in this type

of environment even if dynamic strategies are ruled out, through what Abada et al. (2022) call

“collusion by mistake”.8 Cartea et al. (2022a) and Cartea et al. (2022b) study different families

of reinforcement learning algorithms and develop new methods to study which ones may lead to

non Nash behavior in a market-making environment.9 Banchio and Skrzypacz (2022) find that Q-

learning algorithms post less competitive bids in first price auctions than in second price auctions.

In contrast to our setting, bidders and sellers have a fixed valuation for the auctioned good and

bidders are not exposed to adverse selection in their setting (they consider private value auctions).

In sum, in line with other papers, we find that pricing algorithms relying on Q-learning can lead

to non competitive outcomes even when dynamic strategies are ruled out and when price setters

compete in prices. However, new to the literature, we find that adverse selection tends to mitigate

this issue.10 Moreover, to our knowledge, we are the first to study price discovery with Q-learning

algorithms (an issue specific to securities markets).

Our paper also contributes to the literature on algorithmic trading in securities markets. The

theoretical literature on this issue (e.g., Biais et al. (2015), Budish et al. (2015), Menkveld and

Zoican (2017), Baldauf and Mollner (2020), etc.) has mainly focused on how the increase in the

speed with which algorithms can respond to information increases or reduces liquidity suppliers’

exposure to adverse selection, using traditional workhorses models (Glosten and Milgrom (1985)

or Kyle (1985)). Yet, O’Hara (2015) calls for the development of new methodologies to study the

effects of algorithms in financial markets, writing that as a result of algorithmic trading: “the data

that emerge from the trading process are consequently altered [...] For microstructure researchers, I

believe these changes call for a new research agenda, one that recognizes how the learning models used

in the past are lacking [...].” Our paper responds to this call. Instead of modeling algorithmic traders

as Bayesian learners, with an omniscient knowledge of the environment in which they operate, we
8
This idea is in line with an earlier literature in machine learning showing that games between Q-learning algorithms
do not necessarily converge to a Nash equilibrium (Wunder et al., 2010).
9
In particular, Cartea et al. (2022b) show that using a finer pricing grid (a lower “tick size”) reduces the scope for
collusion.
10
Another rather unique feature of our setting is that, in our setting, the demand faced by pricing algorithms is
stochastic. See also Hansen et al. (2021) and Cartea et al. (2022b) other settings in which selling algorithms face a
stochastic demand elasticity, but without adverse selection.

Electronic copy available at: https://ssrn.com/abstract=4252858


model them as Q-learning algorithms. These algorithms learn by trial and error with almost no prior

knowledge of the environment, which represents the polar opposite of standard Bayesian learning.

Moreover, Q-learning is relatively simple and transparent, which makes it a good candidate for

a workhorse model of algorithmic interaction, much like the Glosten-Milgrom environment is a

workhorse model of market-making. This approach generates strikingly different implications for

those of canonical Bayesian-learning models. In particular, price competition does not guarantee

a competitive outcome and, maybe even more surprisingly, increased adverse selection can reduce

dealers’ rents.

2 The economic environment

In this section, we provide a general description of the economic environment considered in our

experiments (Section 3.3). We consider the market for one risky asset with t = 1, 2, ..., T episodes

(one can think of them as “trading days”). Quotes in this market are posted by N dealers who

trade with clients. Each episode has τ̄ trading rounds and the asset payoff ṽ is realized at the end of

the last trading round in a given episode. This payoff has a binary distribution, ṽ ∈ {vL , vH }, with

vL ≤ vH and µ := Pr(ṽ = vH ) = 12 . We denote ∆v = vH − vL . Realizations of the asset payoffs are

independent across episodes. In the rest of this section, we describe traders’ actions and realized

profits in a given episode.

In each trading round τ , a new trader (the “client”) arrives to buy one share of the asset. The

buyer’s valuation for the asset is vτC = ṽ + L̃τ , where L̃τ is i.i.d across trading rounds. Clients’

private valuations are assumed to be normally distributed with mean zero and variance σ 2 . The

buyer observes her valuation for the asset and requests quotes from the dealers, who simultaneously

respond by posting their offers a(τ ) = {a1τ , ..., aN τ }. The ask price anτ is the price at which dealer

n is ready to sell at most one share in trading round τ . We denote by (i) amin
τ = min{anτ } the
n

smallest ask price, (ii) Dτ the set of dealers posting this price and (iii) zτ the number of dealers in

Dτ . The client buys the asset if the best offer is less than her valuation for the asset (amin
τ ≤ vτC )

and, in this case, she splits her demand among the zτ dealers posting this price. Otherwise she does

not trade.

Electronic copy available at: https://ssrn.com/abstract=4252858


Let denote by V (a(τ ), L̃τ , ṽ) the volume of trade in round τ . It equals 1 if the client buys the

asset, i.e., if the client valuation ṽ + L̃τ is not smaller than the lowest price amin
τ , and it is zero

otherwise. Let denote by Z(anτ , a(τ )) the fraction of the τ th round trade executed by dealer n, that
1
is, Z(anτ , a(τ )) = zτ if anτ = amin
τ and is zero otherwise. In trading round τ , dealer n’s realized

trading volume is:

I(an,τ , a(τ ), L̃τ , ṽ) := V (a(τ ), L̃τ , ṽ)Z(anτ , a(τ )), (1)

and his realized profit is:

Π(anτ , a(τ ), L̃τ , ṽ) := I(an,τ , a(τ ), L̃τ , ṽ)(amin


τ − ṽ), (2)

Hence, dealer n’s total realized profit in a given episode is:

τ
X
Π(anτ , a(τ ), L̃τ , ṽ). (3)
τ =1

In our setting, holding prices constant, dealers are more likely to sell the asset when the asset

payoff is high than when the asset payoff is low. Indeed, conditional on a realization of v, the

likelihood that a trade occurs in trading round τ is:

D(amin
τ , v) := P r(amin
τ ≤ v + L̃τ ) = 1 − G(amin
τ − v), (4)

which increases with v as D(amin


τ , vH ) > D(amin
τ , vL ). Thus, dealers are exposed to adverse selection:

they are more likely to sell the asset when its payoff is high than when it is low.

Finally, we denote by Π(a, µτ ) = N E[Π(a, a, L̃, ṽ)] the dealers’ expected aggregate profit when

they all post the same price a, and attach probability µτ to the event ṽ = vH . This gives

Π̄(a, µτ ) := µτ D(a, vH )(a − vH ) + (1 − µτ )D(a, vL )(a − vL ) (5)

In the rest of the paper, we study how AMMs using Q-learning algorithms set their prices in

such an environment. We consider two cases. In the first case, analyzed in Section 3, we consider

Electronic copy available at: https://ssrn.com/abstract=4252858


an environment in which τ = 1 (a single trading round per episode). Our focus in this case

is on whether and how outcomes when prices are set by AMMs differ from those obtained in two

benchmarks: (i) the Nash-Bertrand equilibrium with multiple dealers (the competitive case) and (ii)

the case in which dealers set their price to maximize their aggregate expected profit (the monopoly

case). This comparison will help us to analyze how AMMs exert market power and cope with

adverse selection relative to rational Bayesian dealers. In the second case, analyzed in Section 4,

we consider a dynamic environment in which τ = 2 (two trading rounds per episode) and we focus

on price discovery (i.e., on how AMMs adjust their quotes over time).

3 The Static Case (τ̄ = 1)

In this section, we compare the pricing policies chosen by AMMs using a Q-learning algorithm

to equilibrium outcomes predicted by standard economic analysis in the environment described in

Section 2 when there is a single trading round per episode. We refer to this case as the static case

since, when we solve for dealers’ equilibrium pricing policies in this case, they behave as if they

were facing a static one-shot problem. We proceed in three steps. First, in Section 3.1, we derive

the equilibrium outcomes in two benchmark cases (the monopolist and competitive cases). Then,

in Section 3.2, we describe the Q-learning algorithms used by AMMs to choose their pricing policy

when τ̄ = 1. Third, in Section 3.3, we compare the pricing policies chosen by AMMs to those

obtained in the benchmarks.

3.1 Benchmarks

Monopolist Case. In the monopolist case, in each episode, each dealer chooses her price, denoted

am , to maximize dealers’ aggregate expected profit. Recalling that P r(ṽ = vH ) = µ = 1/2, am

solves:
 
m 1
a ∈ arg max Π̄ a, . (6)
a 2

Economic theory predicts that this price should be the equilibrium outcome when N = 1.

Competitive Case. In the competitive case, dealers choose a price ac such that each dealer’s

Electronic copy available at: https://ssrn.com/abstract=4252858


expected profit is nil. That is,
 
c c 1
a s.t. Π̄ a , = 0. (7)
2

When the set of prices is continuous, ac is the Bertrand-Nash equilibrium of the game played by

dealers in each trading round. This is the outcome predicted by economic theory when N ≥ 2.

We explain how to obtain ac and am in Appendices A.4 and A.3 and we find (numerically) that

(i) the competitive price, ac , increases with ∆v and decreases with σ while (ii) the monopoly price

increases with both ∆v and σ. We provide a numerical example in Table 1 where we report am and ac

when ∆v = 4 and σ = 5 (vH = 4 and vL = 0, so that E(ṽ) = 2), the baseline values of the parameters

in our experiments. The table also reports the expected half-quoted spread, E(a − E(ṽ)) = a − E(ṽ),

(the difference between the ask price posted by each dealer and the unconditional expected payoff

of the asset) and the expected half realized spread, E(a − ṽ | V = 1). In contrast to the average half

quoted spread, the expected half realized spread measures the expected profit of a dealer conditional

on a trade by the client. As this trade is more likely when v is high, this measure accounts for the

adverse selection cost borne by the dealer. In fact the difference between average half quoted spread

and average half realized spread is often used as a measure of adverse selection costs in empirical

studies.11 Last observe that the total expected profit of a dealer is the expected half realized spread

times the probability that the client trades (Π̄(a, µ) = Pr(V = 1)E(a −˜˜v | V = 1)).

[Insert Table 1 about here]

When the dispersion of clients’ private valuations (σ) increases or the volatility of the asset payoff

(∆v) decreases, dealers’ ask prices become lower in the competitive case because dealers’ adverse

selection costs decline. In contrast, when the dispersion of clients’ private valuations increases, the

monopolist offer becomes larger, despite the fact that their adverse selection costs decline. This

reflects an increase in dealers’ rents, as shown by the increase in the expected realized spread. The

reason is that as σ increases, clients’ demand becomes more inelastic, which as usual enables a

monopolistic dealer to extract larger rents. In contrast, when ∆v increases, the client’s demand

becomes more elastic and the adverse selection cost increases. As a result, the monopolist dealer
11
See Foucault et al. (2013), ch. 2, for a description of various measures of bid-ask spreads in securities markets
and their interpretation.

10

Electronic copy available at: https://ssrn.com/abstract=4252858


charges a larger price but she obtains smaller rents (the realized bid-ask spread declines).

3.2 Q-Learning Algorithms

3.2.1 Description of the Algorithms

We now describe the functioning of Q-learning algorithms in the environment described in Section

2 when τ̄ = 1.12 We use the same notations as in Section 2, unless otherwise stated. In contrast

to the benchmark case, we restrict AMMs to choose their quotes in a discrete and finite action set

A = {a1 , a2 ...aM }, where each am is a possible ask price.13

To each dealer n and episode t, we associate a so-called Q-Matrix Qn,t ∈ RM ×1 . In this section,

Qn,t is simply a column vector of size M . The m-th entry of the matrix is denoted qm,n,t where

qm,n,t represents the estimate in episode t of the payoff that AMM n expects from playing price am .

The Q-learning algorithm is meant to refine the payoff estimates in Qn,t over time, and to end up

playing the action associated with the highest estimate.

More formally, the algorithms (AMMs) play the game according to the following process. We

first initialize the matrices Qn,0 with random values: Each qm,n,0 for 1 ≤ m ≤ M and 1 ≤ n ≤ N

is i.i.d. and follows a uniform distribution over [q, q]. Then, in each episode t, we do the following:

1. For each dealer n, we define m∗n,t = arg max qm,n,t−1 the index associated with the highest
m

value in matrix Qn,t−1 , and we denote a∗n,t = am∗n,t the greedy price of AMM n in episode t. This is

the price that seems to maximize the AMM’s static profit, according to the estimates available in

episode t.

2. For each dealer n, with probability ϵt = e−βt the AMM “explores” by playing an,t = am̃n,t ,

where β > 0 and m̃n,t is a random integer between 1 and M , all values being equiprobable. The

price am̃n,t is thus a price taken randomly in A. With probability 1 − ϵt , the dealer “exploits” and

plays an,t = a∗n,t , the greedy price. The random draws leading to exploring or exploiting are i.i.d.

across all dealers in a given episode.


12
See Calvano et al. (2020) for an introduction to Q-learning algorithms in the more complex case of infinite horizon
problems. See also Sutton and Barto (2018) for an introductory textbook on this topic.
13
This constraint is necessary because the algorithm must evaluate the payoff associated with each possible price.

11

Electronic copy available at: https://ssrn.com/abstract=4252858


3. We compute amin
t = min an,t the best ask in episode t, zt the number of AMMs with an,t =
n

amin
t , and draw ṽt and L̃t . Each dealer n then receives a profit equal to πn,t = Π(an,t , a(t), L̃t , ṽt ),

as given by (2).14

4. We update the Q-matrix of each dealer as follows, with 0 < α < 1:



απ

n,t + (1 − α)qm,n,t−1 if an,t = am
∀1 ≤ n ≤ N, qm,n,t = (8)

qm,n,t if an,t ̸= am

5. We then repeat starting from stage 1, until the last episode T .

Intuitively, each Q-learning algorithm alternates between experimenting random prices, and

playing the price that seems to lead to the highest payoff based on past plays. As the number of

past episodes grows, information accumulates and there should be less value in experimenting. For

this reason, the probability of experimenting decays over time (here exponentially, at rate β). This

important parameter of the algorithm governs the trade-off between experimenting and exploiting.

A second trade-off is how much should one react to one particular observation πn,t , knowing that

payoffs are stochastic. This is governed by the parameter α: if α is large the algorithm reacts

quickly to new observations, but the estimates generated in the Q-matrix are unstable (consider

the extreme case α → 1). Conversely, if α is small the estimates are stable but it will take a lot

of experimentation to move the values of the Q-matrix towards accurate estimates of the expected

payoffs associated with each price.

3.2.2 Convergence

There are many variants of the Q-learning algorithm, with different specifications for the experi-

mentation probability ϵt and the updating rule (8). The one described in the previous section is

common in practical applications and is also the one used in recent papers in the economic lit-

erature (e.g., Calvano et al. (2020)). We choose it for comparability with prior literature. This

version of Q-learning does not satisfy the assumptions given in, e.g., Watkins and Dayan (1992),
14
We index all variables by the episode counter and omits the trading round index, τ within an episode since τ = 1
in each episode here.

12

Electronic copy available at: https://ssrn.com/abstract=4252858


Jaakkola et al. (1994), or Tsitsiklis (1994) to guarantee convergence. In fact, given the design of

these algorithms and the environment in which they operate, Lemma 1 below shows that no matter

t (that is, even when T becomes very large), with a probability that is bounded away from 0, the

Q-matrix changes by an amount that is bounded away from 0. Thus, entries in the Q-matrix never

converge.

Lemma 1. (Impossibility of convergence of the Q-matrix) For any given t and am ∈ A, if

an,t = am = amin
t , then ,

Pr (|qm,n,t − qm,n,t+1 | ≥ ∆∗m ) ≥ Pm



,

where

 
α vH − vL
∆∗m := vH − vL + am − ,
2 2

and
 
∗ 1 1
Pm := min D(am , vL ), 1 − (D(am , vL ) + D(am , vH ))
2N 2

For instance, consider the case N = 1 and suppose that the AMM plays for T consecutive

periods the same price am . Then, as T goes to infinity, the value of the Q-matrix qm,t does not

converge in probability to Π(am , µ), that is the actual monopolist dealer’s expected profit when

she sets a price of am (the metrics a monopolist would use to set his price in economic theory).

Because the Q-matrix does not converge, the price that maximizes the Q-matrix needs not stay the

same and will vary with probability 1 after sufficiently many episodes. Thus, one cannot expect the

greedy price to remain stable, even when T becomes very large. This also implies that the greedy

price observed in the final episode will vary across experiments.

In actual simulations, when α is small, most experiments give the impression of “converging” in

the sense that, after a sufficiently large number of episodes, the price chosen by each AMM stays

constant for many periods (this is because ∆∗m is linear in α). This is the meaning of “convergence”

in many papers in the literature (e.g., Calvano et al. (2020)). Thus, following the literature, we

say that an experiment has “converged” if all algorithms’ actions have been constant over the last

κT periods (e.g., κ = 0.05). Moreover, to describe the outcome of the interaction between AMMs

13

Electronic copy available at: https://ssrn.com/abstract=4252858


we look at the distribution of prices after a large number T of episodes and across a large number

K of experiments, focusing in particular on the mean of the distribution. When needed, we use a

superscript k to denote the outcome of the k th experiment. For instance, a∗k


n,t is the greedy price of

dealer n in episode t of experiment k.

Keeping these observations in mind, we set the parameters of our baseline simulations as follows.

The parameters of the economic environment are the same as in Table 1: ∆v = 4, σ = 5, vH = 4,

and vL = 0. In addition, the set of available prices A is all integers between 1 and 15 included.

We initialize the Q-matrices with values between q = 3 and q = 6 so that all values of the initial

Q-matrix are above the maximal payoff a dealer can get in a given period.15 There are K = 10, 000

experiments, T = 200, 000 episodes per experiment, and in all experiments we set β = 0.0008 and

α = 0.01. This means that the algorithm chooses to experiment 1249.5 times in expectation, and

hence “tries” each price about 100 times on average. As α = 0.01, this frequency of experimentation

is enough (in expectation) to override the initial values of the Q-matrix.16 Finally, we set κ = 0.05,

so that an experiment is said to “converge” if algorithms’ actions have been unchanged for the last

10,000 episodes.

3.3 Results

In this section, we report the main results of our experiments. We first consider the monopoly case

(N = 1) and duopoly case (N = 2), holding other parameters to their baseline values (Sections

3.3.1 and 3.3.2). In particular, we compare the distribution of final prices in these cases to their

equilibrium values when N = 1 (monopoly price) and N = 2 (duopoly case), accounting for the fact

that dealers must position their prices on a grid. Given this constraint, the theoretical monopoly

price is am = 7 and there are two possible Nash-Bertrand equilibria (ac = 3 or ac = 4).
15
This specification is common in the literature on Q-learning to guarantee that all actions are chosen sufficiently
often to overcome the initial values of the Q-matrix. Indeed, as long as qm,n,t is larger than the maximal payoff the
agent can realize, action m will necessarily be picked again because all the cells associated with actions that are played
eventually fall below the maximal payoff.
16
Note that Q-learning algorithms are meant for situations in which agents have no prior knowledge of the envi-
ronment. Hence, there is no basis on which one could optimize the algorithm, e.g., by picking the “best” values of α
and β. Rather, these values and the rules used by the algorithm must be seen as parameters.

14

Electronic copy available at: https://ssrn.com/abstract=4252858


3.3.1 A single AMM (N = 1) behaves more competitively than in theory

Consider the case in which N = 1 first. Panel (a) of Figure 1 reports the evolution of the greedy price

a∗k
1,t over episodes, averaged over the 10, 000 experiments while Panel (b) reports the distribution over

the K experiments of the final greedy price, a∗k


1,T (whether convergence takes place or not). Panel

(a) suggests that, on average the greedy price converges as the number of episodes becomes large.

However, only 73.64% of the experiments converge (as defined in Section 3.2.2) and the final greedy

price is heterogeneous across experiment as shown in Figure 1. The average final greedy price, a∗k
1,T ,

across all experiments is 6.16. However, there is substantial heterogeneity across experiments. As

panel (b) shows, while most experiments ultimately reach a greedy price of 6, a substantial number

reach 5 or 7, and in a few cases even 8 or 9. This dispersion in final outcomes across experiments

is due to the environment being stochastic. Even though the values of the Q-matrix are not very

sensitive to individual observations (remember that α = 0.01), there is still a significant probability

to obtain sufficiently many “bad draws” resulting in zero demand with prices of 6 or 7 to lead to a

Q-matrix with a greedy price of 5 or 8, even though the monopolist’s payoff is maximized at 7.

[Insert Fig. 1 here.]

A striking feature of this experiment is that, most of the times (in more than 75% of all exper-

iments), the algorithm fails to learn the theoretical (optimal) monopoly price 7 even though T is

large. Moreover, this failure is not random: On average, the final price posted by the algorithm is

below 7. The modal price is 6, and the algorithm is more likely to set a price of 5 than a price of

8, even though playing 8 would give a higher expected profit than playing 5. The reason is that

the updating rule (8) is biased against actions giving a high payoff with a low probability, such as

choosing a high price. This effect is more pronounced when α is larger, but still significant even

with the low value of α we are using (in unreported results, we checked that the average final greedy

price indeed decreases in the parameter α).17


17
To understand this point, imagine there are only two actions a1 and a2 . Action a1 gives a sure payoff π1 , whereas
a2 gives a payoff π + > π1 with probability p, and π − < π1 with probability 1 − p, with π2 = pπ + + (1 − p)π − > π1 . If
both actions are played many times, the expectation of q2,t associated with a2 will converge to π2 . However, as noted
in Lemma 1, the random variable q2,t itself does not converge pointwise. Instead, the distribution of q2,t converges to
a non-degenerate distribution. A simple example is the case α = 1: then q2,t will be equal to π + > π1 with probability
p and π − < π1 with probability 1 − p. Hence, the Q-learning algorithm will mistakenly pick action 1 as the greedy
action with probability 1 − p.

15

Electronic copy available at: https://ssrn.com/abstract=4252858


One might be tempted to interpret this failure to learn the optimal price with probability 1 as a

deficiency of the algorithm. However, this class of algorithms is not explicitly designed to learn the

optimal price. Rather they seek to reach a certain balance between “exploring” and “exploiting”.

For instance, one could reach a final outcome closer to the monopoly price by choosing smaller

values of α and β. However, doing so would be at the cost of playing suboptimal prices for more

periods (so that the average profit of the AMMs over all episodes might be smaller).

In any case, an important conclusion from the single dealer case is that the Q-learning algorithm

used by the AMMs in our experiments is not by itself biased towards high prices. If anything, the

single dealer case shows that the opposite happens. This makes the non competitive final outcomes

observed in the duopoly case (see next section) more striking.

3.3.2 Two AMMs do not suffice to obtain Bertrand-Nash outcomes

Now consider the duopoly case (N = 2). The starting values of the Q-matrices, Q1,0 and Q2,0 , for

each AMM as well as all the subsequent random draws for the two AMMs, are drawn independently

of each other (except the client’s demand). Panel (a) of Figure 2 reports the evolution of the greedy

price a∗k
n,t for each AMM over T episodes, averaged over the K experiments. Panel (b) reports the

∗k for each AMM over the K experiments.


distribution of the final greedy price an,T

[Insert Fig. 2 here.]

As can be seen in Figure 2, the AMMs’ quotes converge more quickly in the duopoly case

than in the monopoly case. Convergence is also more frequent: In 94.18% of the experiments, the

quote posted by each AMM has converged after 200,000 episodes (vs., only 73.64% when N = 1).

Moreover, in all experiments with convergence, the AMMs end up posting the same price (a∗k
1,T =

a∗k
2,T ). However, this price is rarely one of the two Bertrand-Nash equilibrium prices (3 or 4 due to

price discreteness). Indeed, we observe a∗k ∗k


1,T = a2,T = 4 in 5.57% of experiments only, and we never

observe a∗k ∗k
1,T = a2,T = 3. As Panel (b) of Figure 2 shows, a majority of experiments (more than

60%) converge to a price of 5, about 20% converge to a price of 6, and some to 7 (the monopoly

price) or 8. Thus, on average, the prices posted by the two competing AMMs are far above the

Bertrand-Nash equilibrium price.

16

Electronic copy available at: https://ssrn.com/abstract=4252858


The reason for this seemingly collusive outcome is different from the one in Calvano et al. (2020),

because our setup precludes dynamic strategies (quotes cannot be contingent on past competitors’

quotes in our set-up). Its origin seems closer to that in Asker et al. (2022) who also find that prices set

by Bertrand competitors using Q-learning are above competitive prices (in an environment without

adverse selection). In the first episodes, both AMMs are experimenting with a high probability.

AMM 1 for instance is gradually learning how to best respond to AMM 2. However, most of

the time, AMM 2 chooses a random price since the likelihood of experimentation is high in early

episodes. The best response to AMM 2 is actually for AMM 1 to play a = 6. As AMM 1 plays 6

more and more often (since the likelihood of experimentation declines over time), AMM 2 should

in principle learn that her best response is then to play a = 5 (in an undercutting process typical

of Bertrand competition). However, because both AMMs experiment less and less often over time,

this undercutting process will typically not last long enough to reach the Bertrand outcome. For

instance, both AMMs may have reached a price of only 5 when the probability of experimenting

ever again becomes very small. If for both AMMs playing 3 or 4 did not prove profitable in the

past (when the other AMM was playing differently), then the AMMs appear “stuck” with supra-

competitive prices.18 Our next step is to study how the probability that this happens depends on

the parameters of the model, and in particular on the degree of adverse selection.

3.3.3 Adverse selection tends to make AMMs’ quotes more competitive

In this section we study how the outcomes of the simulations vary when we change the parameters of

the economic environment, in particular the degree of adverse selection. For each set of parameters,

in each experiment k and episode t we compute the following four variables (which correspond to

empirically observable quantities):

1. The trading volume Vtk , which is equal to 1 if a trade happens and 0 otherwise.

2. The quoted spread QStk , which is the best ask minus the asset’s ex ante expected value:

QStk = amin,k
t − E[ṽ]. (9)
18
See Abada et al. (2022) for a comprehensive analysis and discussion of this issue. Wunder et al. (2010) show that
even in a simple prisoner’s dilemma Q-learning algorithms may not reach the Nash equilibrium.

17

Electronic copy available at: https://ssrn.com/abstract=4252858


3. The realized spread RStk , which is:

RStk = amin,k
t − vtk . (10)

The realized spread is computed only when there is a trade. It measures the profit actually

realized by the AMM with the best quote, given the actual value vtk of the asset. Its average

value over trades is a standard measure of dealers’ expected profits per share in the literature

(see Section 3.1).

We then compute the average across the K experiments of these three quantities in the last episode.

That is, we compute:

PK k
k=1 VT
V = (11)
K
PK k
k=1 QST
QS = (12)
K
PK k k
k=1 VT RST
RS = K
. (13)
k
P
k=1 VT
(14)

[Insert Fig. 3 here.]

Panels a) and b) in Figure 3 show the effect of a change in σ (the variance of clients’ private

valuation) and ∆v (the volatility of the asset payoff) on the average trading volume, the average

quoted spread, and the average realized spread in the case with a single AMM (dashed line), two

AMMs (plain line) and in the Bertrand-Nash equilibria (dotted lines).

As explained previously, an increase in σ reduces dealers’ exposure to adverse selection and

the elasticity of clients’ demand to dealers’ price. In the benchmark case (see Table 1), the first

effect reduces adverse selection costs. For this reason, the quoted spread in the Bertrand-Nash

equilibria decreases (weakly due to price discreteness) with σ. However, surprisingly, the opposite

pattern is observed for the quoted spread posted by AMMs: As σ increases, the two AMMs post

less competitive quotes. In fact, the effect of σ on AMM’s quotes is similar to its effect on the

monopoly price (red dashed line in 3). In this case, like a monopolist, the competing AMMs seem

18

Electronic copy available at: https://ssrn.com/abstract=4252858


to take advantage of the decrease in the client’s demand elasticity to charge larger markups and

thereby obtain larger expected profits (as shown by the evolution of the average realized spread).19

Thus, surprisingly, a decrease in adverse selection makes the quotes posted by AMMs less

competitive. The effect of ∆v on AMMs’ quotes (Panel b) conveys a similar message. As ∆v

decreases from 8 to 4, AMMs’ exposure to adverse selection decreases. However, as shown by the

evolution of AMMs’ realized spread, their rents increase, exactly as in the monopolist case. When

∆v keeps decreasing (from 4 to 1), AMMs rents decrease but in a way similar to what is observed

in theory for the monopolist.20 In sum, competing AMMs react to a decline in adverse selection

(an increase in σ or a decrease in ∆v) in a way qualitatively similar to a monopolist rather than

Bertrand competitors.

Panel (c) of 3 shows the effect of an increase in the number of AMMs (from 1 to 10). As the

number of AMMs increases, AMMs’ quotes become closer to the Bertrand-Nash equilibria. Thus,

AMMs’ rents (realized bid-ask spreads) decline. This pattern may seem intuitive. However, in

theory it takes only two dealers to obtain the Bertrand-Nash equilibrium. Thus, economic theory

predicts that bid-ask spreads and dealers’ rents should decline when N increases from 1 to 2 but

that a further increase in N should have no effect. Empirical findings regarding the effects of

high frequency market makers’ entry on bid-ask spreads, reported in Brogaard and Garriott (2019)

(discussed in the introduction), are more consistent with the patterns obtained for AMMs than

those predicted by the Bertrand-Nash equilibrium.

3.3.4 Welfare implications of algorithmic market-making

Spread measures do not immediately translate into welfare measures. In particular, the realized

spread RS measures a market-maker’s realized profit (and hence, cost for the client) conditionally on

a trade, but does not take into account the probability that this trade occurs. To further investigate

the consequences of AMMs for total welfare in the economy and its distribution between market-

makers and buyers, we compute the levels of welfare, consumer surplus, and firm profits achieved
19
The decline in the client’s demand elasticity explains why trading volume increases with σ in the experiments,
despite the fact that AMMs charge a larger price to their client.
20
AMMs’ rents also evolve in a way similar to that observed in one of the two Nash Bertrand equilibria (dotted
purple line) but opposite to that in the other one (yellow dashed line). We think that the pattern observed in first
case is due to price discreteness and will therefore not be robust with a finer grid, in contrast to other patterns.

19

Electronic copy available at: https://ssrn.com/abstract=4252858


with AMMs and compare them to their counterparts in the competitive benchmark.

For a given best ask a, total welfare can be computed as:

W (a) = Pr(ṽ + L̃ ≥ a)E[L̃|ṽ + L̃ ≥ a]. (15)

In words, welfare in this model is driven by the liquidity shocks L̃, which create gains from trade

between buyers and market-makers. Welfare is always lower when the ask price increases, and as a

result even in the competitive case as increase in adverse selection lowers welfare. Welfare can be

further decomposed into consumer surplus CS and producer surplus P S:

CS(a) = Pr(ṽ + L̃ ≥ a)E[L̃ + ṽ − a|ṽ + L̃ ≥ a], (16)

P S(a) = Pr(ṽ + L̃ ≥ a)E[a − ṽ|ṽ + L̃ ≥ a]. (17)

Based on the results of the experiments, we compute the average realized values of W, CS, and

P S, and show in Fig. 4 how they vary with ∆v and σ.

[Insert Fig. 4 here.]

We observe that an increase in σ leads to an increase in profits, due to both the AMMs behaving

less competitively (realized spreads increase) and demand elasticity being lower. However, because

this elasticity is low, high prices have a lower impact on the probability that a trade is realized, and

conditionally on a trade the gains are also higher. As a result, consumer surplus and total welfare

also increase with σ. An increase in ∆v has a somewhat ambiguous impact on realized spreads but

it reduces profits, consumer surplus, and hence total welfare.

Overall, the comparative statics of welfare and profit with respect to σ and ∆v are the same in

the two benchmarks and with a duopoly of AMMs, the levels reached with AMMs being in between

the monopoly benchmark and the competitive benchmark.

20

Electronic copy available at: https://ssrn.com/abstract=4252858


4 Price Discovery (τ̄ = 2)

Models of trading with asymmetric information in financial markets are often used to study the

process by which market participants discover asset fundamental values (“price discovery”). In

these models, trades convey information about an asset payoff (because some trades come from

informed investors). Using this information, uninformed traders (e.g., dealers) update their beliefs

about this payoff in a Bayesian way. Via this dynamic learning process, over time, prices get closer

to the asset value (see, for instance, Glosten and Milgrom (1985) or Easley and O’Hara (1992)).

In this section, we study whether AMMs can also discover asset fundamental values (ṽ in our

setting). To do so, we consider the case with two trading rounds (τ̄ = 2), following the same steps

as when τ̄ = 1. That is, in Section 4.1, we first explain how to derive equilibrium prices in our

two benchmarks (the monopoly case and the Bertrand-Nash equilibrium). Then, we explain how

Q-learning algorithms work in this environment (Section 4.2). Finally we present the results in

Section 4.3.

4.1 Benchmarks: Learning the Fundamental Value

When τ̄ = 2, dealers can learn information about ṽ from the trading outcome at date 1. Thus,

their beliefs regarding the payoff of the asset evolve over time. As is standard in models of trading

with asymmetric information, in the benchmark monopoly and competitive cases, we assume that

dealers update their beliefs in a Bayesian way. At the end of the first trading round in a given

episode, there are two possible trading histories (H1 ): (i) a trade at price amin
1 (H1 = {1, amin
1 }) or

(ii) no trade (H1 = {0, amin


1 }). In the first case, dealers’ Bayesian beliefs about the likelihood that

v = vH is (remember that dealers’ prior belief about this event is 1/2):

D(amin
1 , vH )
µ2 (1, amin min
1 ) := P r(v = vH | H1 = {1, a1 }) = , (18)
D(a1 , vH ) + D(amin
min
1 , vL )

21

Electronic copy available at: https://ssrn.com/abstract=4252858


where D(a, v), given by (4), is the probability that the client buys the asset at price a when the

asset value if v. In the second case, dealers’ Bayesian beliefs about the likelihood that v = vH is:

1 − D(amin
1 , vH )
µ2 (0, amin min
1 ) := P r(v = vH | H1 = {0, a1 }) = . (19)
2 − (D(a1 , vH ) + D(amin
min
1 , vL ))

It is easily checked that µ2 (1, amin min


1 ) > µ2 (0, a1 ) if (and only if) ∆v > 0. That is, Bayesian dealers

should revise their beliefs about the expected payoff of the asset upward after a trade (buy) at date

1 and downward after no trade at date 1.

Given these observations, one expects the monopoly price and the competitive price (the Nash-

Bertrand equilibrium price) to be larger (smaller) in the second trading round than in the first if

there is a trade (no trade) in the first trading round. Table 2 shows that this is the case for the

parameters of our experiments. In addition, in the competitive case, the difference between dealers’

ask prices when there is a trade and when there is no trade increases with the informativeness of

the order flow in the first period (i.e., increases with ∆v and decreases with σ). In addition, Table

2 shows that, in the competitive experiment, the ask price posted by dealers in the second period

is smaller on average than in the first period (that is, E[ac2 ] − ac1 ≤ 0). This reflects the fact that

as time passes, the informational asymmetry between dealers and their clients decline since dealers

learn information about the asset payoff. Thus, they face less adverse selection and therefore across

all possible realizations of v and the trading history at date 1, their ask price should be closer to

the asset unconditional value in the second period than in the first period. In Section 4.3, we study

whether AMMs’ quotes satisfy these properties or not. This is a way to study whether AMMs learn

to discover the asset payoff, even though they are not programmed to be Bayesian, as competitive

dealers do in the benchmark case.

[Insert Table 2 here.]

Table 2 also shows that, as in the case with one trading round (and for the same reasons), (i)

the competitive and the monopoly prices increase with the volatility of the asset (∆v) in each

trading round, (ii) the competitive prices in each trading round decrease with the dispersion of

clients’ private valuations (σ) and (iii) the monopoly prices in each trading round increase with this

22

Electronic copy available at: https://ssrn.com/abstract=4252858


dispersion.

Last, observe that, in the competitive case, the quotes posted by dealers in the first trading

round are identical to those obtained when there is a single trading round (compare Tables 1 and

2). In contrast, the monopolist price in the first trading round differs from that obtained when there

is a single a trading round. This reflects the fact that, in choosing her price in the first trading

round, a monopolist accounts for the effect of this price on her expected trading profit in the first

trading round and her expected trading profit in the second trading round via the effect of her

choice on her belief about the asset payoff given the first period outcome (trade/no trade).21

4.2 Q-Learning Algorithms

In this section, we explain how we adapt the Q-learning algorithms described in Section 3.2 to the

case in which episodes have two trading rounds. The algorithms will keep track in each episode of the

“state” they are in, and will play an action depending on the state. More specifically, for each AMM

n, we define (N +3) states, denoted sn , as follows: (i) sn = ∅ in the first trading round; (ii) sn = N T
n o
in the second trading round if no trade takes place in the first; (iii) sn ∈ S = 0, N1 , N 1−1 , ..., 12 , 1

is the number of shares sold by AMM n if a trade took place in period 1 (depending on how many

AMMs shared the market). Each AMM then relies on a Q-matrix Qn,t ∈ RM ×(N +3) , in which

each line corresponds to a different price and each column to a state, ordered as in the previous

paragraph. We denote qm,s,n,t the (m, s) entry of matrix Qn,t .

We then modify the process described in Section 3.2.1 as follows. For any experiment k, we

initialize the matrices Qn,0 with random values: Each qm,s,n,0 (for 1 ≤ m ≤ M , 1 ≤ n ≤ N , and

s ∈ S) is i.i.d. and follows a uniform distribution over [q, q]. Then, in each episode t, we do the

following:

Period 1:

1. For each AMM n, we define m1,∗


n,t = arg max qm,∅,n,t−1 the index associated with the highest
m
1,∗
value in matrix Qn,t−1 in state s = ∅ (the first period), and we denote an,t = am1,∗ the
n,t

21
One can show that µ2 (1, amin
1 )
and µ2 (0, amin
1 )
increase with amin
1 . Thus, by choosing a high amin
1 , the monopolist
improves the informational content of a trade at date 1 but it reduces the informational content of observing no trade.

23

Electronic copy available at: https://ssrn.com/abstract=4252858


corresponding greedy price.

2. For each AMM n, with probability ϵt = e−βt the AMM “explores” by playing a1n,t = am̃1n,t ,

where β > 0 and m̃1n,t is a random integer between 1 and M , all values being equiprobable.

With probability 1 − ϵt , the AMM “exploits” and plays the greedy price a1n,t = a1,∗
n,t . The

random draws leading to exploring or exploiting are i.i.d. across all AMMs in a given trading

round of a given episode.

3. We compute a1,min
t = min a1n,t , and draw ṽt and L̃1,t . This determines the position In,t
1 taken
n

by each AMM in period 1 and the state sn,t it will be in when period 2 starts. Formally, denote

Dt1 the set of AMMs who quote a1,min


t and zt1 the size of this set. Then, if ṽt + L̃1,t ≥ a1,min
t

1 =s
we have In,t n,t =
1
zt1
for every n ∈ Dt1 , and In,t
1 =s / Dt1 . If ṽt + L̃1,t < a1,min
n,t = 0 for n ∈ t

1 = 0 and s
then In,t n,t = N T for every n.

4. We update the first column of the Q-matrix of each AMM as follows:


α[a1n,t In,t
1 + max q ′ if a1n,t = am

m ,sn,t ,n,t−1 ] + (1 − α)qm,∅,n,t−1


m
∀1 ≤ n ≤ N, qm,∅,n,t =

qm,∅,n,t−1 if an,t ̸= am

(20)

Period 2:

1. At the beginning of period 2 we know the state sn,t in which AMM n finds itself. We define

m2,∗
n,t = arg max qm,sn,t ,n,t−1 the index associated with the highest value in matrix Qn,t−1 in
m

state s = sn,t , and we denote a2,∗


n,t = am2,∗ the corresponding greedy price.
n,t

2. With probability ϵt the AMM plays a random price a2n,t , following the same process as in

period 1. With probability 1 − ϵt , the AMM plays a2n,t = a2,∗


n,t .

3. We compute a2,min
t = min a2n,t and draw L̃2,t . This determines the position In,t
2 taken by each
n

AMM in period 2, following the same rules as in period 1.

24

Electronic copy available at: https://ssrn.com/abstract=4252858


4. For each AMM n, we only update the column corresponding to state sn,t , as follows:


α[a2n,t In,t
2 − ṽ (I 1 + I 2 )] + (1 − α)q if a2n,t = am


t n,t n,t m,sn,t ,n,t−1
∀1 ≤ n ≤ N, qm,sn,t ,n,t =

qm,s ,n,t−1 if a2n,t ̸= am

n,t

(21)

Q-learning algorithms were initially designed to solve dynamic stochastic optimization problems

(both finite and infinite horizon), and are thus in principle well suited to optimizing prices in this

environment. The Q-matrix is defined in such a way that each algorithm can in principle learn to

play a different price in period 2 depending on the “state”, that is, depending on whether there

was a trade in period 1. Note that, in addition, the state needs to include the amount sold by the

AMM in period 1. Indeed, as vt is only revealed in period 2, the Q-matrix can record only at the

end of period 2 what was the actual cost of selling some units of the asset in period 1.22

4.3 Results

To study price discovery with AMMs using the Q-learning algorithms described in the previous

section, we proceed exactly as in Section 3.3. In particular, we use the same parameter values for

K (number of experiments), T (number of episodes per experiments), α and β. For brevity, we only

focus on the case with two AMMs (N = 2). We measure price discovery by AMMs (i.e., whether

AMMs’ quotes reflect information about the asset payoff contained in the first period trade) by

computing the magnitude of the average price reaction to the observation of a trade vs. no trade

(across experiments with the same environment). Formally, defining Vtτ,k the total trading volume

in trading round τ of episode t in experiment k, we compute:

1,k 2,min,k
a1,min,k − VT1,k )[a2,min,k − a1,min,k
PK PK
k=1 VT [aT − T ] k=1 (1 T T ]
Discovery = PK 1,k
− PK 1,k
. (22)
k=1 VT k=1 (1 − VT )
22
Using inventory levels as the state variable is common in other applications of Q-learning, in particular in dynamic
pricing and revenue management. See, e.g., Rana and Oliveira (2014) for a recent example. The list of states used by
the algorithms is an important parameter of the model. The list could be even richer (e.g., conditioning on prices in
period 1 as well), or coarser (not distinguishing states N T and 0).

25

Electronic copy available at: https://ssrn.com/abstract=4252858


The variable Discovery is the empirical counterpart, in our experiments, of the difference between

the ask price in the second period when there is a trade and when there is no trade in the benchmark

cases. In these cases, this difference is always positive because dealers become more optimistic about

the asset payoff after observing a buy in the first trading round than after observing no buy (see

Table 2).

We also want to study whether price discovery induces dealers to charge lower markups relative

to their expectation of the asset payoff because it reduces informational asymmetries, as is observed

when dealers are competitive in the benchmark case (E(ac2 ) < ac1 ; see Table 2). To this end, we

compute the average difference (denoted Dif f erence) between the ask price posted in the second

trading round and the ask price posted in the first trading round across experiments:

2,min,k
− a1,min,k
PK
k=1 [aT T ]
Dif f erence = . (23)
K

If AMMs behave as in the competitive benchmark, Dif f erence should be negative. If it is not and

Discovery > 0, this indicates that (i) price discovery takes place but (ii) AMMs take advantage

of the reduction in informational asymmetries to charge less competitive prices, in line with our

observations in the static case.

Figure 5 plots Discovery and Dif f erence for different values of σ and ∆v. In addition, we

plot the highest and lowest values these quantities can take across the several Nash equilibria of the

game, the monopoly benchmark, and the competitive benchmark with continuous prices.

[Insert Fig. 5 here.]

First, we observe that for all values of the parameters, Discovery is positive. Thus, AMMs learn

to quote higher prices when a trade occurred in period 1 than when a trade did not occur. Hence,

Q-learning algorithms are able to learn from past trades and contribute to price discovery. However,

the algorithms seem to significantly “overshoot”. That is, the difference in the prices posted by

AMMs following a buy or no buy in the first trading round is always larger than that predicted in

the most competitive Nash-Bertrand equilibrium (the dashed dotted line), given price discreteness.

This indicates that the difference in posted prices following a trade or no trade in the first trading

26

Electronic copy available at: https://ssrn.com/abstract=4252858


round is in part driven by deviations from competitive prices.

Second, we observe that Dif f erence is always positive, that is on average the algorithms use a

higher price in the second trading round than in the first. This is in stark contrast to the competitive

benchmark in which, at least if the tick size were zero, Dif f erence should be negative (as shown

in Table 2 and the dashed green line in Figure 5 in the case of ”Difference”).

A mechanism that explains both results is, as in the static case, that adverse selection curbs

the market power of algorithms. In our set-up, observing the trading outcome in the first trading

round always reduces informational asymmetries between dealers and clients in the benchmark case.

Thus, dealers’ adverse selection cost is smaller in the second trading round. This decrease in adverse

selection leads the AMMs to settle on using less competitive prices, as we already observed in the

static case. In addition, adverse selection is reduced more after a trade than after no trade (observing

a trade is less likely ex ante, hence is more informative when it happens). Thus, AMMs tend to

charge larger markups after a trade than after no trade, explaining why on average Dif f erence is

positive insead of negative as in the competitive case.

These results give interesting insights into how competition between algorithms can be spotted

in the data. The first result implies that quotes will tend to over-react to order flow, potentially

generating more long-term reversal. The second result implies that spreads tend to widen as adverse

selection is resolved over time, whereas in competitive environments the opposite should occur (see,

e.g., Glosten and Putnins (2020)).

5 Conclusion

We study the interaction of market-makers using Q-learning algorithms in a standard microstructure

environment a la Glosten and Milgrom (1985). We show that this provides a natural workhorse

model to study the role of algorithms in securities markets, and how their behavior may differ from

what is predicted by standard theory. We find that, despite their simplicity and the challenge of an

environment with adverse selection, algorithms behave in a realistic way: their quotes reflect adverse

selection costs and they update their quotes in response to the observed order flow. However, their

behavior is markedly different from what standard theory predicts. In particular, their quotes tend

27

Electronic copy available at: https://ssrn.com/abstract=4252858


to be above the competitive level, and to become less competitive over time as adverse selection gets

resolved. More generally, our analysis shows that the interaction between algorithms is significantly

affected by the presence and extent of adverse selection, suggesting that securities markets are a

quite specific and particularly interesting application of recent research on competition between

algorithms.

28

Electronic copy available at: https://ssrn.com/abstract=4252858


References
Abada, I., Lambin, X. and Tchakarov, N. (2022). Collusion by Mistake: Does Algorithmic Sophistication
Drive Supra-Competitive Profits? Working paper. 6, 17

Asker, J., Fershtman, C. and Pakes, A. (2021). Artificial intelligence and pricing: The impact of
algorithm design. Tech. rep., National Bureau of Economic Research. 6

—, — and — (2022). Artificial intelligence, algorithm design, and pricing. AEA Papers and Proceedings,
112, 452–56. 17

Baldauf, M. and Mollner, J. (2020). High-frequency trading and market performance. The Journal of
Finance, 75 (3), 1495–1526. 6

Banchio, M. and Skrzypacz, A. (2022). Artificial intelligence and auction design. Available at SSRN
4033000 9. 6

Biais, B., Foucault, T. and Moinas, S. (2015). Equilibrium fast trading. Journal of Financial Economics,
116 (2), 292–313. 6

Brogaard, J. and Garriott, C. (2019). High-frequency trading competition. Journal of Financial and
Quantitative Analysis, 54 (4), 1469–1497. 4, 19

—, Hagströmer, B., Nordén, L. and Riordan, R. (2015). Trading fast and slow: Colocation and
liquidity. The Review of Financial Studies, 28 (12), 3407–3443. 1

Budish, E., Cramton, P. and Shim, J. (2015). The High-Frequency Trading Arms Race: Frequent Batch
Auctions as a Market Design Response *. The Quarterly Journal of Economics, 130 (4), 1547–1621. 6

Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S. (2020). Artificial intelligence, algorith-
mic pricing, and collusion. American Economic Review, 110 (10), 3267–97. 5, 11, 12, 13, 17

Cartea, Á., Chang, P., Mroczka, M. and Oomen, R. C. (2022a). AI driven liquidity provision in OTC
financial markets. Working paper. 6

—, — and Penalva, J. (2022b). Algorithmic Collusion in Electronic Markets: The Impact of Tick Size.
Working paper. 6

Chen, L., Mislove, A. and Wilson, C. (2016). An empirical analysis of algorithmic pricing on amazon
marketplace. In Proceedings of the 25th international conference on World Wide Web, pp. 1339–1349. 1

CMA (2018). Pricing algorithms. pp. 3–62. 1

Easley, D. and O’Hara, M. (1992). Time and the process of security price adjustment. The Journal of
Finance, 47 (2), 577–605. 21

Foucault, T., Marco, P. and Ailsa, R. (2013). Market Liquidity: Theory, Evidence, and Policy. Oxford:
Oxford University Press. 10

Glosten, L. and Putnins, T. (2020). Welfare Costs of Informed Trade. Working paper. 27

Glosten, L. R. and Harris, L. E. (1988). Estimating the components of the bid/ask spread. Journal of
financial Economics, 21 (1), 123–142. 5

— and Milgrom, P. R. (1985). Bid, ask and transaction prices in a specialist market with heterogeneously
informed traders. Journal of Financial Economics, 14 (1), 71–100. 1, 3, 4, 6, 21, 27

Hansen, K. T., Misra, K. and Pai, M. M. (2021). Frontiers: Algorithmic collusion: Supra-competitive
prices via independent algorithms. Marketing Science, 40 (1), 1–12. 6

29

Electronic copy available at: https://ssrn.com/abstract=4252858


Hendershott, T., Jones, C. M. and Menkveld, A. J. (2011). Does algorithmic trading improve liquid-
ity? The Journal of Finance, 66 (1), 1–33. 4

Jaakkola, T., Jordan, M. I. and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6 (6), 1185–1201. 13

Kyle, A. S. (1985). Continuous auctions and insider trading. Econometrica, 53 (6), 1315–1335. 1, 3, 6

MacKay, A. and Weinstein, S. (2022). Dynamic Pricing Algorithms, Consumer Harm, and Regulatory
Response. Working paper. 1

Menkveld, A. and Zoican, M. (2017). Need for speed? exchange latency and liquidity. Review of Financial
Studies, 30 (4), 1188–1228. 6

Menkveld, A. J. (2013). High frequency trading and the new market makers. Journal of financial Markets,
16 (4), 712–740. 1

OECD (2017). Algorithms and collusion: Competition policy in the digital age. pp. 1–72. 1

O’Hara, M. (2015). High frequency market microstructure. Journal of Financial Economics, 116 (2), 257–
270. 6

Rana, R. and Oliveira, F. S. (2014). Real-time dynamic pricing in a non-stationary environment using
model-free reinforcement learning. Omega, 47, 116–126. 25

Sutton, R. and Barto, A. (2018). Reinforcement Learning: An Introduction. Cambridge (Mass.): MIT
Press. 11

Tsitsiklis, J. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning, 16, 185–
202. 13

Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. 12

Wunder, M., Littman, M. L. and Babes, M. (2010). Classes of multiagent q-learning dynamics with
epsilon-greedy exploration. In ICML, pp. 1167–1174. 6, 17

30

Electronic copy available at: https://ssrn.com/abstract=4252858


A Appendix

A.1 Tables

Panel A
σ 0.5 1 3 5 7
Competitive Case
a c 4.00 4.00 3.24 2.68 2.47
Quo. Spread 2.00 2.00 1.24 0.68 0.47
Real. Spread 0 0 0 0 0
Monopoly
am 4.37 4.69 5.68 6.54 7.03
Quo. Spread 2.37 2.69 3.68 4.54 5.03
Real. Spread 0.03 0.09 0.32 0.68 1.03
Panel B
∆v 0 2 4 6 8
Competitive Case
ac 2 2.16 2.68 3.65 5.02
Quo. Spread 0 0.16 0.68 1.65 3.02
Real. Spread 0 0 0 0 0
Monopoly
am 5.75 5.94 6.54 7.66 9.11
Quo. Spread 3.75 3.94 4.54 5.66 7.11
Real. Spread 0.82 0.78 0.69 0.57 0.47

Table 1: Predicted Outcomes in the Benchmark Cases, τ̄ = 1. Prices are continuous. Clients’
private valuations are normally distributed with mean zero and variance σ 2 . Moreover, E(v) = 2
and µ = 21 (vH = 4 and vL = 0). Panel A: ∆v = 4. Quotes have been rounded up to two decimals
(which explains why they are equal when σ = 0.5 and σ = 1). Panel B: σ = 5.

31

Electronic copy available at: https://ssrn.com/abstract=4252858


Panel A
σ 0.5 1 3 5 7
Competitive Case
ac1 4.00 4.00 3.24 2.68 2.47
c
a2 , V1 = 1 4.00 4.00 3.82 3.26 2.92
ac2 , V1 = 0 4.00 4.00 2.44 2.08 2.02
E(ac2 ) 4.00 4.00 2.96 2.62 2.45
Monopoly
am
1 4.38 4.75 5.65 6.53 7.8
m
a2 , V1 = 1 4.38 4.75 6.2 7.33 8.47
am
2 , V1 = 0 4.38 4.75 5.45 6.28 7.59
E(am 2 ) 4.38 4.75 5.65 6.53 7.8
Panel B
∆v 0 2 4 6 8
Competitive Case
ac1 2 2.16 2.68 3.65 5.03
c
a2 , V1 = 1 2 2.5 3.26 4.6 5.87
ac2 , V1 = 0 2 1.8 2.08 2.45 3.67
E(ac2 ) 2 2.09 2.62 3.42 4.61
Monopoly
am
1 5.76 5.94 6.53 7.61 9.09
m
a2 , V1 = 1 5.76 6.2 7.33 8.61 9.73
am
2 , V1 = 0 5.76 5.87 6.28 7.26 8.86
E(am 2 ) 5.76 5.93 6.49 7.53 9.01

Table 2: Predicted Outcomes in the Benchmark Cases, τ̄ = 2. Prices are continuous. Clients’
private valuations are normally distributed with mean zero and variance σ 2 . Moreover, E(v) = 2
and µ = 21 (vH = 4 and vL = 0). Panel A: ∆v = 4. Quotes have been rounded up to two decimals
(which explains why they are equal when σ = 0.5 and σ = 1). Panel B: σ = 5. In each case, I1 = 1
if a trade takes place at date 1 and I1 = 0 otherwise.

32

Electronic copy available at: https://ssrn.com/abstract=4252858


A.2 Figures

(a) Average greedy price as a function of time. (b) Distribution of the final greedy price.

Figure 1: A single AMM - Baseline Parameters.


Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
σ = 5, E(v) = 2 and µ = 21 (vH = 4, vL = 0 and ∆v = 4)

(a) Average greedy price of both AMMs as a function (b) Distribution of the final greedy price of both
of time. AMMs.

Figure 2: Duopoly of AMMs - Baseline Parameters.


Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
σ = 5, E(v) = 2 and µ = 12 (vH = 4, vL = 0 and ∆v = 4)

33

Electronic copy available at: https://ssrn.com/abstract=4252858


(a) Dispersion of Clients’ Private Valuations (σ)
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
E(v) = 2 and µ = 21 (vH = 4, vL = 0 and ∆v = 4)

34

Electronic copy available at: https://ssrn.com/abstract=4252858


(b) Volatility of the Asset Payoff (∆v)
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
σ = 5, E(v) = 2 and µ = 21

35

Electronic copy available at: https://ssrn.com/abstract=4252858


(c) Number of AMMs (N )
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
σ = 5, E(v) = 2 and µ = 21 (vH = 4, vL = 0 and ∆v = 4)

Figure 3: Comparative statics

36

Electronic copy available at: https://ssrn.com/abstract=4252858


(a) Dispersion of Clients’ Private Valuations (σ)
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
E(v) = 2 and µ = 21 (vH = 4, vL = 0 and ∆v = 4)

37

Electronic copy available at: https://ssrn.com/abstract=4252858


(b) Volatility of the Asset Payoff (∆v)
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
σ = 5, E(v) = 2 and µ = 21

38

Electronic copy available at: https://ssrn.com/abstract=4252858


(c) Number of AMMs (N )
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover,
σ = 5, E(v) = 2 and µ = 21 (vH = 4, vL = 0 and ∆v = 4)

Figure 4: Comparative statics

39

Electronic copy available at: https://ssrn.com/abstract=4252858


(a) Dispersion of Clients’ Private Valuations (σ)
Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover, E(v) = 2
and µ = 12 (vH = 4, vL = 0 and ∆v = 4)

(b) Volatility of the Asset Payoff (∆v)


Clients’ private valuations are normally distributed with mean zero and variance σ 2 . Moreover, σ = 5,
E(v) = 2 and µ = 21

Figure 5: Comparative statics

40

Electronic copy available at: https://ssrn.com/abstract=4252858


A.3 Derivation of the Competitive Price

In this section, we explain how to compute the competitive price in a given trading round for given

dealers’ beliefs about the distribution of the asset payoff. We do so in the general case (for any τ )

so that our results apply in particular when τ = 1 and τ = 2.

Let Vτ = I(a(τ ), L̃τ , ṽ) ∈ {0, 1} denote the realized trade in period τ and let Inτ = Vτ Z(anτ , a(τ )) ∈

[0, 1] the trade executed by dealer n in round τ . Let Hτ denote the trading history (the obser-

vation of clients’ trading decisions and the best quotes until trading round τ ). That is, Hτ =

{(Vi , amin
i )}{i=1,2,...τ,} for τ ≥ 0 and H(0) = ∅. The trading history contains information about the

asset payoff. Indeed, holding the best quote constant, a client is more likely to buy the asset when

ṽ is large than when ṽ is low. Let µ(Hτ −1 ) be dealers’ estimate of the probability that ṽ = vH at

the beginning of trading round τ (with µ(0) = µ = 12 ), given the trading history.

In a Nash-Bertrand equilibrium, in the τ th trading round, all dealers posts the same price acτ

such that their expected profit is zero. This happens only if the expected profit of the dealer posting

the lowest price is nil among all dealers. Thus, acτ solves:

Π(acτ , µ(Hτ −1 )) = µ(Hτ −1 )D(acτ , vH )(acτ − vH ) + (1 − µ(Hτ −1 ))D(acτ , vL )(acτ − vL ) = 0. (A.1)

We deduce that:

µ(Hτ −1 )(1 − µ(Hτ −1 ))(vH − vL ))(D(acτ , vH ) − D(acτ , vL ))


acτ = E(ṽ | Hτ −1 ) + . (A.2)
µ(Hτ −1 )D(acτ , vH ) + (1 − µ(Hτ −1 ))D(acτ , vL )

The competitive price is the smallest solution to this equation. Observe that it is equal to dealers’

expectation of the asset payoff conditional on their information at the beginning of trading round j

plus a markup (since D(acτ , vH ) − D(acτ , vL ) = G(c −vH ) − G(ac − vL ) > 0). This markup increases

with dealers’ uncertainty about the asset payoff at the beginning of trading round τ (measured by

µ(Hτ −1 )(1 − µ(Hτ −1 ))(vH − vL )).

There is no analytical solution to (A.2). However, one can easily solve it numerically for specific

parameter values. To solve (numerically) for the competitive price in the first trading round, we

just replace µ(Hτ −1 ) by µ = 1/2 in (A.2) (dealers’ prior at the beginning of an episode). To solve

41

Electronic copy available at: https://ssrn.com/abstract=4252858


for the competitive price in the second trading round after a trade in the first trading round, we

replace µ(H1 ) by µ2 (1, ac1 ) (given in (18) in the text) in (A.2). To solve for the competitive price

in the second trading round after no trade in the first trading round, we replace µ(H1 ) by µ2 (0, ac1 )

(given in (19) in the text) in (A.2). Also, note that the probability that a trade occurs in trading

round τ is P r(Vτ = 1) = µ(Hτ −1 )D(acτ , vH ) + (1 − µ(Hτ −1 ))D(acτ , vL ). Hence, one also gets that:

acτ = E(v | Hτ −1 , Vτ = 1). (A.3)

That is, the competitive price is the expected payoff of the asset conditional on the beginning of

the trading history up trading round τ and the occurrence of a trade in trading round τ .

A.4 Derivation of the Monopolist’s Prices

For any given belief µ ∈ [0, 1] that a monopolist might have about ṽ in a given round τ , if the

monopolist sets a price of a, his expected payoff from that trading round is equal to Π̄(a, µτ −1 ).

Let am (µ) solve:

am (µ) ∈ Arg max Π̄(a, µ). (A.4)


a

That is, am (µ) is the price that maximizes the monopolist dealer’s expected payoff in round τ , given

his belief µ. If the monopolist plays price am (µ), his round τ expected payoff is equal to

Π∗ (µ) := Π̄(am (µ), µ)

Monopolist case with one trading round (τ̄ = 1). Because the initial belief is µ = 21 , when

there is a single trading round, the monopolist sets a price of am 21 .




Monopolist case with two trading rounds (τ̄ = 2). We now consider the optimal pricing

policy of a monopolist dealer when there are two trading rounds. To do so, we proceed by backward

induction.

In the second and last round, the price that the monopolist will choose if at the beginning of the

second period his belief is µ2 must be equal to am (µ2 ) leading to a second period payoff of Π∗ (µ2 )

42

Electronic copy available at: https://ssrn.com/abstract=4252858


Let consider now the monopolist total payoff from the perspective of period 1. If he sets a first

period price of a, then his posterior belief µ2 is equal to µ2 (1, a) (given in (18) in the text) in (A.2)

if the first period client buys, whereas µ2 = µ2 (0, a) (given in (18) in the text) if the first period

client does not buy. Hence the monopolist will set his first period price am
1 equal to the price a

maximizing his total payoff

 
1
Π̄ a, + P r(a)Π∗ (µ2 (1, a)) + (1 − P r(a))Π∗ (µ2 (0, a)), (A.5)
2 | {z }
| {z } second round payoff
First round payoff

where P r(a) := 12 (D(a, vH ) + D(a, vL )) is the probability that a trade takes place at date 1 if

the monopolist chooses price a at this date. Thus, in choosing her price at date 1, the monopolist

accounts for the effect of this price on her expected profit on the trade at date 1 and her continuation

value.

When τ̄ = 2, we obtain the benchmark price at date 2 in the monopoly case by solving numeri-

cally (A.4) (both when there is a trade at date 1 and when there is no trade) and the benchmark

price at date 1 by maximizing (A.5).

A.5 Proof of Lemma 1

Fix a price am and a dealer n. Suppose that at episode t the dealer’s price is an,t = am and

it is the lowest price among dealers, i.e. an,t = am = amin


t . Then three outcomes are possible:

either the dealer does not trade, the dealer sells the asset worth vH , or the dealer sells the asset

worth vL . In all cases the Q-matrix is updated. If the dealer does not trade then πn,t = 0 and

qm,n,t+1 = (1 − α)qm,n,t+1 , implying

|qm,n,t − qm,n,t+1 | = α|qm,n,t |

If the dealer trades then qm,n,t+1 = α(am − ṽ) + (1 − α)qm,n,t+1 , and thus

|qm,n,t − qm,n,t+1 | = α|am − vH − qm,n,t |

43

Electronic copy available at: https://ssrn.com/abstract=4252858


if ṽ = vH , and

|qm,n,t − qm,n,t+1 | = α|am − vL − qm,n,t |

if ṽ = vL . Denote ∆m (q) := α max{|q|, |am − vH − q|, |am − vL − q|} the maximum possible value

of that |qm,n,t − qm,n,t+1 | can take given that qm,n,t = q. Note that

   
am − vL vH − am vH − vL α vH − vL
min ∆m (q) = α max , , = vH − vL + am − = ∆∗m
q 2 2 2 2 2

In words, no matter the value of qm,n,t , at least one of the three possible outcomes mentioned above

leads to |qm,n,t − qm,n,t+1 | ≥ ∆∗m . Thus the probability that |qm,n,t − qm,n,t+1 | ≥ ∆∗m cannot be

smaller than the smallest of the probabilities of these three events.

Now, given an,t = am = amin


t , the probability that the dealer sells the asset worth vH , is at least

1 1
2N D(am , vH ). The probability that the dealer sells the asset worth vL , is at least 2N D(am , vL ) <
1
2N D(am , vH ). The probability that the dealer does not trade is 1 − 12 (D(am , vL ) + D(am , vH )),
∗ . Q.E.D.
hence the expression for Pm

44

Electronic copy available at: https://ssrn.com/abstract=4252858

You might also like