0% found this document useful (0 votes)

2 views117 pages

thesis_execution-augmented

This dissertation by Beomsoo Park explores strategic and adaptive execution in financial markets, focusing on optimal trading strategies in the presence of adversaries like arbitrageurs. It presents algorithms for computing perfect Bayesian equilibrium and introduces the CTRACE policy for simultaneous execution and learning, demonstrating its efficiency through numerical experiments. The work emphasizes the importance of minimizing execution costs while managing information signaling and strategic interactions in trading.

Uploaded by

qinjn.09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views117 pages

thesis_execution-augmented

Uploaded by

qinjn.09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 117

STRATEGIC AND ADAPTIVE EXECUTION

A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

Beomsoo Park
July 2012
© 2012 by Beomsoo Park. All Rights Reserved.
Re-distributed by Stanford University under license with the author.

This dissertation is online at: http://purl.stanford.edu/qh657ph4274

ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Benjamin Van Roy, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Kay Giesecke

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Ramesh Johari

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.

iii
iv
Abstract

First, we consider a trader who aims to liquidate a large position in the presence of an
arbitrageur who hopes to profit from the trader’s activity. The arbitrageur is uncer-
tain about the trader’s position and learns from observed price fluctuations. This is a
dynamic game with asymmetric information. We present an algorithm for computing
perfect Bayesian equilibrium behavior and conduct numerical experiments. Our re-
sults demonstrate that the trader’s strategy differs significantly from one that would
be optimal in the absence of the arbitrageur. In particular, the trader must balance
the conflicting desires of minimizing price impact and minimizing information that is
signaled through trading. Accounting for information signaling and the presence of
strategic adversaries can greatly reduce execution costs.
Second, we consider a model in which a trader seeks to maximize expected risk-
adjusted profit while trading a single security. In our second model, there is no arbi-
trageur, and each price change is a linear combination of observed factors, impact re-
sulting from the trader’s current and prior activity, and unpredictable random effects.
The trader must learn coefficients of a price impact model while trading. We propose
a new method for simultaneous execution and learning – the confidence-triggered
regularized adaptive certainty equivalent (CTRACE) policy – and establish a poly-
logarithmic finite-time expected regret bound. This bound implies that CTRACE
is efficient in the sense that the (, δ)-convergence time is bounded by a polynomial
function of 1/ and log(1/δ) with high probability. In addition, we demonstrate via
Monte Carlo simulation that CTRACE outperforms the certainty equivalent policy
and a recently proposed reinforcement learning algorithm that is designed to explore
efficiently in linear-quadratic control problems.

v
To my family

vi
Acknowledgements

First and foremost, I am deeply grateful to my advisor, Prof. Benjamin Van Roy,
for his guidance and earnest advice. Not only has he taught me how to formulate
interesting problems and carry out in-depth research on them, but he has constantly
encouraged me to improve myself in many other respects. Interactions with him have
always given me fresh ideas and inspiration for challenging problems that I have faced
during my Ph.D program, and his way of thinking has had a great impact on mine.
Next, I would like to express my sincere thanks to Prof. Ramesh Johari and
Prof. Kay Giesecke who are my thesis reading committee members. Their insightful
comments and advice have been crucial to enhancing the quality of this thesis sub-
stantially. Also, I am honored to have Prof. J. Michael Harrison as a chair of my oral
examination committee.
I feel indebted to Prof. Ciamac C. Moallemi, who helped me greatly as a co-
author finish the paper on which the Part I of this thesis is based, and gave me
detailed answers to my questions all the time. His knowledgeability in many different
areas has profoundly impressed me and I have benefited from it significatly.
It is very fortunate to have Edward Kao, Michael Padilla, Waraporn Tongprasit,
Zheng Wen and Robbie Yan as friends and officemates. I have learned enormously
from them and I am heartily thankful to them for their kindness. Active discussions
with them on a variety of topics have deepen my understanding and shapen my
thoughts, and have remained one of the most precious memories at Stanford.
Finally, I could have not finished the long academic journey at Stanford with-
out wholehearted supports and encouragements from my family. It has always been
everything to me and I dedicate this thesis to my beloved family, especially, to my

vii
parents.

The author of this thesis was supported by Samsung Scholarship.

viii
Notation

A , B indicates “define A by B.” The set of all real numbers is denoted by R and
the set of all natural numbers is denoted by N. X ∼ N (µ, σ 2 ) represents that a
random variable X has Gaussian distribution with mean µ and variance σ 2 . k · k and
k · kF denote the `2 -norm and the Frobenius norm of a matrix, respectively. a ∨ b and
a ∧ b denote max{a, b} and min{a, b}, respectively. For a symmetric matrix A, A 0
means that A is positive definite and A 0 means that A is positive semidefinite.
λmin (A) indicates the smallest eigenvalue of A. (A)ij of a matrix A indicates the entry
of A in the ith row and in the jth column. (v)i of a vector v indicates the ith entry
of v. diag(v) of a vector v denotes a diagonal matrix whose ith diagonal entry is (v)i .
A∗,j denotes the jth column of A and Ai:j,k indicates a segment of the kth column
of A from the ith entry to the jth entry. 1 denotes all-one vector with appropriate
dimension whereas 1{B} denotes an indicator function on the event B.

ix
Contents

Abstract v

Acknowledgements vii

Notation ix

1 Introduction 1
1.1 Optimal Execution Strategies . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Strategic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Adaptive Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Part I: Strategic Execution 11

2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Game Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Price Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Information Structure . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.6 Equilibrium Concept . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Dynamic Programming Analysis . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Stage-Wise Decomposition . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Linear Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Quadratic Value Functions . . . . . . . . . . . . . . . . . . . . 20

x
2.2.4 Simplified Conditions for Equilibrium . . . . . . . . . . . . . . 21
2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Representation of Policies . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Searching for Equilibrium Variances . . . . . . . . . . . . . . . 24
2.4 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Alternative Policies . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Relative Volume . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 Policy Performance . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.4 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.5 Adaptive Trading . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.6 Does an Arbitrageur Benefit the Market? . . . . . . . . . . . . 35
2.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.1 Time Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.2 Risk Aversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.3 Price Impact and Price Dynamics . . . . . . . . . . . . . . . . 40

3 Part II: Adaptive Execution 42

3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2 Existence of Optimal Solution . . . . . . . . . . . . . . . . . . 46
3.1.3 Closed-Form Solution: A Single Factor and Permanent Impact
Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.4 Performance Measure: Regret . . . . . . . . . . . . . . . . . . 50
3.2 CTRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Conclusion 63

A Proofs for Chapter 2 66

B Proofs for Chapter 3 73

xi
List of Tables

3.1 Monte Carlo Simulation Setting . . . . . . . . . . . . . . . . . . . . . 59

xii
List of Figures

2.1 The normalized expected profit of trading strategies for the time hori-
zon T = 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 The spill-over of the system for the time horizon T = 20 . . . . . . . 32
2.3 The evolution of relative uncertainty of the trader’s position for the
time horizon T = 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 The deterministic components of trading strategies for the time horizon
T = 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Sample paths of the evolution of the trader’s actual and expected posi-
tions, and the arbitrageur’s mean belief, when T = 20, x0 = σ0 = 105 ,
µ0 = y0 = 0, σ = 0.125, λ = 10−5 . . . . . . . . . . . . . . . . . . . . 36
2.6 The volatility of PBE policy and equipartitioning policy for the time
horizon T = 20 when σ = 10−2 . . . . . . . . . . . . . . . . . . . . . 38

3.1 Relative errors for PT and L(θt ) . . . . . . . . . . . . . . . . . . . . . 48

3.2 Relative regret with varying κ and Cv . . . . . . . . . . . . . . . . . . 60
3.3 Comparions of CTRACE to CE in terms of relative regret and realized
profit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Comparions of CTRACE to AS policy in terms of relative regret and
realized profit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

xiii
xiv
Chapter 1

Introduction

1.1 Optimal Execution Strategies

In financial markets, execution services are typically referred to as implementation
of investment decisions involving how, where and when to buy or sell particular
financial instruments. Execution of large block orders often entails high execution
costs that could be severely detrimental to the return of investment portfolios. In
order to enhance investment performance, it is crucial to design execution strategies
that minimize execution costs. Growing recognition for the importance of execution
has fueled academic literature on the topic [3, 4, 5, 7, 6, 11, 18, 20, 31, 29, 33, 36, 39,
43, 45] as well as the formation of specialized groups at investment banks and other
organizations to offer execution services.
By execution costs, we mean all costs associated with executing orders for finan-
cial instruments [11]. Some execution costs, such as exchange fees, commissions and
bid-ask spreads, are known in advance before executing orders. However, there is
an “invisible” component that often dominates other sources of execution costs: A
large block trade tends to “move the market” considerably during its execution by
disturbing the balance between supply and demand or adjusting other market partic-
ipants’ valuations. Such a trade is typically executed through a sequence of orders,
each of which pushes price in an adverse direction. This effect is called price impact.
Price impact can be dramatic when trading large blocks, when the security is thinly

1
2 CHAPTER 1. INTRODUCTION

traded, or when there is an urgent demand for liquidity. Because it is responsible for
a large fraction of execution costs, it is important to design execution strategies that
effectively manage price impact. Execution algorithms aim to reduce price impact by
partitioning the quantity to be traded and placing trades sequentially.
Optimal execution algorithms have been developed for a number of models. In the
base model of [11], a stock price nominally follows a discrete-time random walk and the
market impact of a trade is permanent and linear in trade size. The authors establish
that expected cost is minimized by an equipartitioning policy. This policy trades equal
amounts over time increments within the trading horizon. Further developments have
led to optimal execution algorithms for models that incorporate price predictions [11],
bid-ask spreads and resilience [3, 39], nonlinear price impact models [4, 5], and risk
aversion [7, 6, 18, 20, 29, 31, 36, 43, 45].
The aforementioned results offer insight into how one should partition a block and
sequence trades under various assumptions about market dynamics and objectives.
The resulting algorithms, however, are unrealistic in that they exhibit predictable
behavior. Such predictable behavior allows strategic adversaries, which we call arbi-
trageurs, to “front-run” trades and profit at the expense of increased execution costs.
For example, consider liquidating a large block by an equipartitioning policy which
sells an equal amount during each minute of a trading day. Trades early in the day
generate abnormal price movements. The resulting “information leakage” allows an
observing arbitrageur to anticipate further liquidation. If the arbitrageur sells short
and closes his position at the end of the day, he profits from expected price decrease.
The arbitrageur’s actions amplify price impact and therefore increase execution costs.
Concern about the increased cost of trading due to information leakage is not aca-
demic. Indeed, it is known that many high-frequency statistical arbitrage trading
strategies developed by banks and hedge funds profit by exploiting precisely this type
of signaling [19].
Another issue with price impact is that it is not known a priori in practice and
it is what traders should learn from their trading activity using a certain type of
price impact model. In fact, the learning of a price impact model poses a challenging
problem. Price impact represents an aggregation of numerous market participants’
1.2. STRATEGIC EXECUTION 3

interpretations of and reactions to executed trades. As such, learning requires “ex-

citation” of the market, which can be induced by regular trading activity or trades
deliberately designed to facilitate learning. The trader must balance the short term
costs of accelerated learning against the long term benefits of an accurate model.
Further, given the continual evolution of trading venues and population of market
participants, price impact models require retuning over time.
In this thesis, we address two important issues regarding price impact when de-
signing optimal execution strategies: information leakage through price impact, and
simultaneous execution and learning of price impact. In the first part of this thesis,
we formulate and study a dynamic game with asymmetric information in which a
trader balances the trade-off between minimizing price impact and concealing private
information about his stock position in the presence of a strategic arbitrageur who is
willing to profit from front-running the trader. In the second part of this thesis, we
develop an algorithm that learns a price impact model while guiding trading decisions
using the model being learned. We will make detailed introductory remarks on each
problem in the following sections.

1.2 Strategic Execution in the Presence of an Un-

informed Arbitrageur
Several recent papers study game-theoretic models of execution in the presence of
strategic arbitrageurs [12, 15, 17, 40, 44]. However, these models involve games with
symmetric information, in which arbitrageurs know the position to be liquidated.
In more realistic scenarios, this information would be the private knowledge of the
trader, and the arbitrageurs would make inferences as to the trader’s position based
on observed market activity.
This type of information asymmetry is central to effective execution. The fact
that his position is unknown to others allows the trader to greatly reduce execution
costs. But to do so requires the deliberate management of information leakage, or
the signals that are transmitted via trading activity. Further, the desire to minimize
information signaling may be at odds with the desire to minimize price impact. A
4 CHAPTER 1. INTRODUCTION

model through which such signaling can be studied must account for uncertainty
among arbitrageurs and their ability to learn from observed price fluctuations. In the
Part I of this thesis, we formulate and study a model which we believe to be the first
that meets this requirement.
More precisely, we formulate the optimal execution problem as a dynamic game
with asymmetric information involving a trader and a single arbitrageur. Both agents
are risk-neutral, and we assume that price impact is permanent and linear in order
size. The trader aims to liquidate his position in a finite time horizon. Meanwhile,
the arbitrageur attempts to infer the trader’s position from the impact of the trader’s
activity on prices, and profit from this information. We develop an algorithm that
computes a perfect Bayesian equilibrium of this game, and demonstrate that the
associated equilibrium strategies have several interesting structural properties as well
as effectively adapt to price fluctuataions. Indeed, the trader’s adaptive behavior is
crucial in alleviating the loss from front-running by misleading the arbitrageur into
incorrect estimates for the trader’s position. Note that strategies proposed based on
prior models in the literature are deterministic and independent of price movements.
Solving for perfect Bayesian equilibrium in dynamic games with asymmetric infor-
mation is notoriously difficult. What facilitates effective computation in our model is
that, in equilibrium, each agent solves a tractable linear-quadratic Gaussian control
problem. Similar approaches based on linear-quadratic Gaussian control have previ-
ously been used to analyze equilibrium behavior of traders with private information.
Here, the private signal typically takes the form of information on the fundamental
value of the traded asset. This line of work on “insider trading” or “strategic trad-
ing” begins with the seminal paper of [34], and includes many subsequent papers [e.g.,
9, 10, 14, 21, 22, 26, 27, 28, 48, 47]. Among these contributions, [21] come closest to
the model and method we propose. In the model of that paper, there are two strate-
gic traders, many “noise” traders, and a market maker. The strategic traders possess
information that is not initially reflected in market prices. One trader knows more
than the other. The more informed trader adapts trades to maximize his expected
payoff, and this entails controlling how his private information is revealed through
price fluctuations. This model parallels ours if we think of the arbitrageur as the less
1.2. STRATEGIC EXECUTION 5

informed trader. However, in our model there is no private information about future
dividends but instead uncertainty about the size of the position to be liquidated. Fur-
ther, in the model of [21], trades influence prices because the market maker tries to
infer the traders’ private information. In our setting, there is an exogenously specified
price impact model. The algorithm we develop bears some similarity to that of [21],
but requires new features designed to address differences in our model.
It is also worth discussing how our model differs from that of [48]. Both models
consider the inventory of a large trader as asymmetric information. The details and
the goals of the models are significantly different, however. In particular, [48] seeks a
structural model to provide intuition for behavior of a large trader seeking to maximize
utility through trade and consumption decisions. We, on the other hand, specialize
to the context of minimizing execution costs in a short time horizon (e.g., one day)
trade execution problem. We seek to provide specific policy recommendations for
this problem, which is known to be of significant practical interest. [48] assumes an
implicit price impact that arises endogenously through a continuum of competitive
market makers. We assume an exogenous and explicit price impact. This is important
in the context of trade execution problems, since such forms of explicit price impact
can be directly estimated. Moreover, the competitive market makers of [48] do not
directly trade. Instead, they manipulate prices in anticipation of the large trader, and
their effect is diluted through competition. Our model, with a strategic arbitrageur,
more directly captures the idea of “front-running.”
The contributions of the Part I of this thesis are summarized as follows:
1. We formulate the optimal execution problem as a dynamic game with asym-
metric information. This game involves a trader and a single arbitrageur. Both
agents are risk-neutral, and market dynamics evolve according to a linear per-
manent price impact model. The trader seeks to liquidate his position in a
finite time horizon. The arbitrageur attempts to infer the position of the trader
by observing market price movements, and seeks to exploit this information for
profit.
2. We develop an algorithm that computes perfect Bayesian equilibrium behavior.
3. We demonstrate that the associated equilibrium strategies take on a simple
6 CHAPTER 1. INTRODUCTION

structure: Trades placed by the trader are linear in the trader’s position, the
arbitrageur’s position and the arbitrageur’s expectation of the trader’s position.
Trades placed by the arbitrageur are linear in the arbitrageur’s position and his
expectation of the trader’s position. Equilibrium policies depend on the time
horizon and a parameter that we call the “relative volume”. This parameter
captures the magnitude of the per-period activity of the trader relative to the
exogenous fluctuations of the market.
4. We present computational results that make several points about perfect Bayesi-
an equilibrium in our model:

(a) In the presence of adversaries, there are significant potential benefits to

employing perfect Bayesian equilibrium strategies.
(b) Unlike strategies proposed based on prior models in the literature, which
exhibit deterministic sequences of trades, trades in a perfect Bayesian equi-
librium adaptively respond to price fluctuations; the trader leverages these
random outcomes to conceal his activity.
(c) When the relative volume of the trader’s activity is low, in equilibrium,
the trader can ignore the presence of the arbitrageur and will equipartition
to minimize price impact. Alternatively, when the relative volume is high,
the trader will concentrate his trading activity in a short time interval so
as to minimize signaling.
(d) The presence of the arbitrageur can benefit and/or harm the market in
the following sense: The arbitrageur encourages informed traders to make
the price informative more rapidly. However, the arbitrageur encourages
liquidity traders to generate greater volatility conveying no information
about the fundamental value of the stock.
(e) The presence of the arbitrageur leads to a spill-over effect. That is, the
trader’s expected loss due to the arbitrageur’s presence is larger than the
expected profit of the arbitrageur. Hence, other market participants ben-
efit from the arbitrageur’s activity.

5. We discuss how the basic model presented can be extended to incorporate a

number of additional features, such as transient price impact and risk aversion.
1.3. ADAPTIVE EXECUTION 7

1.3 Adaptive Execution – Exploration and Learn-

ing of Price Impact
In the Part I of this thesis, we assume that both the trader and arbitrageur have
complete knowledge of a price impact model. However, this is typically not the case
in practice: Price impact is what traders should learn from their trading activity. A
price impact model can be learned only through executed trades and requires retuning
over time due to the continual evolution of trading venues and population of market
participants. In the second part of this thesis, we consider a single trader who aims to
minimize price impact and learn unknown price impact simultaneously. Note that an
arbitrageur is excluded in our second model. The problem that the trader faces can be
viewed as a special case of reinforcement learning. This topic more broadly addresses
sequential decision problems in which unknown properties of an environment must be
learned in the course of operation (see, e.g., [46]). Research in this area has established
how judicious investments in decisions that explore the environment at the expense of
suboptimal short-term behavior can greatly improve longer-term performance. What
we develop in the Part II can be viewed as a reinforcement learning algorithm: the
workings of price impact are unknown, and exploration facilitates learning.
In reinforcement learning, one seeks to optimize the balance between exploration
and exploitation – the use of what has already been learned to maximize rewards
without regard to further learning. Certainty equivalent control (CE) represents
one extreme where at any time, current point estimates are assumed to be correct
and actions are made accordingly. This is an instance of pure exploitation; though
learning does progress with observations made as the system evolves, decisions are
not deliberately oriented to enhance learning.
An important question is how aggressively a trader should explore to learn a
price impact model. Unlike many other reinforcement learning problems, in ours a
considerable degree of exploration is naturally induced by exploitative decisions. This
is because a trader excites the market through regular trading activity regardless of
whether or not she aims to learn a price impact model. This activity could, for
example, be triggered by return-predictive factors, and given sufficiently large factor
8 CHAPTER 1. INTRODUCTION

variability, the induced exploration might adequately resolve uncertainties about price
impact. Results of the Part II demonstrate that executing trades to explore beyond
what would naturally occur through exploitation can yield significant benefit.
Our work is constructive: we propose the confidence-triggered regularized adaptive
certainty equivant policy (CTRACE), pronounced “see-trace,” a new method that
explores and learns a price impact model alongside trading. CTRACE can be viewed
as a generalization of CE, which at each point in time estimates coefficients of a price
impact model via least-squares regression using available data and makes decisions
that optimize trading under an assumption that the estimated model is correct and
will be used to guide all future decisions. CTRACE deviates in two ways: (1) `2
regularization is applied in least-squares regression and (2) coefficients are only up-
dated when a certain measure of confidence exceeds a pre-specified threshold and a
minimum inter-update time has elapsed. Note that CTRACE reduces to CE as the
regularization penalty, the threshold, and the minimum inter-update time vanish.
We demonstrate through Monte Carlo simulation that CTRACE outperforms CE.
Further, we establish a finite-time regret bound for CTRACE; no such bound is
available for CE. Regret is defined here to be the difference between realized risk-
adjusted profit of a policy in question and one that is optimal with respect to the
true price impact model. Our bound exhibits a poly-logarithmic dependence on time.
Among other things, this regret bound implies that CTRACE is efficient in the sense
that the (, δ)-convergence time is bounded by a polynomial function of 1/ and
log(1/δ) with high probability. We define the (, δ)-convergence time to be the first
time when an estimate and all the future estimates following it are within an -
neighborhood of a true value with probability at least 1 − δ. Let us provide here
some intuition for why CTRACE outperforms CE. First, regularization enhances
exploration in a critical manner. Without regularization, we are more likely to obtain
overestimates of price impact. Such an outcome abates trading and thus exploration,
making it difficult to escape from the predicament. Regularization reduces the chances
of obtaining overestimates, and further, tends to yield underestimates that encourage
active exploration. Second, requiring a high degree of confidence reduces the chances
of occasionally producing erratic estimates, which regularly arise with application of
1.4. ORGANIZATION 9

CE. Such estimates can result in undesirable trades and/or reductions in the degree
of exploration.
It is also worth comparing CTRACE to a reinforcement learning algorithm re-
cently proposed in [2] which appears well-suited for our problem. This algorithm was
designed to explore efficiently in a broader class of linear-quadratic control problems,
and is based on the principle of optimism in the face of uncertainty. [2] establish
q
an O( T log(1/δ)) regret bound that holds with probability at least 1 − δ, where T
denotes time and some logarithmic terms are hidden. Our bound for CTRACE is on
expected regret and exhibits a dependence on T of O(log2 T ). We also demonstrate
via Monte Carlo simulation that CTRACE dramatically outperforms this algorithm.
To summarize, the primary contributions of the Part II of this thesis include:

1. We propose a new method for simultaneous execution and learning – the confi-
dence-triggered regularized adaptive certainty equivalent (CTRACE) policy.

2. We establish a finite-time expected regret bound for CTRACE that exhibits

a poly-logarithmic dependence on time. This bound implies that CTRACE is
efficient in the sense that, with probability 1 − δ, the (, δ)-convergence time is
bounded by a polynomial function of 1/ and log(1/δ).

3. We demonstrate via Monte Carlo simulation that CTRACE outperforms the

certainty equivalent policy and a reinforcement learning algorithm recently pro-
posed by [2] which is designed to explore efficiently in linear-quadratic control
problems.

1.4 Organization
The rest of this thesis is organized as follows: Chapter 2 focuses on strategic liqui-
dation of a large position in the presence of an arbitrageur who is willing to profit
from the trader’s activity, which comprises the Part I of this thesis and is based on
[37]. Section 2.1 presents our problem formulation. Section 2.2 discusses how per-
fect Bayesian equilibrium in this model is characterized by a dynamic program. A
practical algorithm for computing perfect Bayesian equilibrium behavior is developed
10 CHAPTER 1. INTRODUCTION

in Section 2.3. This algorithm is applied in computational studies, for which results
are presented and interpreted in Section 2.4. Several extensions of this model are
discussed in Section 2.5. Proofs of all theoretical results in this chapter are presented
in Appendix A.
In Chapter 3, we discuss simultaneous execution and learning of unknown price
impact via active exploration. This chapter comprises the Part II of this thesis and is
based on [41]. Section 3.1 presents our problem formulation, establishes existence and
uniqueness of an optimal solution to our problem, and defines performance measures
that can be used to evaluate policies. In Section 3.2, we propose CTRACE and derive
a finite-time expected regret bound for CTRACE along with two properties: inter-
temporal consistency and efficiency. Section 3.3 is devoted to Monte Carlo simulation
in which the performance of CTRACE is compared to that of two benchmark policies.
Proofs of all theoretical results in this chapter are provided in Appendix B.
Finally, Chapter 4 makes some closing remarks and suggests directions for future
work.
Chapter 2

Part I: Strategic Execution in the

Presence of an Uninformed
Arbitrageur

2.1 Problem Formulation

In this section, the optimal execution problem is formulated as a dynamic game with
asymmetric information. Our formulation makes a number of simplifying assump-
tions and we omit several factors that are important in the practical implementation
of execution strategies, for example, transient price impact and risk aversion. Our
goal here is to highlight the strategic and informational aspects of execution in a
streamlined fashion. However, these assumptions are discussed in more detail and a
number of extensions of this basic model are presented in Section 2.5. Indeed, many
such extensions as to price dynamics are applied in the Part II.

2.1.1 Game Structure

Consider a game that evolves over a finite horizon in discrete time steps t = 0, . . . , T +
1. There are two players: a trader and an arbitrageur. The trader begins with a
position x0 ∈ R in a stock, which he must liquidate by time T . Denote his position at
each time t by xt , and thus require that xt = 0 for t ≥ T . The arbitrageur begins with

11
12 CHAPTER 2. PART I: STRATEGIC EXECUTION

a position y0 . Denote his position at each time t by yt . In general, the arbitrageur

has additional flexibility and will not be limited to the same time horizon as the
trader. For simplicity, this flexibility is modeled by assuming that the arbitrageur
has one additional period of trading activity. In other words, though we do require
that yT +1 = 0, we do not require that yT = 0. This assumption will be revisited in
Section 2.5.1.

2.1.2 Price Dynamics

Denote the price of the stock at time t by pt . This price evolves according to the
permanent linear price impact model given by

pt = pt−1 + ∆pt = pt−1 + λ(ut + vt ) + t . (2.1)

Here, λ > 0 is a parameter that reflects the sensitivity of prices to trade size, and
ut and vt are, respectively, the quantities of stock purchased by the trader and arbi-
trageur at time t. Note that, given the horizon of the trader, uT +1 , 0. The positions
evolve according to xt = xt−1 + ut and yt = yt−1 + vt .
The sequence {t } is a normally distributed IID process with t ∼ N (0, σ2 ), for
some σ > 0. This noise sequence represents the random and exogenous fluctuations
of market prices. We assume that the trading decisions ut and vt are made at time
t − 1, and executed at the price pt at time t. Note that there is no drift term in the
price evolution equation (2.1). In the intraday horizon of typical optimal execution
problems, this is usually a reasonable assumption. This assumption will be revisited
in Section 2.5.3. Further, the price impact in (2.1) is permanent in the sense that it
is long-lived relative to the length of the time horizon T . It is stationary in the sense
that the sensitivity λ is constant. In Section 2.5.3, we will allow for transient price
impact as well as non-stationary price dynamics.

2.1.3 Information Structure

The information structure of the game is as follows. The dynamics of the game (in
particular, the parameters λ and σ ) and the time horizon T are mutually known.
2.1. PROBLEM FORMULATION 13

From the perspective of the arbitrageur, the initial position x0 of the trader is un-
known. Further, the trader’s actions ut are not directly observed. However, the
arbitrageur begins with a prior distribution φ0 on the trader’s initial position x0 . As
the game evolves over time, the arbitrageur observes the price change ∆pt at each
time t. The arbitrageur updates his beliefs based on these price movements, at any
time t maintaining a posterior distribution φt of the trader’s current position xt , based
on his observation of the history of the game up to and including time t.
From the trader’s perspective, it is assumed that everything is known. This is mo-
tivated by the fact that the arbitrageur’s initial position y0 will typically be zero and
the trader can go through the same inference process as the arbitrageur to arrive at
the prior distribution φ0 . Given a prescribed policy for the arbitrageur (for example,
in equilibrium), the trader can subsequently reconstruct the arbitrageur’s positions
and beliefs over time, given the public observations of market price movements. We
do make the assumption, however, that any deviations on the part of the arbitrageur
from his prescribed policy will not mislead the trader. In our context, this assump-
tion is important for tractability. We discuss the situation where this assumption is
relaxed, and the trader does not have perfect knowledge of the arbitrageur’s positions
and beliefs, in Chapter 4.

2.1.4 Policies
The trader’s purchases are governed by a policy, which is a sequence of functions
π = {π1 , . . . , πT }. Each function πt+1 maps xt , yt , and φt , to a decision ut+1 at time
t. Similarly, the arbitrageur follows a policy ψ = {ψ1 , . . . , ψT +1 }. Each function ψt+1
maps yt and φt to a decision vt+1 made at time t. Since policies for the trader and
arbitrageur must result in liquidation, we require that πT (xT −1 , yT −1 , φT −1 ) = −xT −1
and ψT +1 (yT , φT ) = −yT . Denote the set of trader policies by Π and the set of
arbitrageur policies by Ψ.
Note that implicit in the above description is the restriction to policies that are
Markovian in the following sense: the state of the game at time t is summarized for
the trader and arbitrageur by the tuples (xt , yt , φt ) and (yt , φt ), respectively, and each
player’s action is only a function of his state. Further, the policies are pure strategies
14 CHAPTER 2. PART I: STRATEGIC EXECUTION

in the sense that, as a function of the player’s state, the actions are deterministic.
In general, one may wish to consider policies which determine actions as a function
of the entire history of the game up to a given time, and allow randomization over
the choice of action. Our assumptions will exclude equilibria from this more general
class. However, it will be the case that for the equilibria that we do find, arbitrary
deviations that are history dependent and/or randomized will not be profitable.
If the arbitrageur applies an action vt and assumes the trader uses a policy π̂ ∈ Π,
then upon observation of ∆pt at time t, the arbitrageur’s beliefs are updated in a
Bayesian fashion according to

φt (S) = Pr xt ∈ S | φt−1 , yt−1 , λ(π̂t (xt−1 , yt−1 , φt−1 ) + vt ) + t = ∆pt , (2.2)

for all measurable sets S ⊂ R. Note that ∆pt here is an observed numerical value
which could have resulted from a trader action ut 6= π̂t (xt−1 , yt−1 , φt−1 ). As such, the
trader is capable of misleading the arbitrageur to distort his posterior distribution φt .

2.1.5 Objectives
Assume that both the trader and arbitrageur are risk-neutral and seek to maximize
their expected profits (this assumption will be revisited in Section 2.5.2). Profit is
computed according to the change of book value, which is the sum of a player’s
cash position and asset position, valued at the prevailing market price. Hence, the
profits generated by the trader and arbitrageur between time t and time t + 1 are,
respectively,

pt+1 xt+1 − pt+1 ut+1 − pt xt = ∆pt+1 xt , and pt+1 yt+1 − pt+1 vt+1 − pt yt = ∆pt+1 yt .

If the trader uses policy π, and the arbitrageur uses policy ψ and assumes the
trader uses policy π̂, the trader expects profits
" T −1 #
π,(ψ,π̂) π,(ψ,π̂)
X
Ut (xt , yt , φt ) ,E ∆pτ +1 xτ xt , yt , φt ,
τ =t

over times τ = t + 1, . . . , T . Here, the superscripts indicate that trades are executed
2.1. PROBLEM FORMULATION 15

based on π and ψ, while beliefs are updated based on π̂. Similarly, the arbitrageur
expects profits
" T #
(ψ,π̂),π
, Eπ,(ψ,π̂)
X
Vt (yt , φt ) ∆pτ +1 yτ yt , φt ,
τ =t

over times τ = t + 1, . . . , T + 1. Here, the conditioning in the expectation implicitly

assumes that xt is distributed according to φt .
π,(ψ,π̂)
Note that −Ut (x0 , y0 , φ0 ) is the trader’s expected execution cost. For practi-
cal choices of π, ψ, and π̂, we expect this quantity to be positive since the trader is
likely to sell his shares for less than the initial price. To compress notation, for any
π,(ψ,π) (ψ,π),π
π, ψ, and t, let Utπ,ψ , Ut and Vtψ,π , Vt .

2.1.6 Equilibrium Concept

As a solution concept, we consider perfect Bayesian equilibrium [23]. This is a refine-

ment of Nash equilibrium that rules out implausible outcomes by requiring subgame
perfection and consistency with Bayesian belief updates. In particular, a policy π ∈ Π
is a best response to (ψ, π̂) ∈ Ψ × Π if

π,(ψ,π̂) π 0 ,(ψ,π̂)
Ut (xt , yt , φt ) = max
0
Ut (xt , yt , φt ), (2.3)
π ∈Π

for all t, xt , yt , and φt . Similarly, a policy ψ ∈ Ψ is a best response to π ∈ Π if

0
Vtψ,π (yt , φt ) = max
0
Vtψ ,π (yt , φt ), (2.4)
ψ ∈Ψ

for all t, yt , and φt . We define perfect Bayesian equilibrium, specialized to our context,
as follows:

Definition 2.1. A perfect Bayesian equilibrium (PBE) is a pair of policies (π ∗ , ψ ∗ )

∈ Π × Ψ such that (1) π ∗ is a best response to (ψ ∗ , π ∗ ) and (2) ψ ∗ is a best response
to π ∗ .

In a PBE, each player’s action at time t depends on positions xt and/or yt and the
16 CHAPTER 2. PART I: STRATEGIC EXECUTION

belief distribution φt . These arguments, especially the distribution, make computa-

tion and representation of a PBE challenging. We will settle for a more modest goal.
We compute policy actions only for cases where φt is Gaussian. When the initial
distribution φ0 is Gaussian and players employ these PBE policies, we require that
subsequent belief distributions φt determined by Bayes’ rule (2.2) also be Gaussian.
As such, computation of PBE policies over the restricted domain of Gaussian distri-
butions is sufficient to characterize equilibrium behavior given any initial conditions
involving a Gaussian prior. To formalize our approach, we now define a solution
concept.

Definition 2.2. A policy π ∈ Π (or ψ ∈ Ψ) is a Gaussian best response to (ψ, π̂) ∈

Ψ × Π (or π ∈ Π) if (2.3) (or (2.4)) holds for all t, xt , yt , and Gaussian φt . A
Gaussian perfect Bayesian equilibrium is a pair (π ∗ , ψ ∗ ) ∈ Π × Ψ of policies
such that (1) π ∗ is a Gaussian best response to (ψ ∗ , π ∗ ), (2) ψ ∗ is a Gaussian best
response to π ∗ , and if φ0 is Gaussian and the arbitrageur assumes the trader uses
π ∗ then, independent of the true actions of the trader, the beliefs φ1 , . . . , φT −1 are
Gaussian.

Note that when Gaussian PBE policies are used and the prior φ0 is Gaussian, the
system behavior is indistinguishable from that of a PBE since the policies produce
actions that concur with PBE policies at all states that are visited.
Given a belief distribution φt , define the quantities
h i λσt
µt , E[xt | φt ], σt2 , E (xt − µt )2 φt , and ρt , .
σ

Since λ and σ are constants, ρt is simply a scaled version of the standard deviation σt .
The ratio λ/σ acts as a normalizing constant that accounts for the informativeness
of observations. The reason we consider this scaling is that it highlights certain
invariants across problem instances. In Section 2.4.2, we will interpret the value of ρ0
as the relative volume of the trader’s activity in the marketplace. For the moment,
it is sufficient to observe that if the distribution φt is Gaussian, it is characterized by
(µt , ρt ).
2.2. DYNAMIC PROGRAMMING ANALYSIS 17

2.2 Dynamic Programming Analysis

In this section, we develop abstract dynamic programming algorithms for computing
PBE and Gaussian PBE. We also discuss structural properties of associated value
functions. The dynamic programming recursion relies on the computation of equilib-
ria for single-stage games, and we also discuss the existence of such equilibria. The
algorithms of this section are not implementable, but their treatment motivates the
design of a practical algorithm that will be presented in the next section.

2.2.1 Stage-Wise Decomposition

The process of computing a PBE and the corresponding value functions can be decom-
posed into a series of single-stage equilibrium problems via a dynamic programming
backward recursion. We begin by defining some notation. For each πt , ψt , and ut ,
define a dynamic programming operator Fu(ψt t ,π̂t ) by
h i
Fu(ψt t ,π̂t ) U (xt−1 , yt−1 , φt−1 ) , E(ψ
ut
t ,π̂t )
λ(ut + vt )xt−1 + U (xt , yt , φt ) | xt−1 , yt−1 , φt−1 ,

for all functions U , where xt = xt−1 + ut , yt = yt−1 + vt , vt = ψt (yt−1 , φt−1 ), and φt

results from the Bayesian update (2.2) given that the arbitrageur assumes the trader
trades π̂t (xt−1 , yt−1 , φt−1 ) while the trader actually trades ut . Similarly, for each πt
and vt , define a dynamic programming operator Gπvtt by
h i
Gπvtt V (yt−1 , φt−1 ) , Eπvtt λ(ut + vt )yt−1 + V (yt , φt ) | yt−1 , φt−1 ,

for all functions V , where yt = yt−1 + vt , ut = πt (xt−1 , yt−1 , φt−1 ), xt−1 is distributed
according to the belief φt−1 , and φt results from the Bayesian update (2.2) given that
the arbitrageur correctly assumes the trader trades ut .
Consider Algorithm 1 for computing a PBE. In Step 1, the algorithm begins
by initializing the terminal value functions UT∗ −1 and VT∗−1 . These terminal value
functions have a simple closed form in equilibrium. This is because, at time T , the
trader must liquidate his position, hence πT∗ (xT −1 , yT −1 , φT −1 ) = −xT −1 . Similarly,
arbitrageur must liquidate his position over times T and T + 1. In equilibrium, he
18 CHAPTER 2. PART I: STRATEGIC EXECUTION

Algorithm 1 PBE Solver

1: Initialize the terminal value functions UT∗ −1 and VT∗−1 according to (2.5)–(2.6)
2: for t = T − 1, T − 2, . . . , 1 do
3: Compute (πt∗ , ψt∗ ) such that for all xt−1 , yt−1 , and φt−1 ,
∗ ∗

πt∗ (xt−1 , yt−1 , φt−1 ) ∈ argmax Fu(ψt t ,πt ) Ut∗ (xt−1 , yt−1 , φt−1 )
ut
∗

ψt∗ (yt−1 , φt−1 ) ∈ argmax Gvπtt Vt∗ (yt−1 , φt−1 )
vt
4: Compute the value functions at the previous time step by setting, for all xt−1 ,
yt−1 , and φt−1 ,
(ψ∗ ,π∗ )
∗
Ut−1 (xt−1 , yt−1 , φt−1 ) ← Fπt∗ t t
U ∗ (x t t−1 , yt−1 , φt−1 )
π∗
∗
Vt−1 (yt−1 , φt−1 ) ← Gψt∗ Vt∗ (yt−1 , φt−1 )
t

5: end for

will do so optimally, thus his value function takes the form

h i
VT∗−1 (yT −1 , φT −1 ) = max
v
E λ(−xT −1 + vT )yT −1 − λ(yT −1 + vT )2 yT −1 , φT −1
T

= −λ µT −1 + 43 yT −1 yT −1 , (2.5)

where the optimizing decision is ψT∗ (yT −1 , φT −1 ) = − 21 yT −1 . It is straightforward to

derive the corresponding expression of the trader’s value function,
h i
UT∗ −1 (xT −1 , yT −1 , φT −1 ) = E λ −xT −1 − 21 yT −1 xT −1 xT −1 , yT −1 , φT −1
(2.6)
= −λ xT −1 + 21 yT −1 xT −1 .

At each time t < T , equilibrium policies must satisfy the best-response conditions
(2.3)–(2.4). Given the value functions Ut∗ and Vt∗ , these conditions decompose recur-
sively according to Step 3. Given such a pair (πt∗ , ψt∗ ), the value functions Ut−1
∗
and
∗
Vt−1 for the prior time period are, in turn, computed in Step 4.
It is easy to see that, so long as Step 3 is carried out successfully each time it
is invoked, the algorithm produces a PBE (π ∗ , φ∗ ) along with value functions Ut∗ =
∗ ,ψ ∗ ∗ ,π ∗
Utπ and Vt∗ = Vtψ . However, the algorithm is not implementable. For starters,
the functions πt∗ , ψt∗ , Ut−1
∗ ∗
, and Vt−1 , which must be computed and stored, have infinite
domains.
2.2. DYNAMIC PROGRAMMING ANALYSIS 19

2.2.2 Linear Policies

Consider the following class of policies:
ρ ρ ρ
Definition 2.3. A function πt is linear if there are coefficients ax,tt−1 , ay,tt−1 and aµ,t
t−1
,
which are functions of ρt−1 , such that

ρ ρ ρ
πt (xt−1 , yt−1 , φt−1 ) = ax,tt−1 xt−1 + ay,tt−1 yt−1 + aµ,t
t−1
µt−1 , (2.7)

for all xt−1 , yt−1 , and φt−1 . Similarly, function ψt is linear if there are coefficients
ρ ρ
by,tt−1 and bµ,t
t−1
, which are functions of ρt−1 , such that

ρ ρ
ψt (yt−1 , φt−1 ) = by,tt−1 yt−1 + bµ,t
t−1
µt−1 , (2.8)

for all yt−1 and φt−1 . A policy is linear if the component functions associated with
times 1, . . . , T − 1 are linear.

By restricting attention to linear policies and Gaussian beliefs, we can apply an

algorithm similar to that presented in the previous section to compute a Gaussian
PBE. In particular, consider Algorithm 2. This algorithm aims to compute a single-
stage equilibrium that is linear. Further, actions and values are only computed and
stored for elements of the domain for which φt−1 is Gaussian. This is only viable if
the iterates Ut∗ and Vt∗ , which are computed only for Gaussian φt , provide sufficient
information for subsequent computations. This is indeed the case, as a consequence
of the following result.

Theorem 2.1. If the belief distribution φt−1 is Gaussian, and the arbitrageur assumes
ρ ρ
that the trader’s policy π̂t is linear with π̂t (xt−1 , yt−1 , φt−1 ) = âx,tt−1 xt−1 + ây,tt−1 yt−1 +
ρ t−1
âµ,t µt−1 , then the belief distribution φt is also Gaussian. The mean µt is a linear
function of yt−1 , µt−1 , and the observed price change ∆pt , with coefficients that are
deterministic functions of the scaled variance ρt−1 . The scaled variance ρt evolves
according to !−1

ρ
2 1 ρ
ρ2t = 1+ âx,tt−1 + (âx,tt−1 )2 . (2.9)
ρ2t−1
In particular, ρt is a deterministic function of ρt−1 .
20 CHAPTER 2. PART I: STRATEGIC EXECUTION

Algorithm 2 Linear-Gaussian PBE Solver

1: Initialize the terminal value functions UT∗ −1 and VT∗−1 according to (2.5)–(2.6)
2: for t = T − 1, T − 2, . . . , 1 do
3: Compute linear (πt∗ , ψt∗ ) such that for all xt−1 , yt−1 , and Gaussian φt−1 ,
∗ ∗

πt∗ (xt−1 , yt−1 , φt−1 ) ∈ argmax Fu(ψt t ,πt ) Ut∗ (xt−1 , yt−1 , φt−1 )
ut
∗

ψt∗ (yt−1 , φt−1 ) ∈ argmax Gπvtt Vt∗ (yt−1 , φt−1 )
vt
4: Compute the value functions at the previous time step by setting, for all xt−1 ,
yt−1 , and Gaussian φt−1 ,
(ψ∗ ,π∗ )
∗
Ut−1 (xt−1 , yt−1 , φt−1 ) ← Fπt∗ t t
U ∗ (x
t t−1 , yt−1 , φt−1 )
π∗
∗
Vt−1 (yt−1 , φt−1 ) ← Gψt∗ Vt∗ (yt−1 , φt−1 )
t

5: end for

∗ ,π ∗ )
It follows from this result that if π ∗ is linear then, for Gaussian φt−1 , Fu(ψt Ut∗ only
depends on values of Ut∗ evaluated at Gaussian φt . Similarly, if π ∗ is linear then, for
∗
Gaussian φt−1 , Gπvt Vt∗ only depends on values of Vt∗ evaluated at Gaussian φt . It also
follows from this theorem that Algorithm 2, which only computes actions and values
for Gaussian beliefs, results in a Gaussian PBE (π ∗ , ψ ∗ ). We should mention, though,
that Algorithm 2 is still not implementable since the restricted domains of Ut∗ and
Vt∗ remain infinite.
Motivated by these observations, for the remainder of this chapter, we will focus
on computing equilibria of the following form:

Definition 2.4. A pair of policies (π ∗ , ψ ∗ ) ∈ Π × Ψ is a linear-Gaussian perfect

Bayesian equilibrium if it is a Gaussian PBE and each policy is linear.

2.2.3 Quadratic Value Functions

Closely associated with linear policies are the following class of value functions:

Definition 2.5. A function Ut is trader-quadratic-decomposable (TQD) if there

are coefficients cρxx,t
t
, cρyy,t
t
, cρµµ,t
t
, cρxy,t
t
, cρxµ,t
t
, cρyµ,t
t
and cρ0,tt , which are functions of ρt ,
such that
2.2. DYNAMIC PROGRAMMING ANALYSIS 21

1 ρt
Ut (xt , yt , φt ) = −λ c x2
2 xx,t t
+ 12 cρyy,t
t
yt2 + 21 cρµµ,t
t
µ2t
(2.10)
σ2

+ cρxy,t
t
xt yt + cρxµ,t
t
x t µt + cρyµ,t
t
yt µt − 2 cρ0,tt ,
λ

for all xt , yt , and φt . A function Vt is arbitrageur-quadratic-decomposable

(AQD) if there are coefficients dρyy,t
t
, dρµµ,t
t
, dρyµ,t
t
and dρ0,tt , which are functions of ρt ,
such that
!
1 ρt 1 ρt σ2
Vt (yt , φt ) = −λ d y2
2 yy,t t
+ d µ2
2 µµ,t t
+ dρyµ,t
t
y t µt − 2 dρ0,tt , (2.11)
λ

for all yt and φt .

In equilibrium, UT∗ −1 and VT∗−1 are given by Step 1 of Algorithm 2, and hence
are TQD/AQD. The following theorem captures how TQD and AQD structure are
preserved in the dynamic programming recursion given linear policies.

Theorem 2.2. If Ut∗ is TQD and Vt∗ is AQD, and Step 3 of Algorithm 2 produces a
∗
linear pair (πt∗ , ψt∗ ), then Ut−1 ∗
and Vt−1 , defined by Step 4 of Algorithm 2 are TQD
and AQD, respectively.

Hence, each pair of value functions generated by Algorithm 2 is TQD/AQD. A great

benefit of this property comes from the fact that, for a fixed value of ρt , each associated
value function can be encoded using just a few parameters.

2.2.4 Simplified Conditions for Equilibrium

Algorithm 2 relies for each t on existence of a pair (πt∗ , ψt∗ ) of linear functions that
satisfy single-stage equilibrium conditions. In general, this would require verifying
that each policy function is the Gaussian best response for all possible states. The
following theorem provides a much simpler set of conditions. In Section 2.3, we will
exploit these conditions in order to compute equilibrium policies.

3 2
ρ ρ
0 = ρ2t cρµµ,t
t
+ 2ρt cρxµ,t
t
+ cρxx,t
t
ax,tt−1 + 3cρxx,t
t
+ 3ρt cρxµ,t
t
− 1 ax,tt−1
(2.12)
ρ
+ 3cρxx,t
t
+ ρt cρxµ,t
t
− 2 ax,tt−1 + cρxx,t
t
− 1,

ρ
ρ
by,tt−1 + 1 cρxy,t
t
+ αt cρyµ,t
t

ay,tt−1 = − , (2.13)
cρxx,t
t
+ (αt + 1)cρxµ,t
t
+ αt cρµµ,t
t

ρ ρ
ρ
t−1
ax,tt−1 bµ,t
t−1
cρxy,t
t
+ αt cρyµ,t
t
+ αt cρxµ,t
t
+ αt cρµµ,t
t
/ρ2t−1
aµ,t =− ρ
, (2.14)
ax,tt−1 cρxx,t
t
+ (αt + 1)cρxµ,t
t
+ αt cρµµ,t
t

ρ ρ ρ
ρ 1 − dρyµ,t
t
ay,tt−1 ρt−1
t−1
(1 + aµ,t + ax,tt−1 )dρyµ,t
t

by,tt−1 = − 1, bµ,t =− , (2.15)

dρyy,t
t
dρyy,t
t

and the second order conditions

cρxx,t
t
+ (αt + 1)cρxµ,t
t
+ αt cρµµ,t
t
> 0, dρyy,t
t
> 0, (2.16)

where the quantities αt and ρt satisfy

ρ ρ !−1
ax,tt−1 1 + ax,tt−1
ρ
2 1
ρ
2
αt = 2 , ρ2t = 1+ ax,tt−1 + ax,tt−1 . (2.17)
ρ2t−1

ρ
1/ρ2t−1 + ax,tt−1

Then, (πt∗ , ψt∗ ) satisfy the single-stage equilibrium conditions

∗ ∗

πt∗ (xt−1 , yt−1 , φt−1 ) ∈ argmax Fu(ψt t ,πt ) Ut∗ (xt−1 , yt−1 , φt−1 ),
ut
∗

ψt∗ (yt−1 , φt−1 ) ∈ argmax Gπvtt Vt∗ (yt−1 , φt−1 ),
vt

for all xt−1 , yt−1 , and Gaussian φt−1 .

Note that, while this theorem provides sufficient conditions for linear policies
satisfying equilibrium conditions, it does not guarantee the existence or uniqueness
of such policies. These remain open issues. However, we support the plausibility of
existence through the following result on Gaussian best responses to linear policies.
It asserts that, if ψt and π̂t are linear, then there is a linear best-response πt for
the trader in the single-stage game. Similarly, if πt is linear then there is a linear
best-response ψt for the arbitrageur in the single-stage game.
2.3. ALGORITHM 23

Theorem 2.4. If Ut is TQD, ψt is linear, and π̂t is linear, then there exists a linear
πt such that

πt (xt−1 , yt−1 , φt−1 ) ∈ argmax Fu(ψt t ,π̂t ) Ut (xt−1 , yt−1 , φt−1 ),
ut

for all xt−1 , yt−1 , and Gaussian φt−1 , so long as the optimization problem is bounded.
Similarly, if Vt is AQD and πt is linear then there exists a linear ψt such that

ψt (yt−1 , φt−1 ) ∈ argmax Gπvtt Vt (yt−1 , φt−1 ),
vt

for all yt−1 and Gaussian φt−1 , so long as the optimization problem is bounded.

Based on these results, if the trader (arbitrageur) assumes that the arbitrageur
(trader) uses a linear policy then it suffices for the trader (arbitrageur) to restrict
himself to linear policies. Though not a proof of existence, this observation that the
set of linear policies is closed under the operation of best response motivates an aim
to compute linear-Gaussian PBE.

2.3 Algorithm
The previous section presented abstract algorithms and results that lay the ground-
work for the development of a practical algorithm which we will present in this section.
We begin by discussing a parsimonious representation of policies.

2.3.1 Representation of Policies

Algorithm 2 takes as input three values that parameterize our model: (λ, σ , T ). The
ρ ρ ρ ρ ρ
algorithm output can be encoded in terms of coefficients {ax,tt−1 , ay,tt−1 , aµ,t
t−1
, by,tt−1 , bµ,t
t−1
},
for every ρt−1 > 0 and each time step1 t = 1, . . . , T − 1. These coefficients param-
eterize linear-Gaussian PBE policies. Note that the output depends on λ and σ
only through ρt . Hence, given any λ and σ with the same ρt , the algorithm obtains
1
Recall, from the discussion in Section 2.2.1, that aρx,T = −1, aρy,T = aρµ,T = 0, bρy,T = −1/2,
bρµ,T +1 = bρµ,T = 0, and bρy,T +1 = −1, for all ρ > 0.
24 CHAPTER 2. PART I: STRATEGIC EXECUTION

the same coefficients. This means that the algorithm need only be executed once to
obtain solutions for all choices of λ and σ .
Now, for each t, the policy coefficients are deterministic functions of ρt−1 . For a
fixed value of ρt−1 , the coefficients can be stored as five numerical values. However, it
is not feasible to simultaneously store coefficients associated with all possible values
of ρt−1 . Fortunately, given a linear policy for the trader, Theorem 2.1 establishes ρt is
a deterministic function of ρt−1 . Thus, the initial value ρ0 determines all subsequent
values of ρt . It follows that, for a fixed value of ρ0 , over the relevant portion of its
domain, a linear-Gaussian PBE can be encoded in terms of 5(T −1) numerical values.
We will design an algorithm that aims to compute these 5(T − 1) parameters, which
we will denote by {ax,t , ay,t , aµ,t , by,t , bµ,t }, for t = 1, . . . , T − 1. These parameters
allow us to determine PBE actions at all visited states, so long as the initial value of
ρ0 is fixed.

2.3.2 Searching for Equilibrium Variances

The parameters {ax,t , ay,t , aµ,t , by,t , bµ,t } characterize linear-Gaussian PBE policies re-
stricted to the sequence ρ0 , . . . , ρT −1 generated in the linear-Gaussian PBE. We do
not know in advance what this sequence will be, and as such, we seek simultaneously
compute this sequence alongside the policy parameters.
One way to proceed, reminiscent of the bisection method employed by [34] and
[21] would be to conjecture a value for ρT −1 . Given a candidate value ρ̂T −1 , the
preceding values ρ̂T −2 , . . . , ρ̂0 , along with policy parameters for times T − 1, . . . , 1,
can be computed by sequentially solving the equations (2.12)–(2.17) for single-stage
equilibria. The resulting policies form a linear-Gaussian PBE, restricted to the se-
quence ρ̂0 , . . . , ρ̂T −1 that they would generate if ρ0 = ρ̂0 . One can then seek a value
of ρ̂T −1 such that the resulting ρ̂0 is indeed equal to ρ0 . This can be accomplished,
for example, via bisection search.
The bisection method can be numerically unstable, however. This is because, the
belief update equation (2.9) is used to sequentially compute the values ρ̂T −2 , . . . , ρ̂0
backwards in time. When the target value of ρ0 is very large, small changes in ρ̂T −1
can result in very large changes in ρ̂0 , making it difficult to match the precisely value
2.3. ALGORITHM 25

of ρ0 .

To avoid this numerical instability, consider Algorithm 3. This algorithm main-

tains a guess π̂ of the equilibrium policy of the trader, and, along with the initial value
ρ0 , this is used to generate the sequence ρ̂1 , . . . , ρ̂T −1 by applying the belief update
equation (2.9) forward in time. This sequence of values is then used in the single-stage
equilibrium conditions to solve for policies (π ∗ , ψ ∗ ). A sequence of values ρ̂1 , . . . , ρ̂T −1
is then computed forward in time using the policy π ∗ . If this sequence matches the
sequence generated by the guess π̂, then the algorithm has converged. Otherwise, the
algorithm is repeated with a new guess policy that is a convex combination of π̂ and
π ∗ . Since this algorithm only ever applies the belief equation (2.9) forward in time,
it does not suffer from the numerical instabilities of the bisection method.

Note that Step 6 of the algorithm treats ρt−1 as a free variable that is solved along-
side the policy parameters {ax,t , ay,t , aµ,t , by,t , bµ,t }. These variables are computed by

Algorithm 3 Linear-Gaussian PBE Solver with Variance Search

1: Initialize π̂ to an equipartitioning policy
2: for k = 1, 2, . . . do
3: Compute ρ̂1 , . . . , ρ̂T −1 according to the initial value ρ0 and the policy π̂ by (2.9)

4: Initialize the terminal value functions UT∗ −1 and VT∗−1 according to (2.5)–(2.6)
5: for t = T − 1, T − 2, . . . , 1 do
6: Compute single-stage equilibrium (πt∗ , ψt∗ ) and ρt−1 according to (2.12)–(2.17),
with ρt = ρ̂t
∗ ∗
7: Compute the value functions Ut−1 and Vt−1 at the previous time step given
∗ ∗
(πt , ψt )
8: end for
9: Compute ρ̃1 , . . . , ρ̃T −1 according to the initial value ρ0 and the policy π ∗ by (2.9)

10: if ρ̂ = ρ̃ then
11: return
12: else
13: Set π̂ ← γk π̂ + (1 − γk )π ∗ , where γk ∈ [0, 1) is a step-size
14: end if
15: end for
26 CHAPTER 2. PART I: STRATEGIC EXECUTION

simultaneously solving the system of equations (2.12)–(2.17) for single-stage equilib-

rium. To be precise, ax,t is obtained by solving the cubic polynomial equation (2.12)
numerically. Given a value for ax,t , the remaining parameters {ay,t , aµ,t , by,t , bµ,t } can
be obtained by solving the linear system of equations (2.13)–(2.15), while ρt−1 is ob-
tained through (2.17) . It can then be verified that the second order condition (2.16)
holds. Algorithm 3 is implementable and we use it in computational studies presented
in the next section.

2.4 Computational Results

In this section, we present computational results generated using Algorithm 3. In
Section 2.4.1, we introduce some alternative, intuitive policies which will serve as a
basis of comparison to the linear-Gaussian PBE policy. In Section 2.4.2, we discuss
the importance of the parameter ρ0 , λσ0 /σ in the qualitative behavior of the Gaus-
sian PBE policy and interpret ρ0 as a measure of the “relative volume” of the trader’s
activity in the marketplace. In Section 2.4.3, we discuss the relative performance of
the policies from the perspective of the execution cost of the trader. Here, we demon-
strate experimentally that the Gaussian PBE policy can offer substantial benefits. In
Section 2.4.4, we examine the signaling that occurs through price movements. Finally,
in Section 2.4.5, we highlight the fact that the PBE policy is adaptive and dynamic,
and seeks to exploit exogenous market fluctuations in order to minimize execution
costs.

2.4.1 Alternative Policies

In order to understand the behavior of linear-Gaussian PBE policies, we first define

two alternative policies for the trader for the purpose of comparison. In the absence of
an arbitrageur, it is optimal for the trader to minimize execution costs by partitioning
his position into T equally sized blocks and liquidating them sequentially over the
T time periods, as established by [11]. We refer to the resulting policy π EQ as an
equipartitioning policy. For all t, xt−1 , yt−1 , and φt−1 , it is defined by
2.4. COMPUTATIONAL RESULTS 27

1
πtEQ (xt−1 , yt−1 , φt−1 ) , − xt−1 .
T −t+1

Alternatively, the trader may wish to liquidate his position in a way so as to reveal
as little information as possible to the arbitrageur. Trading during the final two time
periods T − 1 and T does not reveal information to the arbitrageur in a fashion that
can be exploited. This is because, as discussed in Section 2.2.1, the arbitrageur’s
optimal trades at time T and T + 1 are vT = −yT −1 /2 and vT +1 = −yT , respectively,
and these are independent of any belief of the arbitrageur with respect to the trader’s
position. Given that the trader is free to trade over these two time periods without
any information leakage, it is natural to minimize execution cost by equipartitioning
over these two time periods. Hence, define the minimum revelation policy π MR to
be a policy that liquidates the trader’s position evenly across only the last two time
periods. That is, for all t, xt−1 , yt−1 , and φt−1 ,




0 if t < T − 1,


πtMR (xt−1 , yt−1 , φt−1 ) , 1
− 2 xt−1 if t = T − 1,




−xt−1 if t = T .



2.4.2 Relative Volume

Observed in Section 2.3.1, linear-Gaussian PBE policies are determined as a func-
tion of the composite parameter ρ0 , λσ0 /σ . In order to interpret this parameter,
consider the dynamics of price changes, ∆pt = λ(ut + vt ) + t , where t ∼ N (0, σ2 ).
Here, t is interpreted as the exogenous, random component of price changes. Alter-
natively, one can imagine the random component of price changes is arising from the
price impact of “noise traders”. Denote by zt the total order flow from noise traders
at time t, and consider a model where ∆pt = λ(ut + vt + zt ), with zt ∼ N (0, σz2 ). If
σ = λσz , these two models are equivalent. In that case, ρ0 , λσ0 /σ = σ0 /σz . In
other words, ρ0 can be interpreted as the ratio of the uncertainty of the total volume
of the trader’s activity to the per period volume of noise trading. As such, we refer
to ρ0 as the relative volume.
28 CHAPTER 2. PART I: STRATEGIC EXECUTION

We shall see in the following sections that, qualitatively, the performance and
behavior of Gaussian PBE policies are determined by the magnitude of ρ0 . In the
high relative volume regime, when ρ0 is large, either the initial position uncertainty
σ0 is very large or the volatility σz of the noise traders is very small. In these cases,
from the perspective of the arbitrageur, the trader’s activity contributes a significant
informative signal which can be decoded in the context of less significant exogenous
random noise. Hence, the trader’s activity early in the time horizon reveals significant
information which can be exploited by the arbitrageur. Thus, it may be better for
the trader to defer his liquidation until the end of the time horizon.
Alternatively, in the low relative regime, when ρ0 is small, the arbitrageur cannot
effectively distinguish the activity of the trader from the noise traders in the market.
Hence, the trader is free to distribute his trades across the time horizon so as to
minimize price impact, without fear of front-running by the arbitrageur.

2.4.3 Policy Performance

Consider a pair of policies (π, ψ), and assume that the arbitrageur begins with a
position y0 = 0 and an initial belief φ0 = N (0, σ02 ). Given an initial position x0 ,
the trader’s expected profit is U0π,ψ (x0 , 0, φ0 ). One might imagine, however, that the
initial position x0 represents one of many different trials where the trader liquidates
positions. It makes sense for this distribution of x0 over trials to be consistent with
the arbitrageur’s belief φ0 , since this belief could be based on past trials. Given this
distribution, averaging over trials results in expected profit E[U0π,ψ (x0 , 0, φ0 ) | φ0 ].
Alternatively, if the trader liquidates his entire position immediately, the expected
profit becomes E[−λx20 | φ0 ] = −λσ02 . We define the trader’s normalized expected profit
Ū (π, ψ) to be the ratio of these two quantities. When the trader’s value function is
TQD, this takes the form
h i
E U0π,ψ (x0 , 0, φ0 ) φ0 1 0 1 0
Ū (π, ψ) , = − cρxx,0 + 2 cρ0,0 ,
λσ02 2 ρ0

where cρxx,0
0
and cρ0,0
0
are the trader’s appropriate value function coefficients at time
t = 0.
2.4. COMPUTATIONAL RESULTS 29

Analogously, the arbitrageur’s normalized expected profit V̄ (π, ψ) is defined to be

the expected profit of the arbitrageur normalized by the expected immediate liqui-
dating cost of the trader. When the arbitrageur’s value function is AQD, this takes
the form h i
E V0ψ,π (x0 , 0, φ0 ) φ0 1 ρ0
V̄ (π, ψ) , = d ,
λσ02 ρ20 0,0
Now, let (π ∗ , ψ ∗ ) denote a linear-Gaussian PBE. Since the corresponding value
functions are TQD/AQD, the normalized expected profits depend on the parameters
{σ0 , λ, σ } only through the relative volume parameter ρ0 , λσ0 /σ .
Similarly, given the equipartitioning policy π EQ , define ψ EQ to be the optimal
response of the arbitrageur to the trader’s policy π EQ . This best response policy can
be computed by solving the linear-quadratic control problem corresponding to (2.4),
via dynamic programming. The policy takes the form

 −1 (T −t)(T −t+3)

y
T +2−t t−1
− µ
2(T +1−t)(T +2−t) t−1
if 1 ≤ t ≤ T ,
ψtEQ (yt−1 , µt−1 ) =
−y otherwise.
T

Using a similar argument as above, it is easy to see that Ū (π EQ , ψ EQ ) and V̄ (π EQ , ψ EQ )

are also functions of the parameter ρ0 .
Finally, given the minimum revelation policy π MR , define ψ MR to be the optimal
response of the arbitrageur to the trader’s policy π MR . It can be shown that, when
y0 = 0 and µ0 = 0, the best response of the arbitrageur to the minimum revelation
policy is to do nothing – since no information is revealed by the trader in a useful
fashion, there is no opportunity to front-run. Hence,
h i
E − 21 λx20 − 14 λx20 φ0 3
Ū (π MR , ψ MR ) = =− , V̄ (π MR , ψ MR ) = 0.
λσ02 4

In Figure 2.1, the normalized expected profits of various policies are plotted as
functions of the relative volume ρ0 , for a time horizon T = 20. In all scenarios, as one
might expect, the trader’s profit is negative while the arbitrageur’s profit is positive.
In all cases, the trader’s profit under the Gaussian PBE policy dominates that under
either the equipartitioning policy or the minimum revelation policy. This difference
30 CHAPTER 2. PART I: STRATEGIC EXECUTION

1.0

0.0 V̄ (π MR , ψ MR )

Ū (π EQ , ψ 0 )
V̄ (π EQ , ψ EQ ) Ū (π MR , ψ MR )
−1.0 V̄ (π ∗ , ψ ∗ )

V̄ (π VT , ψ V T )

Ū (π EQ , ψ EQ )

Ū (π ∗ , ψ ∗ )
−2.0 Ū (π VT , ψ V T )

10−2 10−1 100 101 102 103

ρ0
Figure 2.1: The normalized expected profit of trading strategies for the time horizon
T = 20.

is significant in moderate to high relative volume regimes.

In the high relative volume regime, the equipartitioning policy fares particularly
badly from the perspective of the trader, performing up to a factor of 2 worse than
the Gaussian PBE policy. This effect becomes more pronounced over longer time
horizons. The minimum revelation policy performs about as well as the PBE policy.
Asymptotically as ρ0 ↑ ∞, these policies offer equivalent performance in the sense
that Ū (π ∗ , ψ ∗ ) ↑ Ū (π MR , ψ MR ) = 3/4.
On the other hand, in the low relative volume regime, the equipartitioning policy
and the PBE policy perform comparably. Indeed, define ψ 0 by ψt0 , 0 for all t (that
is, no trading by the arbitrageur). In the absence of the arbitrageur, equipartitioning
is the optimal policy for the trader, and backward recursion can be used to show that

T +1 1
Ū (π EQ , ψ 0 ) = ≈ .
2T 2

Asymptotically as ρ0 ↓ 0, Ū (π EQ , ψ EQ ) ↓ Ū (π EQ , ψ 0 ) and Ū (π ∗ , ψ ∗ ) ↓ Ū (π EQ , ψ 0 ).
2.4. COMPUTATIONAL RESULTS 31

Thus, when the relative volume is low, the effect of the arbitrageur becomes negligible
when ρ0 is sufficiently small.
From the perspective of the arbitrageur in equilibrium, V̄ (π ∗ , ψ ∗ ) → 0 as ρ0 → 0
or ρ0 → ∞. In the low relative volume regime, the arbitrageur cannot distinguish the
past activity of the trader from noise, and hence is not able to profitably predict and
exploit the trader’s future activity. In the high relative volume regime, as we shall
see in Section 2.4.5, the trader conceals his position from the arbitrageur by deferring
trading until the end of the horizon. Here, as with the minimum revelation policy,
the arbitrageur is not able to profitably exploit the trader. Since the arbitrageur can
choose not to trade at each period, his best response to any trading strategy should
lead to non-negative expected profit. In light of these observations, we can easily
infer that in equilibrium the arbitrageur’s profit curve should have at least one local
maximum.
Both the equipartitioning and minimum revelation policies trade at a constant
rate, but over different, extremal time intervals: the equipartitioning policy uses the
entire time horizon, while the minimum revelation policy uses only the last two time
periods. A fairer benchmark policy might consider optimizing the choice of time
interval. Define the variable time policy π VT as follows: given the value ρ0 , select
the τ such that trading at a constant rate ut = − xτ0 over the last τ time periods
results in the highest expected profit for the trader, assuming that the arbitrageur
uses a best response policy. Define ψ VT to be the best response of the arbitrageur to
π VT . The variable time policy partially accounts for the presence of the arbitrageur,
and the expected profit with the variable time strategy will always be better that of
equipartitioning or minimum revelation. This is demonstrated by the Ū (π VT , ψ VT )
curve in Figure 2.1. However, the trader still fares better with an equilibrium policy,
particularly in the intermediate relative volume range, where the difference is close
to 20%.2
Examining Figure 2.1, it is clear that, in equilibrium, the sum of the normalized
profits of the trader and the arbitrageur is negative, and the magnitude of sum is
larger than the magnitude of the loss incurred by the trader in the absence of the
2
In practice, improvements of as low as 0.01% are considered significant.
32 CHAPTER 2. PART I: STRATEGIC EXECUTION

arbitrageur. Define the spill-over to be the quantity

Ū (π EQ , ψ 0 ) − Ū (π ∗ , ψ ∗ ) + V̄ (π ∗ , ψ ∗ ) .

This is the difference between the normalized expected profit of the trader in the
absence of the arbitrageur, under the optimal equipartitioning policy, and the com-
bined normalized expected profits of the trader and arbitrageur in equilibrium. The
spill-over measures the benefit of the arbitrageur’s presence to the other participants
of the system. Note that this benefit is positive, and it is most significant in the high
relative volume regime.

0.3

0.2
Ū (π EQ , ψ 0 ) − Ū (π ∗ , ψ ∗ ) + V̄ (π ∗ , ψ ∗ )
0.1

0.0

−0.1
10−2 10−1 100 101 102 103
ρ0
Figure 2.2: The spill-over of the system for the time horizon T = 20.

2.4.4 Signaling
An important aspect of the linear-Gaussian PBE policy is that it accounts for in-
formation conveyed through price movements. In order to understand this feature,
define the relative uncertainty to be the standard deviation of the arbitrageur’s belief
about the trader’s position at time t, relative to that of the belief at time 0; i.e., the
ratio σt /σ0 . By considering the evolution of relative uncertainty over time for the
Gaussian PBE policy versus the equipartitioning and minimum revelation policies,
we can study the comparative signaling behavior.
Under any linear policy, the evolution of the relative uncertainty σt /σ0 over time
is deterministic and depends only on the parameter ρ0 . This is because of the fact
that σt /σ0 = ρt /ρ0 and the results in Section 2.3.1. In Figure 2.3, the evolution of
2.4. COMPUTATIONAL RESULTS 33

1.0
π MR , ρ0 = 1
π MR , ρ0 = 10
π MR , ρ0 = 100
π ∗ , ρ0 = 1
σt
π ∗ , ρ0 = 10
σ0 0.5
π ∗ , ρ0 = 100
π EQ , ρ0 = 1
π EQ , ρ0 = 10
π EQ , ρ0 = 100
0.0
0 5 10 15 20
t
Figure 2.3: The evolution of relative uncertainty of the trader’s position for the time
horizon T = 20.

the relative uncertainty of the PBE policy is illustrated, for different values of ρ0 , as
compared to the equipartitioning and the minimum revelation policies. In the low
relative volume regime, the relative uncertainty of the PBE policy evolves similarly
to that of the equipartitioning policy. In the high relative volume regime, very little
information is revealed until close to the end of trading period under the PBE policy.
Indeed, the relative uncertainty between the equilibrium and the minimum revelation
policies are indistinguishable on the scale of Figure 2.3, when ρ0 = 10 or ρ0 = 100.
These observations are consistent with our results from Section 2.4.3.

2.4.5 Adaptive Trading

One important feature of the linear-Gaussian PBE policy is that it is adaptive in

the sense that the trades executed are random quantities that are dependent on the
exogenous, stochastic fluctuations of the market. This is in contrast to the policies
developed in most of the optimal execution literature. For example, the baseline
equipartitioning policy of [11] specifies a deterministic sequence of trades. Static
policies have also been derived under more complicated models [e.g., 6, 31, 39, 4].
However, this behavior is in contrast to what is observed amongst institutional traders
34 CHAPTER 2. PART I: STRATEGIC EXECUTION

and trading algorithms that are implemented by practitioners. One justification for
adaptive, price-responsive trading strategies is risk aversion. It has been observed
that optimal policies for certain risk averse objectives require dynamic trading [29, 7].
Our model provides another justification: in the presence of asymmetric information
and a strategic adversary, a trader should seek to exploit price fluctuations so as
disguise trading activity. In order to understand the behavior of linear policies, it is
helpful to decompose them into deterministic and stochastic components. Suppose
that (π, ψ) are a pair of linear policies, and that y0 = µ0 = 0. Given Definition 2.3
and Theorem 2.1, it is easy to see that, for each 1 ≤ t ≤ T , there exist vectors
α,t , β,t , γ,t ∈ Rt and scalars αx0 ,t , βx0 ,t , γx0 ,t ∈ R, each of which depends on the
parameters {σ0 , λ, σ } only through the ρ0 , such that
1 > t 1 > t 1 > t
xt = αx0 ,t x0 + α,t , yt = βx0 ,t x0 + β,t , µt = γx0 ,t x0 + γ,t . (2.18)
λ λ λ

Here, t = (1 , . . . , t ) is the vector of exogenous disturbances up to time t. The first

terms in (2.18) represent deterministic components of the policy and the second terms
represent zero-mean stochastic components that depend on market price fluctuations.
For the equipartitioning and minimum revelation policies, the stochastic components
are zero. On the other hand, the Gaussian PBE policy does have non-zero stochastic
components.

Figure 2.4 shows the deterministic component of the linear-Gaussian PBE versus
those of the equipartitioning and minimum revelation policies. As ρ0 → 0, the trader
ignores the presence of the arbitrageur and the PBE policy approaches the equipar-
titioning policy. At the other extreme, as ρ0 → ∞, in equilibrium the trader seeks to
conceal his activity as much as possible, and hence the PBE policy approaches the
minimum revelation policy.

Figure 2.5 illustrates sample paths of the trader’s position under the linear-
Gaussian PBE policy. Along each path, the trader deviates from the deterministic
schedule based on the random fluctuations of the market and how they influence the
arbitrageur’s beliefs. In general, if the arbitrageur’s estimate of the trader’s position
becomes more accurate, the trader accelerates his selling to avoid front-running. On
2.4. COMPUTATIONAL RESULTS 35

1.0

π MR
π ∗ , ρ0 = 1
αx0 ,t 0.5 π ∗ , ρ0 = 10
π ∗ , ρ0 = 100
π EQ

0.0
0 5 10 15 20
t
Figure 2.4: The deterministic components of trading strategies for the time horizon
T = 20.

the other hand, if the arbitrageur is misled as to the trader’s position, the trader
delays his selling relative to deterministic schedule.

2.4.6 Does an Arbitrageur Benefit the Market?

An important question regarding the arbitrageur is whether or not his strategic activ-
ity against the trader benefits the market as a whole so that it should be encouraged
by market operators. One way of measuring the arbitrageur’s impact on the market
could be to compare social welfare, which is defined as the sum of utility functions of
all market participants, in a market with the arbitrageur to social welfare in a market
without the arbitrageur. In our model, we have implicitly assumed that there are
other market participants that consume the net order flow from the trader and arbi-
trageur, that is, liquidity suppliers who clear the market. One difficulty with applying
this social welfare approach to our model is that these liquidity suppliers do not have
a well-defined utility function and thus do not act strategically by maximizing their
utility function. For this reason, the spill-over discussed in Section 2.4.3 is not a good
measure for the value of the arbitrageur to the market. Since social welfare is not
well defined in our model, we will take an alternative approach to this issue in terms
36 CHAPTER 2. PART I: STRATEGIC EXECUTION

1.0
xt /x0
0.5 αx0 ,t

0.0 µt /x0

0 5 10 15 20
t
1.0
xt /x0
0.5 αx0 ,t

0.0 µt /x0

0 5 10 15 20
t
1.0
xt /x0
0.5 αx0 ,t

0.0 µt /x0

0 5 10 15 20
t
Figure 2.5: Sample paths of the evolution of the trader’s actual and expected positions,
and the arbitrageur’s mean belief, when T = 20, x0 = σ0 = 105 , µ0 = y0 = 0,
σ = 0.125, λ = 10−5 .

of price informativeness and volatility.

On one hand, the arbitrageur encourages informed traders to make the price
informative more rapidly in the high relative volume regime. By informed traders,
we mean those who trade due to arrival of new information about the fundamental
value of the stock. In the absence of the arbitrageur, this new information tends
to be gradually incorporated into the price through a sequence of orders from these
informed traders, and they carefully control the rate of revealing new information
so as to maximize their profit out of informational advantage [34]. As discussed
in Section 2.4.5, however, the trader’s PBE policy requires executing most of his
position at the end of the horizon in response to the arbitrageur’s front-running. In
2.4. COMPUTATIONAL RESULTS 37

fact, concentration of trading volume is a key feature of the trader’s PBE policy in the
high relative volume regime, and the fact that the trader’s execution is concentrated
at the end of the horizon is just an artifact of our modeling assumption xT = 0.
Indeed, it is easy to see that the trader can achieve almost the same performance
as using the PBE policy if he trades his entire position over the first few periods
and does nothing afterwards. Therefore, the presence of the arbitrageur could induce
concentration of informed traders’ trading volume at early periods that expedites
revelation of new information into the price.
On the other hand, the arbitrageur encourages liquidity traders to generate greater
volatility of the stock price without delivering any information about the fundamental
value of the stock. By liquidity traders, we mean those who liquidate their position
for some liquidity reasons, not based on information about the fundamental value of
the stock. To be more precise on this notion, we first define the volatility of the stock
price as follows: v "
T
u #
tE 1
u
Varπt [∆p
X
Vol(π) , vt t |yt−1 , φt−1 ]
T t=1

This quantity corresponds to expected time-average of the variance of the per-period

price increment under the trader’s policy π from the arbitrageur’s perspective. It can
be easily shown that for any linear policy π
2
ρ
Varπvtt [∆pt |yt−1 , φt−1 ] = σ2 1 + ρ2t−1 ax,tt−1

and therefore v
u T
u 1X
ρ
2
Vol(π) = σ t1 + ρ2t−1 ax,tt−1
T t=1
ρ
which can readily be computed using σ , ρ0 and {ax,tt−1 }Tt=1 . It is worth noting that
ρ ρ
there is a trade-off between |ax,tt−1 | and ρt : The larger |ax,tt−1 |, the smaller ρt because
the trader’s aggressive trading accelerates revelation of private information about his
ρ
position. If |ax,tt−1 | is small, ρt does not decrease much since the trader’s passive trading
helps conceal the private information. We will compare the volatility Vol(π ∗ ) in a
market with the arbitrageur where PBE policies are used, and the volatility Vol(π EQ )
in a market without the arbitrageur where the trader uses the equipartitioning policy.
38 CHAPTER 2. PART I: STRATEGIC EXECUTION

101

Vol(π EQ , ψ 0 )
Vol(π ∗ , ψ ∗ )
100

10−1

10−2
10−2 10−1 100 101 102 103
ρ0
Figure 2.6: The volatility of PBE policy and the equipartitioning policy for the time
horizon T = 20 when σ = 10−2 .

Figure 2.6 shows the volatility curves of PBE policy and the equipartitioning
policy over a broad range of relative volume ρ0 , respectively. It is clear that the
volatility in a market without the arbitrageur is at least as low as that in a market
with the arbitrageur for all values of ρ0 , and the difference is significant in the high
relative volume regime. The reason is that the trader’s PBE policy guides the trader
to concentrate his volume near the end of the horizon, and the concentrated trading
volume increases volatility substantially. Thus, we may conclude that the presence of
the arbitrageur could result in rise of volatility with no informational contents that
could be unfavorable to large risk-averse long-term investors outside our model.

2.5 Extensions
In this section, we revisit some of the assumptions in the problem formulation of
Section 2.1. At a high level, the main feature of our model that enables tractability is
that, in equilibrium, each agent solves a linear-quadratic Gaussian control problem.
This requires that the evolution of the model over time be described by a linear
system and that the objectives of the trader and arbitrageur be quadratic functions
that decompose additively over time. As we shall see shortly, there are a number
of extensions of the model one may consider, incorporating important phenomena
2.5. EXTENSIONS 39

such as risk aversion and transient price impact, that maintain this structure. Such
extensions remain tractable and can be addressed using straightforward adaptations
of the techniques we have developed.

2.5.1 Time Horizon

Our model assumes that the trader begins his liquidation at time 1 and completes
it by time T , and that this time interval is common knowledge. In some instances,
public knowledge of the beginning and end of the liquidation interval might be rea-
sonable since, for example, this interval will often correspond to a single trading day.
More generally, however, it may be desirable to impose uncertainty on the part of
arbitrageur as to the beginning and end of the liquidation. Unfortunately, it is not
clear how to allow for this in a tractable fashion in our current framework.
The model further assumes that the arbitrageur must liquidate his position by
time T + 1. Then, the value function of the arbitrageur at time T with position yT ,
is given by VT∗ (yT ) = −λyT2 . This was used in (2.5)–(2.6) to determine the value
functions UT∗ −1 and VT∗−1 , which form the base case of the backward induction. This
assumption can easily be relaxed. For example, suppose that the arbitrageur has
Ta additional trading periods. It is easy to see that, after time T , the arbitrageur
will optimally equipartition over the remaining Ta periods. Therefore the value of a
position yT at time T will take the form VT∗ (yT ) = −λ T2T
a +1 2
a
yT , following the analysis
in [11]. So long as VT∗ is a quadratic function, our discussion in Sections 2.2 and 2.3
carries through, with a different choice of terminal value functions.

2.5.2 Risk Aversion

Our model assumes that both the trader and arbitrageur are risk-neutral. One way to
account for risk aversion is to follow the approach suggested by [29]. In particular, we
could assume that, for example, the trader seeks to optimize the objective function
" T −1 #
η 2
∆pτ +1 xτ − ∆pτ +1 xτ − E[∆pτ +1 xτ | xτ , yτ , φτ ] − ζx2τ
X
E x0 , y0 , φ0 ,
τ =0 2
40 CHAPTER 2. PART I: STRATEGIC EXECUTION

The second term in the sum penalizes for variance in revenue in each time period, with
η ≥ 0 capturing the degree of risk aversion. This final term represents a per stage
holding cost, with the parameter ζ ≥ 0 expressing the degree to which the trader
would prefer to execute sooner rather than later. The risk-neutral case previously
considered corresponds to the choice of η = ζ = 0. For any nonnegative parameter
choices, the objective remains a time separable positive definite quadratic function.
Hence, the methods of Sections 2.2 and 2.3 can be suitably adapted.

2.5.3 Price Impact and Price Dynamics

Our model assumes permanent and linear price impact. Empirically, it has been
observed that transient price impact is a significant component of price dynamics,
and it is important to account for this in the design of execution strategies.
Also, our model assumes a price impact coefficient that is constant. Empirically,
the non-stationarity of price impact may be a significant phenomenon, varying, for
example, according to the time-of-day. Further, one theoretical justification for a
permanent, linear price impact is the model of [34]. For that model, however, the
price sensitivity is time dependent.
More generally, our analysis applies when there is some collection of state variables
(for example, {xt , yt , µt }) that evolve as a linear dynamical system with Gaussian
disturbances, and where changes in price are linear in the state variables. In order
to incorporate transient price impact and non-stationarity, assume that prices evolve
according to

t t
αt−τ γτ (uτ + vτ + zτ ) .
X X
pt = p0 + λτ (uτ + vτ + zτ ) +
τ =1 τ =1 (2.19)
| {z } | {z }
permanent price impact transient price impact

Here, uτ and vτ are the trades of the trader and arbitrageur, respectively, as time τ .
In place of the exogenous noise term in the original price dynamics (2.1), zτ is an IID
N (0, σz2 ) random variable representing the quantity of noise trades at time τ . The
second term in (2.19) captures permanent, linear price impact that is non-stationary
with sensitivity λτ ≥ 0 at time τ . The final term represents transient, linear price
2.5. EXTENSIONS 41

impact that is non-stationary with sensitivity γτ ≥ 0 at time τ and recovery rate

α ∈ [0, 1).
These price dynamics can be rewritten as

pt = pt−1 + (λt + γt )(ut + vt + zt ) − (1 − α)st−1 ,

where st is defined to be geometrically weighted total order flow

t
αt−τ γτ (uτ + vτ + zτ ) = αst−1 + γt (ut + vt + zt ).
X
st ,
τ =0

Now, suppose that the trader’s decision ut is a linear function of {xt−1 , yt−1 , µt−1 , st−1 },
and the arbitrageur’s decision vt is a linear function of {yt−1 , µt−1 , st−1 }. Then, it will
be the case that {xt , yt , µt , st } evolve as a linear dynamical system, and that the price
changes are linear in these state variables. Therefore, the analysis in Sections 2.2
and 2.3 can be suitably modified and repeated, with an augmented state space. Note
that, since st is a function only of the total quantities traded at times up to t, it is
reasonable to assume that this is public knowledge known to both the trader and
arbitrageur.
Other aspects of more complicated price dynamics can also be incorporated via
such state augmentation. For example, one may consider linear factor models or
otherwise add exogenous explanatory variables to the evolution of prices, so long as
the dependencies are linear. Similarly, models that incorporate drift in the price
process, such as short term momentum or mean reversion, can be considered.
Chapter 3

Part II: Adaptive Execution –

Exploration and Learning of Price
Impact

3.1 Problem Formulation

In the Part II of this thesis, we consider extended price dynamics of a single security
that include transient price impact and a linear factor model with observable return-
predictive factors as well as permanent price impact. Another difference from the
Part I is that there is no arbitrageur in the market any more.

3.1.1 Model Description

Decision Variable and Security Position We consider a trader who trades a
single security over an infinite time horizon. She submits a market buy or sell order
at the beginning of each period of equal length. ut ∈ R represents the number of
shares of the security to buy or sell at period t and a positive (negative) value of
ut denotes a buy (sell) order. Let xt−1 ∈ R denote the trader’s pre-trade security
position before placing an order ut at period t as in the Part I. Therefore, xt evolves
over time according to xt = xt−1 + ut for all t ≥ 1.

42
3.1. PROBLEM FORMULATION 43

Price Dynamics The absolute return of the security is given by

M
∆pt = pt − pt−1 = g > ft−1 + λ∗ ut + ∗
X
γm (dm,t − dm,t−1 ) + t
m=1
t
t−i
dt , [d1,t · · · dM,t ]> .
X
dm,t , rm ui = rm dm,t−1 + ut , (3.1)
i=1

We will explain each term in detail as we progress. This can be viewed as a first-order
Taylor expansion of a geometric model
M
!
pt
= g̃ > ft−1 + λ̃∗ ut + ∗
X
log γ̃m (dm,t − dm,t−1 ) + ˜t
pt−1 m=1

over a certain period of time, say, a few weeks in calendar time, which makes this
approximation reasonably accurate for practical purposes. Although it is unrealistic
that the security price can be negative with positive probability, our model neverthe-
less serves its practical purpose for the following reasons: Our numerical experiments
conducted in Section 3.3 show that price changes after a few weeks from now have
ignorable impacts on a current optimal action. In other words, optimal actions for our
infinite-horizon control problem appear to be quite close to those for a finite-horizon
counterpart on a few week time scale. Furthermore, it turns out that in simulation
we could learn a unknown price impact model fast enough to take actions that are
close to optimal actions within a few weeks. Thus, learning based on our price dy-
namics model could also be justified. We will give concrete numerical examples later
to support these notions.
Price Impact The term λ∗ ut represents “permanent price impact” on the security
price of a current trade. The permanent price impact is endogenously derived in [34]
from informational asymmetry between an informed trader and uninformed compet-
itive market makers, and in [42] from equilibrium of a limit order market where fully
strategic liquidity traders dynamically choose limit and market orders. [30] prove that
the linearity of a time-independent permanent price impact function is a necessary
and sufficient condition for the absence of “price manipulation” and “quasi-aribtrage”
under some regularity conditions.
PM ∗
The term m=1 γm dm,t indicates “transient price impact” that models other traders’
44 CHAPTER 3. PART II: ADAPTIVE EXECUTION

responses to non-informative orders. For example, suppose that a large market buy
order has arrived and other traders monitoring the market somehow realize that there
is no definitive evidence for abrupt change in the fundamental value of the security.
Then, they naturally infer that the large buy order came merely for some liquidity
reason, and gradually “correct” the perturbed price into what they believe it is sup-
posed to be by submitting counteracting selling orders. The dynamics of dm,t in (3.1)
indicate that the impact of a current trade on the security price decays exponentially
over time, which is considered in [39] that incorporate the dynamics of supply and de-
mand in a limit order market to optimal execution strategies. In [25], it is shown that
the exponentially decaying transient price impact is compatible only with a linear
instantaneous price impact function in the absence of “dynamic arbitrage.”
Observable Return-Predictive Factors We assume that there are multiple ob-
servable return-predictive factors that affect the absolute return of the security as in
[24]. Those factors could be macroeconomic factors such as gross domestic products
(GDP), inflation rates and unemployment rates, security-specific factors such as P/B
ratio, P/E ratio and lagged returns, or prices of other securities that are correlated
with the security price. In our price dynamics model, ft ∈ RK denotes these factors
and g ∈ RK denotes factor loadings. The term g > ft−1 represents predictable excess
return or “alpha.” We assume that ft is a first-order vector autoregressive process
ft = Φft−1 + ωt where Φ ∈ RK×K is a stable matrix that has all eigenvalues inside a
unit disk and ωt ∈ RK is a martingale difference sequence adapted to the filtration
{Ft , σ({x0 , d0 , f0 , ω1 , . . . , ωt , 1 , . . . , t })}. We further assume that ωt is bounded
almost surely, i.e. kωt k ≤ Cω a.s. for all t ≥ 1 for some deterministic constant Cω ,
and Cov[ωt |Ft−1 ] = Ω ∈ RK×K being positive definite and independent of t.
Unpredictable Noise The term t represents random fluctuations that cannot
be accounted for by price impact and observable return-predictive factors. We as-
sume that t is a martingale difference sequence adapted to the filtration {Ft }, and
independent of x0 , d0 , f0 and ωτ for any τ ≥ 1. Also, E[2t |Ft−1 ] = Σ ∈ R being inde-
pendent of t. Finally, each t is assumed to be sub-Gaussian, i.e., E[exp(at )|Ft−1 ] ≤
exp(C2 a2 /2), ∀t ≥ 1, ∀a ∈ R for some C > 0.
Policy A policy is defined as a sequence π = {π1 , π2 , . . .} of functions where πt
3.1. PROBLEM FORMULATION 45

maps the trader’s information set at the beginning of period t into an action ut . The
trader observes ft−1 and pt−1 at the end of period t − 1 and thus her information
set at the beginning of period t is given by It−1 = {x0 , d0 , f0 , . . . , ft−1 , p0 , . . . , pt−1 }.
A policy π is admissible if zt , [xt d> > >
t ft ] generated by ut = πt (It−1 ) satisfies
limT →∞ kzT k2 /T = 0. The set of admissible policies is denoted by Π.
Objective Function The trader’s objective is to maximize expected average “risk-
adjusted” profit defined as

T
" #
1X
lim inf E ∆pt xt−1 − ρΣ x2t
T →∞ T t=1

where the first term ∆pt xt−1 indicates change in book value and the second term
ρΣ x2t a quadratic penalty for her non-zero security position in the next period that
reflects her risk aversion. ρ is a risk-aversion coefficient that quantifies the extent to
which the trader is risk-averse.
Assumptions The following is a list of assumptions on which our analysis is based
∗ >
throughout this chapter. Let θ∗ , [λ∗ γ1∗ . . . γM ] ∈ RM +1 . We will make two more
assumptions as we progress.

Assumption 3.1. (a) The price impact coefficients θ∗ are unknown to the trader.
Note that they can be learned only through executed trades.
(b) The factor loadings g are known to the trader. This is a reasonable assumption
since they can be learned by observing prices without any transaction.
(c) The decaying rates r , [r1 , . . . , rM ]> ∈ [0, 1)M of the transient price impact
are known to the trader and all the elements are distinct. In practice, they are
definitely not known a priori. However, it can be handled effectively for practical
purposes by using a sufficiently dense r with a large M so that potential bias
induced by modeling mismatch can be greatly reduced at the expense of increased
variance, which can be reduced by regularization.
(d) θ∗ ∈ Θ , {θ ∈ RM +1 : 0 ≤ θ ≤ θmax , 1> θ ≥ β} for some θmax > 0 component-
wise and some β > 0. The constraint 1> θ ≥ β is imposed to capture non-zero
execution costs in practice. Note that Θ is compact and convex.
46 CHAPTER 3. PART II: ADAPTIVE EXECUTION

3.1.2 Existence of Optimal Solution

Now, we will show that there exists an optimal policy among admissible policies that
maximizes expected average risk-adjusted profit. For convenience, we will consider
the following minimization problem that is equivalent to maximize expected average
risk-adjusted profit.

T
" #
1X
min lim sup E ρΣ x2t − ∆pt xt−1
π∈Π T →∞ T t=1

We call the negative of average risk-adjusted profit “average cost.” This problem can
be expressed as a discrete-time linear quadratic control problem
   
T h
1X Q S z
  t−1  s.t. zt = Azt−1 + But + Wt
i
min lim sup E  >
zt−1 ut  >
π∈Π T →∞ T t=1 S R ut

where zt = [xt d> > >

t ft ] , v = [0 γ
∗>
(diag(r) − I) g > ]> , γ ∗ = [γ1∗ · · · γM
∗ >
] , e1 =
[1 0 · · · 0]> , ut = πt (It−1 ),

1 > 1 ∗
Q = ρΣ e1 e> >
1 − (ve1 + e1 v ), S = ρΣ e1 − (λ + γ
∗>
1)e1 , R = ρΣ ,
2 2
       
1 0 0 1 0 0 0 0
       
A =  0 diag(r) 0  , B =  1  , Wt =  0  , Ω̃ , Cov[Wt ] =  0 0 0  .
       
       
0 0 Φ 0 ωt 0 0 Ω
Note that R is strictly positive but Q is not necessarily positive semidefinite. There-
fore, special care should be taken in order to prove the existence of an optimal policy.
We start with a well-known Bellman equation for average-cost linear quadratic control
problems
h i
H(zt−1 ) + h = min
u
E ρΣ (xt−1 + ut )2 − ∆pt xt−1 + H(zt ) (3.2)
t

where H(·) denotes a differential value function and h denotes minimum average cost.
It is natural to conjecture H(zt ) = zt> P zt . Plugging it into (3.2), we can obtain a
3.1. PROBLEM FORMULATION 47

discrete-time Riccati algebraic equation

P = A> P A + Q − (S > + B > P A)> (R + B > P B)−1 (S > + B > P A) (3.3)

with a second-order optimality condition R + B > P B > 0. The following theorem

characterizes an optimal policy among admissible policies that minimizes expected
average cost, and proves existence and uniqueness of such an optimal policy.

Theorem 3.1. For any θ∗ ∈ Θ, there exists a unique symmetric solution P to (3.3)
that satisfies R + B > P B > 0 and ρsr (A + BL) < 1 where L = −(R + B > P B)−1 (S > +
B > P A) and ρsr (·) denotes a spectral radius. Moreover, a policy π = (π1 , π2 , . . .) with
πt (It−1 ) = Lzt−1 is an optimal policy among admissible policies that attains minimum
expected average cost tr(P Ω̃).

For ease of exposition, we define some notations: P (θ) denotes a unique symmet-
ric stabilizing solution to (3.3) with θ∗ = θ. L(θ) , −(R + B > P (θ)B)−1 (S(θ)> +
B > P (θ)A) denotes a gain matrix for an optimal policy with θ∗ = θ, G(θ) , A+BL(θ)
denotes a closed-loop system matrix with θ∗ = θ, and U (θ) , 1L(θ) + [A − I O] de-
notes a linear mapping from zt−1 to a regressor ψt used in least-squares regression for
learning price impact, i.e. ψt = U (θ)zt−1 . Having these notations, we make two as-
sumptions about L(θ) as follows. Indeed, we can verify through closed-form solutions
that these assumptions hold in a special case which will be discussed in Section 3.1.3.

Assumption 3.2. (a) There exists a deterministic constant CL > 0 such that for any
θ1 , θ2 ∈ Θ, kL(θ1 ) − L(θ2 )k ≤ CL kθ1 − θ2 k.
(b) (L(θ))1 6= 0 and (L(θ))M +2 6= 0 for any θ ∈ Θ.

Using Assumption 3.2, we can obatin an upper bound on kzt k uniformly over
θ ∈ Θ and t ≥ 0.

Lemma 3.1. For any 0 < ξ < 1, there exists N ∈ N being independent of θ such
that kGN (θ)k ≤ ξ for all θ ∈ Θ. Thus, max0≤i≤N −1 supθ∈Θ kGi (θ)k , Cg is finite.
For any fixed θ ∈ Θ, kzt k ≤ Cg kz0 k + Cg Cω /(ξ(1 − ξ 1/N )) , Cz , ∀t ≥ 0 a.s. where
zt = G(θ)zt−1 + Wt . Moreover, supθ∈Θ kU (θ)k ≤ Cg + 1.
48 CHAPTER 3. PART II: ADAPTIVE EXECUTION

Note that Lemma 3.1 can be applied only when θ is fixed over time. From now
on, we assume kz0 k ≤ 2Cg Cω /(ξ(1 − ξ 1/N )) without loss of generality otherwise we
can always set Cg to be greater than kz0 kξ(1 − ξ 1/N )/(2Cω ).
Finally, we present concrete numerical examples that support the validity of our
price model as an approximation of the geometric model for practical purposes. As we
discussed earlier, our numerical experiments conducted in Section 3.3 show that our
infinite-horizon control problem could be approximated accurately by a finite-time
control problem with a time horizon on a few week time scale. To be more precise,
(T ) (T ) (T )
we define relative error for P0 as kP0 − P k/kP k where Pt denotes a coefficient
matrix of a quadratic value function at period t for a finite-horizon control problem
with a terminal period T , and P denotes a coefficient matrix of a quadratic value
function for our infinite-horizon control problem. As shown in Figure 3.1, the relative
(T ) (300)
error for P0 appears to decrease exponentially in T and the relative error for P0
is almost 10−7 where T = 300 corresponds to 3.8 trading days.
Furthermore, we could learn the unknown θ∗ fast enough to take actions that are
close to optimal actions on a required time scale. An action from a current estimate
could be quite close to an optimal action even if estimation error for the current
estimate is large, especially in the case where a few “principal components” of L(θ)
with large directional derivatives with respect to θ are learned accurately. To be more

0
10 0.08

0.06
Relative Error for L(θt)
Relative Error for P(T)

−2
0

10
0.04

−4
10 0.02

0
−6
10
−0.02

−8
10 −0.04
0 50 100 150 200 250 300 1000 1500 2000 2500 3000
T Period

Figure 3.1: (Left) Relative error for PT : T = 300 corresponds to 3.8 trading days.
(Right) Relative error for L(θt ) from CTRACE: Period 3000 corresponds to 38 trading
days. The verical bars represent two standard errors. In both figures, the simulation
setting in Section 3.3 is used.
3.1. PROBLEM FORMULATION 49

precise, we define relative error for L(θt ) as

∗
E[(L(θt )zt−1 − L(θ∗ )zt−1
∗
)2 ] (L(θt ) − L(θ∗ ))Πzz (θ∗ )(L(θt ) − L(θ∗ ))>
∗
=
E[(L(θ∗ )zt−1 )2 ] L(θ∗ )Πzz (θ∗ )L(θ∗ )>

where zt∗ is a stationary process generated by u∗t = L(θ∗ )zt−1

∗
and Πzz (θ∗ ) = E[zt∗ zt∗> ].
The relative error for L(θt ) indicates how different an action from an estimate θt is
than an optimal action from the true value θ∗ . Figure 3.1 shows how the relative error
for L(θt ) evolves over time with two-standard-error bars when θt ’s are obtained from
a new policy that we will propose in Section 3.2. As you can see, all the approximate
95%-confidence intervals lie within ±3% range after Period 2500 that corresponds to
32 trading days. It implies that actions from estimates learned over a few weeks could
be sufficiently close to optimal actions.

3.1.3 Closed-Form Solution: A Single Factor and Permanent

Impact Only

When we consider only the permanent price impact and a single observable factor,
we can derive an exact closed-form P and L as follows.
q
λ∗ − ρΣ + 2λ∗ ρΣ + (ρΣ )2
Pxx = ,
2

−gλ∗
Pxf = q ,
(1 − Φ)λ∗ − ΦρΣ + Φ 2λ∗ ρΣ + (ρΣ )2

−g 2 Φ2
Pf f = q ,
2(1 − Φ2 ) (1 − Φ)2 λ∗ + (1 + Φ2 )ρΣ + (1 − Φ2 ) 2λ∗ ρΣ + (ρΣ )2
−2ρΣ
Lx = q ,
ρΣ + 2λ∗ ρΣ + (ρΣ )2
gΦ
Lf = q .
(1 − Φ)λ∗ + ρΣ + 2λ∗ ρΣ + (ρΣ )2
Although this is a special case of our general setting, we can get useful insights into
50 CHAPTER 3. PART II: ADAPTIVE EXECUTION

the effect of the permanent price impact coefficient λ∗ on various quantities. Here are
some examples:

(a) |Lx | and |Lf | are strictly decreasing in λ∗ .

(b) limλ∗ →0 Lx = −1, limλ∗ →∞ Lx = 0.
gΦ
(c) limλ∗ →0 Lf = 2ρΣ
, limλ∗ →∞ Lf = 0.
(d) The expected average risk-adjusted profit −Pf f Ω is strictly decreasing in λ∗ .
g 2 Φ2 Ω
(e) limλ∗ →0 (−Pf f Ω) = 4(1−Φ2 )ρΣ
, limλ∗ →∞ (−Pf f Ω) = 0.

3.1.4 Performance Measure: Regret

In this section, we define a performance measure that can be used to evaluate policies.
For notational simplicity, let L∗ = L(θ∗ ), G∗ = G(θ∗ ) and P ∗ = P (θ∗ ). Using (3.3),
we can show that for any policy π

T n o
JTπ (z0 |FT ) , ρΣ (xt−1 + πt (It−1 ))2 − ∆pt xt−1
X

t=1
T T T
= z0> P ∗ z0 − zT> P ∗ zT + 2 (Azt−1 + Bπt (It−1 ))> P ∗ Wt + Wt> P ∗ Wt −
X X X
xt−1 t
t=1 t=1 t=1
T
(πt (It ) − L∗ zt−1 )> (R + B > P ∗ B)(πt (It−1 ) − L∗ zt−1 ).
X
+
t=1

First, we define pathwise regret RTπ (z0 |FT ) of a policy π at period T as JTπ (z0 |FT ) −
∗
JTπ (z0 |FT ) where πt∗ (It−1 ) = L∗ zt−1
∗
and zt∗ = G∗ zt−1
∗
+ Wt with z0∗ = z0 . That is,
the pathwise regret of a policy π at period T amounts to excess costs accumulated
over T periods when applying π relative to when applying the optimal policy π ∗ . By
definition of π ∗ , the pathwise regret of a policy π at period T can be expressed as

RTπ (z0 |FT ) = zT∗> P ∗ zT∗ − zT> P ∗ zT

T
(πt (It−1 ) − L∗ zt−1 )> (R + B > P ∗ B)(πt (It−1 ) − L∗ zt−1 )
X
+
t=1
T T
∗ ∗
)> P ∗ Wt (x∗t−1 − xt−1 )t .
X X
+2 ((Azt−1 + Bπt (It−1 )) − (A + BL )zt−1 +
t=1 t=1
3.2. CTRACE 51

Second, we define expected regret R̄Tπ (z0 ) of a policy π at period T as E[RTπ (z0 |FT )].
Taking expectation of pathwise regret, we can obtain a more concise expression for
expected regret because the last two terms vanish by the law of total expectation.
Hence, we have

R̄Tπ (z0 ) = E[zT∗> P ∗ zT∗ − zT> P ∗ zT ]

" T #
(πt (It−1 ) − L∗ zt−1 )> (R + B > P ∗ B)(πt (It−1 ) − L∗ zt−1 ) .
X
+E
t=1

Finally, we define relative regret R̃Tπ (z0 ) of a policy π at period T as R̄Tπ (z0 )/|tr(P ∗ Ω̃)|
where tr(P ∗ Ω̃) is minimum expected average cost for θ∗ . Our choice of performance
measure will be either expected regret or relative regret in the rest of this chapter.

3.2 Confidence-Triggered Regularized Adaptive Certainty

Equivalent Policy (CTRACE)

Our problem can be viewed as a special case of reinforcement learning, which focuses
on sequential decision-making problems in which unknown properties of an environ-
ment must be learned in the course of taking actions. It is often emphasized in rein-
forcement learning that longer-term performance can be greatly improved by making
decisions that explore the environment efficiently at the expense of suboptimal short-
term behavior. In our problem, a price impact model is unknown, and submission of
large orders can be considered exploratory actions that facilitate learning.
Certainty equivalent control (CE) represents one extreme where at any time, cur-
rent point estimates are assumed to be correct and actions are made accordingly.
Although learning is carried out with observations made as the system evolves, no
decisions are designed to enhance learning. Thus, this is an instance of pure ex-
ploitation of current knowledge. In our problem, CE estimates the unknown price
impact coefficients θ∗ at each period via least-squares regression using available data,
and makes decisions that maximize expected average risk-adjusted profit under an
assumption that the estimated model is correct. That is, an action ut for CE is given
Pt−1 2
by ut = L(θ̃t−1 )zt−1 where θ̃t−1 = argminθ∈Θ i=1 (∆pi − g > fi−1 ) − ψi> θ with a
52 CHAPTER 3. PART II: ADAPTIVE EXECUTION

regressor ψi = [ui (di − di−1 )> ]> .

An important question is how aggressively the trader should explore to learn
θ∗ . Unlike many other reinforcement learning problems, a fairly large amount of
exploration is naturally induced by exploitative decisions in our problem. That is,
regular trading activity triggered by the return-predictive factors ft excites the mar-
ket regardless of whether or not she aims to learn price impact. Given sufficiently
large factor variability, the induced exploration might adequately resolve uncertainties
about price impact. However, we will demonstrate by proposing a new exploratory
policy that executing trades to explore beyond what would naturally occur through
the factor-driven exploitation can result in significant benefit.
Now, let us formally state that exploitative actions triggered by the return-
predictive factors induce a large degree of exploration that could yield strong con-
sistency of least-squares estimates. It is worth noting that pure exploitation is not
sufficient for strong consistency in other problems such as [35] and [16].

Lemma 3.2. For any θ ∈ Θ, let ut = L(θ)zt−1 , zt = G(θ)zt−1 + Wt and ψt> =

h i
ut (dt − dt−1 )> = (U (θ)zt−1 )> . Also, let Πzz (θ) denote a unique solution to Πzz (θ) =
G(θ)Πzz (θ)G(θ)> + Ω̃. Then,

T
1X
lim ψt ψt> = U (θ)Πzz (θ)U (θ)> 0 a.s. (3.4)
T →∞ T
t=1

Moreover, we can show that Πzz (θ) is continuous on Θ by proving uniform con-
h PT i
1 >
vergence of E T t=1 zt−1 zt−1 to Πzz (θ) on Θ. The continuity leads to

λ∗ψψ , inf λmin U (θ)Πzz (θ)U (θ)> > 0
θ∈Θ

which will be used later.

Corollary 3.1. (a) Πzz (θ) is continuous on Θ.

(b) λ∗ψψ , inf θ∈Θ λmin U (θ)Πzz (θ)U (θ)> > 0.
P
T
Lemma 3.2 implies that λmin t=1 ψt ψt> increases linearly in time T a.s. asymp-
totically. In addition, we can obtain a similar result for a finite-sample case: There
3.2. CTRACE 53

P
T
exists a finite, deterministic constant T1 (θ, δ) such that λmin t=1 ψt ψt> grows lin-
early in time T for all T ≥ T1 (θ, δ) with probability at least 1 − δ. This is a crucial
result that will be used for bounding above “(, δ)-convergence time” later. It is
formally stated in the following lemma.

Lemma 3.3. For any θ ∈ Θ, let ut = L(θ)zt−1 , zt = G(θ)zt−1 + Wt and ψt> =

h i
ut (dt − dt−1 )> = (U (θ)zt−1 )> . Then, there exists an event B(δ) such that on B(δ)
with Pr(B(δ)) ≥ 1 − δ
T
7 1X 17
U (θ)Πzz (θ)U (θ)> ψt ψt> U (θ)Πzz (θ)U (θ)> ∀T ≥ T1 (θ, δ) where
8 T t=1 16

!2 !
32(Cz Cg )2 (M + K + 1) (M + K + 2)4
T1 (θ, δ) = 4 2 log
ξ 2 (1 − ξ N )λmin (Πzz (θ)) 432δ 2
!3
32(Cz Cg )2 (M + K + 1)
∨8 2 ∨ 216.
ξ 2 (1 − ξ N )λmin (Πzz (θ))
P
T
Furthermore, we can extend Lemma 3.2 in such a way that λmin t=1 ψt ψt> still
increases to infinity linearly in time T for time-varying {θt } adapted to {σ(It )} as
long as θt remains sufficiently close to a fixed θ ∈ Θ for all t ≥ 0. Here, σ(It ) denotes
a σ-algebra generated by It and θt is σ(It )-measurable for each t.

Lemma 3.4. Consider any θ ∈ Θ and {θt ∈ Θ} adapted to {σ(It )} such that kθt −θk ≤
η
√
M +1CL
a.s. where
1 1
ν 3 (1 − ν N )3 λmin (Πzz (θ)) ν 3 (1 − ν N )3 λmin (U (θ)Πzz (θ)U (θ)> ) ν−ξ
η= ∧ ∧
N
42N Cg Cω +1 2 N
42N Cg Cω (1 + kU (θ)k)
+1 2 2 N CgN −1

for all t ≥ 0 and any ν ∈ (ξ, 1). Let ut = L(θt−1 )zt−1 , zt = G(θt−1 )zt−1 + Wt and
h i
ψt> = ut (dt − dt−1 )> = (U (θt−1 )zt−1 )> . Then,

T
1X λmin (U (θ)Πzz (θ)U (θ)> )
lim inf ψt ψt> I a.s.
T →∞ T t=1 2

Similarly to Lemma 3.3, we can obtain a finite-sample result for Lemma 3.4. This
result will provide with a useful insight into how our new exploratory policy operates
54 CHAPTER 3. PART II: ADAPTIVE EXECUTION

in the long term.

Lemma 3.5. Consider {θt ∈ Θ} defined in Lemma 3.4. Let ut = L(θt−1 )zt−1 , zt =
h i
G(θt−1 )zt−1 + Wt and ψt> = ut (dt − dt−1 )> = (U (θt−1 )zt−1 )> . Then, for any 0 <
δ < 1 on the event B(δ) in Lemma 3.3 with Pr(B(δ)) ≥ 1 − δ
T
!
1X 3 3kz0 k(2Cω + kz0 k)
λmin ψt ψt> ≥ λmin (U (θ)Πzz (θ)U (θ)> ), ∀T ≥ T1 (θ, δ)∨ .
T t=1 8 Cω2

It is challenging to guarantee that all estimates generated by CE are sufficiently

close to one another uniformly over time so that Lemma 3.4 and Lemma 3.5 can
be applied to CE. In particular, CE is subject to overestimation of price impact
that could be considerably detrimental to trading performance. The reason is that
overestimated price impact discourages submission of large orders and thus it might
take a while for the trader to realize that price impact is overestimated due to reduced
“signal-to-noise ratio.” To address this issue, we propose the confidence-triggered
regularized adaptive certainty equivalent policy (CTRACE) as presented in Algorithm
4. CTRACE can be viewed as a generalization of CE and deviates from CE in two
ways: (1) `2 regularization is applied in least-squares regression, (2) coefficients are
only updated when a certain measure of confidence exceeds a pre-specified threshold
and minimum inter-update time has elapsed. Note that CTRACE reduces to CE as

Algorithm 4 CTRACE
Input: θ0 , x0 , d0 , r, g, κ, Cv , τ , L(·), θmax , {pt }∞ ∞
t=0 , {ft }t=0
Output: {ut }∞ t=1
1: V0 ← κI, t0 ← 0, i ← 1
2: for t = 1, 2, . . . do
3: ut ← L(θt−1 )zt−1 , xt ← xt−1 + ut , dt ← diag(r)dt−1 + 1ut
4: ψt ← [ut (dt − dt−1 )> ]> , Vt ← Vt−1 + ψt ψt>
5: if λmin (Vt ) ≥ κ + Cv t and t ≥ ti−1 + τ then
2
θt ← argminθ∈Θ ti=1 (∆pi − g > fi−1 ) − ψi> θ + κkθk2 , ti ← t, i ← i + 1
P
6:
7: else
8: θt ← θt−1
9: end if
10: end for
3.2. CTRACE 55

the regularization penalty κ and the threshold Cv tend to zero, and the minimum
inter-update time τ tends to one.
Regularization induces active exploration in our problem by penalizing the `2 -
norm of price impact coefficients as well as reduces the variance of an estimator.
Without regularization, we are more likely to obtain overestimates of price impact.
Such an outcome attenuates trading intensity and thereby makes it difficult to escape
from the misjudged perspective on price impact. Regularization decreases the chances
of obtaining overestimates by reducing the variance of an estimator and furthermore
tends to yield underestimates that encourage active exploration.
Another source of improvement of CTRACE relative to CE is that updates are
made based on a certain measure of confidence for estimates whereas CE updates at
every period regardless of confidence. To be more precise on this confidence measure,
we first present a high-probability confidence region for least-squares estimates from
[1].

Proposition 3.1 (Corollary 10 of [1]). Pr (θ∗ ∈ St (δ), ∀t ≥ 1) ≥ 1 − δ where



St (δ) , θ ∈ RM +1 : (θ − θ̂t )> Vt (θ − θ̂t )

 v ! 2 
det(Vt )1/2 det(κI)−1/2
u 
+ κ1/2 kθmax k ,
u
≤  C t2 log
δ 

t t t
!
ψi ψi> Vt−1 ψi ψi> θ∗
X X X
Vt = κI + and θ̂t = + ψi i .
i=1 i=1 i=1

This implies that for any θ ∈ St (δ)

 v ! 2
1 det(Vt )1/2 det(κI)−1/2
u
kθ − θ̂t k2 ≤ + κ1/2 kθmax k .
u
C t2 log
λmin (Vt ) δ

By definition, CTRACE updates only when λmin (Vt ) ≥ κ + Cv t. λmin (Vt ) typically
dominates log (det(Vt )) for large t because it increases linearly in t, and is inversely
proportional to the squared estimation error kθ̂t − θ∗ k2 . That is, CTRACE updates
only when confidence represented by λmin (Vt ) exceeds the specified level κ+Cv t. From
56 CHAPTER 3. PART II: ADAPTIVE EXECUTION

now on, we refer to this updating scheme as confidence-triggered update. Confidence-

triggered update makes a significant contribution to reducing the chances of obtaining
overestimates of price impact by updating “carefully” only at the moments when an
upper bound on the estimation error is guaranteed to decrease.
The minimum inter-update time τ ∈ N in Algorithm 4 can guarantee that the
closed-loop system {zt } from CTRACE is stable as long as τ is sufficiently large.
Meanwhile, there is no such stability guarantee for CE. The following lemma provides
with a specific uniform bound on kzt k.
N log(2Cg /ξ)
Lemma 3.6. Under CTRACE with τ ≥ log(1/ξ)
, for all t ≥ 0,

(2Cg + 1)Cg Cω (Cg + 1)(2Cg + 1)Cg Cω

kzt k ≤ 1 , Cz∗ a.s. kψt k ≤ 1 , Cψ a.s.
ξ(1 − ξ ) N ξ(1 − ξ N )

Confidence-triggered update yields a good property of CTRACE that CE lacks:

CTRACE is inter-temporally consistent in the sense that estimation errors kθt − θ∗ k
are bounded with high probability by monotonically nonincreasing upper bounds
that converge to zero almost surely as time tends to infinity. The following theorem
formally states this property.

Theorem 3.2 (Inter-temporal Consistency of CTRACE). Let {θt } be estimates generated

N log(2Cg /ξ)
by CTRACE with M ≥ 2, τ ≥ log(1/ξ)
and Cv < λ∗ψψ . Then, the ith update
time ti in Algorithm 4 is finite a.s. Moreover, kθt − θ∗ k ≤ bt , ∀t ≥ 0 on the event
{θ∗ ∈ St (δ) ∀t ≥ 1} where b0 = kθ0 − θ∗ k,
 q

 2C (M +1) log(Cψ )
2 t/κ+M +1 +2 log(1/δ)+2κ1/2 kθ
max k
 √ if t = ti for some i
bt =  Cv t
 b otherwise
t−1

and {bt } is monotonically nonincreasing for all t ≥ 1 with limt→∞ bt = 0 a.s.

Moreover, we can show that CTRACE is efficient in the sense that its (, δ)-
convergence time is bounded above by a polynomial of 1/, log(1/δ) and log(1/δ 0 ) with
probability at least 1 − δ 0 . We define (, δ)-convergence time to be the first time when
an estimate and all the future estimates following it are within an -neighborhood of
3.2. CTRACE 57

θ∗ with probability at least 1 − δ. If is sufficiently small, we can apply Lemma 3.4

and Lemma 3.5 to guarantee that λmin (Vt ) increases linearly in t with high probability
after (, δ)-convergence time and thereby confidence-triggered update occurs at every
τ periods. This is a critical property that will be used for deriving a poly-logarithmic
finite-time expected regret bound for CTRACE. By Theorem 3.2, it is easy to see
that the (, δ)-convergence time of CTRACE is bounded above by tN (,δ,Cv ) where
N (, δ, Cv ) is defined as

N (, δ, Cv )
 r 


 2C (M + 1) log Cψ2 ti /κ + M + 1 + 2 log (1/δ) + 2κ1/2 kθmax k 


, inf i ∈ N : √ ≤ .


 Cv ti 



The following theorem presents the polynomial bound on the (, δ)-convergence time
of CTRACE.

N log(2Cg /ξ)
Theorem 3.3 (Efficiency of CTRACE). For any > 0, 0 < δ, δ 0 < 1, τ ≥ log(1/ξ)
and Cv < 87 λ∗ψψ on the event B(δ 0 ) defined in Lemma 3.3,

tN (,δ,Cv ) ≤ T1∗ (δ 0 ) ∨ τ + T2 (, δ, Cv ) where

!2 !
32(Cz∗ Cg )2 (M + K + 1) (M + K + 2)4
T1∗ (δ 0 ) =4 log
2
ξ 2 (1 − ξ N )λ∗zz 432δ 0 2
!3
32(Cz∗ Cg )2 (M + K + 1)
∨8 2 ∨ 216,
ξ 2 (1 − ξ N )λ∗zz
 q 2
8C2 Cψ (M + 1) + 4 4C4 Cψ2 (M + 1)2 + κC2 Cv 2 (M + 1)3/2 + 2 log(1/δ)
T2 (, δ, Cv ) =  √ 
κCv 2

(4κkθmax k)2
∨ .
Cv 2

Finally, we derive a finite-time expected regret bound for CTRACE that is quadra-
tic in logarithm of elapsed time using the efficiency of CTRACE and Lemma 3.5.

Theorem 3.4 (Finite-Time Expected Regret Bound of CTRACE). If π is CTRACE with

N log(2Cg /ξ)
M ≥ 2, τ ≥ log(1/ξ)
and Cv < 78 λ∗ψψ , then for any ν ∈ (ξ, 1) and all T ≥ 2,
58 CHAPTER 3. PART II: ADAPTIVE EXECUTION

R̄Tπ (z0 ) ≤ 2kP ∗ kCz∗2 + (R + B > P ∗ B)Cz∗2 CL2  (τ1∗ (T ) + τ2∗ (T ) + 1) kθmax k2 + τ3∗ (T )2
r 2
τ 2C (M + 1) log Cψ2 T /κ + M + 1 + 2 log (2T ) + 2κ1/2 kθmax k
+
C̃ 
κ + C̃(T − 1) − (C̃ − Cv )+ (τ1∗ (T ) + τ2∗ (T ))
!
× log 1{T > τ ∗ (T )}
κ + C̃(τ ∗ (T ) − 1) − (C̃ − Cv )+ (τ1∗ (T ) + τ2∗ (T ))

where C̃ , 83 λmin (U (θ∗ )Πzz (θ∗ )U (θ∗ )> ), τ ∗ (T ) = τ1∗ (T ) + τ2∗ (T ) + τ3∗ (T ),
!2 !
32(Cz∗ Cg )2 (M + K + 1) (M + K + 2)2 T
τ1∗ (T ) =8 2 log √
ξ 2 (1 − ξ N )λ∗zz 6 3
!3
32(Cz∗ Cg )2 (M + K + 1)
∨8 2 ∨ 216 ∨ τ,
ξ 2 (1 − ξ N )λ∗zz
 q 2
8C2 Cψ (M + 1) + 4 4C4 Cψ2 (M + 1)2 + κC2 Cv 2 (M + 1)3/2 + 2 log(2T )
∗
τ2 (T ) =  √ 
κCv 2
(4κkθmax k)2
∨ ,
Cv 2
!2 !
32(Cz∗ Cg )2 (M + K + 1) (M + K + 2)2 T
τ3∗ (T ) =8 2 log √
ξ 2 (1 − ξ N )λmin (Πzz (θ∗ )) 6 3
!3
32(Cz∗ Cg )2 (M + K + 1) 3Cz∗ (2Cω + Cz∗ )
∨8 2 ∨ 216 ∨ ,
ξ 2 (1 − ξ )λmin (Πzz (θ∗ ))
N Cω2
1 1
1 ν 3 (1 − ν N )3 λmin (Πzz (θ∗ )) ν 3 (1 − ν N )3 λmin (U (θ∗ )Πzz (θ∗ )U (θ∗ )> )
= √ ∧
M + 1CL 42N CgN +1 Cω2 42N CgN +1 Cω2 (1 + kU (θ∗ )k)2
!
ν−ξ
∧ .
N CgN −1
Note that τ1∗ (T ), τ2∗ (T ) and τ3∗ (T ) are all O(log T ). Therefore, it is not difficult to
see that the expected regret bound for CTRACE is O(log2 T ).

3.3 Computational Analysis

In this section, we will compare through Monte Carlo simulation the performance
of CTRACE to that of two benchmark policies: CE and a reinforcement learning
3.3. COMPUTATIONAL ANALYSIS 59

Table 3.1: Monte Carlo Simulation Setting (1 trading day = 6.5 hours)

M 6 K 2
Trading interval 5 mins Initial asset price $50
Half-life of r [5, 7.5, 10, 15, 30, 45] Half life of factor [10, 40] mins
mins
r [ 0.50, 0.63, 0.71, 0.79, Φ diag([0.71, 0.92])
0.89, 0.93 ]
γ ($/share) [0, 6, 0, 3, 7, 5] × 10−8 λ ($/share) 2 × 10−8
Σ 0.0013 (annualized vol. Ω diag([1, 1])
= 10%)
ρ 1 × 10−6 θmax (5 × 10−7 )1
β 5 × 10−9 g [0.006, 0.002]
T 3000 (≈38 trading days) Sample paths 600

algorithm recently proposed in [2], which is referred to as AS policy from now on. AS
policy was designed to explore efficiently in a broader class of linear-quadratic control
problems and appears well-suited for our problem. It updates an estimate only when
the determinant of Vt is at least twice as large as the determinant evaluated at the
last update, and selects an element from a high-probability confidence region that
yields maximum average reward. In our problem, AS policy can translate to update
an estimate with θt = argminθ∈St (δ)∩Θ tr(P (θ)Ω̃) at each update time t. Intuitively,
the smaller price impact, the larger average profit, equivalently, the smaller tr(P (θ)Ω̃)
which is the negative of average profit. In light of this, we restrict our attention to so-
lutions to minθ∈St (δ)∩Θ tr(P (θ)Ω̃) of the form {αt θ̂con,t ∈ St (δ) ∩ Θ : 0 ≤ αt ≤ 1} where
θ̂con,t denotes a constrained least-squares estimate to Θ with `2 regularization. The
motivation is to reduce the amount of computation needed for AS policy otherwise
it would be prohibitive. Indeed, the minimum appears to be attained always with
the smallest αt such that αt θ̂con,t ∈ St (δ) ∩ Θ, which is provable in the special case
considered in Section 3.1.3. Note that αt can be viewed as a measure of aggressiveness
of exploration: αt = 1 means no extra exploration and smaller αt implies more active
exploration.
Table 3.1 summarizes numerical values used in our simulation. The signal-to-
PM
noise ratio (SNR), which is defined as E[(λut + m=1 γm (dm,t − dm,t−1 ))2 ]/E[2t ] under
60 CHAPTER 3. PART II: ADAPTIVE EXECUTION

Relative Regret with Varying κ (Cv = 0) Relative Regret with Varying Cv (κ = 1e+11)

200
250 0 0
2e+10 180 20
1e+11 600
160
200
140
Relative Regret

Relative Regret
120
150
100

100 80

50 40

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Period Period

Figure 3.2: Relative regret with varying κ and Cv : (Left) Varying κ ∈ {0, 2 × 1010 , 1 ×
1011 } with fixed Cv = 0. (Right) Varying Cv ∈ {0, 20, 600} with fixed κ = 1 × 1011 .

ut = L(θ∗ )zt−1 , is 0.058 and the optimal average profit is $765.19 per period. t and ωt
are sampled independently from Gaussian distribution even though ωt is assumed to
be bounded almost surely for the theoretical analysis. In fact, it turns out that the use
of Gaussian distribution for ωt does not make a noticeable difference from a bounded
case. The regularization coefficient κ, the confidence-triggered update threshold Cv ,
the minimum inter-update time τ and the significance level δ are chosen via cross-
validation with realized profit: For CTRACE, κ = 1 × 1011 , Cv = 600 and τ = 1. For
AS policy, κ = 1 × 108 and δ = 0.99. The reason for smaller κ and large δ for AS
policy is to keep the radius of confidence regions small because the exploration done
by AS policy tends to be more than necessary and thus costly.

The left figure in Figure 3.2 illustrates improvement of relative regret due to
regularization. It shows the relative regret of CTRACE with varying κ and fixed
Cv = 0, i.e., no confidence-triggered update. The vertical bars indicate two standard
errors in both directions, that is, approximate 95% confidence intervals. It is clear
that the relative regret is reduced as CTRACE regularizes more, and the improvement
from no regularization to κ = 1 × 1011 is statistically significant with approximate
95% confidence level. The right figure in Figure 3.2 shows improvement achieved by
confidence-triggered update with varying Cv but fixed κ = 1 × 1011 . As you can
3.3. COMPUTATIONAL ANALYSIS 61

Relative Regret Distribution of Realized Profit Difference (CTRACE − CE)

250 CTRACE
CE
50

200
40
Relative Regret

150
30

100
20

50 10

0
0 500 1000 1500 2000 2500 −1 −0.5 0 0.5 1
Period Dollar 6
x 10

Figure 3.3: (Left) Relative regret of CTRACE and CE. (Right) Distribution of realized
profit of CTRACE and CE. The red dotted line represents zero difference.

see, update based on confidence makes a substantial contribution to reducing relative

regret further. The improvement from Cv = 0 to Cv = 600 is statistically significant
with approximate 95% confidence level.
As shown on the left of Figure 3.3, CTRACE clearly outperforms CE in terms
of relative regret and the difference is statistically significant with approximate 95%
confidence level. The dominance stems from both regularization and confidence-
triggered update as shown in Figure 3.2. The figure on the right shows an empirical
distribution of difference between realized profit of CTRACE and that of CE over 600
sample paths. Much more realizations are located to the right with respect to zero
profit. It implies that CTRACE tends to make more profit than CE more frequently.
Finally, we compare performance of CTRACE to that of AS policy in Figure 3.4.
The left figure shows that CTRACE outperforms AS policy even more drastically
than CE in terms of relative regret, and the superiority is statistically significant
with approximate 95% confidence level. On the right, you can see an empirical
distribution of difference between realized profit of CTRACE and that of AE over
600 sample paths. It is clear that CTRACE is more profitable than AS policy in
most of the sample paths. This illustrates that aggressive exploration performed
by AS policy is too costly. The reason is that AS policy is designed to explore
62 CHAPTER 3. PART II: ADAPTIVE EXECUTION

Relative Regret Distribution of Realized Profit Difference (CTRACE − AS)

30
1600
CTRACE
AS
1400 25

1200
20
Relative Regret

1000

15
800

600
10

400
5
200

0
0 500 1000 1500 2000 2500 −1 −0.5 0 0.5 1 1.5 2
Period Dollar 6
x 10

Figure 3.4: (Left) Relative regret of CTRACE and AS policy. (Right) Distribution
of realized profit of CTRACE and AS policy. The red dotted line represents zero
difference.

actively in situations where pure exploitation done by CE is unable to identify a true

model. In our problem, however, a great degree of exploration is naturally induced by
observable return-predictive factors and thus aggressiveness of exploration suggested
by AS policy turns out to be even more than necessary. Meanwhile, CTRACE strikes
a desired balance between exploration and exploitation by taking into account factor-
driven natural exploration.
Chapter 4

Conclusion

In the first part of this thesis, we have considered a model that captures strategic
interactions between a trader aiming to liquidate a position and an arbitrageur trying
to detect and profit from the trader’s activity. The algorithm we have developed
computes Gaussian perfect Bayesian equilibrium behavior. It is interesting that the
resulting trader policy takes on such a simple form: the number of shares to liquidate
at time t is linear in the trader’s position xt−1 , the arbitrageur’s position yt−1 and the
arbitrageur’s estimate µt−1 of xt−1 . The coefficients of the policy depend only on the
relative volume parameter ρ0 , which quantifies the magnitude of the trader’s position
relative to the typical market activity, and the time horizon T . This policy offers useful
guidance beyond what has been derived in models that do not account for arbitrageur
behavior. In the absence of an arbitrageur, it is optimal to trade equal amounts over
each time period, which corresponds to a policy that is linear in xt−1 . The difference
in the PBE policy stems from its accounting of the arbitrageur’s inference process.
In particular, the policy reduces information revealed to the arbitrageur by delaying
trades, and takes advantage of situations where the arbitrageur has been misled by
unusual market activity.
Our model represents a starting point for the study of game theoretic behavior
in trade execution. It has an admittedly simple structure, and this allows for a
tractable analysis that highlights the importance of information signaling. There are
a number of extensions to this model that are possible, however, and that warrant

63
64 CHAPTER 4. CONCLUSION

further discussion:

1. (Flexible Time Horizon) We assume finite time horizons T and T + 1 for the
trader and arbitrageur, respectively. The choice of time horizon has an impact
on the resulting equilibrium policies, and there are clearly end-of-horizon effects
in the policies computed in Section 2.4. To some extent it seems artificial to
impose a fixed time horizon as an exogenous restriction on behavior. Fixed
horizon models preclude the trader from delaying liquidation beyond the horizon
even if this can yield significant benefits, for example. A better model would be
to consider an infinite horizon game, where risk aversion provides the motivation
for liquidating a position sooner rather than later.
2. (Uncertain Trader) In our model, we assume that the arbitrageur is uncertain
of the trader’s position, but that the trader knows everything. A more realistic
model would allow for uncertainty on the part of the trader as well, and would
allow for the arbitrageur to mislead the trader.
3. (Multi-player Games) Our model restricts to a single trader and arbitrageur.
A natural extension would be to consider multiple traders and arbitrageurs that
are uncertain about each others’ positions and must compete in the marketplace
as they unwind. Such a generalized model could be useful for analysis of im-
portant liquidity issues such as those arising from the credit crunch of 2007.

Also of interest are the potential empirical implications of the model. If we make
the assumption that the trade execution horizon is a single day, the observations in
Section 2.4 suggest particular patterns for intraday volume. For example, if ρ0 is
large, the volume traded should be much higher near the end of the day than at other
times. Similarly, the structure of the equilibrium trading policies for the trader and
arbitrageur will generate specific, time-varying auto-correlation in the increments of
the price process. Formulating tests of such empirical predictions would be an inter-
esting direction for future research.

In the second part of this thesis, we have considered a dynamic trading problem
where a trader maximizes expected average risk-adjusted profit while trading a single
65

security in the presence of unknown price impact. Note that an arbitrageur is ex-
cluded in the second model. Our second problem can be viewed as a special case of
reinforcement learning: The trader can improve longer-term performance significantly
by making decisions that explore efficiently to learn price impact at the expense of
suboptimal short-term behavior such as execution of larger orders than appearing
optimal with respect to current information. Like other reinforcement learning prob-
lems, it is crucial to strike a balance between exploration and exploitation. To this
end, we have proposed the confidence-triggered regularized adaptive certainty equiva-
lent policy (CTRACE) that improves purely exploitative certainty equivalent control
(CE) in our problem. The enhancement is attributed to two properties of CTRACE:
regularization and confidence-triggered update. Regularization encourages active ex-
ploration that accelerates learning as well as reduces the variance of an estimator.
It helps keep CTRACE from being a passive learner due to overestimation of price
impact that abates trading. Confidence-triggered update allows CTRACE to have
monotonically nonincreasing upper bounds on estimation errors so that it reduces
the frequency of overestimation. Using these two properties, we derived a finite-
time expected regret bound for CTRACE of the form O(log2 T ). Finally, we have
demonstrated through Monte Carlo simulation that CTRACE outperforms CE and
a reinforcement learning policy recently proposed in [2].
As extention to our current model, it would be interesting to develop an efficient
reinforcement learning algorithm for a portfolio of securities. Another interesting
direction is to incorporate a prior knowledge of particular structures of price impact
coefficients, e.g., sparsity, to an estimation problem. It is worth considering other
regularization schemes such as LASSO.
Appendix A

Proofs for Chapter 2

Theorem 2.1. If the belief distribution φt−1 is Gaussian, and the arbitrageur assumes
ρ ρ
that the trader’s policy π̂t is linear with π̂t (xt−1 , yt−1 , φt−1 ) = âx,tt−1 xt−1 + ây,tt−1 yt−1 +
ρ t−1
âµ,t µt−1 , then the belief distribution φt is also Gaussian. The mean µt is a linear
function of yt−1 , µt−1 , and the observed price change ∆pt , with coefficients that are
deterministic functions of the scaled variance ρt−1 . The scaled variance ρt evolves
according to !−1

ρ
2 1 ρ
ρ2t = 1+ âx,tt−1 + (âx,tt−1 )2 .
ρ2t−1
In particular, ρt is a deterministic function of ρt−1 .

Proof. Set {Kt−1 , ht−1 } to be the information form parameters for the Gaussian
distribution φt−1 , so that

2 2
Kt−1 , 1/σt−1 , and ht−1 , µt−1 /σt−1 .

Define φ+
t−1 to be the distribution of xt−1 conditioned on all information seen by the

arbitrageur at times up to and including t. That is,

φ+
t−1 (S) , Pr xt−1 ∈ S | φt−1 , yt−1 , λ(π̂t (xt−1 , yt−1 , φt−1 ) + vt ) + t = ∆pt ,

where ∆pt is the price change observed at time t. By Bayes’ rule, this distribution

66
67

has density
 2 
∆pt − λ(πt (x, yt−1 , φt−1 ) + ψt (yt−1 , φt−1 )) 
φ+
t−1 (dx) ∝ φt−1 (dx) exp −

2σ2



ρ
λ2 (âx,tt−1 )2
!
∝ exp− 1 2
Kt−1 + x2
σ2

ρ ρ
  
λ ∆pt − λ(ây,tt−1 yt−1 + âµ,t
t−1
µt−1 + ψt ) âx,t
+ ht−1 +  x dx.
σ2

Thus, φ+
t−1 is a Gaussian distribution, with variance

ρ !−1
λ2 (âx,tt−1 )2
Kt−1 + ,
σ2

and mean

!−1  ρ ρ

ρ
λ2 (âx,tt−1 )2 λ ∆pt − λ(ây,tt−1 yt−1 + âµ,t
t−1
µt−1 + ψt ) âx,t
Kt−1 + ht−1 + .
σ2 σ2

Now, note that

ρ ρ ρ
xt = xt−1 + π̂t (xt−1 , yt−1 , φt−1 ) = (1 + âx,tt−1 )xt−1 + ây,tt−1 yt−1 + âµ,t
t−1
µt−1 .

Then, φt is also a Gaussian distribution, with variance

ρ !−1 ρ !−1
ρ λ2 (âx,tt−1 )2 1 λ2 (âx,tt−1 )2
σt2 = (1 + âx,tt−1 )2 Kt−1 + = (1 + âx,t ) 2
2
+ , (A.1)
σ2 σt−1 σ2

and mean
ρ ρ
µt = ây,tt−1 yt−1 + âµ,t
t−1
µt−1

ρ ρ (A.2)
ρ
µt−1 /ρ2t−1 + ∆pt /λ − ây,tt−1 yt−1 − âµ,t
t−1
µt−1 − ψt âx,t
+ (1 + âx,tt−1 ) ρ .
1/ρ2t−1 + (âx,tt−1 )2

The conclusions of the theorem immediately follow.

68 APPENDIX A. PROOFS FOR CHAPTER 2

In order to prove Theorems 2.2–2.4, it is necessary to explicitly evaluate the oper-

ator Fu(ψt t ,πt ) applied to quadratic functions of {xt , yt , µt } and the operator Gπvtt applied
to quadratic functions of {yt , µt }. The following lemma is helpful for this purpose, as
it provides expressions for the expectation of µt and µ2t under Gaussian distribution.

Lemma A.1. Assume that the the policies ψt and πt are linear with

ρ ρ ρ
πt (xt−1 , yt−1 , φt−1 ) = ax,tt−1 xt−1 + ay,tt−1 yt−1 + aµ,t
t−1
µt−1 ,
ρ ρ
ψt (yt−1 , φt−1 ) = by,tt−1 yt−1 + bµ,t
t−1
µt−1 .

Define
ρ
ρ 1 + ax,tt−1
γt t−1 , ρ .
1/ρ2t−1 + (ax,tt−1 )2
Then,

t ,πt ) ρ ρ ρ
E(ψ
ut [ µt | xt−1 , yt−1 , φt−1 ] = ay,tt−1 yt−1 + aµ,t
t−1
µt−1 + γt t−1 µt−1 /ρ2t−1
ρ ρ

ρ ρ
(A.3a)
+ γt t−1 ax,tt−1 ut − ay,tt−1 yt−1 − aµ,t
t−1
µt−1 ,

2
t ,πt ) ρ ρ
Var(ψ
ut [ µt | xt−1 , yt−1 , φt−1 ] = γt t−1 ax,tt−1 σ /λ , (A.3b)

h i
t ,πt )
Eu(ψt t ,πt ) µ2t xt−1 , yt−1 , φt−1 = Var(ψ
ut [ µt | xt−1 , yt−1 , φt−1 ]
2 (A.3c)
t ,πt )
+ E(ψ
ut [ µt | xt−1 , yt−1 , φt−1 ] ,

ρ ρ ρ
Eπvtt [ µt | yt−1 , φt−1 ] = ay,tt−1 yt−1 + (1 + ax,tt−1 + aµ,t
t−1
)µt−1 , (A.3d)

2 2
ρ ρ ρ
Varπvtt [ µt | yt−1 , φt−1 ] = γt t−1 ax,tt−1 σ /λ 1 + ax,tt−1 ρ2t−1 , (A.3e)

h i 2
Eπvtt µ2t yt−1 , φt−1 = Varπvtt [ µt | yt−1 , φt−1 ] + Eπvtt [ µt | yt−1 , φt−1 ] . (A.3f)

Proof. The lemma follows directly from taking expectations of the mean update
69

equation (A.2).

Theorem 2.2. If Ut∗ is TQD and Vt∗ is AQD, and Step 3 of Algorithm 2 produces a
linear pair (πt∗ , ψt∗ ), then Ut−1
∗ ∗
and Vt−1 , defined by Step 4 of Algorithm 2 are TQD
and AQD, respectively.

Proof. Suppose that

!
1 ρt 1 ρt σ2
Vt∗ (yt , φt ) = −λ d y2
2 yy,t t
+ d µ2
2 µµ,t t
+ dρyµ,t
t
yt µt − 2 dρ0,tt ,
λ
ρ ρ ρ
πt∗ (xt−1 , yt−1 , φt−1 ) = ax,tt−1 xt−1 + ay,tt−1 yt−1 + aµ,t
t−1
µt−1 ,
ρ ρ
ψt∗ (yt−1 , φt−1 ) = by,tt−1 yt−1 + bµ,t
t−1
µt−1 .

If the trader uses the policy πt∗ and the arbitrageur uses the policy ψt∗ , we have

ρ ρ ρ
ut = ax,tt−1 xt−1 + ay,tt−1 yt−1 + aµ,t
t−1
µt−1 ,
ρ ρ
vt = by,tt−1 yt−1 + bµ,t
t−1
µt−1 ,
ρ ρ
yt = yt−1 + by,tt−1 yt−1 + bµ,t
t−1
µt−1 .

Using these facts, Theorem 2.1, and (A.3d)–(A.3f) from Lemma A.1, we can explicitly
compute
π∗
∗
Vt−1 (yt−1 , φt−1 ) = Gψtt∗ V (yt−1 , φt−1 )

π∗
= Eψtt∗ λ(ut + vt )yt−1 + Vt∗ (yt , φt ) yt−1 , φt−1
!
1 ρt−1 1 ρt−1 ρt−1 σ 2 ρt−1
= −λ d y2
2 yy,t−1 t
+ d µ2
2 µµ,t−1 t
+ dyµ,t−1 yt µt − 2 d0,t−1 ,
λ

where
!−1

ρ
2 1 ρ
ρ2t = 1 + âx,tt−1 + (âx,tt−1 )2
,
ρ2t−1
(dρyµ,t )2 dρyµ,t
t
! t
!
ρ ρt 1
t−1
dyy,t−1 = dyy,t − ρt (ay,t ) + 2 ρt − 1 aρy,tt − ρt + 2,
ρt 2
dyy,t dyy,t dyy,t
ρt ρt
(dyµ,t )2 ρt
! !
ρ d
t−1
dyµ,t−1 = −aρµ,t
t
− aρx,tt + yµ,t ρt
ρt + dµµ,t − ρt ay,t (1 + aρx,tt + aρy,tt ),
dyy,t dyy,t
70 APPENDIX A. PROOFS FOR CHAPTER 2

(dρyµ,t )2
t
!
ρt−1
dµµ,t−1 = dρµµ,t
− ρt
t
(1 + aρx,tt + aρy,tt )2 ,
dyy,t
dρt σ 2

ρ ρ
t−1
d0,t−1 = dρ0,tt + µµ,t aρx,tt γt t−1 1 + (ρt−1 aρx,tt )2 .
2 λ
∗ ∗
Therefore, Vt−1 is AQD. Similarly, we can check that Ut−1 is TQD.

Theorem 2.3. Suppose that Ut∗ and Vt∗ are TQD/AQD value functions specified by
(2.10)–(2.11), and (πt∗ , ψt∗ ) are linear policies specified by (2.7)–(2.8). Assume that,
for all ρt−1 , the policy coefficients satisfy the first order conditions
3 2
ρ ρ
0 = ρ2t cρµµ,t
t
+ 2ρt cρxµ,t
t
+ cρxx,t
t
ax,tt−1 + 3cρxx,t
t
+ 3ρt cρxµ,t
t
− 1 ax,tt−1
(A.4)
ρ
+ 3cρxx,t
t
+ ρt cρxµ,t
t
− 2 ax,tt−1 + cρxx,t
t
− 1,

ρ
ρ
by,tt−1 + 1 cρxy,t
t
+ αt cρyµ,t
t

ay,tt−1 = − , (A.5)
cρxx,t
t
+ (αt + 1)cρxµ,t
t
+ αt cρµµ,t
t

ρ ρ
ρ
t−1
ax,tt−1 bµ,t
t−1
cρxy,t
t
+ αt cρyµ,t
t
+ αt cρxµ,t
t
+ αt cρµµ,t
t
/ρ2t−1
aµ,t =− ρ
, (A.6)
ax,tt−1 cρxx,t
t
+ (αt + 1)cρxµ,t
t
+ αt cρµµ,t
t

ρ ρ ρ
ρ 1 − dρyµ,t
t
ay,tt−1 ρt−1
t−1
(1 + aµ,t + ax,tt−1 )dρyµ,t
t

by,tt−1 = − 1, bµ,t =− , (A.7)

dρyy,t
t
dρyy,t
t

and the second order conditions

cρxx,t
t
+ (αt + 1)cρxµ,t
t
+ αt cρµµ,t
t
> 0, dρyy,t
t
> 0, (A.8)

where the quantities αt and ρt satisfy

ρ ρ !−1
ax,tt−1 1 + ax,tt−1
ρ
2 1
ρ
2
αt = 2 , ρ2t = 1+ ax,tt−1 + ax,tt−1 . (A.9)
ρ2t−1

ρ
1/ρ2t−1 + ax,tt−1

Then, (πt∗ , ψt∗ ) satisfy the single-stage equilibrium conditions

∗ ∗

πt∗ (xt−1 , yt−1 , φt−1 ) ∈ argmax Fu(ψt t ,πt ) Ut∗ (xt−1 , yt−1 , φt−1 ), (A.10)
ut
∗

ψt∗ (yt−1 , φt−1 ) ∈ argmax Gπvtt Vt∗ (yt−1 , φt−1 ), (A.11)
vt
71

for all xt−1 , yt−1 , and Gaussian φt−1 .

Proof. As we will discuss in the proof of Theorem 2.4, the optimizing value u∗t
in (A.10) is a linear function of xt−1 , yt−1 and zt−1 , whose coefficients depend on
ρ ρ ρ ρ ρ
{ax,tt−1 , ay,tt−1 , aµ,t
t−1
, by,tt−1 , bµ,t
t−1
}. By equating the coefficients of {xt−1 , yt−1 , zt−1 } with
ρ ρ ρ
{ax,tt−1 , ay,tt−1 , aµ,t
t−1
}, respectively, we can obtain (A.4), (A.5) and (A.6). (A.7) can be
derived by considering (A.11) in the same way. (A.8) corresponds to the second order
conditions for the two maximization problems.

for all yt−1 and Gaussian φt−1 , so long as the optimization problem is bounded.

Proof. Suppose that

1 ρt
Ut (xt , yt , φt ) = −λ c x2
2 xx,t t
+ 12 cρyy,t
t
yt2 + 12 cρµµ,t
t
µ2t
!
σ2
+ cρxy,t
t
xt yt + cρxµ,t
t
x t µt + cρyµ,t
t
yt µt − 2 cρ0,tt ,
λ
ρ ρ ρ
π̂t (xt−1 , yt−1 , φt−1 ) = âx,tt−1 xt−1 + ây,tt−1 yt−1 + âµ,t
t−1
µt−1 ,
ρ ρ
ψt (yt−1 , φt−1 ) = by,tt−1 yt−1 + bµ,t
t−1
µt−1 .

If the trader takes the action ut , while the arbitrageur uses the policy ψt∗ and assumes
that the trader uses the policy π̂t , we have

ρ ρ
vt = by,tt−1 yt−1 + bµ,t
t−1
µt−1 ,
xt = xt−1 + ut ,
72 APPENDIX A. PROOFS FOR CHAPTER 2

ρ ρ
yt = yt−1 + by,tt−1 yt−1 + bµ,t
t−1
µt−1 .

Using these facts, Theorem 2.1, and (A.3a)–(A.3c) from Lemma A.1, we can explicitly
compute

Fu(ψt t ,π̂t ) Ut (xt−1 , yt−1 , φt−1 )
t ,π̂t )
= E(ψ
ut [ λ(ut + vt )xt−1 + Ut (xt , yt , φt ) | xt−1 , yt−1 , φt−1 ] .

It is easy to see that Fu(ψt t ,π̂t ) Ut (xt−1 , yt−1 , φt−1 ) is quadratic in ut . Moreover,
the coefficient of u2t is independent of {xt−1 , yt−1 , µt−1 } while the coefficient of ut
is linear in {xt−1 , yt−1 , µt−1 }. Therefore, the optimizing u∗t is a linear function of
{xt−1 , yt−1 , µt−1 }, whose coefficients can be computed by substitution and rearrange-
ment of the associated terms.
Similarly, suppose that
!
1 ρt 1 ρt σ2
Vt (yt , φt ) = −λ d y2
2 yy,t t
+ d µ2
2 µµ,t t
+ dρyµ,t
t
y t µt − 2 dρ0,tt ,
λ
ρ ρ ρ
πt (xt−1 , yt−1 , φt−1 ) = ax,tt−1 xt−1 + ay,tt−1 yt−1 + aµ,t
t−1
µt−1 .

If the arbitrageur takes the action vt and assumes that the trader uses the policy πt ,
we have
ρ ρ ρ
ut = ax,tt−1 xt−1 + ay,tt−1 yt−1 + aµ,t
t−1
µt−1 ,
yt = yt−1 + vt .

Using these facts, Theorem 2.1, and (A.3d)–(A.3f) from Lemma A.1, we can explicitly
compute

Gπvtt Vt (yt−1 , φt−1 ) = Eπvtt [ λ(πt + vt )yt−1 + Vt (yt , φt ) | yt−1 , φt−1 ] .

It can be easily checked that Gπvtt Vt (yt−1 , φt−1 ) is quadratic in vt . Moreover, the
coefficient of vt2 is independent of {yt−1 , µt−1 } while the coefficient of vt is linear in
{yt−1 , µt−1 }. Therefore, the optimizing vt∗ is a linear of {yt−1 , µt−1 }, whose coefficients
can be computed by substitution and rearrangement of the associated terms.
Appendix B

Proofs for Chapter 3

Proof. First, we will show the existence of a solution P to (3.3) that satisfies R +
B > P B > 0 and ρsr (A + BL) < 1. One challenge is that the matrix pair (A, B) is
not controllable because return-predictive factors ft evolve over time independently
of actions ut . This autonomous factor dynamics lead us into a useful observation:
When we rewrite (3.3) in terms of
 
1 > 1 >
Pxx P
2 xd
P
2 xf
 
 21 Pxd
P = Pdd 1
P ,

 2 df 
1 1 >
P
2 xf
P
2 df
Pf f

we obtain
(2ρΣ − λ∗ − γ ∗> 1 + 2Pxx + Pxd>
1)2
Pxx = ρΣ + Pxx − >
(B.1)
4(ρΣ + Pxx + Pxd 1 + 1> Pdd 1)

>
Pxd = γ ∗> (I − diag(r)) + Pxd
>
diag(r)
(2ρΣ − λ∗ − γ ∗> 1 + 2Pxx + Pxd
>
1)(21> Pdd diag(r) + Pxd
>
diag(r)) (B.2)
− > >
2(ρΣ + Pxx + Pxd 1 + 1 Pdd 1)

73
74 APPENDIX B. PROOFS FOR CHAPTER 3

Pdd = diag(r)> Pdd diag(r)

(2diag(r)> Pdd
>
1 + diag(r)> Pxd )(21> Pdd diag(r) + Pxd
>
diag(r)) (B.3)
− > >
4(ρΣ + Pxx + Pxd 1 + 1 Pdd 1)

> > (2ρΣ − λ∗ − γ ∗> 1 + 2Pxx + Pxd

>
1)(1> Pdf Φ + Pxf
>
Φ)
Pxf = Pxf Φ − g> − > >
(B.4)
2(ρΣ + Pxx + Pxd 1 + 1 Pdd 1)

(2diag(r)> Pdd
>
1 + diag(r)> Pxd )(1> Pdf Φ + Pxf
>
Φ)
Pdf = diag(r)> Pdf Φ − >
(B.5)
2(ρΣ + Pxx + Pxd 1 + 1> Pdd 1)

(Φ> Pdf> 1 + Φ> Pxf )(1> Pdf Φ + Pxf

>
Φ)
Pf f = Φ> Pf f Φ − > >
(B.6)
4(ρΣ + Pxx + Pxd 1 + 1 Pdd 1)
and the associated optimal policy coefficients L = [Lx Ld Lf ] are given by

ρΣ − 21 (λ + 1> γ) + Pxx + 12 1> Pxd

Lx = − (B.7)
ρΣ + Pxx + 1> Pxd + 1> Pdd 1

>
( 21 Pxd + 1> Pdd )diag(r)
Ld = − (B.8)
ρΣ + Pxx + 1> Pxd + 1> Pdd 1

>
+ 1> Pdf> )Φ
1
2
(Pxf
Lf = − . (B.9)
ρΣ + Pxx + 1> Pxd + 1> Pdd 1
Note that (B.1), (B.2) and (B.3) involve only (Pxx , Pxd , Pdd ) which can be completely
determined by these three equations. Furthermore, (B.7) and (B.8) depend only on
(Pxx , Pxd , Pdd ). It implies that we can obtain the same (Pxx , Pxd , Pdd , Lx , Ld ) when
setting g = 0. This is not surprising because evolution of security positions xt and
exponential averages dt is affected by actions ut but evolution of return-predictive
factors ft is not. This observation makes it sufficient to work on a reduced problem
  
T h
X
>
i Q̃ S̃ zt−1
min lim sup zt−1 ut    s.t. zt = Ãzt−1 + B̃ut , ut = πt (It−1 )
π∈Π T →∞ t=1 S̃ > R̃ ut

where
75

   
h i 1 0 1 h i
zt> = xt d>
t , Ã =  , B̃ =  , v> = 0 γ ∗> (diag(r) − I) ,
0 diag(r) 1

1 > 1 ∗
Q̃ = ρΣ e1 e> >
1 − (ve1 + e1 v ), S̃ = ρΣ e1 − (λ + γ
∗>
1)e1 , R̃ = ρΣ .
2 2
Now, (Ã, B̃) is controllable and this problem is a special case of the problem considered
in [38]. By Theorem 1 in [38], there exists a desired P̃ if Ψ(z) > 0 for all z on the
unit circle where
  
h i Q̃ S̃ (Iz − Ã)−1 B̃
Ψ(z) , B̃ > (Iz −1 − Ã> )−1 I   .
S̃ > R̃ I

In our problem, it is not difficult to check that for any φ ∈ (0, 2π), λ ≥ 0 and γi ≥ 0,
M
ρΣ λ X 2γm (1 − rm cos φ)
Ψ(eiφ ) = + + >0
2(1 − cos φ) 2 m=1 1 + rm
2 − 2r cos φ
m

and limφ→0 Ψ(eiφ ) = ∞ > 0. Therefore, there exists a desired (Pxx , Pxd , Pdd ) that
satisfies the second-order optimality condition, and stability of the closed-loop system
matrix Ã+ B̃[Lx Ld ]. Noting an upper block diagonal structure of the original closed-
loop system matrix A+BL, we can easily see that stability carries over to our original
problem with g 6= 0. Given (Pxx , Pxd , Pdd ), we can compute (Pxf , Pdf , Pf f ) as follows:

1. Compute a and b using (Pxx , Pxd , Pdd ).

2ρΣ − λ∗ − γ ∗> 1 + 2Pxx + Pxd

>
1 2diag(r)> Pdd
>
1 + diag(r)> Pxd
a, > >
, b, >
2(ρΣ + Pxx + Pxd 1 + 1 Pdd 1) 2(ρΣ + Pxx + Pxd 1 + 1> Pdd 1)

2. Solve for Pdf using vectorization and Kronecker product. We assume that the
associated linear equation is nonsingular.

Pdf = (diag(r)> −b1> )Pdf Φ+b(g > +a1> Pdf Φ)(I −(1−a)Φ)−1 Φ (linear in Pdf )

3. Solve for Pxf and Pf f using Pdf .

>
Pxf = −(g > + a1> Pdf Φ)(I − (1 − a)Φ)−1
76 APPENDIX B. PROOFS FOR CHAPTER 3

∞
X (Φ> )i (g + (I − Φ> )Pxf )(g > + Pxf
>
(I − Φ))Φi
Pf f = −
i=0 4ρΣ

Finally, uniqueness of a stabilizing solution follows from [32].

Next, we will show that a policy π with π(It−1 ) = Lzt−1 is an optimal policy
among admissible policies. It is not difficult to see that for any sequence of controls
{ui } generated by an admissible policy and any t ≥ 1

zt P zt − zt−1 P zt−1 = (ut − Lzt−1 )> (R + B > P B)(ut − Lzt−1 ) + 2(Azt−1 + But )> P Wt

Wt> P Wt − zt−1
> >
Qzt−1 + zt−1 Sut + u> > >
t S zt−1 + ut Rut

using (3.3) and definition of L. Although ut is a scalar, the above equality holds for
vector-valued ut and thus we treat it as a vector for generality. Taking the sum of
both sides over t = 1, . . . , T , we can obtain

T
> >
Sut + u> > >
X
zt−1 Qzt−1 + zt−1 t S zt−1 + ut Rut
t=1
T T
= z0> P z0 − zT> P zT + 2(Azt−1 + But )> P Wt + Wt> P Wt
X X

t=1 t=1
T
(ut − Lzt−1 )> (R + B > P B)(ut − Lzt−1 )
X
+
t=1
T T
≥ z0> P z0 − zT> P zT + 2(Azt−1 + But )> P Wt + Wt> P Wt
X X

t=1 t=1

The inequality follows from R + B > P B > 0. By the law of total expectation and the
fact that zt−1 and ut are Ft−1 -measurable,
" T # T h i
>
E 2(Azt−1 + But )> P E [Wt |Ft−1 ] = 0
X X
E 2(Azt−1 + But ) P Wt =
t=1 t=1

since E [Wt |Ft−1 ] = 0 for all t ≥ 1. Finally, we have

T
" #
1 X > >

E zt−1 Qzt−1 + zt−1 Sut + u> > >
t S zt−1 + ut Rut
T t=1
77

T
" #
1 1 > 1 X

≥ z0> P z0 − E zT P zT + E W > P Wt → tr(P Ω̃) as T → ∞
T T T t=1 t

since z0> P z0 /T → 0 and |zT> P zT /T | ≤ |P |kzT k2 /T → 0 a.s. as T → ∞ by ad-

missibility of the policy that generates ut . The equality holds if ut = Lzt−1 , zt =
(A + BL)zt−1 + Wt , ∀t ≥ 1.

Proof. By Theorem 3.1, ρsr (G(θ)) < 1 for all θ ∈ Θ. Since Θ is a compact set and As-
sumption 2-(a) implies the continuity of L(θ) and G(θ), it follows that supθ∈Θ kG(θ)k
is finite and supθ∈Θ ρsr (G(θ)) < 1. Therefore, by Theorem in [13], {Gn (θ)} uni-
formly converges to zero matrix. That is, for any 0 < ξ < 1, there exists N ∈
N being independent of θ such that kGN (θ)k ≤ ξ for all θ ∈ Θ. Also, Cg ,
max0≤i≤N −1 supθ∈Θ kGi (θ)k is finite by continuity of G(θ) and compactness of Θ.
For any t ≥ 0, it is easy to see that kGt (θ)k ≤ Cg ξ bt/N c . Since zt = Gt (θ)z0 +
Pt
i=1 Gt−i (θ)Wt ,

t t
kzt k ≤ kGt (θ)kkz0 k + kGt−i (θ)kkWt k ≤ Cg ξ bt/N c kz0 k + Cg ξ b(t−i)/N c Cω
X X

i=1 i=1
t
Cg Cω
ξ (t−i)/N −1 ≤ Cg kz0 k +
X
≤ Cg kz0 k + Cg Cω = Cz a.s.
i=1 ξ(1 − ξ 1/N )

Since U (θ) = (G(θ))1:M +1,∗ − [I 0], it follows that kU (θ)k ≤ k(G(θ))1:M +1,∗ k +
k[I 0]k ≤ Cg + 1.

Lemma 3.2. For any θ ∈ Θ, let ut = L(θ)zt−1 , zt = G(θ)zt−1 + Wt and ψt> =

h i
ut (dt − dt−1 )> = (U (θ)zt−1 )> . Also, let Πzz (θ) denote a unique solution to Πzz (θ) =
G(θ)Πzz (θ)G(θ)> + Ω̃. Then,

T
1X
lim ψt ψt> = U (θ)Πzz (θ)U (θ)> 0 a.s. (B.10)
T →∞ T
t=1
78 APPENDIX B. PROOFS FOR CHAPTER 3

Proof. For notational simplicity, let G = G(θ), L = L(θ), U = U (θ) and Πzz =
PT
Πzz (θ). The almost-sure convergence of 1
T t=1 ψt ψt> follows from Lemma 2 in [8].
For completeness, we present a proof.

T T
1X 1X
Wt Wt> = (zt − Gzt−1 )(zt − Gzt−1 )>
T t=1 T t=1
T T T T
!
1
zt zt> zt−1 zt> >
G> >
G>
X X X X
= −G − zt zt−1 +G zt−1 zt−1
T t=1 t=1 t=1 t=1
T T
!
1 > 1 >
G>
X X
= zt−1 zt−1 −G zt−1 zt−1
T t=1 T t=1
T T
! !
1 1X 1 1
zt−1 Wt> >
G> − z0 z0> + zT zT>
X
−G − Wt zt−1
T t=1 T t=1 T T

1 PT
Note that each entry in the third and fourth term is of the form T t=1 (zt−1 )i (Wt )j .
PT 1
It is easy to see that t=1 t (zt−1 )i (Wt )j is a martingale with variance bounded by
PT 1 2 P∞ 1 2 PT 1
t=1 t2 Cz (Ω̃)jj . Since t=1 t2 Cz (Ω̃)jj < ∞, it follows that t=1 t (zt−1 )i (Wt )j con-
1 PT
verges a.s. and, by Kronecker’s lemma, T t=1 (zt−1 )i (Wt )j converges to zero a.s. Also,
1
z z > and T1 zT zT> converges to zero a.s. because kzT k ≤ Cz a.s. Therefore,
T 0 0

T T
! !
1X > 1X >
lim zt−1 zt−1 −G lim zt−1 zt−1 G> = Ω̃ a.s.
T →∞ T T →∞ T
t=1 t=1

1 PT >
and it follows that limT →∞ T t=1 zt−1 zt−1 = Πzz a.s. Finally,

T T
1X 1X
lim ψt ψt> = lim U >
zt−1 zt−1 U > = U Πzz U > a.s.
T →∞ T T →∞ T
t=1 t=1

It is easy to see that U is full-rank since (L)1 6= 0. Therefore, it is sufficient to show

that Πzz is positive definite. Since G is a stable matrix and Ω̃ 0,

∞ MX
+K
Gi Ω̃(G> )i Gi Ω̃(G> )i = HH >
X
Πzz =
i=0 i=0

where
h i 2
H= Ω̃1/2 GΩ̃1/2 . . . GM +K Ω̃1/2 ∈ R(M +K+1)×(M +K+1)
79

If H is full-rank, then HH > is positive definite and so is Πzz . Now, we show that H
is full-rank. In particular, we will show that {(G)1:M +1,M +2 , . . . , (GM +1 )1:M +1,M +2 }
is linearly independent where (G)m:n,k indicates a column vector consisting of the
entries Gm,k , Gm+1,k . . . , Gn,k . To this end, we claim that for each i = 1, . . . , M + 1
there exists a polynomial gi of degree i − 1 and a constant hi independent of rj ’s such
that  
 gi (1) 
 gi (r1 ) 
 
 
..
(Gi )∗,M +2
 
=
 . 

 

 gi (rM ) 

 
hi

where (Gi )∗,M +2 indicates the (M + 2)nd column vector of Gi . For i = 1, it is clear
that g1 (r) = (L)M +2 and h1 = (Φ)∗:1 . Suppose that the induction hypothesis holds
for i ≥ 1. Then, it can be shown that
 
 gi (1) + (L)M +2 (hi )1 
r1 gi (r1 ) + (L)M +2 (hi )1
 
 
 

..
(Gi+1 )∗,M +2 = (A + BL)Gi
 
=  . 
∗,M +2 




 rM gi (rM ) + (L)M +2 (hi )1 

 
Φhi

that is, gi+1 (r) = rgi (r) + (L)M +2 (hi )1 and hi+1 = Φhi , which concludes the induction
step. In fact, it is easy to see that

i−1
(Φm )1,1 ri−1−m , hi = (Φi )∗,1 ,
X
gi (r) = (L)M +2 i = 1, . . . , M + 1.
m=0

Since each gi (r) is a polynomial of degree i − 1 and its leading coefficient is all
(L)M +2 6= 0, we can transform [(G)1:M +1,M +2 , . . . (GM +1 )1:M +1,M +2 ] into a well-
known Vandermonde matrix
80 APPENDIX B. PROOFS FOR CHAPTER 3

 
 1 1 1 ··· 1 
r12 · · · r1M 
 
 1 r1
 
 . .. .. . . . .. 
 .
 . . . . 

 
2 M
1 rM rM ··· rM

through elementary row operations and thus [(G)1:M +1,M +2 , . . . (GM +1 )1:M +1,M +2 ] is
nonsingluar.
Now, suppose α> H = 0 for some α ∈ RM +K+1 . By definition of H, it is equivalent
to α> Gi Ω̃1/2 = 0, i = 0, 1, . . . , M + K. By the sparsity pattern of Ω̃ and positive
definiteness of Ω, α> Ω̃1/2 = 0 implies (α)M +2:M +K+1 = 0. Also, for i ≥ 1
h i h h i i
>
α1:M +1 0 Gi Ω̃1/2 = 0 0 >
α1:M i
+1 (G )1:M +1,M +2 ∗ Ω
1/2 =0

> i
implies that α1:M +1 (G )1:M +1,M +2 = 0, i = 1, . . . , M + K. Finally, we have

h i
>
α1:M +1 (G)1:M +1,M +2 · · · (GM )1:M +1,M +2 = 0.

>
It follows from nonsingularity of [(G)1:M +1,M +2 , . . . (GM +1 )1:M +1,M +2 ] that α1:M +1 =
0. Therefore, α = 0 and we may conclude that H is full-rank.

Corollary 3.1. (a) Πzz (θ) is continuous on Θ.

(b) λ∗ψψ , inf θ∈Θ λmin U (θ)Πzz (θ)U (θ)> > 0.
h PT i
1 >
Proof. First, we show that E T t=1 zt−1 zt−1 uniformly converges to Πzz (θ) on Θ.
1 2T /N
Since T
and ξ are monotonically decreasing to zero as T tends to infinity, for any
kz0 k2 Cg2 ξ 2/N ξ 2(T −1)/N

1
> 0 there exists T̃ > 0 such that T
+ Ω̃ ξ2 T (1−ξ 2/N )2
+ 1−ξ 2/N
≤ , ∀T ≥ T̃ .
Therefore,
T −1 X ∞
1 TX t−1
" #
1X > 1
z0 z0> + Gi Ω̃(G> )i − Gt−1 Ω̃(G> )t−1
X
E zt−1 zt−1 − Πzz (θ) =
T t=1 T T t=1 i=0 t=1
TX−1 ∞
1 t−1

z0 z0> + Gt−1 Ω̃(G> )t−1 − Gt−1 Ω̃(G> )t−1
X
= 1−
T t=1 T t=1
TX−1 ∞
1 > t − 1 t−1 > t−1
Gt−1 Ω̃(G> )t−1
X
= z0 z0 + G Ω̃(G ) +
T t=1 T t=T
81

−1 ∞
kz0 k2 TX t−1
Gt−1 Ω̃(G> )t−1 + Gt−1 Ω̃(G> )t−1
X
≤ +
T t=1 T t=T
−1 ∞
kz0 k2 TX t−1
Gt−1 (G> )t−1 + Gt−1 (G> )t−1
X
≤ + Ω̃ Ω̃
T t=1 T t=T
−1 ∞
TX
1 Cg2 2(t−1)/N Cg2 2(t−1)/N
!
2
kz0 k t− X
≤ + Ω̃ ξ + ξ
T t=1 T ξ2 t=T ξ2
Cg2 ξ 2(T −1)/N
!
kz0 k2 1 ξ 2/N
≤ + Ω̃ + ≤ , ∀T ≥ T̃ .
T ξ2 T (1 − ξ 2/N )2 1 − ξ 2/N

Uniform convergence follows from the fact that T̃ does not depend on θ. Since
h PT i PT −1 Pt−1
>
E 1
T t=1 zt−1 zt−1 =
1
z z>
T 0 0
+ 1
T t=1 i=0 Gi Ω̃(G> )i is continuous in θ ∈ Θ for all
T ≥ 1, the limiting matrix Πzz (θ) is continuous in θ ∈ Θ component-wise. Therefore,
so is U (θ)Πzz (θ)U (θ)> .
By Assumption 2-(a), L(θ) is continuous on Θ. Hence, G(θ) and U (θ), which
are an affine function of L(θ), are continuous on Θ and so is U (θ)Πzz (θ)U (θ)> .

Finally, λmin U (θ)Πzz (θ)U (θ)> is continuous on Θ. Since Θ is a compact set

and λmin U (θ)Πzz (θ)U (θ)> > 0 for all θ ∈ Θ, it follows from its continuity that

inf θ∈Θ λmin U (θ)Πzz (θ)U (θ)> > 0.

Lemma 3.3. For any θ ∈ Θ, let ut = L(θ)zt−1 , zt = G(θ)zt−1 + Wt and ψt> =

!2 !
32(Cz Cg )2 (M + K + 1) (M + K + 2)4
T1 (θ, δ) = 4 2 log
ξ (1 − ξ 2/N )λmin (Πzz (θ)) 432δ 2
!3
32(Cz Cg )2 (M + K + 1)
∨8 2 ∨ 216.
ξ (1 − ξ 2/N )λmin (Πzz (θ))

Proof. We use a special case of Corollary 1 of [1] that can be obtained by setting
mk = 1 for all k.
82 APPENDIX B. PROOFS FOR CHAPTER 3

Corollary 1 of [1] Let {ηk } be a real-valued martingale difference process adapted

to the filtration {Fk }. Assume that ηk is conditionally sub-Gaussian in the sense
that there exists some R > 0 such that for any γ ∈ R, k ≥ 1, E[exp(γηk )|Fk−1 ] ≤
exp (γ 2 R2 /2) a.s. Then, for any a > 0
 s 
t
a+t
X
Pr  ηk ≤ R (a + t) log , ∀t ≥ 1 ≥ 1 − δ.
k=1 aδ 2

Let ei ∈ RM +K+1 denote an elementary vector whose entries are all zero except for
ith entry being one and ηij,k = e> > > >
i zk zk ej − ei E[zk zk |Fk−1 ]ej , 1 ≤ i, j, ≤ M + K + 1.
Since |ηij,k | ≤ 2Cz2 a.s., {ηij,k } is an almost-surely bounded martingale difference
process adapted to {Fk } and thus it is conditionally sub-Gaussian. In particular,
E[exp(γηij,k )|Fk−1 ] ≤ exp (γ 2 (2Cz2 )2 /2) a.s. for all 1 ≤ i, j ≤ M + K + 1. Hence, for
all 1 ≤ i, j ≤ M + K + 1 and any a > 0
 s 
t
(M + K + 2)4 (a + t) 2δ
X
Pr  ηij,k ≤ 2Cz2 (a + t) log 2
, ∀t ≥ 1 ≥ 1 − .
k=1
4aδ (M + K + 2)2

It follows from union bound and the fact that ηij,k = ηji,k that
 s 
t
(M + K + 2)4 (a + t)
X
Pr  ηij,k ≤ 2Cz2 (a + t) log , 1 ≤ i, j ≤ M + K + 1, ∀t ≥ 1
k=1
4aδ 2

≥ 1 − δ.

Since E[zk zk> |Fk−1 ] = G(θ)zk−1 zk−1

>
G(θ)> + Ω̃,

t t
! !
1X 1X
Pr zk zk> − G(θ) >
zk−1 zk−1 G(θ)> − Ω̃
t k=1 t k=1
ij
s !
2Cz2 (M + K + 2)4 (a + t)

≤ (a + t) log , 1 ≤ i, j ≤ M + K + 1, ∀t ≥ 1 ≥ 1 − δ.
t 4aδ 2

Noting that a + t ≤ 2t for all t ≥ a, we obtain that for an arbitrary > 0

s s
2Cz2 (M + K + 2)4 (a + t) 2 (M + K + 2)4 t

(a + t) log ≤ 2Cz2 log ≤ (B.11)
t 4aδ 2 t 2aδ 2
83

if ! 2 2
2 (M + K + 2)4 1 2 log t 1

log ≤ and ≤ .
t 2aδ 2 2 2Cz2 t 2 2Cz2

2Cz2 2 (M +K+2)4

The first inequality holds if t ≥ 4
log 2aδ 2
. Using log t ≤ t1/3 for all
2Cz2 3

t ≥ 216, we can show that the second inequality holds if t ≥ 8
∨216. Therefore,
2 2 4 2 3

2Cz (M +K+2)
(B.11) holds for t > 4
log 2aδ 2
∨ 8 2C z ∨ a ∨ 216 , t∗ (δ, , a). For
P
1 Pt > 1 t > >
notational convenience, let Yt = t k=1 zk zk − G(θ) t k=1 zk−1 zk−1 G(θ) − Ω̃.
Then,
 

Pr| (Yt )ij | ≤ , 1 ≤ i, j ≤ M + K + 1, ∀t ≥ t∗ (δ, , a) ≥ 1 − δ.

On the above event, kYt k ≤ kYt kF ≤ (M + K + 1) and thus with probability at least
1−δ
−(M + K + 1)I Yt (M + K + 1)I, ∀t ≥ t∗ (δ, , a).

First, we focus on −(M + K + 1)I Yt . We can rewrite this as

t t
!
1X > 1X >
zk−1 zk−1 G(θ) zk−1 zk−1 G(θ)> + Ω̃
t k=1 t k=1
1 > 1 >

− (M + K + 1)I + z0 z − zt zt .
t 0 t

Repeating n times a process of left-multiplying both sides with G(θ), right-multiplying

with G(θ)> and adding the resulting inequality into the original one side-by-side, we
obtain
t t n
!
1X > 1X >
Gn+1 (θ) Gn+1 (θ)> + Gi (θ)Ω̃Gi (θ)>
X
zk−1 zk−1 zk−1 zk−1
t k=1 t k=1 i=0
n n
(B.12)
1 1

>
i
G (θ) z0 z0> − zt zt> Gi (θ)>
i i
X X
− (M + K + 1) G (θ)G (θ) +
i=0 i=0 t t

Note that n
X
i i >
n
X
i Cg2 2
G (θ)G (θ) ≤ kG (θ)k ≤ 2 and
i=0 i=0 ξ (1 − ξ 2/N )
84 APPENDIX B. PROOFS FOR CHAPTER 3

n
1 > 1 > n
2 2 i 1 2Cz2 Cg2

G (θ) z0 z0 − zt zt Gi (θ)> ≤
i
Cz kG (θ)k2 ≤
X X

i=0 t t i=0 t t ξ 2 (1 − ξ 2/N )

Taking limit over n in (B.12) and using the above two inequalities, we have with
probability at least 1 − δ

t
Cg2 (M + K + 1) 1 2Cz2 Cg2
!
1X >
zk−1 zk−1 Πzz (θ) − + I, ∀t ≥ t∗ (δ, , a)
t k=1 ξ 2 (1 − ξ 2/N ) t ξ 2 (1 − ξ 2/N )

ξ 2 (1−ξ 2/N )λmin (Πzz (θ)) 32C 2 C 2

Setting = 16Cg2 (M +K+1)
and a = 216, for any t ≥ t∗ (δ, , a)∨ ξ2 (1−ξ2/N )λzming (Πzz (θ))

t
1X > λmin (Πzz (θ))
zk−1 zk−1 Πzz (θ) − I.
t k=1 8

It is easy to see that

!3
∗ 32Cz2 Cg2 (M + K + 1) 32Cz2 Cg2
t (δ, , a) ≥ 8 2 ∨ 216 ≥ 2 .
ξ (1 − ξ 2/N )λmin (Πzz (θ)) ξ (1 − ξ 2/N )λmin (Πzz (θ))

Similarly, from Yt (M + K + 1)I, we can obtain for all t ≥ t∗ (δ, , a)

t
(M + K + 1)Cg2 1 2Cz2 Cg2
!
1X >
zk−1 zk−1 Πzz (θ) + − 2 I
t k=1 ξ 2 (1 − ξ 2/N ) t ξ (1 − ξ 2/N )
λmin (Πzz (θ))
Πzz (θ) + I
16

Therefore, we may conclude that with probability at least 1 − δ

t
λmin (Πzz (θ)) 1X > λmin (Πzz (θ))
Πzz (θ) − I zk−1 zk−1 Πzz (θ) + I, ∀t ≥ T1 (θ, δ)
8 t k=1 16

where
!2 !
32(Cz Cg )2 (M + K + 1) (M + K + 2)4
T1 (θ, δ) = 4 2 log
ξ (1 − ξ 2/N )λmin (Πzz (θ)) 432δ 2
!3
32(Cz Cg )2 (M + K + 1)
∨8 2 ∨ 216.
ξ (1 − ξ 2/N )λmin (Πzz (θ))
85

Pt >
Since λmin (Πzz (θ))I Πzz (θ), it follows that 78 Πzz (θ) 1
t
17
k=1 zk−1 zk−1 16 Πzz (θ)
Pt
and thus 78 U (θ)Πzz (θ)U (θ)> U (θ) 1t >
k=1 zk−1 zk−1 U (θ)
>
17
16
U (θ)Πzz (θ)U (θ)> .

Lemma 3.4. Consider any θ ∈ Θ and {θt ∈ Θ} adapted to {σ(It )} such that kθt −θk ≤
η
√
M +1CL
a.s. where

ν 3 (1 − ν 1/N )3 λmin (Πzz (θ)) ν 3 (1 − ν 1/N )3 λmin (U (θ)Πzz (θ)U (θ)> ) ν−ξ
η= ∧ ∧
N
42N Cg Cω +1 2 N
42N Cg Cω (1 + kU (θ)k)
+1 2 2 N CgN −1

for all t ≥ 0 and any ν ∈ (ξ, 1). Let ut = L(θt−1 )zt−1 , zt = G(θt−1 )zt−1 + Wt and
h i
ψt> = ut (dt − dt−1 )> = (U (θt−1 )zt−1 )> . Then,

T
1X λmin (U (θ)Πzz (θ)U (θ)> )
lim inf ψt ψt> I a.s.
T →∞ T t=1 2

Proof. For notational convenience, let G = G(θ), Gt = G(θt ), U = U (θ), Ut = U (θt ),

Πzz = Πzz (θ), Π(i, j) = Gi · · · Gj , Π(i, i − 1) = I and Π(i, i − 2) = 0. By definition of
G,

kGt − Gk = kB(L(θt ) − L(θ))k ≤ kBkkL(θt ) − L(θ)k

√ √
= M + 1 kL(θt ) − L(θ)k ≤ M + 1 CL kθt − θk ≤ η, ∀t ≥ 0.

By induction, we can show the following inequalities:

• kΠ(t + 1, t + i) − Gi k ≤ ηiCgi−1 for i = 1, . . . , N

• kΠ(t + 1, t + N )k ≤ kGN k + ηN CgN −1 ≤ ξ + ηN CgN −1 ≤ ν

• kΠ(t + 1, t + i)k ≤ kGi k + ηiCgi−1 ≤ 2Cg for i = 2, . . . , N − 1

• kΠ(i, j)k ≤ 2Cg ν b(j−i+1)/N c ≤ 2Cg ν (j−i+1)/N −1

• kΠ(i, i + kN − 1) − GkN k ≤ kηN CgN −1 ν k−1

3ηN CgN
• kΠ(i, j) − Gj−i+1 k ≤ ν2
(j − i + 1)ν (j−i)/N
86 APPENDIX B. PROOFS FOR CHAPTER 3

Pt
Since zt can be expressed as zt = Π(0, t − 1)z0 + i=1 Π(i, t − 1)Wi , we have

T
1X >
zt−1 zt−1
T t=1
! >
T t−1 t−1
1X X X
= Π(0, t − 2)z0 + Π(i, t − 2)Wi Π(0, t − 2)z0 + Π(j, t − 2)Wj 
T t=1 i=1 j=1
! >
T t−1 t−1
1X
Gt−1 z0 + Gt−i−1 Wi Gt−1 z0 + Gt−j−1 Wj 
X X
=
T t=1 i=1 j=1
T
1X
+ Π(0, t − 2)z0 z0> Π(0, t − 2)> − Gt−1 z0 z0> Gt−1> · · · (a)
T t=1
T Xt−1
1X
+ Π(0, t − 2)z0 Wj> Π(j, t − 2)> − Gt−1 z0 Wj> Gt−j−1> · · · (b)
T t=2 j=1
T Xt−1
1X
+ Π(j, t − 2)Wj z0> Π(0, t − 2)> − Gt−j−1 Wj z0> Gt−1> · · · (c)
T t=2 j=1
T Xt−1 X
t−1
1X
+ Π(i, t − 2)Wi Wj> Π(j, t − 2)> − Gt−i−1 Wi Wj> Gt−j−1> · · · (d)
T t=2 i=1 j=1

We show that the matrix norm of the sum (a) + (b) + (c) + (d) can be made arbitrarily
small if T is sufficiently large. To this end, we will frequently use the inequality
Pt−1 1
i=1 iν̃ i−1 ≤ (1−ν̃)2
for any 0 < ν̃ < 1 and any t ≥ 2.

First, we bound k(d)k above. Since

t−1 X
t−1
kΠ(i, t − 2)Wi Wj> Π(j, t − 2)> − Gt−i−1 Wi Wj> Gt−j−1> k
X

i=1 j=1
t−1 X
t−1
k(Π(i, t − 2) − Gt−i−1 )Wi Wj> Π(j, t − 2)>
X
=
i=1 j=1

+ Gt−i−1 Wi Wj> (Π(j, t − 2) − Gt−j−1 )> k


t−1 X
t−1 
2Cg t−j−1
k(Π(i, t − 2) − Gt−i−1 )kCω2
X
≤ ν N
i=1 j=1
 ν

Cg t−i−1 
+ kΠ(j, t − 2) − Gt−j−1 kCω2 ν N
ν 
87


t−1 X
t−1 
3ηN CgN t−i−2 2Cg t−j−1
Cω2
X
≤ (t − i − 1)ν N ν N
i=1 j=1
 ν2 ν

3ηN CgN t−j−2
2 Cg t−i−1

+ (t − j − 1)ν N C ω ν N
ν2 ν 

18ηN CgN +1 Cω2 X

t−1 X
t−1
t−i−2 t−j−1
= 3
(t − i − 1)ν N ν N
ν i=1 j=1

18ηN CgN +1 Cω2

≤ 1 ,
ν 3 (1 − ν N )3

18ηN CgN +1 Cω 2
we obtain k(d)k ≤ ν 3 (1−ν 1/N )3
. Likewise, we bound k(b)k and k(c)k above.

T Xt−1
1X
k(b)k ≤ kΠ(0, t − 2)z0 Wj> Π(j, t − 2)> − Gt−1 z0 Wj> Gt−j−1> k
T t=2 j=1
T Xt−1
1X
= kΠ(0, t − 2)z0 Wj> (Π(j, t − 2) − Gt−j−1 )>
T t=2 j=1
+ (Π(0, t − 2) − Gt−1 )z0 Wj> Gt−j−1> k

1X T Xt−1 
2Cg t−1 3ηN CgN t−j−2
≤ ν N kz0 kCω 2
(t − j − 1)ν N
T t=2 j=1  ν ν

3ηN CgN t−2 Cg t−j−1 
+ (t − 1)ν N kz kC
0 ω ν N
ν2 ν 

1 3ηN CgN +1 Cω kz0 k XT Xt−1

t−1 t−j−2 t−2 t−j−1
= {2ν N (t − j − 1)ν N + (t − 1)ν N ν N }
T ν3 t=2 j=1
 
t−1 t−2
1 3ηN CgN +1 Cω kz0 k XT 
2ν N (t − 1)ν N 
≤  (1 − ν N1 )2
+ 1
T ν3 t=2 1 − νN 
 1

1 3ηN CgN +1 Cω kz0 k  2ν N 1
≤ 3 1 + 1

T ν (1 − ν N )3 (1 − ν N )3
1 9ηN CgN +1 Cω kz0 k 1
≤ 3 1
T ν (1 − ν N )3
N +1
1 9ηN Cg Cω kz0 k 1
and also k(c)k ≤ T ν3 (1−ν 1/N )3
. Finally, we bound (a) above
88 APPENDIX B. PROOFS FOR CHAPTER 3

T
1X
k(a)k ≤ kΠ(0, t − 2)z0 z0> Π(0, t − 2)> − Gt−1 z0 z0> Gt−1> k
T t=1
T
1X
= k(Π(0, t − 2) − Gt−1 )z0 z0> Π(0, t − 2)>
T t=1
+ Gt−1 z0 z0> (Π(0, t − 2) − Gt−1 )> k

T 
1 2Cg t−1
kΠ(0, t − 2) − Gt−1 kkz0 k2
X
≤ ν N
T t=1
 ν

Cg t−1 
+ ν N kz0 k2 kΠ(0, t − 2) − Gt−1 k
ν 

1 9ηN CgN +1 kz0 k2 XT

t−2 t−1
≤ 3
(t − 1)ν N ν N
T ν t=1
1 9ηN CgN +1 kz0 k2
≤
T ν 3 (1 − ν N2 )2

Therefore,

k(a) + (b) + (c) + (d)k ≤ k(a)k + k(b)k + k(c)k + k(d)k

18ηN CgN +1 Cω2 1 18ηN CgN +1 Cω kz0 k 1 9ηN CgN +1 kz0 k2
≤ 1 + 1 +
ν 3 (1 − ν N )3 T ν 3 (1 − ν N )3 T ν 3 (1 − ν N2 )2
18ηN CgN +1 Cω2 1 9ηN CgN +1 kz0 k(2Cω + kz0 k)
≤ 1 + 1
ν 3 (1 − ν N )3 T ν 3 (1 − ν N )3
21ηN CgN +1 Cω2 λmin (Πzz ) 3kz0 k(2Cω + kz0 k)
≤ 1 ≤ (by definition of η), ∀T ≥ .
ν (1 − ν N )
3 3 2 Cω2

3kz0 k(2Cω +kz0 k)

It follows that for any T ≥ 2
Cω

! >
T T t−1 t−1
1X > 1X
Gt−1 z0 + Gt−i−1 Wi Gt−1 z0 + Gt−j−1 Wj 
X X
zt−1 zt−1
T t=1 T t=1 i=1 j=1

λmin (Πzz )
− I.
2

Taking lim inf on both sides, we can obtain

T
1X > λmin (Πzz ) λmin (Πzz )
lim inf zt−1 zt−1 Πzz − I I a.s.
T →∞ T t=1 2 2

Similarly, we can show lim inf T →∞ T1 Tt=1 ψt ψt> 0 a.s. Using kU − Ut k = k1(L −
P
√
Lt )k ≤ M + 1 kL − Lt k ≤ η, ∀t ≥ 0, we can show by induction that for any
0≤i≤j

3ηN CgN j−i

kUj+1 Π(i, j) − U Gj−i+1 k ≤ (1 + kU k) 2
(j − i + 1)ν N .
ν

Likewise,
T
1X
ψt ψt>
T t=1
! >
T t−1 t−1
1X X X
>
= Ut−1 Π(0, t − 2)z0 + Π(i, t − 2)Wi Π(0, t − 2)z0 + Π(j, t − 2)Wj  Ut−1
T t=1 i=1 j=1
! >
T t−1 t−1
1X X X
= U Gt−1 z0 + Gt−i−1 Wi Gt−1 z0 + Gt−j−1 Wj  U >
T t=1 i=1 j=1
T
1X
Ut−1 Π(0, t − 2)z0 z0> Π(0, t − 2)> Ut−1
>
− U Gt−1 z0 z0> Gt−1> U > · · · (a)0

+
T t=1
T t−1
1 XX
Ut−1 Π(0, t − 2)z0 Wj> Π(j, t − 2)> Ut−1
>
− U Gt−1 z0 Wj> Gt−j−1> U > · · · (b)0

+
T t=2 j=1
T t−1
1 XX
Ut−1 Π(j, t − 2)Wj z0> Π(0, t − 2)> Ut−1
>
− U Gt−j−1 Wj z0> Gt−1> U > · · · (c)0

+
T t=2 j=1
T t−1 t−1
1 XXX
Ut−1 Π(i, t − 2)Wi Wj> Π(j, t − 2)> Ut−1
>
− U Gt−i−1 Wi Wj> Gt−j−1> U > · · · (d)0

+
T t=2 i=1 j=1

and we can show that

0 18ηN CgN +1 Cω2 (1 + kU k)2 0 1 9ηN CgN +1 kz0 k2 (1 + kU k)2

k(d) k ≤ 1 , k(a) k ≤ 2 ,
ν 3 (1 − ν N )3 T ν 3 (1 − ν N )2

1 9ηN CgN +1 Cω kz0 k(1 + kU k)2

k(b)0 k, k(c)0 k ≤ 1 .
T ν 3 (1 − ν N )3
3kz0 k(2Cω +kz0 k)
It follows that for any T ≥ 2
Cω
90 APPENDIX B. PROOFS FOR CHAPTER 3

k(a)0 + (b)0 + (c)0 + (d)0 k ≤ k(a)0 k + k(b)0 k + k(c)0 k + k(d)0 k

21ηN CgN +1 Cω2 (1 + kU k)2 λmin (U Πzz U > )
≤ 1 ≤
ν 3 (1 − ν N )3 2

and thus
! >
T T t−1 t−1
1X 1X
ψt ψt> Gt−j−1 Wj  U >
X X
U Gt−1 z0 + Gt−i−1 Wi Gt−1 z0 +
T T
t=1 t=1 i=1 j=1 (B.13)
λmin (U Πzz U >)
− I.
2

Taking lim inf on both sides, we can obtain

T
1X λmin (U Πzz U > ) λmin (U Πzz U > )
lim inf ψt ψt> U Πzz U > − I I a.s.
T →∞ T t=1 2 2

Proof. By (B.13) and Lemma 3.3, on the event B(δ) with Pr(B(δ)) ≥ 1 − δ,
T
1X
ψt ψt>
T t=1
! >
T t−1 t−1
1X
G(θ)t−1 z0 + G(θ)t−i−1 Wi G(θ)t−1 z0 + G(θ)t−j−1 Wj  U (θ)>
X X
U (θ)
T t=1 i=1 j=1

λmin (U (θ)Πzz (θ)U (θ)> )

− I (by (B.13))
2
λmin (U (θ)Πzz (θ)U (θ)> )
!
λmin (Πzz (θ))
U (θ) Πzz (θ) − I U (θ)> − I
8 2
(by Lemma 3.3)
3 3kz0 k(2Cω + kz0 k)
λmin (U (θ)Πzz (θ)U (θ)> ), ∀T ≥ T1 (θ, δ) ∨ .
8 Cω2
91

N log(2Cg /ξ)
Lemma 3.6. Under CTRACE with τ ≥ log(1/ξ)
, for all t ≥ 0,

(2Cg + 1)Cg Cω (Cg + 1)(2Cg + 1)Cg Cω

kzt k ≤ 1 , Cz∗ a.s., kψt k ≤ 1 , Cψ a.s.
ξ(1 − ξ ) N ξ(1 − ξ N )

(2Cg +1)Cg Cω (Cg +1)(2Cg +1)Cg Cω

Proof. We show that kzt k ≤ ξ(1−ξ 1/N )
a.s. and kψt k ≤ ξ(1−ξ 1/N )
a.s. for all
j/N −1
t ≥ 0 under CTRACE. As seen in Proof of Lemma 3.1, kzti +j k ≤ Cg ξ kzti k +
Cg Cω g /ξ)
ξ(1−ξ 1/N )
a.s. as long as ti + j ≤ ti+1 . Since τ ≥ N log(2C log(1/ξ)
, we have Cg ξ τ /N −1 ≤ 12 .
2Cg Cω j/N −1 Cg Cω 2Cg Cω
Thus, if kzti k ≤ ξ(1−ξ 1/N ) a.s., then kzti +j k ≤ Cg ξ kzti k+ ξ(1−ξ 1/N ) ≤ ξ(1−ξ 1/N ) a.s.
2Cg Cω
for all τ ≤ j ≤ ti+1 − ti . Since kzt0 k = kz0 k ≤ ξ(1−ξ 1/N ) and ti+1 − ti ≥ τ for all i ≥ 1,
2Cg Cω
it follows by induction that kzti k ≤ ξ(1−ξ1/N ) a.s. for all i ≥ 1. For any ti < t < ti+1 ,
Cg Cω (2Cg +1)Cg Cω
kzt k ≤ Cg ξ (t−ti )/N −1 kzti k + ξ(1−ξ 1/N ) ≤ ξ(1−ξ 1/N )
a.s. Note that the above results
still hold if ti = ∞ for some i ≥ 1. Finally, it follows from supθ∈Θ kU (θ)k ≤ Cg + 1
(Cg +1)(2Cg +1)Cg Cω
that kψt k ≤ kU (θt−1 )kkzt−1 k ≤ ξ(1−ξ 1/N )
a.s.

Theorem 3.2 (Inter-temporal Consistency of CTRACE). Let {θt } be estimates generated

N log(2Cg /ξ)
by CTRACE with M ≥ 2, τ ≥ log(1/ξ)
and Cv < λ∗ψψ . Then, the ith update
time ti in Algorithm 1 is finite a.s. Moreover, kθt − θ∗ k ≤ bt , ∀t ≥ 0 on the event
{θ∗ ∈ St (δ), ∀t ≥ 1} where b0 = kθ0 − θ∗ k,
 q

 2C (M +1) log(Cψ )
2 t/κ+M +1 +2 log(1/δ)+2κ1/2 kθ
max k
 √ if t = ti for some i
bt = Cv t

b

otherwise
t−1

and {bt } is monotonically nonincreasing for all t ≥ 1 with limt→∞ bt = 0 a.s.

Proof. Let t0 = 0. Conditioned on the event {ti−1 < ∞}, for t > ti−1 ,
   
t t
ψj ψj>  ψj ψj>  .
X X
λmin (Vt ) ≥ λmin (Vti−1 ) + λmin  ≥ κ + Cv ti−1 + λmin 
j=ti−1 +1 j=ti−1 +1

ψj ψj> → λmin U (θ)Πzz (θ)U (θ)> ≥ λ∗ψψ > Cv a.s.
1 Pt
By Lemma 3.2, t−ti−1 j=ti−1 +1
as t → ∞ so long as θ is fixed after ti−1 . It implies that there exists ti−1 + τ ≤
92 APPENDIX B. PROOFS FOR CHAPTER 3

P
ti
ti < ∞ a.s. such that λmin j=ti−1 +1 ψj ψj> ≥ Cv (ti − ti−1 ) a.s. Since λmin (Vti ) ≥
κ + Cv ti−1 + Cv (ti − ti−1 ) = κ + Cv ti , ti is indeed a qualified update time. That
is, Pr(ti < ∞| ti−1 < ∞) = 1. If Pr(ti−1 < ∞) = 1, then Pr(ti < ∞, ti−1 <
∞) = Pr(ti−1 < ∞)Pr(ti < ∞|ti−1 < ∞) = 1 and thus Pr(ti < ∞) = 1. Since
Pr(t0 < ∞) = 1, it follows that Pr(ti < ∞) = 1 for all i ≥ 0 by induction. Hence,
Pr(ti < ∞, ∀i ≥ 0) = ∩∞
i=0 Pr(ti < ∞) = 1.

By Proposition 1, on the event {θ∗ ∈ St (δ), ∀t ≥ 1} for any i ≥ 1

kθti − θ∗ k ≤ kθti − θ̂ti k + kθ̂ti − θ∗ k

 v ! 
2 det(Vti )1/2 det(κI)−1/2
u
+ κ1/2 kθmax k
u
≤q C t2 log
λmin (Vti ) δ
r
2C (M + 1) log Cψ2 ti /κ + M + 1 + 2 log (1/δ) + 2κ1/2 kθmax k
≤ √ = bti .
Cv ti

The second inequality follows from the fact that

ti 2
∆pi − g > fi−1 − ψi> θ + κkθk2 = argmin (θ − θ̂ti )> Vti (θ − θ̂ti )
X
θti = argmin
θ∈Θ i=1 θ∈Θ

and thus on the event {θ∗ ∈ St (δ), ∀t ≥ 1}

λmin (Vti )kθti − θ̂ti k2 ≤ (θti − θ̂ti )> Vti (θti − θ̂ti ) ≤ (θ∗ − θ̂ti )> Vti (θ∗ − θ̂ti )
 v ! 2
det(Vti )1/2 det(κI)−1/2
u
+ κ1/2 kθmax k .
u
≤ C t2 log
δ

In the third inequality, we use λmin (Vti ) ≥ κ + Cv ti ≥ Cv ti by definition of ti and

 M +1
ti
det(Vti ) ≤ λmax (Vti )M +1 ≤ tr(Vti )M +1 = κ(M + 1) + kψj k2 
X

j=1
M +1
≤ κ(M + 1) + Cψ2 ti .

For any ti < t < ti+1 , kθt −θ∗ k = kθti −θ∗ k ≤ bti = bt . Now, we show the monotonicity
93

of bt . A key observation is that for any C > 0, Cψ > 0, κ > 0, 0 < δ < 1 and kθmax k
r
C (M + 1) log Cψ2 t/κ + M + 1 + 2 log (1/δ) + κ1/2 kθmax k
h(t) = √
Cv t

is strictly decreasing in t ≥ 1 if M ≥ 2. It can be easily verified through elementary

calculus. Since bt is monotonically nonincreasing and bounded below from 0, it con-
verges almost surely. It follows from limi→∞ bti = 0 a.s. that limt→∞ bt = 0 a.s.

N log(2Cg /ξ)
Theorem 3.3 (Efficiency of CTRACE). For any > 0, 0 < δ, δ 0 < 1, τ ≥ log(1/ξ)
and Cv < 87 λ∗ψψ on the event B(δ 0 ) defined in Lemma 3.3,

tN (,δ,Cv ) ≤ T1∗ (δ 0 ) ∨ τ + T2 (, δ, Cv ) where

(4κkθmax k)2
∨ .
Cv 2
√ √
Proof. Using log(t + M + 1) ≤ t + M + 1 for all t ≥ 0,
r
2C (M + 1) log Cψ2 t/κ + M + 1 + 2 log (1/δ) + 2κ1/2 kθmax k
√
Cv t
r q √
2C (M + 1) Cψ2 t/κ + M + 1 + 2 log (1/δ) 2κ1/2 kθ k
≤ √ + √ max .
Cv t Cv t

It is easy to show that if

 q 2
8C2 Cψ (M + 1) + 4 4C4 Cψ2 (M + 1)2 + κC2 Cv 2 ((M + 1)3/2 + 2 log(1/δ))
t≥  √  ,
κCv 2
94 APPENDIX B. PROOFS FOR CHAPTER 3

then r q √
2C (M + 1) Cψ2 t/κ + M + 1 + 2 log (1/δ)
√ ≤ .
Cv t 2
√
Also, if t ≥ (4κkθmax k)2 /Cv 2 , then 2κkθmax k/ Cv t ≤ /2. Therefore,
r
2C (M + 1) log Cψ2 t/κ + M + 1 + 2 log (1/δ) + 2κ1/2 kθmax k
√ ≤ if
Cv t
 q 2
8C2 Cψ (M + 1) + 4 4C4 Cψ2 (M + 1)2 + κC2 Cv 2 (M + 1)3/2 + 2 log(1/δ)
t≥  √ 
κCv 2
(4κkθmax k)2
∨ = T2 (, δ, Cv )
Cv 2

Suppose for contradiction that tN (,δ,Cv ) > T1∗ (δ 0 ) ∨ τ + T2 (, δ, Cv ) , T̃ ∗ . Let ti be

the last update time less than T2 (δ, , Cv ). ti is zero if there is no update time before
T2 (, δ, Cv ). Then, there is no update time in the interval [ti + 1, T̃ ∗ ] by definition of
tN (,δ,Cv ) and T2 (, δ, Cv ). Thus,
 ∗

T̃
ψt ψt> 
X
λmin (VT̃ ∗ ) ≥ λmin (Vti ) + λmin 
t=ti +1
 ∗

T̃
ψt ψt> 
X
≥ κ + Cv ti + λmin  (by definition of ti )
t=ti +1
7
≥ κ + Cv ti + λ∗ψψ (T̃ ∗ − ti ) (by Lemma 3.3)
8
7 ∗

∗
≥ κ + Cv T̃ ∵ λψψ > Cv .
8

It is clear that T̃ ∗ − ti ≥ τ . Consequently, T̃ ∗ is eligible for a next update time after

ti . It implies that tN (,δ,Cv ) = T̃ ∗ but this is a contradiction.

Theorem 3.4 (Finite-Time Expected Regret Bound of CTRACE). If π is CTRACE with

N log(2Cg /ξ)
M ≥ 2, τ ≥ log(1/ξ)
and Cv < 78 λ∗ψψ , then for any ν ∈ (ξ, 1) and all T ≥ 2,


R̄Tπ (z0 ) ≤ 2kP ∗ kCz∗2 + (R + B > P ∗ B)Cz∗2 CL2  (τ1∗ (T ) + τ2∗ (T ) + 1) kθmax k2 + τ3∗ (T )2
95

r 2
1/2
τ 2C (M + 1) log Cψ2 T /κ + M + 1 + 2 log (2T ) + 2κ kθmax k
+
C̃ 
κ + C̃(T − 1) − (C̃ − Cv )+ (τ1∗ (T ) + τ2∗ (T ))
!
× log 1{T > τ ∗ (T )}
κ + C̃(τ ∗ (T ) − 1) − (C̃ − Cv )+ (τ1∗ (T ) + τ2∗ (T ))

1 1
1 ν 3 (1 − ν N )3 λmin (Πzz (θ∗ )) ν 3 (1 − ν N )3 λmin (U (θ∗ )Πzz (θ∗ )U (θ∗ )> )
= √ ∧
M + 1CL 42N CgN +1 Cω2 42N CgN +1 Cω2 (1 + kU (θ∗ )k)2
!
ν−ξ
∧ .
N CgN −1

Proof. Let
!2 !
32(Cz∗ Cg )2 (M + K + 1) (M + K + 2)4
T1 (δ/2) = 4 2 log
ξ 2 (1 − ξ N )λ∗zz 432(δ/2)2
!3
32(Cz∗ Cg )2 (M + K + 1)
∨8 2 ∨ 216 ∨ τ,
ξ 2 (1 − ξ N )λ∗zz
 q 2
8C2 Cψ (M + 1) + 4 4C4 Cψ2 (M + 1)2 + κC2 Cv 2 (M + 1)3/2 + 2 log(1/δ)
T2 (δ/2) =  √ 
κCv 2
96 APPENDIX B. PROOFS FOR CHAPTER 3

(4κkθmax k)2
∨ ,
Cv 2
!2 !
32(Cz∗ Cg )2 (M + K + 1) (M + K + 2)4
T3 (δ/2) = 4 2 log
ξ 2 (1 − ξ N )λmin (Πzz (θ∗ )) 432(δ/2)2
!3
32(Cz∗ Cg )2 (M + K + 1) 3Cz∗ (2Cω + Cz∗ )
∨8 2 ∨ 216 ∨ ,
ξ 2 (1 − ξ N )λmin (Πzz (θ∗ )) Cω2
Ts (δ/2) = T1 (δ/2) + T2 (δ/2) + T3 (δ/2).

Then,

T T
(ut − L∗ zt−1 )> (R + B > P ∗ B)(ut − L∗ zt−1 ) = (R + B > P ∗ B) (ut − L∗ zt−1 )2
X X

t=1 t=1
T
= (R + B > P ∗ B) ((L(θt−1 ) − L(θ∗ ))zt−1 )2
X

t=1
T
≤ (R + B > P ∗ B) kL(θt−1 ) − L(θ∗ )k2 kzt−1 k2
X

t=1
T
≤ (R + B > P ∗ B)Cz∗2 CL2 kθt−1 − θ∗ k2
X

t=1
T1 (δ/2)+T2 (δ/2)
= (R + B > P ∗ B)Cz∗2 CL2 kθt−1 − θ∗ k2 1 {1 ≤ t ≤ T1 (δ/2) + T2 (δ/2)}
X

t=1
Ts (δ/2)
kθt−1 − θ∗ k2 1 {T1 (δ/2) + T2 (δ/2) < t ≤ Ts (δ/2)}
X
+
t=T1 (δ/2)+T2 (δ/2)+1
T
!
∗ 2
X
+ kθt−1 − θ k 1 {Ts (δ/2) < t ≤ T }
t=Ts (δ/2)+1
T1 (δ/2)+T2 (δ/2) Ts (δ/2)
≤ (R + B > P ∗ B)Cz∗2 CL2 kθt−1 − θ∗ k2 + kθt−1 − θ∗ k2
X X

t=1 t=T1 (δ/2)+T2 (δ/2)+1

T
!
kθt−1 − θ∗ k2 1 {T > Ts (δ/2)}
X
+
t=Ts (δ/2)+1

First, on the event {θ∗ ∈ St (δ/2), ∀t ≥ 1} ∩ B(δ/2) with Pr({θ∗ ∈ St (δ/2), ∀t ≥

1} ∩ B(δ/2)) ≥ 1 − δ,
97

T1 (δ/2)+T2 (δ/2)
kθt−1 − θ∗ k2 ≤ (T1 (δ/2) + T2 (δ/2))kθmax k2
X
and
t=1

Ts (δ/2)
kθt−1 − θ∗ k2 ≤ T3 (δ/2)2 .
X

t=T1 (δ/2)+T2 (δ/2)+1

Also, for t ≥ Ts (δ/2) + 1

 
t−1
ψi ψi> 
X
λmin (Vt−1 ) ≥ λmin VtN (,δ/2,Cv ) + λmin 
i=tN (,δ/2,Cv ) +1

≥ κ + Cv tN (,δ/2,Cv ) + C̃(t − 1 − tN (,δ/2,Cv ) )

(by definition of tN (,δ/2,Cv ) and t − 1 − tN (,δ/2,Cv ) ≥ T3 (δ/2))
= κ + C̃(t − 1) − (C̃ − Cv )tN (,δ/2,Cv )
≥ κ + C̃(t − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2)) (by Lemma 3.3)

Therefore,

T
kθt−1 − θ∗ k2
X

t=Ts (δ/2)+1
 v 2
T
!
det(Vt−1 )1/2 det(κI)−1/2
u
X τ u
≤ 2C t2 log + 2κ1/2 kθmax k
t=Ts
λ (V )
(δ/2)+1 min t−1
δ/2
r 2
T
X τ 2C (M + 1) log Cψ2 (t − 1)/κ + M + 1 + 2 log (2/δ) + 2κ1/2 kθmax k
≤
t=Ts (δ/2)+1
κ + C̃(t − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2))
r 2
T
X τ 2C (M + 1) log Cψ2 (T − 1)/κ + M + 1 + 2 log (2/δ) + 2κ1/2 kθ max k
≤
t=Ts (δ/2)+1
κ + C̃(t − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2))
r 2
τ 2C (M + 1) log Cψ2 T /κ + M + 1 + 2 log (2/δ) + 2κ1/2 kθ max k
≤
C̃ !
κ + C̃(T − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2))
× log .
κ + C̃(Ts (δ/2) − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2))
98 APPENDIX B. PROOFS FOR CHAPTER 3

Let A = {θ∗ ∈ St (δ/2), ∀t ≥ 1} ∩ B(δ/2) and q = Pr (A). Then,

" T #
R̄Tπ (z0 ) E[zT∗> P ∗ zT∗ zT> P ∗ zT ] ∗ > > ∗ ∗
X
= − +E (ut − L zt−1 ) (R + B P B)(ut − L zt−1 )
t=1
" T #
∗
kCz∗2 ∗ > > ∗ ∗
X
≤ 2kP + qE (ut − L zt−1 ) (R + B P B)(ut − L zt−1 ) A
t=1
" T #
∗ > > ∗ ∗ c
X
+ (1 − q)E (ut − L zt−1 ) (R + B P B)(ut − L zt−1 ) A
t=1
" T #
∗
kCz∗2 ∗ > > ∗ ∗
X
≤ 2kP +E (ut − L zt−1 ) (R + B P B)(ut − L zt−1 ) A
t=1

+ δT (R + B > P ∗ B)Cz∗2 CL2 kθmax k2



≤ 2kP ∗ kCz∗2 + (R + B > P ∗ B)Cz∗2 CL2  (T1 (δ/2) + T2 (δ/2) + δT ) kθmax k2 + T3 (δ/2)2
r 2
1/2
τ 2C (M + 1) log Cψ2 T /κ + M + 1 + 2 log (2/δ) + 2κ kθmax k
+
C̃ 
!
κ + C̃(T − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2))
× log 1 {T > Ts (δ/2)} .
κ + C̃(Ts (δ/2) − 1) − (C̃ − Cv )+ (T1 (δ/2) + T2 (δ/2))

A key observation is that the last inequality holds for any 0 < δ < 1 and δ is not an
input to the CTRACE algorithm, i.e. operation of CTRACE is independent of δ. If
we choose δ = 1/T for T ≥ 2, then we obtain the desired result.
Bibliography

[1] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvàri. Online least squares estimation

with self-normalized processes: An application to bandit problems. Working
paper, 2011.

[2] Y. Abbasi-Yadkori and C. Szepesvàri. Regret bounds for the adaptive control
of linear quadratic systems. In 24th Annual Conference on Learning Theory.
JMLR: Workshop and Conference Proceedings, 2010.

[3] A. Alfonsi, A. Schied, and A. Schulz. Constrained portfolio liquidation in a limit

order book model. Working paper, 2007.

[4] A. Alfonsi, A. Schied, and A. Schulz. Optimal execution strategies in limit order
books with general shape functions. Quantitative Finance, 10:143–157, 2010.

[5] R. Almgren. Optimal execution with nonlinear impact functions and trading-
enhanced risk. Applied Mathematical Finance, 10:1–18, 2003.

[6] R. Almgren and N. Chriss. Optimal control of portfolio transactions. Journal of

Risk, 3:5–39, 2000.

[7] R. Almgren and J. Lorenz. Adaptive arrival price. Working paper, April 2006.

[8] T. W. Anderson and J. B. Taylor. Strong consistency of least squares estimates

in dynamic models. The Annals of Statistics, 7:484–489, 1979.

[9] K. Back. Insider trading in continuous time. Review of Financial Studies,

5(3):387–409, 1992.

99
100 BIBLIOGRAPHY

[10] K. Back and S. Baruch. Information in securities markets: Kyle meets Glosten
and Milgrom. Econometrica, 72(2):433–465, March 2004.

[11] D. Bertsimas and A. W. Lo. Optimal control of execution costs. Journal of

Financial Markets, 1:1–50, 1998.

[12] M. K. Brunnermeier and L. H. Pedersen. Predatory trading. Journal of Finance,

60(4):1825–1863, 2005.

[13] M. L. Buchanan and B. N. Parlett. The uniform convergence of matrix powers.

Numerische Mathematik, 9(1):51–54, 1966.

[14] H. H. Cao, M. D. Evans, and R. K. Lyons. Inventory information. Journal of

Business, 79(1):325–363, January 2006.

[15] B. I. Carlin, M. S. Lobo, and S. Viswanathan. Episodic liquidity crises: Coop-

erative and predatory trading. Journal of Finance, 62(5):2235–2274, 2007.

[16] H. Chen and L. Guo. Convergence rate of least squares identification and adap-
tive control for stochastic systems. International Journal of Control, 44:1459–
1476, 1986.

[17] P. M. DeMarzo and B. Urošević. Ownership dynamics and asset pricing with a
large shareholder. Journal of Political Economy, 114(4):774–815, August 2006.

[18] R. Dubil. Optimal liquidation of large security holdings in thin markets. In

Y. Shachmurove, editor, Research in International Entrepreneurial Finance
and Business Ventures at the Turn of the Third Millennium. Academy of En-
trepreneurial Finance, 2002.

[19] C. Duhigg. Stock traders find speed pays, in milliseconds. The New York Times,
page A1, July 24, 2009.

[20] R. Engle and R. Ferstenberg. Execution risk. Working Paper 12165, NBER,
April 2006.
BIBLIOGRAPHY 101

[21] F. D. Foster and S. Viswanathan. Strategic trading with asymmetrically informed

traders and long-lived information. Journal of Financial and Quantitative Anal-
ysis, 29(4):499–518, December 1994.

[22] F. D. Foster and S. Viswanathan. Strategic trading when agents forecast the
forecasts of others. Journal of Finance, 51(4):1437–1478, September 1996.

[23] D. Fudenberg and J. Tirole. Game Theory. MIT Press, Cambridge, MA, 1991.

[24] N. Garleanu and L. H. Pedersen. Dynamic trading with predictable returns and
transaction costs. Working paper, 2009.

[25] J. Gatheral. No-dynamic-arbitrage and market impact. Quantitative Finance,

10:749–759, 2010.

[26] M. Guo and A. S. Kyle. Dynamic strategic informed trading with risk-averse
market makers. Working paper, 2005.

[27] C. W. Holden and A. Subrahmanyam. Long-lived private information and im-

perfect competition. Journal of Finance, 47(1):247–270, March 1992.

[28] C. W. Holden and A. Subrahmanyam. Risk aversion, imperfect competition, and

long-lived information. Economics Letters, 44:181–190, 1994.

[29] M. Hora. Tactical liquidity trading and intraday volume. Working paper, avail-
able at http://ssrn.com/paper=931667, September 2006.

[30] G. Huberman and W. Stanzl. Price manipulation and quasi-arbitrage. Econo-

metrica, 74(4):1247–1276, 2004.

[31] G. Huberman and W. Stanzl. Optimal liquidity trading. Review of Finance,

9:165–200, 2005.

[32] V. Ionescu and M. Weiss. On computing the stabilizing solution of the discrete-
time riccati equation. Linear Algebra and its Applications, 174:229–238, 1992.
102 BIBLIOGRAPHY

[33] R. Kissell and M. Glantz. Optimal Trading Strategies: Quantitative Approaches

for Managing Market Impact and Trading Risk. Amacom Books, 2003.

[34] A. S. Kyle. Continuous auctions and insider trading. Econometrica, 53(6):1315–

1335, 1985.

[35] T. L. Lai and C. Wei. Extended least squares and their applications to adap-
tive control and prediction in linear systems. IEEE Transactions on Automatic
Control, 31:898–906, 1986.

[36] J. Lorenz. Optimal Trading Algorithms: Portfolio Transactions, Multiperiod

Portfolio Selection, and Competitive Online Search. PhD thesis, ETH Zürich,
2008.

[37] C. C. Moallemi, B. Park, and Van Roy B. Strategic execution in the presence of
an uninformed arbitrageur. Forthcoming in Journal of Financial Markets, 2008.

[38] B. P. Molinari. The stabilizing solution of the discrete algebraic riccati equation.
IEEE Transactions on Automatic Control, 20(3):396–399, 1975.

[39] A. Obizhaeva and J. Wang. Optimal trading strategy and supply/demand dy-
namics. Working paper, 2005.

[40] M. Oehmke. Liquidating illiquid collateral. Working paper, 2010.

[41] B. Park and B. Van Roy. Adaptive execution: Exploration and learning of price
impact. Working paper, 2012.

[42] I. Rosu. A dynamic model of the limit order book. Review of Financial Studies,
22:4601–4641, 2009.

[43] A. Schied and T. Schönenborn. Optimal portfolio liquidation for CARA investors.
Working paper, 2007.

[44] T. Schönenborn and A. Schied. Liquidation in the face of adversity: stealth vs.
sunshine trading, predatory trading vs. liquidity provision. Working paper, 2007.
BIBLIOGRAPHY 103

[45] A. Subramanian and R. A. Jarrow. The liquidity discount. Mathematical Fi-

nance, 11(4):447–474, 2001.

[46] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction.

[47] D. Vayanos. Strategic trading and welfare in a dynamic market. Review of

Economic Studies, 66(2):219–254, 1999.

[48] D. Vayanos. Strategic trading in a dynamic noisy market. Journal of Finance,

56:131–171, 2001.

40 Classic Crude Oil Trades-Routledge (2022)
100% (1)
40 Classic Crude Oil Trades-Routledge (2022)
257 pages
Stock Market Modeling and Forecasting, Chen
100% (3)
Stock Market Modeling and Forecasting, Chen
166 pages
Munk - 2008 - Financial Asset Pricing Theory
100% (2)
Munk - 2008 - Financial Asset Pricing Theory
360 pages
Theory of Mechanism Design
No ratings yet
Theory of Mechanism Design
279 pages
Computational Modeling of An Economy Using Elements of Artificial
No ratings yet
Computational Modeling of An Economy Using Elements of Artificial
160 pages
1909.09571v1
No ratings yet
1909.09571v1
123 pages
UCL-Thesis-Hui Gong
No ratings yet
UCL-Thesis-Hui Gong
101 pages
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - libgen.li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - libgen.li
283 pages
MasterThesis-EdouardBerthe
No ratings yet
MasterThesis-EdouardBerthe
58 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Main Ai Games Markets
No ratings yet
Main Ai Games Markets
89 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
meanfieldgames-priceformation
No ratings yet
meanfieldgames-priceformation
32 pages
Modeling, Optimization and Estimation For The On-Line Control of Trading Algorithms in Limit-Order Markets
No ratings yet
Modeling, Optimization and Estimation For The On-Line Control of Trading Algorithms in Limit-Order Markets
164 pages
Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report
No ratings yet
Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report
69 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
65 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Controle Stochastique M2 S10
No ratings yet
Controle Stochastique M2 S10
203 pages
Reinforcement Learning - A comprehensive Overview
No ratings yet
Reinforcement Learning - A comprehensive Overview
177 pages
FAPT2009
No ratings yet
FAPT2009
358 pages
RL Notes
No ratings yet
RL Notes
69 pages
Financial Asset Pricing Theory 2007
No ratings yet
Financial Asset Pricing Theory 2007
335 pages
Deep Policy Gradient Methods in Commodity Markets
No ratings yet
Deep Policy Gradient Methods in Commodity Markets
114 pages
Fundamental and Technical Analysis of Financial Markets (PDFDrive)
No ratings yet
Fundamental and Technical Analysis of Financial Markets (PDFDrive)
124 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Open Macroeconomics
No ratings yet
Open Macroeconomics
42 pages
Mailath - Economics703 Microeconomics II Modelling Strategic Behavior
No ratings yet
Mailath - Economics703 Microeconomics II Modelling Strategic Behavior
264 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
64 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
IEA Book 2011jan12
No ratings yet
IEA Book 2011jan12
257 pages
Poly X MB 2024
No ratings yet
Poly X MB 2024
116 pages
Control Engineering and Finance Lecture Notes in Control and Information Sciences PDF
100% (3)
Control Engineering and Finance Lecture Notes in Control and Information Sciences PDF
312 pages
Finanzas PT2008
No ratings yet
Finanzas PT2008
360 pages
PH DThesis
No ratings yet
PH DThesis
101 pages
Micro1 PDF
No ratings yet
Micro1 PDF
545 pages
Microeconomics
75% (4)
Microeconomics
551 pages
Iintermediate Microeconomics 3rd Year
No ratings yet
Iintermediate Microeconomics 3rd Year
493 pages
Diss 1
No ratings yet
Diss 1
117 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Economics - Lecture Notes in Microeconomic Theory
100% (1)
Economics - Lecture Notes in Microeconomic Theory
493 pages
Economics Lecture Notes in Microeconomic Theory
No ratings yet
Economics Lecture Notes in Microeconomic Theory
493 pages
DP Book
No ratings yet
DP Book
428 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Lecture Notes Microeconomic Theory, Jehle
100% (2)
Lecture Notes Microeconomic Theory, Jehle
469 pages
Lectures Notes Microeconomy Theory - Guoqiang Tian
100% (1)
Lectures Notes Microeconomy Theory - Guoqiang Tian
457 pages
Microeconomics Texas University PDF
No ratings yet
Microeconomics Texas University PDF
457 pages
Lecture Notes 1 Microeconomic Theory
No ratings yet
Lecture Notes 1 Microeconomic Theory
552 pages
Lecture Notes: Guoqiang TIAN Department of Economics Texas A&M University College Station, Texas 77843 (Gtian@tamu - Edu)
No ratings yet
Lecture Notes: Guoqiang TIAN Department of Economics Texas A&M University College Station, Texas 77843 (Gtian@tamu - Edu)
552 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
The Stock Market from A to See - 2nd Edition
From Everand
The Stock Market from A to See - 2nd Edition
John Nunez
No ratings yet
Human Nature Potential in Nurture
From Everand
Human Nature Potential in Nurture
David L. Hawk
No ratings yet
The evaluation of financial risk profile of the companies and the mandatory disclosure on Liquidity and Credit Risk: An experiment to evaluate the usefulness of the disclosure required by the IFRS 7 accounting standard for Professional users (Financial analysts) and nonprofessional users (University students)
From Everand
The evaluation of financial risk profile of the companies and the mandatory disclosure on Liquidity and Credit Risk: An experiment to evaluate the usefulness of the disclosure required by the IFRS 7 accounting standard for Professional users (Financial analysts) and nonprofessional users (University students)
Olga Cucaro
No ratings yet
Truth, what is it good for?: Using the Power of Enlightenment to Improve Academic Performance
From Everand
Truth, what is it good for?: Using the Power of Enlightenment to Improve Academic Performance
Bryan Leland Steele
No ratings yet
Essential Mathematics for Market Risk Management
From Everand
Essential Mathematics for Market Risk Management
Simon Hubbert
5/5 (1)
Negotiation Truths: The most reliable way to beat the real estate market ... in record time!
From Everand
Negotiation Truths: The most reliable way to beat the real estate market ... in record time!
Anne-Maree Elizabeth Russell
No ratings yet
Code_Example__Optuna_with_Bandit-Style_Pruner_Extended_refreshed
No ratings yet
Code_Example__Optuna_with_Bandit-Style_Pruner_Extended_refreshed
3 pages
Bandit_Algorithms_in_Hyperparameter_Tuning_Extended_refreshed
No ratings yet
Bandit_Algorithms_in_Hyperparameter_Tuning_Extended_refreshed
3 pages
parimutuel_simulation_extended
No ratings yet
parimutuel_simulation_extended
3 pages
Bandit_Algorithms_in_Hyperparameter_Tuning
No ratings yet
Bandit_Algorithms_in_Hyperparameter_Tuning
1 page
Presentation_thesis
No ratings yet
Presentation_thesis
19 pages
microstructure_ML
No ratings yet
microstructure_ML
57 pages
Understanding Bandit Pruning in Practice
No ratings yet
Understanding Bandit Pruning in Practice
2 pages
2501.16730v2
No ratings yet
2501.16730v2
73 pages
Machine Learning
No ratings yet
Machine Learning
182 pages
ssrn-279911
No ratings yet
ssrn-279911
67 pages
2153_Pathformer_Multi_scale_Tr
No ratings yet
2153_Pathformer_Multi_scale_Tr
19 pages
Quant Roadmap (Ultimate Edition) 双语对照版
No ratings yet
Quant Roadmap (Ultimate Edition) 双语对照版
148 pages
An Lou Shi
No ratings yet
An Lou Shi
45 pages
SSRN 4579159
No ratings yet
SSRN 4579159
59 pages
Teaching The Science Process Skills: H U W R
No ratings yet
Teaching The Science Process Skills: H U W R
5 pages
Growing Australian Red Cedar
No ratings yet
Growing Australian Red Cedar
68 pages
Aileen B. Rulona Final Manuscript
No ratings yet
Aileen B. Rulona Final Manuscript
81 pages
L Ikigai Worksheet
No ratings yet
L Ikigai Worksheet
4 pages
G7836CD Operating Instructions
100% (1)
G7836CD Operating Instructions
64 pages
Structural Geology Laboratory Manual: Fourth Edition
No ratings yet
Structural Geology Laboratory Manual: Fourth Edition
207 pages
JM Company
No ratings yet
JM Company
3 pages
Clinical Reflection
No ratings yet
Clinical Reflection
4 pages
CPSD SL FORM TestRequestFormGeneral v5 (Revisi)
No ratings yet
CPSD SL FORM TestRequestFormGeneral v5 (Revisi)
4 pages
BIO PRACTICAL 2022-'23 Final Bhavkunj School
No ratings yet
BIO PRACTICAL 2022-'23 Final Bhavkunj School
1 page
Full Download Aspects of Value Frederick Charles Gruber (Editor) PDF DOCX
No ratings yet
Full Download Aspects of Value Frederick Charles Gruber (Editor) PDF DOCX
40 pages
ANFIS A Vision For Smart Electric Power Grid
No ratings yet
ANFIS A Vision For Smart Electric Power Grid
8 pages
Food Fruit Preservation
No ratings yet
Food Fruit Preservation
7 pages
LP-Q3-For-Conversation-Press-1
No ratings yet
LP-Q3-For-Conversation-Press-1
7 pages
Basic Marketing Research 8th Edition Brown Test Bank pdf download
100% (1)
Basic Marketing Research 8th Edition Brown Test Bank pdf download
42 pages
Checklist for Research
No ratings yet
Checklist for Research
3 pages
Action Plan
No ratings yet
Action Plan
2 pages
A Semi Detailed Lesson Plan (Process of Communication Week 2)
No ratings yet
A Semi Detailed Lesson Plan (Process of Communication Week 2)
2 pages
The Experience of Grade 12 Students of Using Wood As Garbage Case
No ratings yet
The Experience of Grade 12 Students of Using Wood As Garbage Case
27 pages
Menaong of Quote There Is Only True Wisdom in Knowing You Know Nothing - Buscar Con Google
No ratings yet
Menaong of Quote There Is Only True Wisdom in Knowing You Know Nothing - Buscar Con Google
1 page
Test Bank for Biology 12th by Raven - Instantly Accessible In Full PDF Version
100% (7)
Test Bank for Biology 12th by Raven - Instantly Accessible In Full PDF Version
56 pages
Post Colonial National Identity in The Philippines Celebrating The Centennial of Independence 1st Edition Greg Bankoff Ebook All Chapters PDF
100% (7)
Post Colonial National Identity in The Philippines Celebrating The Centennial of Independence 1st Edition Greg Bankoff Ebook All Chapters PDF
62 pages
Project On LPG Refrigerator Mechanical Project
86% (69)
Project On LPG Refrigerator Mechanical Project
53 pages
Prequalified Testing Devices List
No ratings yet
Prequalified Testing Devices List
9 pages
Cv-Noman Mahmood - Chemical Engineer
100% (1)
Cv-Noman Mahmood - Chemical Engineer
2 pages
Hesse, 2018
No ratings yet
Hesse, 2018
24 pages
Prelim Test Schedule 2024 -25
No ratings yet
Prelim Test Schedule 2024 -25
2 pages
MSDS B-Kurita KV N-198
No ratings yet
MSDS B-Kurita KV N-198
1 page

thesis_execution-augmented

Uploaded by

thesis_execution-augmented

Uploaded by

STRATEGIC AND ADAPTIVE EXECUTION

This dissertation is online at: http://purl.stanford.edu/qh657ph4274

Benjamin Van Roy, Primary Adviser

Approved for the Stanford University Committee on Graduate Studies.

The author of this thesis was supported by Samsung Scholarship.

2 Part I: Strategic Execution 11

3 Part II: Adaptive Execution 42

A Proofs for Chapter 2 66

B Proofs for Chapter 3 73

3.1 Monte Carlo Simulation Setting . . . . . . . . . . . . . . . . . . . . . 59

3.1 Relative errors for PT and L(θt ) . . . . . . . . . . . . . . . . . . . . . 48

1.1 Optimal Execution Strategies

interpretations of and reactions to executed trades. As such, learning requires “ex-

1.2 Strategic Execution in the Presence of an Un-

(a) In the presence of adversaries, there are significant potential benefits to

5. We discuss how the basic model presented can be extended to incorporate a

1.3 Adaptive Execution – Exploration and Learn-

2. We establish a finite-time expected regret bound for CTRACE that exhibits

3. We demonstrate via Monte Carlo simulation that CTRACE outperforms the

Part I: Strategic Execution in the

2.1 Problem Formulation

2.1.1 Game Structure

a position y0 . Denote his position at each time t by yt . In general, the arbitrageur

2.1.2 Price Dynamics

pt = pt−1 + ∆pt = pt−1 + λ(ut + vt ) + t . (2.1)

2.1.3 Information Structure

over times τ = t + 1, . . . , T + 1. Here, the conditioning in the expectation implicitly

2.1.6 Equilibrium Concept

As a solution concept, we consider perfect Bayesian equilibrium [23]. This is a refine-

for all t, xt , yt , and φt . Similarly, a policy ψ ∈ Ψ is a best response to π ∈ Π if

Definition 2.1. A perfect Bayesian equilibrium (PBE) is a pair of policies (π ∗ , ψ ∗ )

belief distribution φt . These arguments, especially the distribution, make computa-

Definition 2.2. A policy π ∈ Π (or ψ ∈ Ψ) is a Gaussian best response to (ψ, π̂) ∈

2.2 Dynamic Programming Analysis

2.2.1 Stage-Wise Decomposition

for all functions U , where xt = xt−1 + ut , yt = yt−1 + vt , vt = ψt (yt−1 , φt−1 ), and φt

Algorithm 1 PBE Solver

will do so optimally, thus his value function takes the form

where the optimizing decision is ψT∗ (yT −1 , φT −1 ) = − 21 yT −1 . It is straightforward to

2.2.2 Linear Policies

By restricting attention to linear policies and Gaussian beliefs, we can apply an

Algorithm 2 Linear-Gaussian PBE Solver

Definition 2.4. A pair of policies (π ∗ , ψ ∗ ) ∈ Π × Ψ is a linear-Gaussian perfect

2.2.3 Quadratic Value Functions

Definition 2.5. A function Ut is trader-quadratic-decomposable (TQD) if there

for all xt , yt , and φt . A function Vt is arbitrageur-quadratic-decomposable

for all yt and φt .

Hence, each pair of value functions generated by Algorithm 2 is TQD/AQD. A great

2.2.4 Simplified Conditions for Equilibrium

by,tt−1 = − 1, bµ,t =− , (2.15)

and the second order conditions

where the quantities αt and ρt satisfy

Then, (πt∗ , ψt∗ ) satisfy the single-stage equilibrium conditions

for all xt−1 , yt−1 , and Gaussian φt−1 .

2.3.1 Representation of Policies

2.3.2 Searching for Equilibrium Variances

To avoid this numerical instability, consider Algorithm 3. This algorithm main-

Algorithm 3 Linear-Gaussian PBE Solver with Variance Search

simultaneously solving the system of equations (2.12)–(2.17) for single-stage equilib-

2.4 Computational Results

2.4.1 Alternative Policies

In order to understand the behavior of linear-Gaussian PBE policies, we first define

2.4.2 Relative Volume

2.4.3 Policy Performance

Analogously, the arbitrageur’s normalized expected profit V̄ (π, ψ) is defined to be

Using a similar argument as above, it is easy to see that Ū (π EQ , ψ EQ ) and V̄ (π EQ , ψ EQ )

10−2 10−1 100 101 102 103

is significant in moderate to high relative volume regimes.

arbitrageur. Define the spill-over to be the quantity

2.4.5 Adaptive Trading

One important feature of the linear-Gaussian PBE policy is that it is adaptive in

Here, t = (1 , . . . , t ) is the vector of exogenous disturbances up to time t. The first

2.4.6 Does an Arbitrageur Benefit the Market?

of price informativeness and volatility.

This quantity corresponds to expected time-average of the variance of the per-period

2.5.1 Time Horizon

pt = pt−1 + ∆pt = pt−1 + λ(ut + vt ) + t . (2.1)

Here, t = (1 , . . . , t ) is the vector of exogenous disturbances up to time t. The first

θ∗ with probability at least 1 − δ. If is sufficiently small, we can apply Lemma 3.4

tN (,δ,Cv ) ≤ T1∗ (δ 0 ) ∨ τ + T2 (, δ, Cv ) where

> > (2ρΣ − λ∗ − γ ∗> 1 + 2Pxx + Pxd

ρΣ − 21 (λ + 1> γ) + Pxx + 12 1> Pxd

2ρΣ − λ∗ − γ ∗> 1 + 2Pxx + Pxd