0% found this document useful (0 votes)

8 views

1560 Quantum Algorithms For Deep Co

Uploaded by

lokesh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

1560 Quantum Algorithms For Deep Co

Uploaded by

lokesh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Published as a conference paper at ICLR 2020

Q UANTUM A LGORITHMS F OR D EEP C ONVOLUTIONAL

N EURAL N ETWORK
Iordanis Kerenidis, Jonas Landman & Anupam Prakash
Institut de Recherche en Informatique Fondamentale (IRIF)
Université de Paris, CNRS
Paris, France
[email protected]

A BSTRACT

Quantum computing is a powerful computational paradigm with applications in

several fields, including machine learning. In the last decade, deep learning, and
in particular Convolutional Neural Networks (CNN), have become essential for
applications in signal processing and image recognition. Quantum deep learning,
however, remains a challenging problem, as it is difficult to implement non lin-
earities with quantum unitaries Schuld et al. (2014). In this paper we propose a
quantum algorithm for evaluating and training deep convolutional neural networks
with potential speedups over classical CNNs for both the forward and backward
passes. The quantum CNN (QCNN) reproduces completely the outputs of the
classical CNN and allows for non linearities and pooling operations. The QCNN
is in particular interesting for deep networks and could allow new frontiers in the
image recognition domain, by allowing for many more convolution kernels, larger
kernels, high dimensional inputs and high depth input channels. We also present
numerical simulations for the classification of the MNIST dataset to provide prac-
tical evidence for the efficiency of the QCNN.

1 I NTRODUCTION

The growing importance of deep learning in research, in industry and in our society will require
extreme computational power as the dataset sizes and the complexity of these algorithms is expected
to increase. Quantum computers are a good candidate to answer this challenge. The recent progress
in the physical realization of quantum processors and the advances in quantum algorithms increases
the importance of understanding their capabilities and limitations. In particular, the field of quantum
machine learning has witnessed many innovative algorithms that offer speedups over their classical
counterparts Kerenidis et al. (2019); Lloyd et al. (2013; 2014); Kerenidis & Prakash (2017b); Wiebe
et al. (2014a).
Quantum deep learning refers to the problem of creating quantum circuits that mimic and enhance
the operations of neural networks. It has been studied in several works Allcock et al. (2018); Reben-
trost et al. (2018); Wiebe et al. (2014b) but remains challenging as it is difficult to implement non
linearities with quantum unitaries Schuld et al. (2014). In this work we propose a quantum algorithm
for convolutional neural networks (CNN), a type of deep learning designed for visual recognition,
signal processing and time series. We also provide results of numerical simulations to evaluate the
running time and accuracy of the quantum convolutional neural network (QCNN). Note that our
algorithm is theoretical and could be compiled on any type of quantum computers (trapped ions,
superconducting qubits, cold atoms, photons, etc.)
The CNN was originally developed by LeCun et al. (1998) in the 1980’s. They have achieved great
practical success over the last decade Krizhevsky et al. (2012) and have been used in cutting-edge
domains like autonomous cars Bojarski et al. (2016) and gravitational wave detection George &
Huerta (2018). Despite these successes, CNNs suffer from computational bottlenecks due to the
size of the optimization space and the complexity of the inner operations, these bottlenecks make
deep CNNs resource expensive.

1
Published as a conference paper at ICLR 2020

The growing interest in quantum machine learning has led researchers to develop different variants
of Quantum Neural Networks (QNN). The quest for designing quantum analogs of neural networks
is challenging due to the modular layer architecture of the neural networks and the presence of non
linearities, pooling, and other non unitary operations, as explained in Schuld et al. (2014). Several
strategies have been tried in order to to implement some features of neural networks Allcock et al.
(2018); Wiebe et al. (2014b); Beer et al. (2019) in the quantum setting.
Variational quantum circuits provide another path to the design of QNNs, this approach has been
developed in Farhi & Neven (2018); Henderson et al. (2019); Killoran et al. (2018). A quantum con-
volutional neural network architecture using variational circuits was recently proposed Cong et al.
(2018). However further work is required to provide evidence that such techniques can outperform
classical neural networks in machine learning settings.

2 P RELIMINARIES

2.1 C ONVOLUTION P RODUCT AS M ATRIX M ULTIPLICATION

We briefly introduce the formalism and notation concerning classical convolution product and its
equivalence with matrix multiplication. More details can be found in Appendix (Section C). A
single layer ` of the classical CNN does the following operations: from an input image X ` ∈
` ` ` ` `+1
RH ×W ×D seen as a 3D tensor, and a kernel K ` ∈ RH×W ×D ×D seen as a 4D tensor, it
`+1 `+1 `+1
performs a convolution product and outputs X `+1 ` `
= X ∗ K , with X `+1
∈ RH ×W ×D .
` ` `+1
This convolution operation is equivalent to the matrix multiplication A F = Y where A , F `
`
`+1 ` ` `+1
and Y are suitably vectorized versions of X , K and X respectively. The output of the layer
` of the CNN is f (X `+1 ) where f is a non linear function.

2.2 Q UANTUM C OMPUTING

For a detailed introduction to quantum computing and its applications to machine learning in the
context of this work, we invite the reader to look at Appendix F. We also refer to Nielsen & Chuang
(2002b) for a more complete overview of quantum computing.
In this part we will discuss only briefly the core notions of quantum computing. Like a classical
bit, a quantum bit (qubit) can be |0i, |1i, but can also be in a superposition state α |0i + β |1i with
amplitudes (α, β) ∈ C such that |α|2 + |β|2 = 1. With n qubits it is then possible to construct
a superposition of the 2n binary combinations possible, each with a specific amplitude. We will
note the ith combination (e.g. |01 · · · 110i) as |ii. A vector v ∈ Rd can be encoded in a quantum
state made of dlog(d)e qubits. This encoding is a quantum superposition, where the components
(v1 , · · · , vdP
) of v are used as the amplitudes of the d binary combinations. We note this state
1 th
|vi := kvk i∈[d] vi |ii, where |ii is a register representing the i vector in the standard basis.
Quantum computation proceeds by applying quantum gates which are defined to be unitary matrices
acting on 1 or 2 qubits, for example the Hadamard gate that maps |0i 7→ √12 (|0i + |1i) and |1i 7→
√1 (|0i − |1i). The output of the computation is a quantum state that can be measured to obtain
2
classical information. The measurement of a qubit α |0i + β |1i yields either 0 or 1, with probability
equal to the square of the respective amplitude. A detailed discussion of the results from quantum
machine learning and linear algebra used in this work can be found in Appendix (Section F).

3 M AIN RESULTS

In this paper, we design a quantum convolutional neural network (QCNN) algorithm with a modular
architecture that allows for any number of layers, any number and size of kernels, and that can
support a large variety of non linearity and pooling methods. Our main technical contributions
include a new notion of a quantum convolution product, the development of a quantum sampling
technique well suited for information recovery in the context of CNNs and a proposal for a quantum
backpropagation algorithm for efficient training of the QCNN.

2
Published as a conference paper at ICLR 2020

The QCNN can be directly compared to the classical CNN as it has the same inputs and outputs.
We show that it offers a speedup compared to certain cases of classical CNN for both the forward
pass and for training using backpropagation. For each layer, on the forward pass (Algorithm 1),
the speedup is exponential in the size of the layer (number of kernels) and almost quadratic on the
spatial dimension of the input. We next state informally the speedup for the forward pass, the formal
version appears as Theorem D.1.

Result 1 (Quantum Convolution Layer)

Let X ` be the input and K ` be the kernel for layer ` of a convolutional neural network, and f :
R 7→ [0, C] with C > 0 be a non linear function so that f (X `+1 ) := f (X ` ∗ K ` ) is the output for
layer `. Given X ` and K ` stored in QRAM (Quantum Random Access Memory), there is a quantum
`+1
algorithm that, for precision parameters > 0 and η > 0, creates quantum state |f (X )i such
`+1
that f (X ) − f (X `+1 ) ≤ 2 and retrieves classical tensor X `+1 such that for each pixel j,
∞

`+1
(
|Xj`+1 − f (Xj`+1 )| ≤ 2 if f (X j )≥η
`+1 (1)
Xj`+1 =0 if f (X j ) <η
√

1 √ M C
The running time of the algorithm is O
e
η 2 · `+1
where E(·) represents the aver-
E(f (X ))
e hides factors poly-logarithmic in the size of X ` and K ` and the parameter M =
age value, O
maxp,q kAp k kFq k is the maximum product of norms from subregions of X ` and K ` .

We see that the number of elements in the input and the kernels appear only with a poly-logarithmic
contribution in the running time. This is one of the main advantages of our algorithm and it allows
us to use for larger and even exponentially deeper kernels. For the number of elements in the input,
their number is hidden in the precision parameter η in the running time. Indeed, a sufficiently
large fraction of pixels must be sampled from the output of the quantum convolution to retrieve the
meaningful information. In the Numerical Simulations (Section 6) we provide empirical estimates
for η. For details about the QRAM, see Appendix F.2.
Following the forward pass, a loss function L is computed for the output of a classical CNN. The
backpropagation algorithm is then used to calculate, layer by layer, the gradient of this loss with
respect to the elements of the kernels K ` , in order to update them through gradient descent. We state
our quantum backpropagation algorithm next, the formal version of this result appears as Theorem
E.1

Result 2 (Quantum Backpropagation for Quantum CNN)

Given the forward pass quantum algorithm in Result 1, and given the kernel matrix F ` , input
matrices A` and Y ` , stored in the QRAM for each layer `, and a loss function L, there is a quantum
∂L
backpropagation algorithm that estimates each element of the gradient tensor ∂F ` within additive
∂L `
error δ ∂F ` , and updates F according to a gradient descent update rule. The running time of a
single layer ` for quantum backpropagation is given by

∂L ∂L ∂L ∂L log 1/δ
O µ(A` ) + µ( ) κ( ) + µ( ) + µ(F `
) κ( ) (2)
∂Y `+1 ∂F ` ∂Y `+1 ∂Y ` δ2
√
where for a matrix V ∈ Rn×n , κ(V ) is the condition number and µ(V ) ≤ n is a matrix dependent
parameter defined in Equation (5).

For the quantum back-propagation algorithm, we introduce a quantum tomography algorithm with
`∞ norm guarantees, that could be of independent interest. It is exponentially faster than tomog-
raphy with `2 norm guarantees and is given as Theorem G.1 in Section G. Numerical simulations
on classifying the MNIST dataset show that our quantum CNN achieves a similar classification
accuracy as the classical CNN.

4 F ORWARD PASS FOR QCNN

The forward pass algorithm for the QCNN implements the quantum analog of a single quantum
convolutional layer. It includes a convolution product between an input and a kernel, followed by

3
Published as a conference paper at ICLR 2020

the application of a non linear function and pooling operations to prepare the next layer’s input. We
provide an overview of the main ideas of the algorithm here, the complete technical details are given
in the Appendix (Section D).

Algorithm 1 QCNN Layer

Require: Matrix A` representing input to layer ` and kernel matrix F ` stored in QRAM. Precision
parameters and η, a boolean circuit for a non linear function f : R 7→ [0, C].
Ensure: Outputs the data matrix A`+1 for the next layer which is the result of the convolution
between the input and the kernel, followed by a non linearity and pooling.

1: Step 1: Quantum Convolution

1.1: Inner product estimation
Perform the following mapping, using QRAM queries on rows A`p and columns Fq` , fol-
lowed
P by quantum inner P product estimation (Theorems F.2 and F.4) to implement the mapping
1 1
K p,q |pi |qi →
7 K p,q |pi |qi |P pq i |gpq i
1+hA`p |Fq` i √
Where P pq is -close to Ppq = 2 , K = H `+1 W `+1 D`+1 is a normalisation factor
and |gpq i is some garbage quantum state that can be ignored.
1.2: Non linearity
`+1
Use an arithmetic circuit and two QRAM queries to obtain Y , an -approximation of the
`+1 ` `
convolution output Yp,q = (Ap , Fq ) and apply the non linear function f as a boolean circuit to
1
P `+1
obtain K p,q |pi |qi |f (Y p,q )i |gpq i.
2: Step 2: Quantum Sampling
`+1
0 pq f (Y )
Use Conditional Rotation and Amplitude Amplification to encode the values αpq := C
1
P 0 `+1
into the amplitudes to obtain K p,q αpq |pi |qi |f (Y pq )i |gpq i. Perform `∞ tomography from
`+1
Theorem G.1 with precision η, and obtain classically all positions and values (p, q, f (Y pq ))
such that, with high probability, values above η are known exactly, while others are set to 0.
3: Step 3: QRAM Update and Pooling
Update the QRAM for the next layer A`+1 while sampling. The implementation of pooling
(Max, Average, etc.) can be done by a specific update to the QRAM data structure described in
Section D.2.2.

In this algorithm, we propose the first quantum algorithm for performing the convolution product.
Our algorithm is based on the observation that the convolution product can be regarded as a ma-
trix product between reshaped matrices. The reshaped input’s rows A`p and the reshaped kernel’s
columns Fq` are loaded as quantum states, in superposition. Then the entries of the convolution
hA`p |Fq` i are estimated using a simple quantum circuit for inner product estimation and stored in an
auxiliary register as in Step 1.1 of Algorithm 1.
One of the difficulties in the design of quantum neural networks is that non linear functions are hard
to implement as unitary operations. We get around this difficulty by applying the non-linear function
f as a boolean circuit to the output of the quantum inner product estimation circuit in Step 1.2 of
Algorithm 1. Most of the non linear functions in the machine learning literature can be implemented
using small sized boolean circuits, our algorithm thus allows for many possible choices of the non-
linear function f (see Appendix F.1 for details on non linear boolean circuits in quantum circuits).
Step 2 of Algorithm 1 develops a quantum importance sampling procedure wherein the pixels with
`+1
high values of f (Y pq ) are read out with higher probability. This is done by encoding these values
into the amplitudes of the quantum state using the well known Amplitude Amplification algorithm
Brassard et al. (2002). This kind of importance sampling is a task that can be performed easily
in the quantum setting and has no direct classical analog. Although it does not lead to asymptotic
improvements for the algorithms running time, it could lead to improvements that are significant in
practice.
More precisely, during the measurement of a quantum register in superposition, only one of its
values appears, with a probability corresponding the the square of its amplitude. It implies that the

4
Published as a conference paper at ICLR 2020

`+1
output’s pixels measured with more probability are the ones with the highest value f (Yp,q ). Once
measured, we read directly from the registers the position p, q and the value itself. Thus we claim
that we measure only a fraction of the quantum convolution product output, and that the set of pixels
measured collect most of the meaningful information for the CNN, the other pixels being set to 0.
After being measured, each pixel’s value and position are stored in a QRAM to be used as quantum
state for next layer’s input. During this phase, it is possible to discard or aggregate some values to
perform pooling operations as described in Step 3 of Algorithm 1. The forward pass for the QCNN
thus includes the the convolution product, the non linearity f and pooling operation, in time poly-
logarithmic in the kernel’s dimensions. In comparison, the classical CNN layer in linear in both
kernel and input dimensions.
Note finally that quantum importance sampling in Step 2 implies that the non linear function f be
bounded by a parameter C > 1. In our experiments we use the capReLu function, which is a
modified ReLu function that becomes constant above C.

5 Q UANTUM BACKPROPAGATION ALGORITHM

The second algorithm required for the QCNN is the quantum backpropagation algorithm given as
Algorithm 2. Like the classical backpropagation algorithm, it updates all kernels weights according
to the derivatives of a given loss function L. In this sectiion, we explain the main ideas and compare
it to the classical backpropagation algorithm, the complete details are given in Appendix (Section
E).

Algorithm 2 Quantum Backpropagation

Require: Precision parameter δ. Data matrices A` and kernel matrices F ` stored in QRAM for
each layer `.
∂L ∂L
Ensure: Outputs gradient matrices ∂F ` and ∂Y ` for each layer `.

∂L
1: Calculate the gradient for the last layer L using the outputs and the true labels: ∂Y L
2: for ` = L − 1, · · · , 0 do
3: Step 1 : Modify the gradient
With ∂Y∂L`+1 stored in QRAM, set to 0 some of its values to take into account pooling, tomog-
raphy and non linearity that occurred in the forward pass of layer `. These values correspond
to positions that haven’t been sampled nor pooled, since they have no impact on the final loss.

4: Step 2 : Matrix-matrix multiplications

With the modified values of ∂Y∂L `+1 , use quantum linear algebra algorithm (Theorem F.7) to

perform the matrix-matrix multiplications (A` )T · ∂Y∂L ∂L ` T

`+1 and ∂Y `+1 · (F ) , allowing to
∂L ∂L
obtain quantum states corresponding to ∂F ` and ∂Y ` .
5: Step 3 : `∞ tomography
∂L
Measure the previous outputs, as in Algorithm 3. This allows to estimate each entry of ∂F `
∂L ∂L ∂L
and ∂Y ` with errors δ ∂F ` and δ ∂Y ` respectively, using `∞ tomography from Theorem
∂L
G.1. Store all elements of ∂F ` in QRAM.
6: Step 4 : Gradient descent
From the previous tomography,
perform the gradient
descent to update the values of F ` in
` ` ∂L ∂L
QRAM: Fs,q ← Fs,q − λ ∂F ` ± 2δ ∂F `
2
.
s,q
7: end for

We describe briefly detail the implementation of quantum backpropagation at layer `. The algorithm
assumes that ∂Y∂L`+1 is known. First, the backpropagation of the quantum convolution product is
equivalent to the classical one, and we use the matrix-matrix multiplication formulation to obtain
∂L ∂L
the derivatives ∂F ` and ∂Y ` . The first one is the result wanted and the second one is needed for layer
` − 1. This matrix-matrix multiplication can be implemented as a quantum circuit, by decomposing
into several matrix-vector multiplications, known to be efficient, with a running time depending
on the ranks and Frobenius norm of the matrices. We obtain a quantum state corresponding to a

5
Published as a conference paper at ICLR 2020

superposition of all derivatives. We use again the `∞ tomography to retrieve each derivative with
`
precision δ > 0 such that, for all kernel’s weight Fs,q we have approximated it’s loss derivative
∂L ∂L ∂L ∂L
with ` ,
∂Fs,q
with an error bounded by `
∂Fs,q
− `
∂Fs,q
≤ 2δ ∂F ` 2
. This implies that the gradient
∂L
descent rule is perturbed by 2δ ∂F ` 2
at most, see Appendix (Section E.4).
We also take into account the effects of quantum non linearity, quantum measurement and pooling.
The quantum pooling operation is equivalent to the classical one, where pixels that were not selected
during pooling see their derivative set to 0. Quantum measurement is similar, since pixels that
haven’t been measured don’t contribute to the gradient. For the non linearity, as in the classical
case, pixels with negative values were set to zero, hence should have no contribution to the gradient.
Additionally, because we used the capReLu function, pixels bigger than the threshold C must also
have null derivatives. This two rules can be implemented by combining them with measurement
rules compared to classical backpropagation, see Appendix (Section E.2.2) for details.

6 N UMERICAL S IMULATIONS
As described above, the adaptation of the CNNs to the quantum setting implies some modifications
that could alter the efficiency of the learning or classifying phases. We now present some experi-
ments to show that such modified CNNs can converge correctly, as the original ones.
The experiment, using the PyTorch library developed by Paszke et al. (2017), consists of training
classically a small convolutional neural network for which we have added a “quantum” sampling
after each convolution. Instead of parametrising it with the precision η, we have choosed to use the
sampling ratio σ that represents the fraction of pixels drawn during tomography. This two definitions
are equivalent, as shown in Appendix (Section D.1.5), but the second one is more intuitive regarding
the running time and the simulations.
We also add a noise simulating the amplitude estimation (parameter ), followed by a capReLu
instead of the usual ReLu (parameter C), and a noise during the backpropagation (parameter δ). In
the following results, we observe that our quantum CNN is able to learn and classify visual data
from the widely used MNIST dataset. This dataset is made of 60.000 training images and 10.000
testing images of handwritten digits. Each image is a 28x28 grayscale pixels between 0 and 255 (8
bits encoding), before normalization.
Let’s first observe the “quantum” effects on an image of the dataset. In particular, the effect of the
capped non linearity, the introduction of noise and the quantum sampling.
We now present the full simulation of our quantum CNN. In the following, we use a simple network
made of 2 convolution layers, and compare our quantum CNN to the classical one. The first and
second layers are respectively made of 5 and 10 kernels, both of size 7x7. A three-layer fully
connected network is applied at the end and a softmax activation function is applied on the last
layer to detect the predicted outcome over 10 classes (the ten possible digits). Note that we didn’t
introduce pooling, being equivalent between quantum and classical algorithms and not improving
the results on our CNN. The objective of the learning phase is to minimize the loss function, defined
by the negative log likelihood of the classification on the training set. The optimizer used was a
built-in Stochastic Gradient Descent.
Using PyTorch, we have been able to implement the following quantum effects (the first three points
are shown in Figure 1):
- The addition of a noise, to simulate the approximation of amplitude estimation during the forward
quantum convolution layer, by adding gaussian noise centered on 0 and with standard deviation
2M , with M = maxp,q kAp k kFq k.
- A modification of the non linearity: a ReLu function which is constant above the value T (the cap).
- A sampling procedure to apply on a tensor with a probability distribution proportional to the tensor
itself, reproducing the quantum sampling with ratio σ.
- The addition of a noise during the gradient descent, to simulate the quantum backpropagation,
by adding a gaussian noise centered on 0 with standard deviation δ, multiplied by the norm of the
gradient, as given by Equation (28).

6
Published as a conference paper at ICLR 2020

Figure 1: Effects of the QCNN on a 28x28 input image. From left to right: original image, image
after applying a capReLu activation function with a cap C at 2.0, introduction of a strong noise
during amplitude estimation with = 0.5, quantum sampling with ratio σ = 0.4 that samples the
highest values in priority. The useful information tends to be conserved in this example. The side
gray scale indicates the value of each pixel. Note that during the QCNN layer, a convolution is
supposed to happen before the last image but we chose not to perform it for visualisation matter.

The CNN used for this simulation may seem “small” compared to the standards AlexNet developed
by Krizhevsky et al. (2012) or VGG-16 by Simonyan & Zisserman (2014), or those used in industry.
However simulating this small QCNN on a classical computer was already very computationally
intensive and time consuming, due to the“quantum” sampling task, apparently not optimized for
a classical implementation in PyTorch. Every single training curve showed in Figure 9 could last
for 4 to 8 hours. Hence adding more convolutional layers wasn’t convenient. Similarly, we didn’t
compute the loss on the whole testing set (10.000 images) during the training to plot the testing
curve. However we have computed the test losses and accuracies once the model trained (see Table
4), in order to detect potential overfitting cases.
We now present the result of the training phase for a quantum version of this CNN, where partial
quantum sampling is applied, for different sampling ratio (number of samples taken from the result-
ing convolution). Since the quantum sampling gives more probability to observe high value pixels,
we expect to be able to learn correctly even with small ratio (σ ≤ 0.5). We compare these training
curve to the classical one. The learning has been done on two epochs, meaning that the whole dataset
is used twice. The following plots show the evolution of the loss L during the iterations on batches.
This is the standard indicator of the good convergence of a neural network learning phase. We can
compare the evolution of the loss between a classical CNN and our QCNN for different parameters.
Most results are presented in Appendix (Section H).
Our simulations show that the QCNN is able to learn despite the introduction of noise, tensor sam-
pling and other modifications. In particular it shows that only a fraction of the information is mean-
ingful for the neural network, and that the quantum algorithm captures this information in priority.
This learning can be more or less efficient depending on the choice of the key parameters. For de-
cent values of these parameters, the QCNN is able to converge during the training phase. It can then
classify correctly on both training and testing set, indicating neither overfitting nor underfitting.
We notice that the learning curves sometimes present a late start before the convergence initializes,
in particular for small sampling ratio. This late start can be due to the random initialization of the
kernel weights, that performs a meaningless convolution, a case where the quantum sampling of the
output is of no interest. However it is very interesting to see that despite this late start, the kernel
start converging once they have found a good combination.
Overall, it is possible that the QCNN presents some behaviors that have no classical equivalence.
Understanding their potential effects, positive or negative, is an open question, all the more so as
the effects of the classical CNN’s hyperparameters are already a topic of active research, see the
work of Samek et al. (2017) for details. Note also that the neural network used in this simulation is

7
Published as a conference paper at ICLR 2020

Figure 2: Training curves comparison between the classical CNN and the Quantum CNN (QCNN)
for = 0.01, C = 10, δ = 0.01 and the sampling ratio σ from 0.1 to 0.5. We can observe a learning
phase similar to the classical one, even for a weak sampling of 20% or 30% of each convolution
output, which tends to show that the meaningful information is distributed only at certain location
of the images, coherently with the purpose of the convolution layer. Even for a very low sampling
ratio of 10%, we observe a convergence despite a late start.

rather small. A following experiment would be to simulate a quantum version of a standard deeper
CNN (AlexNet or VGG-16), eventually on more complex dataset, such as CIFAR-10 developed by
Krizhevsky & Hinton (2009) or Fashion MNIST by Xiao et al. (2017).

7 C ONCLUSIONS
We have presented a quantum algorithm for evaluating and training convolutional neural networks
(CNN). At the core of this algorithm, we have developed a novel quantum algorithm for computing
a convolution product between two tensors, with a substantial speed up. This technique could be
reused in other signal processing tasks that could benefit from an enhancement by a quantum com-
puter. Layer by layer, convolutional neural networks process and extract meaningful information.
Following this idea of learning foremost important features, we have proposed a new approach of
quantum tomography where the most meaningful information is sampled with higher probability,
hence reducing the complexity of our algorithm.
Our QCNN is complete in the sense that almost all classical architectures can be implemented in a
quantum fashion: any (non negative and upper bounded) non linearity, pooling, number of layers
and size of kernels are available. Our circuit is shallow and could be run on relatively small quantum
computers. One could repeat the main loop many times on the same shallow circuit, since perform-
ing the convolution product is simple, and is similar for all layer. The pooling and non linearity are
included in the loop. Our building block approach, layer by layer, allows high modularity, and can
be combined with work on quantum feedforward neural network developed by Allcock et al. (2018).
The running time presents a speedup compared to the classical algorithm, due to fast linear alge-
bra when computing the convolution product, and by only sampling the important values from the
resulting quantum state. This speedup can be highly significant in cases where the number of chan-
nels D` in the input tensor is high (high dimensional time series, videos sequences, games play) or
when the number of kernels D`+1 is big, allowing deep architectures for CNN, which was the case
in the recent breakthrough of DeepMind AlphaGo algorithm of Silver et al. (2016). The Quantum
CNN also allows larger kernels, that could be used for larger input images, since the size the kernels
must be a contant fraction of the input in order to recognize patterns. However, despite our new
techniques to reduce the complexity, applying a non linearity and reusing the result of a layer for the
next layer make register encoding and state tomography mandatory, hence preventing from having
an exponential speedup on the number of input parameters.
Finally we have presented a backpropagation algorithm that can also be implemented as a quantum
circuit. The numerical simulations on a small CNN show that despite the introduction of noise and
sampling, the QCNN can efficiently learn to classify visual data from the MNIST dataset, perform-
ing a similar accuracy than the classical CNN.

8
Published as a conference paper at ICLR 2020

R EFERENCES
Jonathan Allcock, Chang-Yu Hsieh, Iordanis Kerenidis, and Shengyu Zhang. Quantum algorithms
for feedforward neural networks. arXiv preprint arXiv:1812.03089, 2018.
Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, Tobias J Osborne, Robert Salzmann, and Ramona
Wolf. Efficient learning for deep quantum neural networks. arXiv preprint arXiv:1902.10445,
2019.
Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Larry Jackel,
Urs Muller, and Karol Zieba. Visualbackprop: efficient visualization of cnns. arXiv preprint
arXiv:1611.05418, 2016.
Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude amplification
and estimation. Contemporary Mathematics, 305:53–74, 2002.
Shantanav Chakraborty, András Gilyén, and Stacey Jeffery. The power of block-encoded ma-
trix powers: improved regression techniques via faster Hamiltonian simulation. arXiv preprint
arXiv:1804.01973, 2018.
Iris Cong, Soonwon Choi, and Mikhail D Lukin. Quantum convolutional neural networks. arXiv
preprint arXiv:1810.03787, 2018.
Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term pro-
cessors. arXiv preprint arXiv:1802.06002, 2018.
Daniel George and EA Huerta. Deep learning for real-time gravitational wave detection and param-
eter estimation: Results with advanced ligo data. Physics Letters B, 778:64–70, 2018.
Maxwell Henderson, Samriddhi Shakya, Shashindra Pradhan, and Tristan Cook. Quanvolu-
tional neural networks: Powering image recognition with quantum circuits. arXiv preprint
arXiv:1904.04767, 2019.
Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems. Proceedings of the
8th Innovations in Theoretical Computer Science Conference, 2017a.
Iordanis Kerenidis and Anupam Prakash. Quantum gradient descent for linear systems and least
squares. arXiv:1704.04992, 2017b.
Iordanis Kerenidis and Anupam Prakash. A quantum interior point method for LPs and SDPs.
arXiv:1808.09266, 2018.
Iordanis Kerenidis, Jonas Landman, Alessandro Luongo, and Anupam Prakash. q-means: A
quantum algorithm for unsupervised machine learning. Neural Information Processing systems
(NeurIPS), 2019.
Nathan Killoran, Thomas R Bromley, Juan Miguel Arrazola, Maria Schuld, Nicolás Quesada, and
Seth Lloyd. Continuous-variable quantum neural networks. arXiv preprint arXiv:1806.06871,
2018.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech-
nical report, Citeseer, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
Yann LeCun, L Bottou, Yoshua Bengio, and P Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 1998.
Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and
unsupervised machine learning. arXiv, 1307.0411:1–11, 7 2013. URL http://arxiv.org/
abs/1307.0411.

9
Published as a conference paper at ICLR 2020

Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis.
Nature Physics, 10(9):631, 2014.
Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002a.
Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002b.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. 2017.
Patrick Rebentrost, Thomas R Bromley, Christian Weedbrook, and Seth Lloyd. Quantum hopfield
neural network. Physical Review A, 98(4):042308, 2018.
Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificial intelli-
gence: Understanding, visualizing and interpreting deep learning models. arXiv preprint
arXiv:1708.08296, 2017.
Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. The quest for a quantum neural network.
Quantum Information Processing, 13(11):2567–2586, 2014.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum Algorithms for Nearest-Neighbor
Methods for Supervised and Unsupervised Learning. arXiv:1401.2142v2, 2014a. URL https:
//arxiv.org/pdf/1401.2142.pdf.
Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum deep learning. arXiv preprint
arXiv:1412.3489, 2014b.
J Wu. Introduction to convolutional neural networks. https://pdfs.semanticscholar.
org/450c/a19932fcef1ca6d0442cbf52fec38fb9d1e5.pdf, 2017.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-
ing machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

10
Published as a conference paper at ICLR 2020

A PPENDIX A VARIABLE S UMMARY

We recall the most important variables for layer `. They represent tensors, their approximations, and
their reshaped versions.

Data Variable Dimensions Indices

X` H ` × W ` × D` (i` , j ` , d` )
Input Y` (H ` W ` ) × D` -
A` (H W `+1 ) × (HW D` )
`+1
(p, r)
K` H × W × D` × D`+1 (i, j, d, d0 )
Kernel
F` (HW D` ) × D`+1 (s, q)

Table 1: Summary of input variables for the `th layer, along with their meaning, dimensions and
corresponding notations. These variables are common for both quantum and classical algorithms.
We have omitted indices for Y ` which don’t appear in our work.

Data Variable Dimensions Indices

`+1
f (Y ) (H `+1 W `+1 ) × D`+1 (p, q)
Output of Quantum Convolution `+1
f (X ) H `+1 × W `+1 × D`+1 `+1 `+1 `+1
(i , j , d )
Output of Quantum Tomography X `+1 H `+1 × W `+1 × D`+1 (i`+1 , j `+1 , d`+1 )
H `+1 `+1
Output of Quantum Pooling X̃ `+1 P × WP × D`+1 (ĩ`+1 , j̃ `+1 , d˜`+1 )

Table 2: Summary of variables describing outputs of the layer `, with the quantum algorithm.

Data Variable Dimensions Indices

f (Y `+1 ) (H `+1 W `+1 ) × D`+1 (p, q)
Output of Classical Convolution
f (X `+1 ) H `+1 × W `+1 × D`+1 (i`+1 , j `+1 , d`+1 )
H `+1 `+1
Output of Classical Pooling X̃ `+1 P × WP × D`+1 (ĩ`+1 , j̃ `+1 , d˜`+1 )

Table 3: Summary of variables describing outputs of the layer `, with the classical algorithm.

Classical and quantum algorithms can be compared with these two diagrams:

(
`+1 `+1
Quantum convolution layer : X ` → |X i → |f (X )i → X `+1 → X̃ `+1
(3)
Classical convolution layer : X ` → X `+1 → f (X `+1 ) → X̃ `+1

We finally provide some remarks that could clarify some notations ambiguity:
- Formally, the output of the quantum algorithm is X̃ `+1 . It is used as input for the next layer ` + 1.
But we consider that all variables’ names are reset when starting a new layer: X `+1 ← X̃ `+1 .
- For simplicity, we have sometimes replaced the indices (i`+1 , j `+1 , d`+1 ) by n to index the ele-
ments of the output.
- In Section D.2.2, the input for layer ` + 1 is stored as A`+1 , for which the elements are indexed by
(p0 , r0 ).

A PPENDIX B P RELIMINARIES IN Q UANTUM I NFORMATION

We introduce a basic and broad-audience quantum information background necessary for this work.
For a more detailed introduction we recommend Nielsen & Chuang (2002a).

11
Published as a conference paper at ICLR 2020

B.1 Q UANTUM I NFORMATION

Quantum Bits and Quantum Registers: The bit is the most basic unit of classical information.
It can be either in state 0 or 1. Similarly a quantum bit or qubit, is a quantum system that can be is
state |0i, |1i (the braket notation |·i is a reminder that the bit considered is a quantum system) or in
superposition of both states α |0i + β |1i with coefficients α, β ∈ C such that |α|2 + |β|2 = 1. The
amplitudes α and β are linked to the probabilities of observing either 0 or 1 when measuring the
qubit, since P (0) = |α|2 and P (1) = |β|2 .
Before the measurement, any superposition is possible, which gives quantum information special
abilities in terms of computation. With n qubits, the 2n possible binary combinations can exist
simultaneously, each with a specific amplitude. For instance we can consider an uniform distribution
P2n −1
√1 th
n i=0 |ii where |ii represents the i binary combination (e.g. |01 · · · 1001i). Multiple qubits
together are often called a quantum register.
In its most general formulation, a quantum state with n qubits can be seen as vector in a complex
Hilbert space of dimension 2n . This vector must be normalized under `2 -norm, to guarantee that the
squared amplitudes sum to 1.

Quantum Computation: To process qubits and therefore quantum registers, we use quantum
gates. These gates are unitary operators in the Hilbert space as they should map unit-norm vectors
n
to unit-norm vectors. Formally, we can see a quantum gate acting on n qubits as a matrix U ∈ C2
such that U U † = U † U = I,whereU † is the conjugate transpose of U . Some basic single
qubit
0 1 1 1 1
gates includes the NOT gate that inverts |0i and |1i, or the Hadamard gate √2
1 0 1 −1
1 1
that maps |0i 7→ √2 (|0i + |1i) and |1i 7→ √2 (|0i − |1i), creating the quantum superposition.
Finally, multiple qubits gates exist, such as the Controlled-NOT that applies a NOT gate on a target
qubit conditioned on the state of a control qubit.
The main advantage of quantum gates is their ability to be applied to a superposition of inputs.
Indeed, givenP that U |xi 7→ |f (x)i, we can apply it to all possible combinations of x
a gate U such P
at once U ( C1 x |xi) 7→ C1 x |f (x)i.
We now state some primitive quantum circuits, which we will use in our algorithm: For two integers
i and j, we can check their equality with the mapping |ii |ji |0i 7→ |ii |ji |[i = j]i. For two real
value numbers a > 0 and δ > 0, we can compare them using |ai |δi |0i 7→ |ai |δi |[a ≤ δ]i. Finally,
for a real value numbers a > 0, we can obtain its square |ai |0i 7→ |ai |a2 i. Note that these circuits
are basically a reversible version of the classical ones and are linear in the number of qubits used to
encode the input values.
Any classical boolean function can be implemented in a quantum unitary, even though this seems
at first contradictory with the requirements of unitaries (reversibility, linearity). Let σ : R 7→ R
be a classical function, we define Uσ the unitary that acts as Uσ |xi |0i 7→ |xi |σ(x)i. Using a
second quantum register to encode the result of the function, the properties of quantum unitaries are
respected.

B.2 Q UANTUM S UBROUTINES FOR DATA E NCODING

Knowing some basic principles of quantum information, the next step is to understand how data can
be efficiently encoded using quantum states. While several approaches could exist, we present the
most common one called amplitude encoding, which leads to interesting and efficient applications.
Let x ∈ Rd be a vector with components (x1 , · · · , xd ). Using only dlog(d)e qubits, we can form
1
Pd−1 th
|xi, the quantum state encoding x, given by |xi = kxk j=0 xj |ji. We see that the j component
th th
xj becomes the amplitude of |ji, the j binary combination (or equivalently the j vector in the
standard basis). Each amplitude must be divided by kxk to preserve the unit `2 -norm of |xi.
Similarly, for a matrix A ∈ Rn×d or equivalently for n vectors Ai for i ∈ [n], we can express each
Pd−1
row of A as |Ai i = kA1i k i=0 Aij |ji.

12
Published as a conference paper at ICLR 2020

We can now explain an important definition, the ability to have quantum access to a matrix. This
will be a requirements for many algorithms.

Definition 1 [Quantum Access to Data]

We say that we have quantum access to a matrix A ∈ Rn×d if there exist a procedure to perform
the following mapping, for i ∈ [n], in time T :

• |ii |0i 7→ |ii |Ai i

1
P
• |0i 7→ kAk i kAi k |ii
F

By using appropriate data structures the first mapping can be reduced to the ability to perform a
mapping of the form |ii |ji |0i 7→ |ii |ji |Aij i. The second requirement can be replaced by the
ability of performing |ii |0i 7→ |ii |kAi ki or to just have the knowledge of each norm. Therefore,
using matrices such that all rows Ai have the same norm makes it simpler to obtain the quantum
access.
The time or complexity T necessary for the quantum access can be reduced to polylogarithmic
dependence in n and d if we consider the access to a Quantum Memory or QRAM. The QRAM
Kerenidis & Prakash (2017a) is a specific data structure from which a quantum circuit can allow
quantum access to data in time O(log (nd)).

Theorem B.1 (QRAM data structure, see Kerenidis & Prakash (2017a)) Let A ∈ Rn×d , there
is a data structure to store the rows of A such that,

1. The time to insert, update or delete a single entry Aij is O(log2 (n)).
2. A quantum algorithm with access to the data structure can perform the following unitaries
in time T = O(log2 n).
(a) |ii |0i → |ii |Ai i for i ∈ [n].
P
(b) |0i → i∈[n] kAi k |ii.

We now state important methods for processing the quantum information. Their goal is to store some
information alternatively in the quantum state’s amplitude or in the quantum register as a bitstring.

Theorem B.2 [Amplitude Amplification √ and Estimation

√ Brassard et al. (2002)] Given a unitary
operator U such that U : |0i 7→ p |yi |0i + 1 − p |y ⊥ i |1i in time T , where p > 0 is the
probability of measuring “0”, it is possible to obtain the state |yi |0i using O( √Tp ) queries to U , or
T
to estimate p with relative error δ using O( δ√ p ) queries to U .

Theorem B.3 [Conditional Rotation] Given the quantum state |ai, with a ∈ [−1, 1], it is possible
√
to perform |ai |0i 7→ |ai (a |0i + 1 − a |1i) with complexity O(1).
e
Pd−1
Using Theorem F.3 followed by Theorem F.2, it then possible to transform the state √1 |xj i
d j=0
1
Pd−1
into kxk j=0 xj |xj i.

In addition to amplitude estimation, we will make use of a tool developed in Wiebe et al. (2014a)
to boost the probability of getting a good estimate for the inner product required for the quantum
convolution algorithm. In high level, we take multiple copies of the estimator from the amplitude
estimation procedure, compute the median, and reverse the circuit to get rid of the garbage. Here we
provide a theorem with respect to time and not query complexity.

Theorem B.4 (Median Evaluation, see Wiebe et al. (2014a)) Let U be a unitary operation that
maps √ √
U : |0⊗n i 7→ a |x, 1i + 1 − a |G, 0i
for some 1/2 < a ≤ 1 in time T . Then there exists a quantum algorithm that,√ for any ∆ > 0 and
for any 1/2 < a0 ≤ a, produces a state |Ψi such that k |Ψi − |0⊗nL i |xi k ≤ 2∆ for some integer

13
Published as a conference paper at ICLR 2020

L, in time & '

ln(1/∆)
2T .
1 2

2 |a0 | − 2

B.3 Q UANTUM SUBROUTINES FOR L INEAR A LGEBRA

In the recent years, as the field of quantum machine learning grew, its “toolkit” for linear alge-
bra algorithms has become important enough to allow the development of many quantum machine
learning algorithms. We introduce here the important subroutines for this work, without detailing
the circuits or the algorithms.

Definition 2 For a matrix A, the parameter µ(A) is defined

q by µ(A) =
p
minp∈[0,1] kAkF , s2p (A)s2(1−p) (AT ) where sp (A) = maxi (kAi kp ).

The next theorems allow to compute the distance between vectors encoded as quantum states, and
use this idea to perform the k-means algorithm.

Theorem B.5 [Quantum Distance Estimation Wiebe et al. (2014b); Kerenidis et al. (2019)]
Given quantum access in time T to two matrices U and V with rows ui and vj of dimen-
sion d, there is a quantum algorithm that, for any pair (i, j), performs the following mapping
|ii |ji |0i 7→ |ii |ji |d2 (ui , vj )i, estimating the euclidean distance between ui and vj with precision
|d2 (ui , vj ) − d2 (ui , vj )| ≤ for any > 0. The algorithm has a running time given by O(T e η/),
where η = maxij (kui k kvj k), assuming that mini (kui k) = mini (kvi k) = 1.

Theorem B.6 [Quantum k-means clustering Kerenidis et al. (2019)]

Given quantum access in time T to a dataset V ∈ Rn×d , there is a quantum algorithm that outputs
with high probability k centroids c1 , · · · , ck that are consistent with the output of the k-means algo-
1.5
e × (kd η(V2 ) κ(V )(µ(V ) + k η(V ) ) + k 2 η(V 2) κ(V )µ(V ))) per
rithm with noise δ > 0, in time O(T δ δ δ
iteration.
maxi (kvi k2 )
Definition 3 For a matrix V ∈ Rn×d , its parameter η(V ) is defined as as mini (kvi k2 )
, or as
2
maxi (kvi k ) assuming mini (kvi k) = 1.

In theorem F.6, the other parameters in the running time can be interpreted as follows : δ is the
precision in the estimation of the distances, but also in the estimation of the position of the centroids.
κ(V ) is the condition number of V and µ(V ) is defined above (Definition 5). Finally, in the case
of well clusterable datasets, which should be the case when we will apply k-means during spectral
2.5 2
e × (k 2 d η(V 3) + k 2.5 η(V3) )).
clustering, the running simplifies to O(T δ δ
Note that the dependence in n is hidden in the time T to load the data. This dependence becomes
polylogarithmic in n if we assume access to a QRAM.

Theorem B.7 (Quantum Matrix Operations, Chakraborty et al. (2018) ) Let M ∈ Rd×d and
x ∈ Rd . Let δ1 , δ2 > 0. If M is stored in appropriate QRAM data structures and the time to
prepare |xi is Tx , then there exist quantum algorithms that with probability at least 1 − 1/poly(d)
return

1. A state |zi such that k|zi − |M xik2 ≤ δ1 in time O((κ(M

e )µ(M ) + Tx κ(M )) log(1/δ1 )).
Note that this also implies k|zi − |M xik∞ ≤ δ1
2. Norm estimate z ∈ (1 ± δ2 ) kM xk2 , with relative error δ2 , in time
e x κ(M )µ(M ) log(1/δ1 )).
O(T δ2

n×d
appliedto any rectangular matrix V ∈ R
The linear algebra procedures above can also be by
0 V
considering instead the symmetric matrix V = .
VT 0

14
Published as a conference paper at ICLR 2020

A PPENDIX C C LASSICAL C ONVOLUTIONAL N EURAL N ETWORK (CNN)

CNN is a specific type of neural network, designed in particular for image processing or time series.
It uses the Convolution Product as a main procedure for each layer. We will focus on image pro-
cessing with a tensor framework for all elements of the network. Our goal is to explicitly describe
the CNN procedures in a form that can be translated in the context of quantum algorithms.
As a regular neural network, a CNN should learn how to classify any input, in our case images. The
training consists of optimizing a series of parameters, learned on the inputs and their corresponding
labels.

C.1 T ENSOR REPRESENTATION

Images, or more generally layers of the network, can be seen as tensors. A tensor is a generalization
of a matrix to higher dimensions. For instance an image of height H and width W can be seen as
a matrix in RH×W , where every pixel is a greyscale value between 0 ans 255 (8 bit). However the
three channels of color (RGB: Red Green Blue) must be taken into account, by stacking three times
the matrix for each color. The whole image is then seen as a 3 dimensional tensor in RH×W ×D
where D is the number of channels. We will see that the Convolution Product in the CNN can be
expressed between 3-tensors (input) and 4-tensors (convolution filters or kernels), the output being
a 3-tensor of different dimensions (spatial size and number of channels).

Figure 3: RGB decomposition, a colored image is a 3-tensor.

C.2 A RCHITECTURE

A CNN is composed of 4 main procedures, compiled and repeated in any order : Convolution layers,
most often followed by an Activation Function, Pooling Layers and some Fully Connected layers at
the end. We will note ` the current layer.

Convolution Layer : The `th layer is convolved by a set of filters called kernels. The output of this
operation is the (` + 1)th layer. A convolution by a single kernel can be seen as a feature detector,
that will screen over all regions of the input. If the feature represented by the kernel, for instance
a vertical edge, is present in some part of the input, there will be a high value at the corresponding
position of the output. The output is commonly called the feature map of this convolution.

Activation Function : As in regular neural network, we insert some non linearities also called
activation functions. These are mandatory for a neural network to be able to learn any function. In
the case of a CNN, each convolution is often followed by a Rectified Linear Unit function, or ReLu.
This is a simple function that puts all negative values of the output to zero, and lets the positive
values as they are.

Pooling Layer : This downsampling technique reduces the dimensionality of the layer, in order
to improve the computation. Moreover, it gives to the CNN the ability to learn a representation
invariant to small translations. Most of the time, we apply a Maximum Pooling or an Average
Pooling. The first one consists of replacing a subregion of P × P elements only by the one with the

15
Published as a conference paper at ICLR 2020

maximum value. The second does the same by averaging all values. Recall that the value of a pixel
corresponds to how much a particular feature was present in the previous convolution layer.

Fully Connected Layer : After a certain number of convolution layers, the input has been suf-
ficiently processed so that we can apply a fully connected network. Weights connect each input
to each output, where inputs are all element of the previous layer. The last layer should have one
node per possible label. Each node value can be interpreted as the probability of the initial image to
belong to the corresponding class.

C.3 C ONVOLUTION P RODUCT AS A T ENSOR O PERATION

Most of the following mathematical formulations have been very well detailed by Wu (2017).
At layer `, we consider the convolution of a multiple channels image, seen as a 3-tensor X ` ∈
` ` ` `
RH ×W ×D . Let’s consider a single kernel in RH×W ×D . Note that its third dimension must
match the number of channels of the input, as in Figure 4. The kernel passes over all possible re-
gions of the input and outputs a value for each region, stored in the corresponding element of the
`+1 `+1
output. Therefore the output is 2 dimensional, in RH ×W

Figure 4: Convolution of a 3-tensor input (Left) by one 3-tensor kernel (Center). The ouput (Right)
is a matrix for which each entry is a inner product between the kernel and the corresponding over-
lapping region of the input.

In a CNN, the most general case is to apply several convolution products to the input, each one with
a different 3-tensor kernel. Let’s consider an input convolved by D`+1 kernels. We can globally
` `+1
see this process as a whole, represented by one 4-tensor kernel K ` ∈ RH×W ×D ×D . As D`+1
`+1
convolutions are applied, there are D outputs of 2 dimensions, equivalent to a 3-tensor X `+1 ∈
`+1 `+1 `+1
RH ×W ×D
We can see on Figure 5 that the output’s dimensions are modified given the following rule:

H `+1 = H ` − H + 1

(4)
W `+1 = W ` − W + 1

We omit to detail the use of Padding and Stride, two parameters that control how the kernel moves
through the input, but these can easily be incorporated in the algorithms.
An element of X ` is determined by 3 indices (i` , j ` , d` ), while an element of the kernel K ` is
determined by 4 indices (i, j, d, d0 ). For an element of X `+1 we use 3 indices (i`+1 , j `+1 , d`+1 ).
We can express the value of each element of the output X `+1 with the relation

`
H X
X W X
D
Xi`+1
`+1 ,j `+1 ,d`+1 = `
Ki,j,d,d `
`+1 Xi`+1 +i,j `+1 +j,d (5)
i=0 j=0 d=0

16
Published as a conference paper at ICLR 2020

Figure 5: Convolutions of the 3-tensor input X ` (Left) by one 4-tensor kernel K ` (Center). Each
channel of the output X `+1 (Right) corresponds to the output matrix of the convolution with one of
the 3-tensor kernel.

C.4 M ATRIX E XPRESSION

Figure 6: A convolution product is equivalent to a matrix-matrix multiplication.

It is possible to reformulate Equation (5) as a matrix product. For this we have to reshape our objects.
`+1 `+1 `
We expand the input X ` into a matrix A` ∈ R(H W )×(HW D ) . Each row of A` is a vectorized
version of a subregion of X ` . This subregion is a volume of the same size as a single kernel volume
H × W × D` . Hence each of the H `+1 × W `+1 rows of A` is used for creating one value in X `+1 .
Given such a subregion of X ` , the rule for creating the row of A` is to stack, channel by channel,
a column first vectorized form of each matrix. Then, we reshape the kernel tensor K ` into a matrix
` `+1
F ` ∈ R(HW D )×D , such that each column of F ` is a column first vectorized version of one of
the D`+1 kernels.
The convolution operation X ` ∗ K ` = X `+1 is equivalent to the following matrix multiplication
A` F ` = Y `+1 , (6)

17
Published as a conference paper at ICLR 2020

`+1 `+1 `+1

where each column of Y `+1 ∈ R(H W )×D is a column first vectorized form of one of the
D`+1 channels of X `+1 . Note that an element Yp,q `+1
is the inner product between the pth row of A`
and the q th column of F ` . It is then simple to convert Y `+1 into X `+1 The indices relation between
`+1
the elements Yp,q and Xi`+1
`+1 ,j `+1 ,d`+1 is given by:
 `+1
d =q
p
j `+1 = b H `+1 c (7)
 `+1 p
i = p − H `+1 b H `+1 c
A summary of all variables along with their meaning and dimensions is given in Section A.

A PPENDIX D Q UANTUM C ONVOLUTIONAL NEURAL NETWORK

In this section we will design quantum procedures for the usual operations in a CNN layer. We start
by describing the main ideas before providing the details. Steps are gathered in Algorithm 1.
First, to perform a convolution product between an input and a kernel, we will use the mapping be-
tween convolution of tensors and matrix multiplication from Section C.3, that can further be reduced
to inner product estimation between vectors, in order to use quantum linear algebra procedures to
perform these computations faster. The output will be a quantum state representing the result of the
convolution product, from which we can sample to retrieve classical information to feed the next
layer. This is stated by the following Theorem:
Theorem D.1 (Quantum Convolution Layer)
` ` ` ` `+1
Given 3D tensor input X ` ∈ RH ×W ×D and 4D tensor kernel K ` ∈ RH×W ×D ×D stored in
`+1
QRAM, there is a quantum algorithm that computes a quantum states ∆-close to |f (X )i with
`+1
arbitrary small parameter ∆ > 0. |f (X )i is close to the result of the convolution product
`+1
X = X ∗ K followed by any non linear function f : R 7→ R+ , with an error bounded by
` `
`+1
f (X ) − f (X `+1 ) ≤ 2M for any precision > 0, where M is the maximum norm of a
∞
product between one of the D`+1 kernels, and one of the regions of X ` of size HW D` . The time
complexity of this procedure is given by O
e (1/), where O
e hides factors poly-logarithmic in ∆ and
` `
in the size of X and K .
In a second step, we efficiently retrieve classical information from the output. Recall that a convo-
lution can be seen as a pattern detection on the input image, where the pattern is the kernel. The
output values correspond to “how much” the pattern was present in the corresponding region of the
input. Low value pixels in the output indicate the absence of the pattern in the input at the corre-
sponding regions. Therefore, by sampling according to these output values, where the high value
pixels are sampled with more probability, we could retrieve less but only meaningful information
for the neural network to learn. While sampling, we update the QRAM data structure with the new
information (see Section D.2). We also perform the Pooling operation during this phase (see Section
D.2.2). It is an interesting use case where amplitudes of a quantum state are proportional to the
importance of the information they carry, giving a new utility to the probabilistic nature of quantum
sampling. Numerical simulations are presented in Section 6 to have an empirical estimation of how
many samples from the output state are necessary.

D.1 S INGLE Q UANTUM C ONVOLUTION L AYER

In order to develop a quantum algorithm to perform the convolution as described above, we will
make use of quantum linear algebra procedures. We will use quantum states proportional to the rows
of A` , noted |Ap i, and the columns of F ` , noted |Fq i (we omit the ` exponent in the quantum states
PHW D` −1
to simplify the notation). These states are given by |Ap i = kA1p k r=0 Apr |ri and |Fq i =
`+1
1
P D −1
kFq k s=0 Fsq |si. We suppose we can load these vectors in quantum states by performing the
following queries:
|pi |0i 7→ |pi |Ap i
(8)
|qi |0i 7→ |qi |Fq i

18
Published as a conference paper at ICLR 2020

Such queries, in time poly-logarithmic in the dimension of the vector, can be implemented with
a Quantum Random Access Memory (QRAM). See Section D.2 for more details on the QRAM
update rules and its integration layer by layer.

D.1.1 I NNER P RODUCT E STIMATION

The following method to estimates inner product is derived from previous work by Kerenidis et al.
(2019). With the initial state |pi |qi √12 (|0i + |1i) |0i we apply the queries detailed above in a con-
trolled fashion, followed simply by a Hadamard gate to extract the inner product hAp |Fq i in an
amplitude. √12 (|pi |qi |0i |0i + |pi |qi |1i |0i) 7→ √12 (|pi |qi |0i |Ap i + |pi |qi |1i |Fq i). By apply-

ing a Hadamard gate on the third register we obtain the following state, 21 |pi |qi |0i (|Ap i+|Fq i)+

1+hAp |Fq i
|1i (|Ap i−|Fq i) . The probability of measuring 0 on the third register is given by Ppq = .
p p 0
2
Thus we can rewrite the previous state as |pi |qi Ppq |0, ypq i + 1 − Ppq |1, ypq i , where
0
|ypq i and |ypq i are some garbage states. We can perform the previous circuit in superposi-
tion. Since A` has H `+1 W `+1 rows, and F ` has D`+1 columns,we obtain the state: |ui =
0
1
P P p p
q |pi |qi Ppq |0, ypq i + 1 − Ppq |1, ypq i Therefore the probability of
√
H `+1 W`+1 `+1
D p
pqP
measuring the triplet (p, q, 0) in the first three registers is given by P0 (p, q) = H `+1 W `+1 D `+1
=
1+hAp |Fq i
2H `+1 W `+1 D `+1
Now we can relate to the Convolution product. Indeed, the triplets (p, q, 0) that are
the most probable to be measured are the ones for which the value hAp |Fq i is the highest. Recall that
each element of Y `+1 is given by Ypq `+1
= (Ap , Fq ), where “(·, ·)” denotes the inner product. We
see here that we will sample most probably the positions (p, q) for the highest values of Y `+1 , that
corresponds to the most important points of X `+1 , by the Equation (7). Note that the the values of
Y `+1 can be either positive of negative, which is not an issue thanks to the positiveness of P0 (p, q).
A first approach could be to measure indices (p, q) and rely on the fact that pixels with high values,
hence a high amplitude, would have a higher probability to be measured. However we have not
exactly the final result, since hAp |Fq i =
6 (Ap , Fq ) = kAp k kFq k hAp |Fq i. Most importantly we then
`+1
want to apply a non linearity f (Ypq ) to each pixel, for instance the ReLu function, which seems
not possible with unitary quantum gates if the data is encoded in the amplitudes only. Morever, due
to normalization of the quantum amplitudes and the high dimension of the Hilbert space of the input,
the probability of measuring each pixel is roughly the same, making the sampling inefficient. Given
`+1
these facts, we have added steps to the circuit, in order to measure (p, q, f (Ypq )), therefore know
the value of a pixel when measuring it, while still measuring the most important points in priority.

D.1.2 E NCODING THE AMPLITUDE IN A REGISTER

p
Let U be the unitary that map |0i to |ui: |ui = √ `+1 1`+1 `+1 p,q |pi |qi
P
Ppq |0, ypq i +
H W D
p 0 p
1 − Ppq |1, ypq i . The amplitude Ppq can be encoded in an ancillary register by using Ampli-
tude Estimation (Theorem F.2) followed by a Median Evaluation (Theorem F.4). For any ∆ > 0
0
and > 0, we can have a state ∆-close to |u i = √ `+1 1`+1 `+1 p,q |pi |qi |0i |P pq i |gpq i with
P
H W D
probability at least 1 − 2∆, where |Ppq − P pq | ≤ and |gpq i is a garbage state. This requires
O( ln(1/∆)
) queries of U. In the following we discard the third register |0i for simplicity.
The benefit of having P pq in a register is to be able to perform operations on it (arithmetic
or even non linear). Therefore we can simply obtain a state corresponding to the exact value
1+hAp |Fq i
of the the convolution product. Since we’ve built a circuit such that Ppq = 2 , with
two QRAM calls, we can retrieve the norm of the vectors by applying the following unitary
|pi |qi |P pq i |gpq i |0i |0i 7→ |pi |qi |P pq i |gpq i |kAp ki |kFq ki. On the fourth register, we can then
`+1
write Ypq = kAp k kFq k hAp |Fq i using some arithmetic circuits (addition, multiplication by a
scalar, multiplication between registers). We then apply a boolean circuit that implements the ReLu
`+1
function on the same register, in order to obtain an estimate of f (Ypq ) in the fourth register. We
finish by inverting the previous computations and obtain the final state:

19
Published as a conference paper at ICLR 2020

`+1 1 X `+1
|f (Y )i = √ |pi |qi |f (Y pq )i |gpq i (9)
H `+1 W `+1 D`+1 p,q

M = max kAp k kFq k (10)

p,q

M is the maximum product between norms of one of the D`+1 kernels, and one of the regions of X `
of size HW D` . Finally, since the previous error estimation is valid for all pairs (p, q), the overall
`+1
error committed on the convolution product can be bounded by Y − Y `+1 ≤ 2M , where
∞
`+1 `+1
k.k∞ denotes the `∞ norm. Recall that Y is just a reshaped version of X . Since the non
linearity adds no approximation, we can conclude on the final error committed for a layer of our
QCNN
`+1
f (X ) − f (X `+1 ) ≤ 2M (11)
∞

At this point, we have established Theorem D.1 as we have created the quantum state (9), with given
precision guarantees, in time poly-logarithmic in ∆ and in the size of X ` and K ` .
`+1
We know aim to retrieve classical information from this quantum state. Note that |Ypq i is rep-
resenting a scalar encoded in as many qubits as needed for the precision, whereas |Ap i was repre-
senting a vector as a quantum state in superposition, where each element Ap,r is encoded in one
amplitude (See Section F). The next step can be seen as a way to retrieve both encoding at the same
time, that will allow an efficient tomography focus on the values of high magnitude.

D.1.3 C ONDITIONAL ROTATION

In the following sections, we omit the ` + 1 exponent for simplicity. Garbage states are removed
as they will not perturb the final measurement. We now aim to modify the amplitudes, such that
the highest values of |f (Y )i are measured with higher probability. A way to do so consists in
applying a conditional rotation on an ancillary qubit, proportionally to f (Y pq ). We will detail
the calculation since in the general case f (Y pq ) can be greater than 1. To simplify the notation,
we note x = f (Y p pqx). This step consists in applying the following rotation on apancillary qubit:
x

|xi |0i 7→ |xi |0i + β |1i max x = max f (Y ) β = 1 − ( max 2
max x , where p,q pq and x) .
Note that in practice it is not possible to have access to |max xi from the state (9), but we will
present ra method to know a priori this value or an upper bound in section D.1.6. Let’s note
f (Y pq )
αpq = maxp,q (f (Y pq ))
. The ouput of this conditional rotation in superposition on state (9) is
q
1
P 2 |1i).
then √HW D p,q |pi |qi |f (Y pq )i (α pq |0i + 1 − αpq

D.1.4 A MPLITUDE A MPLIFICATION

In order to measure (p, q, f (Y pq )) with higher probability where f (Y pq ) has high value, we
could post select on the measurement of |0i on the last register. Otherwise, we can per-
form an amplitude amplification on this q ancillary qubit. Let’s rewrite the previous state as
1 2 |g 0 i |1i, where |g 0 i is another garbage state.
P
p,q αpq |pi |qi |f (Y pq )i |0i + 1 − αpq
√
HW D pq pq
1 2
P
The overall probability of measuring |0i on the last register is P (0) = HW D pq |αpq | . The
1
number of queries required to amplify the state |0i is O( √ ), as shown by Brassard et al.
P (0)
+ 2 f (Y pq )
(2002). Since f (Y pq ) ∈ R , we have αpq = maxp,q (f (Y pq ))
. Therefore the number of

20
Published as a conference paper at ICLR 2020

q
! √
1 maxp,q (f (Y pq ))
queries is O maxp,q (f (Y pq )) q =O √ , where the nota-
1
P
HW D p,q f (Y pq ) Ep,q (f (Y pq ))

tion Ep,q (f (Y pq )) represents the average value of the matrix f (Y ). It can also be written E(f (X))
1
P
as in Result 1: Ep,q (f (Y pq )) = HW D p,q f (Y pq ). At the end of these iterations, we have modi-
fied with high probability the state to the following:

1 X
0
|f (Y )i = √ αpq |pi |qi |f (Y pq )i (12)
HW D p,q

0 αpq
Where, to respect the normalization of the quantum state, αpq = r . Eventually, the
P α2
pq
p,q HW D
0 2
(αpq ) f (Y pq )
probability of measuring (p, q, f (Y pq )) is given by p(p, q, f (Y pq )) = HW D = P
f (Y pq )
. Note
p,q

that we have used the same type of name |f (Y )i for both state (9) and state (12). For now on, this
state name will refer only to the latter (12).

D.1.5 `∞ TOMOGRAPHY AND PROBABILISTIC SAMPLING

We can rewrite the final quantum state obtained in (12) as
`+1 1 Xq `+1 `+1
|f (Y )i = qP f (Y pq ) |pi |qi |f (Y pq )i (13)
`+1
p,q f (Y pq )
p,q

`+1
We see here that f (Y pq ), the values of each pixel, are encoded in both the last register and in
the amplitude. We will use this property to extract efficiently the exact values of high magnitude
`+1
pixels. For simplicity, we will use instead the notation f (X n ) to denote a pixel’s value, with
`+1 `+1 `+1 `+1 `+1
n ∈ [H W D ]. Recall that Y and X are reshaped version of the same object.
The pixels with high values will have more probability of being sampled. Specifically, we perform
a tomography with `∞ guarantee and precision parameter η > 0. See Theorem G.1 and Section G
2
for details. The `∞ guarantee allows to obtain each pixel with error at most η, and require O(1/η
e )
`+1
samples from the state (13). Pixels with low values f (X n ) < η will probably not be sampled due
to their low amplitude. Therefore the error committed will be significative and we adopt the rule of
`+1
setting them to 0. Pixels with higher values f (X n ) ≥ η, will be sample with high probability,
`+1
and only one appearance is enough to get the exact register value f (X n ) of the pixel, as is it also
written in the last register.
To conclude, let’s note Xn`+1 the resulting pixel values after the tomography, and compare it to the
`+1
real classical outputs f (Xn`+1 ). Recall that the measured values f (X n ) are approximated with
error at most 2M with M = maxp,q kAp k kFq k. The algorithm described above implements the
following rules:

`+1
(
|Xn`+1 − f (Xn`+1 )| ≤ 2M if f (X n ) ≥ η
`+1 (14)
Xn`+1 = 0 if f (X n ) < η

Concerning the running time, one could ask what values of η are sufficient to obtain enough mean-
ingful pixels. Obviously this highly depends on the output’s size H `+1 W `+1 D`+1 and on the out-
put’s content itself. But we can view this question from an other perspective, by considering that we
sample a constant fraction of pixels given by σ · (H `+1 W `+1 D`+1 ) where σ ∈ [0, 1] is a sampling
ratio. Because of the particular amplitudes of state (13), the high value pixels will be measured and
known with higher probability. The points that are not sampled are being set to 0. We see that this
approach is equivalent to the `∞ tomography, therefore we have η12 = σ · H `+1 W `+1 D`+1 .

21
Published as a conference paper at ICLR 2020

We will use this analogy in the numerical simulations (Section 6) to estimate, for a particular QCNN
architecture and a particular dataset of images, which values of σ are enough to allow the neural
network to learn.

D.1.6 R EGULARIZATION OF THE N ON L INEARITY

`+1
In the previous steps, we see several appearances of the parameter maxp,q (f (Y pq )). First for the
conditional rotation preprocessing, we need to know this value or an upper bound. Then for the
running time, we would like to bound this parameter. Both problems can be solved by replacing the
usual ReLu non linearity by a particular activation function, that we note capReLu. This function
is simply a parametrized ReLu function with an upper threshold, the cap C, after which the function
remain constant. The choice of C will be tuned for each particular QCNN, as a tradeoff between
accuracy and speed. Otherwise, the only other requirement of the QCNN activation function would
be not to allow negative values. This is already often the case for most of the classical CNN. In
practice, we expect the capReLu to be as good as a usual ReLu, for convenient values of the cap C
(≤ 10). We performed numerical simulations to compare the learning curve of the same CNN with
several values of C. See the numerical experiments presented in Section 6 for more details.

Figure 7: Activation functions: ReLu (Left) and capReLu (Right) with a cap C at 5.

D.2 QRAM UPDATE

We wish to detail the use of the QRAM between each quantum convolution layer, and present how
the pooling operation can happen during this phase. General results about the QRAM is given as
Theorem F.1. Implementation details can be found in the work of Kerenidis & Prakash (2017a). In
this section, we will show how to store samples from the output of the layer `, to create the input of
layer ` + 1.

D.2.1 S TORING THE OUTPUT VALUES DURING THE SAMPLING

At the beginning of layer `+1, the QRAM must store A`+1 , a matrix where each elements is indexed
by (p0 , r0 ), and perform |p0 i |0i 7→ |p0 i |A`+1
p0 i. The data is stored in the QRAM as a tree structure
described by Kerenidis & Prakash (2017b). Each row A`+1 `+1
p0 is stored in such a tree Tp0 . Each leaf
`+1
A`+1
p0 r 0 correspond to a value sampled from the previous quantum state |f (Y )i, output of the layer
`+1
`. The question is to know where to store a sample from |f (Y )i in the tree Tp`+1
0 .
When a point is sampled from the final state of the quantum convolution, at layer `, as described in
Section D.1.4, we obtain a triplet corresponding to the two positions and the value of a point in the
`+1
matrix f (Y ). We can know where this point belong in the input of layer ` + 1, the tensor X `+1 ,
by Equation (7), since Y ` is a reshaped version of X ` .
The position in X `+1 , noted (i`+1 , j `+1 , d`+1 ), is then matched to several positions (p0 , r0 ) in A`+1 .
For each p0 , we write in the tree Tp`+1
0 the sampled value at leaf r0 and update its parent nodes, as
required in the work of Kerenidis & Prakash (2017b). Note that leaves that weren’t updated will be
considered as zeros, corresponding to pixels with too low values, or not selected during pooling (see
next section).

22
Published as a conference paper at ICLR 2020

Having stored pixels in this way, we can then query |p0 i |0i 7→ |p0 i |A`p0 i, using the quan-
tum circuit developed by Kerenidis & Prakash (2017b), where we correctly have |A`+1 p0 i =
1
P `+1 0
`+1
A p0 r 0 Ap0 r 0 |r i. Note that each tree has a logarithmic depth in the number of leaves, hence the

running time of writing the output of the quantum convolution layer in the QRAM gives a marginal
`+1
multiplicative increase, poly-logarithmic in the number of points sampled from |f (Y )i, namely
O(log(1/η 2 )).

D.2.2 Q UANTUM P OOLING

As for the classical CNN, a QCNN should be able to perform pooling operations. We first detail
the notations for classical pooling. At the end of layer `, we wish to apply a pooling operation of
size P on the output f (X `+1 ). We note X̃ `+1 the tensor after the pooling operation. For a point in
f (X `+1 ) at position (i`+1 , j `+1 , d`+1 ), we know to which pooling region it belongs, corresponding
to a position (ĩ`+1 , j̃ `+1 , d˜`+1 ) in X̃ `+1 :

˜`+1 = d`+1

d

`+1
j̃ `+1 = b j P c (15)
`+1
= biP c

 `+1
ĩ

Figure 8: A 2×2 tensor pooling. A point in f (X `+1 ) (left) is given by its position (i`+1 , j `+1 , d`+1 ).
A point in X̃ `+1 (right) is given by its position (ĩ`+1 , j̃ `+1 , d˜`+1 ). Different pooling regions in
f (X `+1 ) have separate colours, and each one corresponds to a unique point in X̃ `+1 .

We now show how any kind of pooling can be efficiently integrated to our QCNN structure. In-
deed the pooling operation will occur during the QRAM update described above, at the end of a
convolution layer. At this moment we will store sampled values according to the pooling rules.
In the quantum setting, the output of layer ` after tomography is noted X `+1 . After pooling, we will
`+1 `+1
describe it by X̃ `+1 , which has dimensions HP × WP × D`+1 . X̃ `+1 will be effectively used as
input for layer ` + 1 and its values should be stored in the QRAM to form the trees T̃p`+1
0 , related to
`+1
the matrix expansion Ã .
However X `+1 is not known before the tomography is over. Therefore we have to modify the
update rule of the QRAM to implement the pooling in an online fashion, each time a sample from
`+1 `+1
|f (X )i is drawn. Since several sampled values of |f (X )i can correspond to the same leaf
`+1
Ãp0 r0 (points in the same pooling region), we need an overwrite rule, that will depend on the type
of pooling. In the case of Maximum Pooling, we simply update the leaf and the parent nodes if the
new sampled value is higher that the one already written. In the case of Average Polling, we replace
the actual value by the new averaged value.
In the end, any pooling can be included in the already existing QRAM update. In the worst case, the
2
running time is increased by O(P/η
e ), an overhead corresponding to the number of times we need
to overwrite existing leaves, with P being a small constant in most cases.

23
Published as a conference paper at ICLR 2020

`+1
As we will see in Section E, the final positions (p, q) that were sampled from |f (X )i and selected
after pooling must be stored for further use during the backpropagation phase.

D.3 RUNNING T IME

We will now summarise the running time for one forward pass of convolution layer `. With Õ we
hide the polylogaryhtmic factors. We first write
the running time of the classical CNN layer, which
e H `+1 W `+1 D`+1 · HW D` . For the QCNN, the previous steps prove Result 1 and
is given by O
√

1 M C
can be implemented in time O
e
η 2 ·√ . Note that, as explain in Section D.1.5, the
`+1
E(f (X ))
√

M C
`+1
quantum running time can also be written O σH W
e `+1 `+1
D · √ `+1
, with σ ∈ [0, 1]
E(f (X ))
being the fraction of sampled elements among H `+1 W `+1 D`+1 of them.
It is interesting to notice that the one quantum convolution layer can also include the ReLu operation
and the Pooling operation in the same circuit, for no significant increase in the running time, whereas
in the classical CNN each operation must be done on the whole data again.

A PPENDIX E Q UANTUM BACKPROGATION

The entire QCNN is made of multiple layers. For the last layer’s output, we expect only one possible
outcome, or a few in the case of a classification task, which means that the dimension of the quantum
output is very small. A full tomography can be performed on the last layer’s output in order to
calculate the outcome. The loss L is then calculated, as a measure of correctness of the predictions
compared to the ground truth. As the classical CNN, our QCNN should be able to perform the
optimization of its weights (elements of the kernels) to minimize the loss by an iterative method.

Theorem E.1 (Quantum Backpropagation for Quantum CNN)

Given the forward pass quantum algorithm in Algorithm 1, the input matrix A` and the kernel matrix
F ` stored in the QRAM for each layer `, and a loss function L, there is a quantum backpropagation
∂L
algorithm that estimates, for any precision δ > 0, the gradient tensor ∂F ` and update each element

∂L ∂L ∂L ∂L
to perform gradient descent such that ∀(s, q), `
∂Fs,q
− `
∂Fs,q
≤ 2δ ∂F ` 2
. Let ∂Y `
be the gradi-
th
ent with respect to the ` layer. The running time of a single layer ` for quantum backpropagation
is given by

∂L ∂L ∂L ∂L log 1/δ
O µ(A` ) + µ( ) κ( ) + µ( ) + µ(F `
) κ( ) (16)
∂Y `+1 ∂F ` ∂Y `+1 ∂Y ` δ2
where for a matrix V , κ(V ) is the condition number and µ(V ) is defined in Equation (5).

E.1 C LASSICAL BACKPROPAGATION

After each forward pass, the outcome is compared to the true labels and define a loss. We can
update our weights by gradient descent to minimize this loss, and iterate. The main idea behind the
backpropagation is to compute the derivatives of the loss L, layer by layer, starting from the last
one.
∂L ∂L
At layer `, the derivatives needed to perform the gradient descent are ∂F ` and ∂Y ` . The first one
represents the gradient of the final loss L with respect to each kernel element, a matrix of values that
`
we will use to update the kernel weights Fs,q . The second one is the gradient of L with respect to
the layer itself and is only needed to calculate the gradient ∂F∂L
`−1 at layer ` − 1.

E.1.1 C ONVOLUTION P RODUCT

We first consider a classical convolution layer without non linearity or pooling. Thus the output of
layer ` is the same tensor as the input of layer ` + 1, namely X `+1 or equivalently Y `+1 . Assuming
we know ∂X∂L ∂L
`+1 or equivalently ∂Y `+1 , both corresponding to the derivatives of the (` + 1)
th
layer’s

24
Published as a conference paper at ICLR 2020

∂L
input, we will show how to calculate ∂F ` , the matrix of derivatives with respect to the elements of
`
the previous kernel matrix F . This is the main goal in order to optimize the kernel’s weights.
The details of the following calculations can be found in the work of Wu (2017). We will use the
notation vec(X) to represents the vectorized form of any tensor X.
Recall that A` is the matrix expansion of the tensor X ` , whereas Y ` is a matrix reshaping of X ` .
∂L ∂L ∂vec(X `+1 )
By applying the chain rule ∂vec(F ` )T = ∂vec(X `+1 )T ∂vec(F ` )T , we can obtain:

∂L ∂L
`
= (A` )T (17)
∂F ∂Y `+1
See calculations details in the work of Wu (2017). Equation (17) shows that, to obtain the desired
gradient, we can just perform a matrix-matrix multiplication between the transposed layer itself (A` )
and the gradient with respect to the previous layer ( ∂Y∂L
`+1 ).

∂L
Equation (17) explains also why we will need to calculate ∂Y `
in order to backpropagate through
`+1
∂L ∂L ∂vec(X )
layer ` − 1. To calculate it, we use the chain rule again for ∂vec(X ` )T = ∂vec(X `+1 )T ∂vec(X ` )T .

Recall that a point in A` , indexed by the pair (p, r), can correspond to several triplets (i` , j ` , d` ) in
X ` . We will use the notation (p, r) ↔ (i` , j ` , d` ) to express formally this relation. One can show
that ∂Y∂L ` T
`+1 (F ) is a matrix of same shape as A` , and that the chain rule leads to a simple relation
∂L
to calculate ∂Y ` :

∂L X ∂L
= (F ` )T (18)
∂X ` i` ,j ` ,d` ∂Y `+1 p,r
(p,r)↔(i` ,j ` ,d` )

We have shown how to obtain the gradients with respect to the kernels F ` and to the layer itself Y `
(or equivalently X ` ).

E.1.2 N ON L INEARITY

The activation function has also an impact on the gradient. In the case of the ReLu, we should only
cancel gradient for points with negative values. For points with positive value, the derivatives remain
the same since the function is the identity. A formal relation can be given by

(h i
∂L
if Xi`+1
`+1 ,j `+1 ,d`+1 ≥ 0

∂L ∂f (X `+1 ) i`+1 ,j `+1 ,d`+1
= (19)
∂X `+1 i`+1 ,j `+1 ,d`+1 0 otherwise

E.1.3 P OOLING

If we take into account the pooling operation, we must change some of the gradients. Indeed, a
pixel that hasn’t been selected during pooling has no impact on the final loss, thus should have a
gradient equal to 0. We will focus on the case of Max Pooling (Average Pooling relies on similar
idea). To state a formal relation, we will use the notations of Section D.2.2: an element in the output
of the layer, the tensor f (X `+1 ), is located by the triplet (i`+1 , j `+1 , d`+1 ). The tensor after pooling
is noted X̃ `+1 and its points are located by the triplet (ĩ`+1 , j̃ `+1 , d˜`+1 ). During backpropagation,
after the calculation of ∂ X̃∂L
`+1
, some of the derivatives of f (X `+1 ) should be set to zero with the
following rule:

(h i
∂L
if (i`+1 , j `+1 , d`+1 ) was selected during pooling

∂L ∂ X̃ `+1 ĩ`+1 ,j̃ `+1 ,d˜`+1
=
∂f (X `+1 ) i`+1 ,j `+1 ,d`+1 0 otherwise
(20)

25
Published as a conference paper at ICLR 2020

E.2 Q UANTUM A LGORITHM FOR BACKPROPAGATION

In this section, we want to give a quantum algorithm to perform backrpopagation on a layer `, and
detail the impact on the derivatives, given by the following diagram:

∂L

∂X `
∂L ∂L ∂L ∂L ∂L
∂L ← `+1
← `+1
← `+1
← = (21)
∂F ` ∂X ∂f (X ) ∂X ∂ X̃ `+1 ∂X `+1

We assume that backpropagation has been done on layer `+1. This means in particular that ∂X∂L
`+1 is
∂L ∂L
stored in QRAM. However, as shown on Diagram (21), ∂X `+1 corresponds formally to ∂ X̃ `+1 , and
not ∂L `+1 . Therefore, we will have to modify the values stored in QRAM to take into account non
∂X
∂L ∂L
linearity, tomography and pooling. We will first consider how to implement ∂X ` and ∂F ` through
∂L ∂L
backpropagation, considering only convolution product, as if `+1 and ∂X `+1 where the same.
∂X
∂L
Then we will detail how to simply modify ∂X `+1
a priori, by setting some of its values to 0.

E.2.1 Q UANTUM C ONVOLUTION P RODUCT

In this section we consider only the quantum convolution product without non linearity, tomography
nor pooling, hence writing its output directly as X `+1 . Regarding derivatives, the quantum convo-
lution product is equivalent to the classical one. Gradient relations (17) and (18) remain the same.
Note that the -approximation from Section D.1.2 doesn’t participate in gradient considerations.
The gradient relations being the same, we still have to specify the quantum algorithm that imple-
∂L ∂L
ments the backpropagation and outputs classical description of ∂X ` and ∂F ` . We have seen that the

two main calculations (17) and (18) are in fact matrix-matrix multiplications both involving ∂Y∂L `+1 ,

the reshaped form of ∂X∂L `+1 . For each, the classical running time is O(H
`+1
W `+1 D`+1 HW D` ).
We know from Theorem F.7 and Theorem G.1 a quantum algorithm to perform efficiently a
matrix-vector multiplication and return a classical state with `∞ norm guarantees. For a matrix
V and a vector b, both accessible from the QRAM, the running time to perform this operation
µ(V )κ(V ) log 1/δ
is O δ2 , where κ(V ) is the condition number of the matrix and µ(V ) is a matrix
parameter defined in Equation (5). Precision parameter δ > 0 is the error committed in the approxi-
mation for both Theorems F.7 and G.1.
We can therefore apply theses theorems to perform matrix-matrix multiplications, by simply de-
composing them in several matrix-vector multiplications. For instance, in Equation (17), the matrix
could be (A` )T and the different vectors would be each column of ∂Y∂L `+1 . The global running time

to perform quantumly Equation (17) is obtained by replacing µ(V ) by µ( ∂Y∂L `

`+1 ) + µ(A ) and κ(V )

by κ((A` )T · ∂Y∂L ∂L ` ∂L ` T
`+1 ). Likewise, for Equation (18), we have µ( ∂Y `+1 )+µ(F ) and κ( ∂Y `+1 ·(F ) ).

Note that the dimension of the matrix doesn’t appear in the running time since we tolerate a `∞ norm
guarantee for the error, instead of a `2 guarantee (see Section G for details). The reason why `∞
tomography is the right approximation here is because the result of these linear algebra operations
are rows of the gradient matrices, that are not vectors in an euclidean space, but a series of numbers
for which we want to be δ-close to the exact values. See next section for more details.
It is a open question to see if one can apply the same sub-sampling technique as in the forward pass
∂L
(Section D.1) and sample only the highest derivatives of ∂X ` , to reduce the computation cost while

maintaining a good optimization. We then have to understand which elements of ∂X∂L `+1 must be set
to zero to take into account the effects the non linearity, tomography and pooling.

E.2.2 Q UANTUM N ON L INEARITY AND T OMOGRAPHY

To include the impact of the non linearity, one could apply the same rule as in (19), and simply
`+1
replace ReLu by capReLu. After the non linearity, we obtain f (X ), and the gradient relation
would be given by

26
Published as a conference paper at ICLR 2020

h i
∂L `+1
if 0 ≤ X i`+1 ,j `+1 ,d`+1 ≤ C

∂L 
∂f (X
`+1
) i`+1 ,j `+1 ,d`+1
`+1
= (22)
∂X i`+1 ,j `+1 ,d`+1
0 otherwise

`+1
If an element of X was negative or bigger than the cap C, its derivative should be zero during
the backpropagation. However, this operation was performed in quantum superposition. In the
quantum algorithm, one cannot record at which positions (i`+1 , j `+1 , d`+1 ) the activation function
was selective or not. The gradient relation (22) cannot be implemented a posteriori. We provide
a partial solution to this problem, using the fact that quantum tomography must also be taken into
account for some derivatives. Indeed, only the points (i`+1 , j `+1 , d`+1 ) that have been sampled
should have an impact on the gradient of the loss. Therefore we replace the previous relation by

(
∂L

if (i`+1 , j `+1 , d`+1 ) was sampled

∂L ∂X `+1 i`+1 ,j `+1 ,d`+1
`+1
= (23)
∂X i`+1 ,j `+1 ,d`+1 0 otherwise

Nonetheless, we can argue that this approximation will be tolerable. In the first case where
`+1
X i`+1 ,j `+1 ,d`+1 < 0, the derivatives can not be set to zero as they should. But in practice, their values
will be zero after the activation function and such points would not have a chance to be sampled. In
`+1
conclusion their derivatives would be zero as required. In the other case, where X i`+1 ,j `+1 ,d`+1 > C,
the derivatives can not be set to zero as well but the points have a high probability of being sampled.
Therefore their derivative will remain unchanged, as if we were using a ReLu instead of a capReLu.
However in cases where the cap C is high enough, this shouldn’t be a source of disadvantage in
practice.

E.2.3 Q UANTUM P OOLING

From relation (23), we can take into account the impact of quantum pooling (see Section D.2.2) on
the derivatives. This case is easier since one can record the selected positions during the QRAM
update. Therefore, applying the backpropagation is similar to the classical setting with Equation
(20).

(h i
∂L
if (i`+1 , j `+1 , d`+1 ) was selected during pooling

∂L ∂ X̃ `+1 ĩ`+1 ,j̃ `+1 ,d˜`+1
=
∂X `+1 i`+1 ,j `+1 ,d`+1 0 otherwise
(24)
∂L ∂L
Note that we know as it is equal to
∂ X̃ `+1
the gradient with respect to the input of layer `+1,
∂X `+1
,
known by assumption and stored in the QRAM.

E.3 C ONCLUSION AND RUNNING T IME

In conclusion, given ∂Y∂L`+1 in the QRAM, the quantum backpropagation first consists in applying
the relations (24) followed by (23). The effective gradient now take into account non linearity,
tomography and pooling that occurred during layer `. We can know use apply the quantum algorithm
for matrix-matrix multiplication that implements relations (18) and (17).
Note that the steps in Algorithm 2 could also be reversed: during backpropagation of layer ` + 1,
when storing values for each elements of ∂Y∂L `+1 in the QRAM, one can already take into account

(24) and (23) of layer `. In this case we directly store ∂L`+1 , at no supplementary cost.
∂X

Therefore, the running time of the quantum backpropagation for one layer
`, given as Algorithm 2, corresponds to the sum of the running times of
the circuits for implementing relations (17) and (18). We finally obtain

27
Published as a conference paper at ICLR 2020

log 1/δ
µ(A` ) + µ( ∂Y∂L · ∂Y∂L ∂L ∂L
` T `
` T
O `+1 ) κ((A ) `+1 ) + µ( ∂Y `+1 ) + µ(F ) κ( ∂Y `+1 · (F ) ) δ2 ,
which can be rewritten as

∂L ∂L ∂L ∂L log 1/δ
O µ(A` ) + µ( ) κ( ) + µ( ) + µ(F `
) κ( ) (25)
∂Y `+1 ∂F ` ∂Y `+1 ∂Y ` δ2
∂L ∂L
Besides storing ∂X ` , the main output is a classical description of ∂F ` , necessary to perform gradient

descent of the parameters of F ` . In the Appendix (Section E.4), which details the impact of the
quantum backpropagation compared to the classical case, which can be reduced to a simple noise
addition during the gradient descent.

E.4 Q UANTUM G RADIENT D ESCENT AND C LASSICAL EQUIVALENCE

In this part we will see the impact of the quantum backpropagation compared to the classical case,
which can be reduced to a simple noise addition during the gradient descent. Recall that gradient
∂L
descent, in our case, would consist in applying the following update rule F ` ← F ` − λ ∂F ` with the
learning rate λ.
∂L ∂L
Let’s note x = ∂F ` and its elements xs,q = ∂F ` . From the first result of Theorem F.7 with
s,q
error δ < 0, and the tomography procedure from Theorem G.1, with same error δ, we can obtain a
x
classical description of kxk with `∞ norm guarantee, such that:
2

x x
− ≤δ
kxk2 kxk2 ∞
e κ(V )µ(V2) log(δ) ), where we note V is the matrix stored in the QRAM that allows to obtain
in time O( δ
x, as explained in Section E.2. The `∞ norm tomography is used so that the error δ is at most the
same for each component
xs,q xs,q
∀(s, q), − ≤δ
kxk2 kxk2
From the second result of the Theorem F.7 we can also obtain an estimate kxk2 of the norm, for the
same error δ, such that
| kxk2 − kxk2 | ≤ δ kxk2
κ(V )µ(V )
in time O(
e
δ log(δ)) (which does not affect the overall asymptotic running time). Using both
results we can obtain an unnormalized state close to x such that, by the triangular inequality
x x
kx − xk∞ = kxk2 − kxk2
kxk2 kxk2 ∞
x x x x
≤ kxk2 − kxk2 + kxk2 − kxk2
kxk2 kxk2 ∞
kxk 2 kxk 2 ∞
x x
≤ 1 · | kxk2 − kxk2 | + kxk2 · −
kxk2 kxk2 ∞
≤ δ kxk2 + kxk2 δ ≤ 2δ kxk2
e κ(V )µ(V2) log(δ) ). In conclusion, with `∞ norm guarantee, having also access to the norm
in time O( δ
of the result is costless.
` ` ∂L
Finally, the noisy gradient descent update rule, expressed as Fs,q ← Fs,q − λ ∂F ` can written in
s,q
the worst case with
∂L ∂L ∂L
`
= `
± 2δ (26)
∂Fs,q ∂Fs,q ∂F ` 2
To summarize, using the quantum linear algebra from Theroem F.7 with `∞ norm tomography from
Theroem G.1, both with error δ, along with norm estimation with relative error δ too, we can obtain
∂L ∂L ∂L ∂L
classically the unnormalized values ∂F ` such that ∂F `
− ∂F ` ≤ 2δ ∂F ` 2 or equivalently
∞
∂L ∂L ∂L
∀(s, q), `
− `
≤ 2δ (27)
∂Fs,q ∂Fs,q ∂F ` 2

28
Published as a conference paper at ICLR 2020

` ` ∂L
Therefore the gradient descent update rule in the quantum case becomes Fs,q ← Fs,q − λ ∂F ` ,
s,q
which in the worst case becomes

` ` ∂L ∂L
Fs,q ← Fs,q − λ `
± 2δ (28)
∂Fs,q ∂F ` 2

This proves the Theorem E.1. This update rule can be simulated by the addition of a random relative
noise given as a gaussian centered on 0, with standard deviation equal to δ. This is how we will
simulate quantum backpropagation in the Numerical Simulations.
Compared to the classical update rule, this corresponds to the addition of noise during the optimiza-
∂L
tion step. This noise decreases as ∂F `
2
, which is expected to happen while converging. Recall
that the gradient descent is already a stochastic process. Therefore, we expect that such noise, with
acceptable values of δ, will not disturb the convergence of the gradient, as the following numerical
simulations tend to confirm.

A PPENDIX F P RELIMINARIES IN Q UANTUM I NFORMATION

We introduce a basic and broad-audience quantum information background necessary for this work.
For a more detailed introduction we recommend Nielsen & Chuang (2002a).

F.1 Q UANTUM I NFORMATION

Finally, multiple qubits gates exist, such as the Controlled-NOT that applies a NOT gate on a target
qubit conditioned on the state of a control qubit.
The main advantage of quantum gates is their ability to be applied to a superposition of inputs.
Indeed, givenP that U |xi 7→ |f (x)i, we can apply it to all possible combinations of x
a gate U such P
at once U ( C1 x |xi) 7→ C1 x |f (x)i.
We now state some primitive quantum circuits, which we will use in our algorithm:
For two integers i and j, we can check their equality with the mapping |ii |ji |0i 7→ |ii |ji |[i = j]i.
For two real value numbers a > 0 and δ > 0, we can compare them using |ai |δi |0i 7→
|ai |δi |[a ≤ δ]i. Finally, for a real value numbers a > 0, we can obtain its square |ai |0i 7→ |ai |a2 i.

29
Published as a conference paper at ICLR 2020

Note that these circuits are basically a reversible version of the classical ones and are linear in the
number of qubits used to encode the input values.

F.2 Q UANTUM S UBROUTINES FOR DATA E NCODING

Definition 4 [Quantum Access to Data]

We say that we have quantum access to a matrix A ∈ Rn×d if there exist a procedure to perform
the following mapping, for i ∈ [n], in time T :

• |ii |0i 7→ |ii |Ai i

1
P
• |0i 7→ kAk i kAi k |ii
F

Theorem F.1 (QRAM data structure, see Kerenidis & Prakash (2017a)) Let A ∈ Rn×d , there
is a data structure to store the rows of A such that,

Theorem F.2 [Amplitude Amplification √ and Estimation

Theorem F.3 [Conditional Rotation] Given the quantum state |ai, with a ∈ [−1, 1], it is possible
√
to perform |ai |0i 7→ |ai (a |0i + 1 − a |1i) with complexity O(1).
e

30
Published as a conference paper at ICLR 2020

Pd−1
Using Theorem F.3 followed by Theorem F.2, it then possible to transform the state √1 |xj i
d j=0
1
Pd−1
into kxk j=0 xj |xj i.

Theorem F.4 (Median Evaluation, see Wiebe et al. (2014a)) Let U be a unitary operation that
maps √ √
U : |0⊗n i 7→ a |x, 1i + 1 − a |G, 0i
for some 1/2 < a ≤ 1 in time T . Then there exists a quantum algorithm that,√ for any ∆ > 0 and
for any 1/2 < a0 ≤ a, produces a state |Ψi such that k |Ψi − |0⊗nL i |xi k ≤ 2∆ for some integer
L, in time & '
ln(1/∆)
2T 2 .
2 |a0 | − 12

F.3 Q UANTUM SUBROUTINES FOR L INEAR A LGEBRA

Definition 5 For a matrix A, the parameter µ(A) is defined

q by µ(A) =
p
minp∈[0,1] kAkF , s2p (A)s2(1−p) (AT ) where sp (A) = maxi (kAi kp ).

The next theorems allow to compute the distance between vectors encoded as quantum states, and
use this idea to perform the k-means algorithm.

Theorem F.5 [Quantum Distance Estimation Wiebe et al. (2014b); Kerenidis et al. (2019)] Given
quantum access in time T to two matrices U and V with rows ui and vj of dimension d, there
is a quantum algorithm that, for any pair (i, j), performs the following mapping |ii |ji |0i 7→
|ii |ji |d2 (ui , vj )i, estimating the euclidean distance between ui and vj with precision |d2 (ui , vj ) −
d2 (ui , vj )| ≤ for any > 0. The algorithm has a running time given by O(T e η/), where
η = maxij (kui k kvj k), assuming that mini (kui k) = mini (kvi k) = 1.

Theorem F.6 [Quantum k-means clustering Kerenidis et al. (2019)]

Given quantum access in time T to a dataset V ∈ Rn×d , there is a quantum algorithm that outputs
with high probability k centroids c1 , · · · , ck that are consistent with the output of the k-means algo-
1.5
e × (kd η(V2 ) κ(V )(µ(V ) + k η(V ) ) + k 2 η(V 2) κ(V )µ(V ))) per
rithm with noise δ > 0, in time O(T δ δ δ
iteration.
maxi (kvi k2 )
Definition 6 For a matrix V ∈ Rn×d , its parameter η(V ) is defined as as mini (kvi k2 )
, or as
2
maxi (kvi k ) assuming mini (kvi k) = 1.

31
Published as a conference paper at ICLR 2020

Theorem F.7 (Quantum Matrix Operations, Chakraborty et al. (2018) ) Let M ∈ Rd×d and
x ∈ Rd . Let δ1 , δ2 > 0. If M is stored in appropriate QRAM data structures and the time to
prepare |xi is Tx , then there exist quantum algorithms that with probability at least 1 − 1/poly(d)
return

1. A state |zi such that k|zi − |M xik2 ≤ δ1 in time O((κ(M

n×d
appliedto any rectangular matrix V ∈ R
The linear algebra procedures above can also be by
0 V
considering instead the symmetric matrix V = .
VT 0

A PPENDIX G A LGORITHM AND P ROOF FOR `∞ NORM TOMOGRAPHY

Finally, we present a logarithmic time algorithm for vector state tomography that will be used to re-
cover classical information from the quantum states with `∞ norm guarantee. Given a unitary U that
1
Pd−1 2
produces a quantum state |xi = kxk j=0 xj |ji, by calling O(log d/δ ) times U , the tomography
2

algorithm is able to reconstruct a vector Xe that approximates |xi with `∞ norm guarantee, such that
|Xi
e − |xi ≤ δ, or equivalently that ∀i ∈ [d], |xi − Xei | ≤ δ. Such a tomography is of interest
∞
when the components xi of a quantum state are not the coordinates of an meaningful vector in some
linear space, but just a series of values, such that we don’t want an overall guarantee on the vector
(which is the case with usual `2 tomography) but a similar error guarantee for each component in
the estimation.

Theorem G.1 (`∞ Vector state tomography) Given access to unitary U such that U |0i = |xi
and its controlled version in time T (U ), there is a tomography algorithm with time complexity
O(T (U ) log2 d ) that produces unit vector X
δ
e ∈ Rd such that X
e −x ≤ δ with probability at least
∞
(1 − 1/poly(d)).

The proof of this theorem is similar to the proof of the `2 -norm tomography by Kerenidis & Prakash
(2018). However the `∞ norm tomography introduced in this paper depends only logarithmically
and not linearly in the dimension d. Note that in our case, T (U ) will be logarithmic in the dimension.

Theorem G.2 [`2 Vector state tomography Kerenidis & Prakash (2018)] Given access to unitary
U such that U |0i = |xi and its controlled version in time T (U ), there is an algorithm that allows
e ∈ Rd with `2 -norm guarantee X
to output a classical vector X e − x ≤ δ for any δ > 0, in time
2
d log(d)
O(T (U ) × δ2 ).

xi |ii, with x ∈ Rd and kxk2 = 1.

P
In the following we consider a quantum state |xi = i∈[d]

The following version of the Chernoff Bound will be used for analysis of algorithm 3.

Theorem G.3 (ChernoffP Bound) Let Xj , for j ∈ [N ], be independent random variables such that
Xj ∈ [0, 1] and let X = j∈[N ] Xj . We have the three following inqualities:
2
1. For 0 < β < 1, P[X < (1 − β)E[X]] ≤ e−β E[X]/2

β2
2. For β > 0, P[X > (1 + β)E[X]] ≤ e− 2+β E[X]
2
3. For 0 < β < 1, P[|X − E[X]| ≥ βE[X]] ≤ e−β E[X]/3
, by composing 1. and 2.

32
Published as a conference paper at ICLR 2020

Algorithm 3 `∞ norm tomography

P
Require: Error δ > 0, access to unitary U : |0i 7→ |xi = i∈[d] xi |ii, the controlled version of U ,
QRAM access.
e ∈ Rd , such that X
Ensure: Classical vector X e = 1 and Xe −x < δ.
∞

36 ln d
1: Measure N = δ2copies of |xi in the standard basis and count ni , the number of times the
√ p
outcome i is observed. Store pi = ni /N in QRAM data structure.
√
2: Create N = 36δln d
copies of the state √12 |0i i∈[d] xi |ii + √12 |1i i∈[d] pi |ii.
P P
2

3: Apply an Hadamard gate on the first qubit to obtain

1 X √ √
|φi = ((xi + pi ) |0, ii + (xi − pi ) |1, ii)
2
i∈[d]

4: Measure both registers of each copy in the standard basis, and count n(0, i) the number of time
the outcome (0, i) is observed.
5: Set σ(i) = +1 if n(0, i) > 0.4N pi and σ(i) = −1 otherwise.
ei = σi √pi
e such that ∀i ∈ [N ], X
6: Output the unit vector X

√
e ∈ Rd such that X
Theorem G.4 Algorithm 3 produces an estimate X e −x < (1 + 2)δ with
∞
1
probability at least 1 − d0.83 .

Proving x − X e ≤ O(δ) is equivalent to show that for all i ∈ [d], we have |xi − X ei | =
√ ∞
|xi − σ(i) pi | ≤ O(δ). Let S be the set of indices defined by S = {i ∈ [d]; |xi | > δ}. We will
separate the proof for the two cases where i ∈ S and i ∈
/ S.

Case 1 : i ∈ S.
We will show that if i ∈ S, we√correctly have√σ(i) = sgn(xi ) with high probability. Therefore
we will need to bound |xi − σ(i) pi | = ||xi | − pi |.
We suppose that xi > 0. The value of σ(i) correctly determines sgn(xi ) if the number of times
we have measured (0, i) at Step 4. is more than half of the outcomes, i.e. n(0, i) > 21 E[n(0, i)]. If
xi < 0, the same arguments holds for n(1, i). We consider the random variable that represents the
outcome of a measurement on state |φi. The Chernoff Bound, part 1 with β = 1/2 gives

1
E[n(0, i)]] ≤ e−E[n(0,i)]/8
P[n(0, i) ≤ (29)
2
√
From the definition of |φi we have E[n(0, i)] = N4 (xi + pi )2 . We will lower bound this value with
the following argument.
For the k th measurement of |xi, with k ∈ [N ], let XP
k be a random variable such that Xk = 1 if
the outcome is i, and 0 otherwise. We define X = k∈[N ] Xk . Note that X = ni = N pi and
E[X] = N x2i . We can apply the Chernoff Bound, part 3 on X for β = 1/2 to obtain,

P[|X − E[X]| ≥ E[X]/2] ≤ e−E[X]/12 (30)

2
P[|x2i − pi | ≥ x2i /2] ≤ e−N xi /12

36 ln d
We have N = δ2 and by assumption x2i > δ 2 (since i ∈ S). Therefore,

P[|x2i − pi | ≥ x2i /2] ≤ e−36 ln d/12 = 1/d3

33
Published as a conference paper at ICLR 2020

P[n(0, i) ≤ 0.41N pi ] ≤ e−1.83 ln d = 1/d1.83

We conclude that for i ∈ S, if n(0, i) > 0.41N pi , the sign of xi is determined correctly by σ(i)
1
with high probability 1 − d1.83 , as indicated in Step 5.
√ √
We finally show |xi − σ(i) pi | = ||xi | − pi | is bounded. Again by the Chernoff Bound (3.) we
have, for 0 < β < 1:

2
N x2i /3
P[|x2i − pi | ≥ βx2i ] ≤ eβ

√ √
By the identity |x2i − pi | = (|xi | − pi )(|xi | + pi ) we have

x2i

√ 2 2
P |xi | − pi ≥ β √ ≤ eβ N xi /3
|xi | + pi

√ x2 x2
h √ i
Since pi > 0, we have β |xi |+i√pi ≤ β |xii | = β|xi |, therefore P |xi | − pi ≥ β|xi | ≤
2
N x2i /3
eβ . Finally, by chosing β = δ/|xi | < 1 we have

h √ i
P |xi | − pi ≥ δ ≤ e36 ln d/3 = 1/d12

We conclude that, if i ∈ S, we have |xi − X̃i | ≤ δ with high probability.

1
Since |S| ≤ d, the probability for this result to be true for all i ∈ S is 1 − d0.83 . This can be proved
by using the Union Bound on the correctness of σ(i).

Case 2 : i ∈
/ S.
If i ∈
/ S, we need to separate again in √
two cases. When√the estimated sign is wrong, i.e. σ(i) =
−sgn(xi ), we have to bound |xi − σ(i) pi | √ = ||xi | + pi |.√On the contrary,
√ if it is correct, i.e.
σ(i) = sgn(xi ), we have to bound |xi − σ(i) pi | = ||xi | − pi | ≤ ||xi | + pi |. Therefore only
one bound is necessary.
We use Chernoff Bound (2.) on the random variable X with β > 0 to obtain

β2 2
P[pi > (1 + β)x2i ] ≤ e 2+β N xi

δ4
We chose β = δ 2 /x2i and obtain P[pi > x2i + δ 2 ] ≤ e 3δ2 N = 1/d12 . Therefore, if i ∈
/ S, with very
high probability 1 − d112 we have pi ≤ x2i + δ 2 ≤ 2δ 2 . We can conclude and bound the error:
√ √ √
|xi − X̃i | ≤ ||xi | + pi | ≤ δ + 2δ = (1 + 2)δ

1
Since |S| ≤ d, the probability for this result to be true for all i ∈
/ S is 1 − d11 . This follows from
applying the Union Bound on the event pi > x2i + δ 2 .

34
Published as a conference paper at ICLR 2020

A PPENDIX H A DDITIONAL N UMERICAL S IMULATIONS

Figure 9: Numerical simulations of the training of the QCNN. These training curves represent the
evolution of the Loss L as we iterate through the MNIST dataset. For each graph, the amplitude
estimation error (0.1, 0.01), the non linearity cap C (2, 10), and the backpropagation error δ
(0.1, 0.01) are fixed whereas the quantum sampling ratio σ varies from 0.1 to 0.5. We can compare
each training curve to the classical learning (CNN). Note that these training curves are smoothed,
over windows of 12 steps, for readability.

In the following we report the classification results of the QCNN when applied on the test set
(10.000 images). We distinguish to use cases: in Table 4 the QCNN has been trained quantumly as
described in this paper, whereas in Table 5 we first have trained the classical CNN, then transferred
the weights to the QCNN only for the classification. This second use case has a global running time
worst than the first one, but we see it as another concrete application: quantum machine learning
could be used only for faster classification from a classically generated model, which could be
the case for high rate classification task (e.g. for autonomous systems, classification over many
simultaneous inputs). We report the test loss and accuracy for different values of the sampling ratio
σ, the amplitude estimation error , and for the backpropagation noise δ in the first case. The cap C
is fixed at 10. These values must be compared to the classical CNN classification metrics, for which
the loss is 0.129 and the accuracy is 96.1%. Note that we used a relatively small CNN and hence

35
Published as a conference paper at ICLR 2020

the accuracy is just over 96%, lower than the best possible accuracy with larger CNN.

QCNN Test - Classification

0.01 0.1
σ
δ 0.01 0.1 0.01 0.1
Loss 0.519 0.773 2.30 2.30
0.1
Accuracy 82.8% 74.8% 11.5% 11.7%
Loss 0.334 0.348 0.439 1.367
0.2
Accuracy 89.5% 89.0% 86.2% 54.1%
Loss 0.213 0.314 0.381 0.762
0.3
Accuracy 93.4% 90.3% 87.9% 76.8%
Loss 0.177 0.215 0.263 1.798
0.4
Accuracy 94.7% 93.3% 91.8% 34.9%
Loss 0.142 0.211 0.337 1.457
0.5
Accuracy 95.4% 93.5% 89.2% 52.8%

Table 4: QCNN trained with quantum backpropagation on MNIST dataset. With C = 10 fixed.

QCNN Test - Classification

σ 0.01 0.1
Loss 1.07 1.33
0.1
Accuracy 86.1% 78.6%
Loss 0.552 0.840
0.2
Accuracy 92.8% 86.5%
Loss 0.391 0.706
0.3
Accuracy 94,3% 85.8%
Loss 0.327 0.670
0.4
Accuracy 94.4% 84.0%
Loss 0.163 0.292
0.5
Accuracy 95.9% 93.5%

Table 5: QCNN created from a classical CNN trained on MNIST dataset. With δ = 0.01 and
C = 10 fixed.

s43673-021-00030-3
No ratings yet
s43673-021-00030-3
11 pages
Quantum Convolutional Neural Networks
No ratings yet
Quantum Convolutional Neural Networks
8 pages
2019-Quantum Convolutional Neural Networks
No ratings yet
2019-Quantum Convolutional Neural Networks
8 pages
Quanvolutional Neural Networks Powering Image Recognition With Quantum Circuits
No ratings yet
Quanvolutional Neural Networks Powering Image Recognition With Quantum Circuits
7 pages
s10773-024-05669-w
No ratings yet
s10773-024-05669-w
17 pages
QIC_with_QCNN
No ratings yet
QIC_with_QCNN
9 pages
Coherent Feed Forward Quantum Neural Network
No ratings yet
Coherent Feed Forward Quantum Neural Network
11 pages
2108.00661v2
No ratings yet
2108.00661v2
18 pages
Quantum Neural Network Classifiers: A Tutorial: Submission
No ratings yet
Quantum Neural Network Classifiers: A Tutorial: Submission
30 pages
oh2020
No ratings yet
oh2020
4 pages
Solving Machine Learning Optimization Problems Using Quantum Computers
No ratings yet
Solving Machine Learning Optimization Problems Using Quantum Computers
6 pages
Quantum Neural Networks
No ratings yet
Quantum Neural Networks
11 pages
2412.09486v1
No ratings yet
2412.09486v1
21 pages
Quantum Computing Models For Articial Neural Networks
No ratings yet
Quantum Computing Models For Articial Neural Networks
9 pages
Hybrid Quantum Neural Network Structures For Image Multi-Classification
No ratings yet
Hybrid Quantum Neural Network Structures For Image Multi-Classification
20 pages
2409.18918v1
No ratings yet
2409.18918v1
13 pages
QQL NS 2019012315280690
No ratings yet
QQL NS 2019012315280690
9 pages
Design Space Exploration of Hybrid Quantum-Classical
No ratings yet
Design Space Exploration of Hybrid Quantum-Classical
20 pages
Entropy 25 00287 v2
No ratings yet
Entropy 25 00287 v2
41 pages
Research Paper05
No ratings yet
Research Paper05
8 pages
Quantum Neural Networks: Concepts, Applications, and Challenges
No ratings yet
Quantum Neural Networks: Concepts, Applications, and Challenges
1 page
Quantum Neural Networks: Concepts, Applications, and Challenges
No ratings yet
Quantum Neural Networks: Concepts, Applications, and Challenges
4 pages
Tensor Flow Q
No ratings yet
Tensor Flow Q
39 pages
Efficient Learning For Deep Quantum Neural Networks
No ratings yet
Efficient Learning For Deep Quantum Neural Networks
24 pages
QQL Sandnes Acit2023
No ratings yet
QQL Sandnes Acit2023
128 pages
Project Report
No ratings yet
Project Report
3 pages
Intro To QMLand QNN
No ratings yet
Intro To QMLand QNN
13 pages
New Quantum Neural Network Designs
No ratings yet
New Quantum Neural Network Designs
16 pages
Quantum Neural Network Model For Token Allocation
No ratings yet
Quantum Neural Network Model For Token Allocation
7 pages
Studying The Impact of Quantum-Specific Hyperparameters On Hybrid Quantum-Classical Neural Networks
No ratings yet
Studying The Impact of Quantum-Specific Hyperparameters On Hybrid Quantum-Classical Neural Networks
7 pages
The Dilemma of Quantum Neural Networks
No ratings yet
The Dilemma of Quantum Neural Networks
13 pages
Jean Faber and Gilson A. Giraldi - Quantum Models For Artifcial Neural Network
No ratings yet
Jean Faber and Gilson A. Giraldi - Quantum Models For Artifcial Neural Network
8 pages
Format Final Elkana
No ratings yet
Format Final Elkana
31 pages
Transfer Learning in Hybrid Classical-Quantum Neural Networks
No ratings yet
Transfer Learning in Hybrid Classical-Quantum Neural Networks
13 pages
POSTER2022
No ratings yet
POSTER2022
1 page
Physics Simulation Via Quantum Graph Neural Network
No ratings yet
Physics Simulation Via Quantum Graph Neural Network
17 pages
s42484-023-00114-3
No ratings yet
s42484-023-00114-3
24 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
25 pages
A Hybrid Quantum-Classical Neural Network Architecture For Binary Classification
No ratings yet
A Hybrid Quantum-Classical Neural Network Architecture For Binary Classification
9 pages
research.0134
No ratings yet
research.0134
10 pages
Quantum_Machine_Learning_An_Interplay_Between_Quan
No ratings yet
Quantum_Machine_Learning_An_Interplay_Between_Quan
5 pages
Avance2024 SLJO
No ratings yet
Avance2024 SLJO
33 pages
arXiv 2412.02293
No ratings yet
arXiv 2412.02293
9 pages
Advances in Quantum Machine Learning
No ratings yet
Advances in Quantum Machine Learning
38 pages
Quantum Machine Learning
No ratings yet
Quantum Machine Learning
122 pages
Entropy 23 00460 v2
No ratings yet
Entropy 23 00460 v2
14 pages
Hybrid Quantum Neural Network Image Anti-Noise Classification Model Combined With Error Mitigation
No ratings yet
Hybrid Quantum Neural Network Image Anti-Noise Classification Model Combined With Error Mitigation
16 pages
A Comparative Analysis of Hybrid-Quantum Classical Neural Networks
No ratings yet
A Comparative Analysis of Hybrid-Quantum Classical Neural Networks
7 pages
Introducing Reduced-Width QNNS, An AI-inspired Ansatz Design Pattern
No ratings yet
Introducing Reduced-Width QNNS, An AI-inspired Ansatz Design Pattern
9 pages
1 s2.0 S092523122300766X Main
No ratings yet
1 s2.0 S092523122300766X Main
10 pages
Classical-To-quantum Convolutional Neural Network Transfer Learning
No ratings yet
Classical-To-quantum Convolutional Neural Network Transfer Learning
16 pages
Quantum-Classical Hybrid Machine Learning For Image Classification
No ratings yet
Quantum-Classical Hybrid Machine Learning For Image Classification
7 pages
QNN v5
No ratings yet
QNN v5
17 pages
Quantum Long Short-Term Memory
No ratings yet
Quantum Long Short-Term Memory
5 pages
NN 3
No ratings yet
NN 3
15 pages
2402.16465v1
No ratings yet
2402.16465v1
8 pages
A Hybrid Quantum-Classical Generative Adversarial Network For Near-Term Quantum Processors
No ratings yet
A Hybrid Quantum-Classical Generative Adversarial Network For Near-Term Quantum Processors
13 pages
Quantum
No ratings yet
Quantum
27 pages
A Hybrid Quantum-Classical Neural Network With Deep Residual
No ratings yet
A Hybrid Quantum-Classical Neural Network With Deep Residual
15 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Deep Learning (R20a06610)
No ratings yet
Deep Learning (R20a06610)
170 pages
Deep Learning For Safety in Construction
No ratings yet
Deep Learning For Safety in Construction
12 pages
Prediction of Single Event Effects in FinFET Devices Based On Deep Learning
No ratings yet
Prediction of Single Event Effects in FinFET Devices Based On Deep Learning
7 pages
Human Activity Recognition Using CNN
No ratings yet
Human Activity Recognition Using CNN
51 pages
Anomaly Detection of Spacecraft Telemetry Data Using Temporal Convolution Network
No ratings yet
Anomaly Detection of Spacecraft Telemetry Data Using Temporal Convolution Network
5 pages
Lecture 6 - Convolution Neural Network (CNN)
No ratings yet
Lecture 6 - Convolution Neural Network (CNN)
26 pages
CSE465 - AzK
No ratings yet
CSE465 - AzK
6 pages
Major_report
No ratings yet
Major_report
27 pages
Immediate download Proceedings of the 11th International Conference on Computer Engineering and Networks 1st Edition Qi Liu ebooks 2024
100% (10)
Immediate download Proceedings of the 11th International Conference on Computer Engineering and Networks 1st Edition Qi Liu ebooks 2024
40 pages
Instruction Set Extension of A RiscV Based SoC For Driver Drowsiness Detection
No ratings yet
Instruction Set Extension of A RiscV Based SoC For Driver Drowsiness Detection
12 pages
Detection of Disease in Bombyx Mori Silkworm by Using Image Analysis Approach
No ratings yet
Detection of Disease in Bombyx Mori Silkworm by Using Image Analysis Approach
5 pages
Mobile Phone-based Real-time Dangerous Object Recognition for the Visually Impaired
No ratings yet
Mobile Phone-based Real-time Dangerous Object Recognition for the Visually Impaired
5 pages
A Beginner's Guide To Using Attention Layer in Neural Networks
No ratings yet
A Beginner's Guide To Using Attention Layer in Neural Networks
11 pages
Dual-dimensional Dependency Fusion Transformer for Long-Term Spatiotemporal Flow Prediction
No ratings yet
Dual-dimensional Dependency Fusion Transformer for Long-Term Spatiotemporal Flow Prediction
8 pages
FINAL THESIS DOC Last
No ratings yet
FINAL THESIS DOC Last
114 pages
Electronics 11 02162
No ratings yet
Electronics 11 02162
18 pages
Estimating City Level Poverty Rate Based On e Commerce Data With Machine Learning
No ratings yet
Estimating City Level Poverty Rate Based On e Commerce Data With Machine Learning
27 pages
Artificial Intelligence Class 9 Q-A
No ratings yet
Artificial Intelligence Class 9 Q-A
6 pages
Mastering Classification of EEG PDF
No ratings yet
Mastering Classification of EEG PDF
2 pages
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows
No ratings yet
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows
11 pages
Roadmap To Become AI Engineer
No ratings yet
Roadmap To Become AI Engineer
12 pages
Mini Project Report Format (1)
No ratings yet
Mini Project Report Format (1)
32 pages
Automated Pixel-Level Pavement Crack Detection On 3D Asphalt Surfaces Using A Deep-Learning Network
No ratings yet
Automated Pixel-Level Pavement Crack Detection On 3D Asphalt Surfaces Using A Deep-Learning Network
15 pages
What Do We Need To Build Explainable AI Systems For The Medical Domain?
No ratings yet
What Do We Need To Build Explainable AI Systems For The Medical Domain?
28 pages
154-Article Text-229-3-10-20230813
No ratings yet
154-Article Text-229-3-10-20230813
24 pages
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
A Review of The Hand Gesture Recognition IEEE 2021
No ratings yet
A Review of The Hand Gesture Recognition IEEE 2021
15 pages
MotionBERT - A Unified Perspective On Learning Human Motion Representations
No ratings yet
MotionBERT - A Unified Perspective On Learning Human Motion Representations
18 pages
ANN and CNN Based Ensemble Learning For Recognizing Renowned Medicinal Plants
No ratings yet
ANN and CNN Based Ensemble Learning For Recognizing Renowned Medicinal Plants
6 pages

1560 Quantum Algorithms For Deep Co

Uploaded by

1560 Quantum Algorithms For Deep Co

Uploaded by

Published as a conference paper at ICLR 2020

Q UANTUM A LGORITHMS F OR D EEP C ONVOLUTIONAL

Quantum computing is a powerful computational paradigm with applications in

2.1 C ONVOLUTION P RODUCT AS M ATRIX M ULTIPLICATION

2.2 Q UANTUM C OMPUTING

Result 1 (Quantum Convolution Layer)

Result 2 (Quantum Backpropagation for Quantum CNN)

4 F ORWARD PASS FOR QCNN

Algorithm 1 QCNN Layer

1: Step 1: Quantum Convolution

5 Q UANTUM BACKPROPAGATION ALGORITHM

Algorithm 2 Quantum Backpropagation

4: Step 2 : Matrix-matrix multiplications

perform the matrix-matrix multiplications (A` )T · ∂Y∂L ∂L ` T

A PPENDIX A VARIABLE S UMMARY

Data Variable Dimensions Indices

Data Variable Dimensions Indices

Data Variable Dimensions Indices

A PPENDIX B P RELIMINARIES IN Q UANTUM I NFORMATION

B.1 Q UANTUM I NFORMATION

B.2 Q UANTUM S UBROUTINES FOR DATA E NCODING

Definition 1 [Quantum Access to Data]

• |ii |0i 7→ |ii |Ai i

Theorem B.2 [Amplitude Amplification √ and Estimation

L, in time & '

B.3 Q UANTUM SUBROUTINES FOR L INEAR A LGEBRA

Definition 2 For a matrix A, the parameter µ(A) is defined

Theorem B.6 [Quantum k-means clustering Kerenidis et al. (2019)]

1. A state |zi such that k|zi − |M xik2 ≤ δ1 in time O((κ(M

A PPENDIX C C LASSICAL C ONVOLUTIONAL N EURAL N ETWORK (CNN)

C.1 T ENSOR REPRESENTATION

Figure 3: RGB decomposition, a colored image is a 3-tensor.

C.3 C ONVOLUTION P RODUCT AS A T ENSOR O PERATION

C.4 M ATRIX E XPRESSION

Figure 6: A convolution product is equivalent to a matrix-matrix multiplication.

`+1 `+1 `+1

A PPENDIX D Q UANTUM C ONVOLUTIONAL NEURAL NETWORK

D.1 S INGLE Q UANTUM C ONVOLUTION L AYER

D.1.1 I NNER P RODUCT E STIMATION

D.1.2 E NCODING THE AMPLITUDE IN A REGISTER

M = max kAp k kFq k (10)

D.1.3 C ONDITIONAL ROTATION

D.1.4 A MPLITUDE A MPLIFICATION

D.1.5 `∞ TOMOGRAPHY AND PROBABILISTIC SAMPLING

D.1.6 R EGULARIZATION OF THE N ON L INEARITY

D.2 QRAM UPDATE

D.2.1 S TORING THE OUTPUT VALUES DURING THE SAMPLING

D.2.2 Q UANTUM P OOLING

D.3 RUNNING T IME

A PPENDIX E Q UANTUM BACKPROGATION

Theorem E.1 (Quantum Backpropagation for Quantum CNN)

E.1 C LASSICAL BACKPROPAGATION

E.1.1 C ONVOLUTION P RODUCT

E.2 Q UANTUM A LGORITHM FOR BACKPROPAGATION

E.2.1 Q UANTUM C ONVOLUTION P RODUCT

to perform quantumly Equation (17) is obtained by replacing µ(V ) by µ( ∂Y∂L `

E.2.2 Q UANTUM N ON L INEARITY AND T OMOGRAPHY

E.2.3 Q UANTUM P OOLING

E.3 C ONCLUSION AND RUNNING T IME

E.4 Q UANTUM G RADIENT D ESCENT AND C LASSICAL EQUIVALENCE

A PPENDIX F P RELIMINARIES IN Q UANTUM I NFORMATION

F.1 Q UANTUM I NFORMATION

F.2 Q UANTUM S UBROUTINES FOR DATA E NCODING

Definition 4 [Quantum Access to Data]

• |ii |0i 7→ |ii |Ai i

Theorem F.2 [Amplitude Amplification √ and Estimation

F.3 Q UANTUM SUBROUTINES FOR L INEAR A LGEBRA

Definition 5 For a matrix A, the parameter µ(A) is defined

Theorem F.6 [Quantum k-means clustering Kerenidis et al. (2019)]

1. A state |zi such that k|zi − |M xik2 ≤ δ1 in time O((κ(M

A PPENDIX G A LGORITHM AND P ROOF FOR `∞ NORM TOMOGRAPHY

xi |ii, with x ∈ Rd and kxk2 = 1.

Algorithm 3 `∞ norm tomography

3: Apply an Hadamard gate on the first qubit to obtain

P[|X − E[X]| ≥ E[X]/2] ≤ e−E[X]/12 (30)

P[|x2i − pi | ≥ x2i /2] ≤ e−36 ln d/12 = 1/d3

Definition 2 For a matrix A, the parameter µ(A) is defined

Definition 5 For a matrix A, the parameter µ(A) is defined