1560 Quantum Algorithms For Deep Co
1560 Quantum Algorithms For Deep Co
A BSTRACT
1 I NTRODUCTION
The growing importance of deep learning in research, in industry and in our society will require
extreme computational power as the dataset sizes and the complexity of these algorithms is expected
to increase. Quantum computers are a good candidate to answer this challenge. The recent progress
in the physical realization of quantum processors and the advances in quantum algorithms increases
the importance of understanding their capabilities and limitations. In particular, the field of quantum
machine learning has witnessed many innovative algorithms that offer speedups over their classical
counterparts Kerenidis et al. (2019); Lloyd et al. (2013; 2014); Kerenidis & Prakash (2017b); Wiebe
et al. (2014a).
Quantum deep learning refers to the problem of creating quantum circuits that mimic and enhance
the operations of neural networks. It has been studied in several works Allcock et al. (2018); Reben-
trost et al. (2018); Wiebe et al. (2014b) but remains challenging as it is difficult to implement non
linearities with quantum unitaries Schuld et al. (2014). In this work we propose a quantum algorithm
for convolutional neural networks (CNN), a type of deep learning designed for visual recognition,
signal processing and time series. We also provide results of numerical simulations to evaluate the
running time and accuracy of the quantum convolutional neural network (QCNN). Note that our
algorithm is theoretical and could be compiled on any type of quantum computers (trapped ions,
superconducting qubits, cold atoms, photons, etc.)
The CNN was originally developed by LeCun et al. (1998) in the 1980’s. They have achieved great
practical success over the last decade Krizhevsky et al. (2012) and have been used in cutting-edge
domains like autonomous cars Bojarski et al. (2016) and gravitational wave detection George &
Huerta (2018). Despite these successes, CNNs suffer from computational bottlenecks due to the
size of the optimization space and the complexity of the inner operations, these bottlenecks make
deep CNNs resource expensive.
1
Published as a conference paper at ICLR 2020
The growing interest in quantum machine learning has led researchers to develop different variants
of Quantum Neural Networks (QNN). The quest for designing quantum analogs of neural networks
is challenging due to the modular layer architecture of the neural networks and the presence of non
linearities, pooling, and other non unitary operations, as explained in Schuld et al. (2014). Several
strategies have been tried in order to to implement some features of neural networks Allcock et al.
(2018); Wiebe et al. (2014b); Beer et al. (2019) in the quantum setting.
Variational quantum circuits provide another path to the design of QNNs, this approach has been
developed in Farhi & Neven (2018); Henderson et al. (2019); Killoran et al. (2018). A quantum con-
volutional neural network architecture using variational circuits was recently proposed Cong et al.
(2018). However further work is required to provide evidence that such techniques can outperform
classical neural networks in machine learning settings.
2 P RELIMINARIES
We briefly introduce the formalism and notation concerning classical convolution product and its
equivalence with matrix multiplication. More details can be found in Appendix (Section C). A
single layer ` of the classical CNN does the following operations: from an input image X ` ∈
` ` ` ` `+1
RH ×W ×D seen as a 3D tensor, and a kernel K ` ∈ RH×W ×D ×D seen as a 4D tensor, it
`+1 `+1 `+1
performs a convolution product and outputs X `+1 ` `
= X ∗ K , with X `+1
∈ RH ×W ×D .
` ` `+1
This convolution operation is equivalent to the matrix multiplication A F = Y where A , F `
`
`+1 ` ` `+1
and Y are suitably vectorized versions of X , K and X respectively. The output of the layer
` of the CNN is f (X `+1 ) where f is a non linear function.
For a detailed introduction to quantum computing and its applications to machine learning in the
context of this work, we invite the reader to look at Appendix F. We also refer to Nielsen & Chuang
(2002b) for a more complete overview of quantum computing.
In this part we will discuss only briefly the core notions of quantum computing. Like a classical
bit, a quantum bit (qubit) can be |0i, |1i, but can also be in a superposition state α |0i + β |1i with
amplitudes (α, β) ∈ C such that |α|2 + |β|2 = 1. With n qubits it is then possible to construct
a superposition of the 2n binary combinations possible, each with a specific amplitude. We will
note the ith combination (e.g. |01 · · · 110i) as |ii. A vector v ∈ Rd can be encoded in a quantum
state made of dlog(d)e qubits. This encoding is a quantum superposition, where the components
(v1 , · · · , vdP
) of v are used as the amplitudes of the d binary combinations. We note this state
1 th
|vi := kvk i∈[d] vi |ii, where |ii is a register representing the i vector in the standard basis.
Quantum computation proceeds by applying quantum gates which are defined to be unitary matrices
acting on 1 or 2 qubits, for example the Hadamard gate that maps |0i 7→ √12 (|0i + |1i) and |1i 7→
√1 (|0i − |1i). The output of the computation is a quantum state that can be measured to obtain
2
classical information. The measurement of a qubit α |0i + β |1i yields either 0 or 1, with probability
equal to the square of the respective amplitude. A detailed discussion of the results from quantum
machine learning and linear algebra used in this work can be found in Appendix (Section F).
3 M AIN RESULTS
In this paper, we design a quantum convolutional neural network (QCNN) algorithm with a modular
architecture that allows for any number of layers, any number and size of kernels, and that can
support a large variety of non linearity and pooling methods. Our main technical contributions
include a new notion of a quantum convolution product, the development of a quantum sampling
technique well suited for information recovery in the context of CNNs and a proposal for a quantum
backpropagation algorithm for efficient training of the QCNN.
2
Published as a conference paper at ICLR 2020
The QCNN can be directly compared to the classical CNN as it has the same inputs and outputs.
We show that it offers a speedup compared to certain cases of classical CNN for both the forward
pass and for training using backpropagation. For each layer, on the forward pass (Algorithm 1),
the speedup is exponential in the size of the layer (number of kernels) and almost quadratic on the
spatial dimension of the input. We next state informally the speedup for the forward pass, the formal
version appears as Theorem D.1.
`+1
(
|Xj`+1 − f (Xj`+1 )| ≤ 2 if f (X j )≥η
`+1 (1)
Xj`+1 =0 if f (X j ) <η
√
1 √ M C
The running time of the algorithm is O
e
η 2 · `+1
where E(·) represents the aver-
E(f (X ))
e hides factors poly-logarithmic in the size of X ` and K ` and the parameter M =
age value, O
maxp,q kAp k kFq k is the maximum product of norms from subregions of X ` and K ` .
We see that the number of elements in the input and the kernels appear only with a poly-logarithmic
contribution in the running time. This is one of the main advantages of our algorithm and it allows
us to use for larger and even exponentially deeper kernels. For the number of elements in the input,
their number is hidden in the precision parameter η in the running time. Indeed, a sufficiently
large fraction of pixels must be sampled from the output of the quantum convolution to retrieve the
meaningful information. In the Numerical Simulations (Section 6) we provide empirical estimates
for η. For details about the QRAM, see Appendix F.2.
Following the forward pass, a loss function L is computed for the output of a classical CNN. The
backpropagation algorithm is then used to calculate, layer by layer, the gradient of this loss with
respect to the elements of the kernels K ` , in order to update them through gradient descent. We state
our quantum backpropagation algorithm next, the formal version of this result appears as Theorem
E.1
For the quantum back-propagation algorithm, we introduce a quantum tomography algorithm with
`∞ norm guarantees, that could be of independent interest. It is exponentially faster than tomog-
raphy with `2 norm guarantees and is given as Theorem G.1 in Section G. Numerical simulations
on classifying the MNIST dataset show that our quantum CNN achieves a similar classification
accuracy as the classical CNN.
3
Published as a conference paper at ICLR 2020
the application of a non linear function and pooling operations to prepare the next layer’s input. We
provide an overview of the main ideas of the algorithm here, the complete technical details are given
in the Appendix (Section D).
In this algorithm, we propose the first quantum algorithm for performing the convolution product.
Our algorithm is based on the observation that the convolution product can be regarded as a ma-
trix product between reshaped matrices. The reshaped input’s rows A`p and the reshaped kernel’s
columns Fq` are loaded as quantum states, in superposition. Then the entries of the convolution
hA`p |Fq` i are estimated using a simple quantum circuit for inner product estimation and stored in an
auxiliary register as in Step 1.1 of Algorithm 1.
One of the difficulties in the design of quantum neural networks is that non linear functions are hard
to implement as unitary operations. We get around this difficulty by applying the non-linear function
f as a boolean circuit to the output of the quantum inner product estimation circuit in Step 1.2 of
Algorithm 1. Most of the non linear functions in the machine learning literature can be implemented
using small sized boolean circuits, our algorithm thus allows for many possible choices of the non-
linear function f (see Appendix F.1 for details on non linear boolean circuits in quantum circuits).
Step 2 of Algorithm 1 develops a quantum importance sampling procedure wherein the pixels with
`+1
high values of f (Y pq ) are read out with higher probability. This is done by encoding these values
into the amplitudes of the quantum state using the well known Amplitude Amplification algorithm
Brassard et al. (2002). This kind of importance sampling is a task that can be performed easily
in the quantum setting and has no direct classical analog. Although it does not lead to asymptotic
improvements for the algorithms running time, it could lead to improvements that are significant in
practice.
More precisely, during the measurement of a quantum register in superposition, only one of its
values appears, with a probability corresponding the the square of its amplitude. It implies that the
4
Published as a conference paper at ICLR 2020
`+1
output’s pixels measured with more probability are the ones with the highest value f (Yp,q ). Once
measured, we read directly from the registers the position p, q and the value itself. Thus we claim
that we measure only a fraction of the quantum convolution product output, and that the set of pixels
measured collect most of the meaningful information for the CNN, the other pixels being set to 0.
After being measured, each pixel’s value and position are stored in a QRAM to be used as quantum
state for next layer’s input. During this phase, it is possible to discard or aggregate some values to
perform pooling operations as described in Step 3 of Algorithm 1. The forward pass for the QCNN
thus includes the the convolution product, the non linearity f and pooling operation, in time poly-
logarithmic in the kernel’s dimensions. In comparison, the classical CNN layer in linear in both
kernel and input dimensions.
Note finally that quantum importance sampling in Step 2 implies that the non linear function f be
bounded by a parameter C > 1. In our experiments we use the capReLu function, which is a
modified ReLu function that becomes constant above C.
∂L
1: Calculate the gradient for the last layer L using the outputs and the true labels: ∂Y L
2: for ` = L − 1, · · · , 0 do
3: Step 1 : Modify the gradient
With ∂Y∂L`+1 stored in QRAM, set to 0 some of its values to take into account pooling, tomog-
raphy and non linearity that occurred in the forward pass of layer `. These values correspond
to positions that haven’t been sampled nor pooled, since they have no impact on the final loss.
We describe briefly detail the implementation of quantum backpropagation at layer `. The algorithm
assumes that ∂Y∂L`+1 is known. First, the backpropagation of the quantum convolution product is
equivalent to the classical one, and we use the matrix-matrix multiplication formulation to obtain
∂L ∂L
the derivatives ∂F ` and ∂Y ` . The first one is the result wanted and the second one is needed for layer
` − 1. This matrix-matrix multiplication can be implemented as a quantum circuit, by decomposing
into several matrix-vector multiplications, known to be efficient, with a running time depending
on the ranks and Frobenius norm of the matrices. We obtain a quantum state corresponding to a
5
Published as a conference paper at ICLR 2020
superposition of all derivatives. We use again the `∞ tomography to retrieve each derivative with
`
precision δ > 0 such that, for all kernel’s weight Fs,q we have approximated it’s loss derivative
∂L ∂L ∂L ∂L
with ` ,
∂Fs,q
with an error bounded by `
∂Fs,q
− `
∂Fs,q
≤ 2δ ∂F ` 2
. This implies that the gradient
∂L
descent rule is perturbed by 2δ ∂F ` 2
at most, see Appendix (Section E.4).
We also take into account the effects of quantum non linearity, quantum measurement and pooling.
The quantum pooling operation is equivalent to the classical one, where pixels that were not selected
during pooling see their derivative set to 0. Quantum measurement is similar, since pixels that
haven’t been measured don’t contribute to the gradient. For the non linearity, as in the classical
case, pixels with negative values were set to zero, hence should have no contribution to the gradient.
Additionally, because we used the capReLu function, pixels bigger than the threshold C must also
have null derivatives. This two rules can be implemented by combining them with measurement
rules compared to classical backpropagation, see Appendix (Section E.2.2) for details.
6 N UMERICAL S IMULATIONS
As described above, the adaptation of the CNNs to the quantum setting implies some modifications
that could alter the efficiency of the learning or classifying phases. We now present some experi-
ments to show that such modified CNNs can converge correctly, as the original ones.
The experiment, using the PyTorch library developed by Paszke et al. (2017), consists of training
classically a small convolutional neural network for which we have added a “quantum” sampling
after each convolution. Instead of parametrising it with the precision η, we have choosed to use the
sampling ratio σ that represents the fraction of pixels drawn during tomography. This two definitions
are equivalent, as shown in Appendix (Section D.1.5), but the second one is more intuitive regarding
the running time and the simulations.
We also add a noise simulating the amplitude estimation (parameter ), followed by a capReLu
instead of the usual ReLu (parameter C), and a noise during the backpropagation (parameter δ). In
the following results, we observe that our quantum CNN is able to learn and classify visual data
from the widely used MNIST dataset. This dataset is made of 60.000 training images and 10.000
testing images of handwritten digits. Each image is a 28x28 grayscale pixels between 0 and 255 (8
bits encoding), before normalization.
Let’s first observe the “quantum” effects on an image of the dataset. In particular, the effect of the
capped non linearity, the introduction of noise and the quantum sampling.
We now present the full simulation of our quantum CNN. In the following, we use a simple network
made of 2 convolution layers, and compare our quantum CNN to the classical one. The first and
second layers are respectively made of 5 and 10 kernels, both of size 7x7. A three-layer fully
connected network is applied at the end and a softmax activation function is applied on the last
layer to detect the predicted outcome over 10 classes (the ten possible digits). Note that we didn’t
introduce pooling, being equivalent between quantum and classical algorithms and not improving
the results on our CNN. The objective of the learning phase is to minimize the loss function, defined
by the negative log likelihood of the classification on the training set. The optimizer used was a
built-in Stochastic Gradient Descent.
Using PyTorch, we have been able to implement the following quantum effects (the first three points
are shown in Figure 1):
- The addition of a noise, to simulate the approximation of amplitude estimation during the forward
quantum convolution layer, by adding gaussian noise centered on 0 and with standard deviation
2M , with M = maxp,q kAp k kFq k.
- A modification of the non linearity: a ReLu function which is constant above the value T (the cap).
- A sampling procedure to apply on a tensor with a probability distribution proportional to the tensor
itself, reproducing the quantum sampling with ratio σ.
- The addition of a noise during the gradient descent, to simulate the quantum backpropagation,
by adding a gaussian noise centered on 0 with standard deviation δ, multiplied by the norm of the
gradient, as given by Equation (28).
6
Published as a conference paper at ICLR 2020
Figure 1: Effects of the QCNN on a 28x28 input image. From left to right: original image, image
after applying a capReLu activation function with a cap C at 2.0, introduction of a strong noise
during amplitude estimation with = 0.5, quantum sampling with ratio σ = 0.4 that samples the
highest values in priority. The useful information tends to be conserved in this example. The side
gray scale indicates the value of each pixel. Note that during the QCNN layer, a convolution is
supposed to happen before the last image but we chose not to perform it for visualisation matter.
The CNN used for this simulation may seem “small” compared to the standards AlexNet developed
by Krizhevsky et al. (2012) or VGG-16 by Simonyan & Zisserman (2014), or those used in industry.
However simulating this small QCNN on a classical computer was already very computationally
intensive and time consuming, due to the“quantum” sampling task, apparently not optimized for
a classical implementation in PyTorch. Every single training curve showed in Figure 9 could last
for 4 to 8 hours. Hence adding more convolutional layers wasn’t convenient. Similarly, we didn’t
compute the loss on the whole testing set (10.000 images) during the training to plot the testing
curve. However we have computed the test losses and accuracies once the model trained (see Table
4), in order to detect potential overfitting cases.
We now present the result of the training phase for a quantum version of this CNN, where partial
quantum sampling is applied, for different sampling ratio (number of samples taken from the result-
ing convolution). Since the quantum sampling gives more probability to observe high value pixels,
we expect to be able to learn correctly even with small ratio (σ ≤ 0.5). We compare these training
curve to the classical one. The learning has been done on two epochs, meaning that the whole dataset
is used twice. The following plots show the evolution of the loss L during the iterations on batches.
This is the standard indicator of the good convergence of a neural network learning phase. We can
compare the evolution of the loss between a classical CNN and our QCNN for different parameters.
Most results are presented in Appendix (Section H).
Our simulations show that the QCNN is able to learn despite the introduction of noise, tensor sam-
pling and other modifications. In particular it shows that only a fraction of the information is mean-
ingful for the neural network, and that the quantum algorithm captures this information in priority.
This learning can be more or less efficient depending on the choice of the key parameters. For de-
cent values of these parameters, the QCNN is able to converge during the training phase. It can then
classify correctly on both training and testing set, indicating neither overfitting nor underfitting.
We notice that the learning curves sometimes present a late start before the convergence initializes,
in particular for small sampling ratio. This late start can be due to the random initialization of the
kernel weights, that performs a meaningless convolution, a case where the quantum sampling of the
output is of no interest. However it is very interesting to see that despite this late start, the kernel
start converging once they have found a good combination.
Overall, it is possible that the QCNN presents some behaviors that have no classical equivalence.
Understanding their potential effects, positive or negative, is an open question, all the more so as
the effects of the classical CNN’s hyperparameters are already a topic of active research, see the
work of Samek et al. (2017) for details. Note also that the neural network used in this simulation is
7
Published as a conference paper at ICLR 2020
Figure 2: Training curves comparison between the classical CNN and the Quantum CNN (QCNN)
for = 0.01, C = 10, δ = 0.01 and the sampling ratio σ from 0.1 to 0.5. We can observe a learning
phase similar to the classical one, even for a weak sampling of 20% or 30% of each convolution
output, which tends to show that the meaningful information is distributed only at certain location
of the images, coherently with the purpose of the convolution layer. Even for a very low sampling
ratio of 10%, we observe a convergence despite a late start.
rather small. A following experiment would be to simulate a quantum version of a standard deeper
CNN (AlexNet or VGG-16), eventually on more complex dataset, such as CIFAR-10 developed by
Krizhevsky & Hinton (2009) or Fashion MNIST by Xiao et al. (2017).
7 C ONCLUSIONS
We have presented a quantum algorithm for evaluating and training convolutional neural networks
(CNN). At the core of this algorithm, we have developed a novel quantum algorithm for computing
a convolution product between two tensors, with a substantial speed up. This technique could be
reused in other signal processing tasks that could benefit from an enhancement by a quantum com-
puter. Layer by layer, convolutional neural networks process and extract meaningful information.
Following this idea of learning foremost important features, we have proposed a new approach of
quantum tomography where the most meaningful information is sampled with higher probability,
hence reducing the complexity of our algorithm.
Our QCNN is complete in the sense that almost all classical architectures can be implemented in a
quantum fashion: any (non negative and upper bounded) non linearity, pooling, number of layers
and size of kernels are available. Our circuit is shallow and could be run on relatively small quantum
computers. One could repeat the main loop many times on the same shallow circuit, since perform-
ing the convolution product is simple, and is similar for all layer. The pooling and non linearity are
included in the loop. Our building block approach, layer by layer, allows high modularity, and can
be combined with work on quantum feedforward neural network developed by Allcock et al. (2018).
The running time presents a speedup compared to the classical algorithm, due to fast linear alge-
bra when computing the convolution product, and by only sampling the important values from the
resulting quantum state. This speedup can be highly significant in cases where the number of chan-
nels D` in the input tensor is high (high dimensional time series, videos sequences, games play) or
when the number of kernels D`+1 is big, allowing deep architectures for CNN, which was the case
in the recent breakthrough of DeepMind AlphaGo algorithm of Silver et al. (2016). The Quantum
CNN also allows larger kernels, that could be used for larger input images, since the size the kernels
must be a contant fraction of the input in order to recognize patterns. However, despite our new
techniques to reduce the complexity, applying a non linearity and reusing the result of a layer for the
next layer make register encoding and state tomography mandatory, hence preventing from having
an exponential speedup on the number of input parameters.
Finally we have presented a backpropagation algorithm that can also be implemented as a quantum
circuit. The numerical simulations on a small CNN show that despite the introduction of noise and
sampling, the QCNN can efficiently learn to classify visual data from the MNIST dataset, perform-
ing a similar accuracy than the classical CNN.
8
Published as a conference paper at ICLR 2020
R EFERENCES
Jonathan Allcock, Chang-Yu Hsieh, Iordanis Kerenidis, and Shengyu Zhang. Quantum algorithms
for feedforward neural networks. arXiv preprint arXiv:1812.03089, 2018.
Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, Tobias J Osborne, Robert Salzmann, and Ramona
Wolf. Efficient learning for deep quantum neural networks. arXiv preprint arXiv:1902.10445,
2019.
Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Larry Jackel,
Urs Muller, and Karol Zieba. Visualbackprop: efficient visualization of cnns. arXiv preprint
arXiv:1611.05418, 2016.
Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude amplification
and estimation. Contemporary Mathematics, 305:53–74, 2002.
Shantanav Chakraborty, András Gilyén, and Stacey Jeffery. The power of block-encoded ma-
trix powers: improved regression techniques via faster Hamiltonian simulation. arXiv preprint
arXiv:1804.01973, 2018.
Iris Cong, Soonwon Choi, and Mikhail D Lukin. Quantum convolutional neural networks. arXiv
preprint arXiv:1810.03787, 2018.
Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term pro-
cessors. arXiv preprint arXiv:1802.06002, 2018.
Daniel George and EA Huerta. Deep learning for real-time gravitational wave detection and param-
eter estimation: Results with advanced ligo data. Physics Letters B, 778:64–70, 2018.
Maxwell Henderson, Samriddhi Shakya, Shashindra Pradhan, and Tristan Cook. Quanvolu-
tional neural networks: Powering image recognition with quantum circuits. arXiv preprint
arXiv:1904.04767, 2019.
Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems. Proceedings of the
8th Innovations in Theoretical Computer Science Conference, 2017a.
Iordanis Kerenidis and Anupam Prakash. Quantum gradient descent for linear systems and least
squares. arXiv:1704.04992, 2017b.
Iordanis Kerenidis and Anupam Prakash. A quantum interior point method for LPs and SDPs.
arXiv:1808.09266, 2018.
Iordanis Kerenidis, Jonas Landman, Alessandro Luongo, and Anupam Prakash. q-means: A
quantum algorithm for unsupervised machine learning. Neural Information Processing systems
(NeurIPS), 2019.
Nathan Killoran, Thomas R Bromley, Juan Miguel Arrazola, Maria Schuld, Nicolás Quesada, and
Seth Lloyd. Continuous-variable quantum neural networks. arXiv preprint arXiv:1806.06871,
2018.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech-
nical report, Citeseer, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
Yann LeCun, L Bottou, Yoshua Bengio, and P Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 1998.
Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and
unsupervised machine learning. arXiv, 1307.0411:1–11, 7 2013. URL http://arxiv.org/
abs/1307.0411.
9
Published as a conference paper at ICLR 2020
Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis.
Nature Physics, 10(9):631, 2014.
Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002a.
Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002b.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. 2017.
Patrick Rebentrost, Thomas R Bromley, Christian Weedbrook, and Seth Lloyd. Quantum hopfield
neural network. Physical Review A, 98(4):042308, 2018.
Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificial intelli-
gence: Understanding, visualizing and interpreting deep learning models. arXiv preprint
arXiv:1708.08296, 2017.
Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. The quest for a quantum neural network.
Quantum Information Processing, 13(11):2567–2586, 2014.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum Algorithms for Nearest-Neighbor
Methods for Supervised and Unsupervised Learning. arXiv:1401.2142v2, 2014a. URL https:
//arxiv.org/pdf/1401.2142.pdf.
Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum deep learning. arXiv preprint
arXiv:1412.3489, 2014b.
J Wu. Introduction to convolutional neural networks. https://pdfs.semanticscholar.
org/450c/a19932fcef1ca6d0442cbf52fec38fb9d1e5.pdf, 2017.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-
ing machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
10
Published as a conference paper at ICLR 2020
We recall the most important variables for layer `. They represent tensors, their approximations, and
their reshaped versions.
Table 1: Summary of input variables for the `th layer, along with their meaning, dimensions and
corresponding notations. These variables are common for both quantum and classical algorithms.
We have omitted indices for Y ` which don’t appear in our work.
Table 2: Summary of variables describing outputs of the layer `, with the quantum algorithm.
Table 3: Summary of variables describing outputs of the layer `, with the classical algorithm.
Classical and quantum algorithms can be compared with these two diagrams:
(
`+1 `+1
Quantum convolution layer : X ` → |X i → |f (X )i → X `+1 → X̃ `+1
(3)
Classical convolution layer : X ` → X `+1 → f (X `+1 ) → X̃ `+1
We finally provide some remarks that could clarify some notations ambiguity:
- Formally, the output of the quantum algorithm is X̃ `+1 . It is used as input for the next layer ` + 1.
But we consider that all variables’ names are reset when starting a new layer: X `+1 ← X̃ `+1 .
- For simplicity, we have sometimes replaced the indices (i`+1 , j `+1 , d`+1 ) by n to index the ele-
ments of the output.
- In Section D.2.2, the input for layer ` + 1 is stored as A`+1 , for which the elements are indexed by
(p0 , r0 ).
We introduce a basic and broad-audience quantum information background necessary for this work.
For a more detailed introduction we recommend Nielsen & Chuang (2002a).
11
Published as a conference paper at ICLR 2020
Quantum Bits and Quantum Registers: The bit is the most basic unit of classical information.
It can be either in state 0 or 1. Similarly a quantum bit or qubit, is a quantum system that can be is
state |0i, |1i (the braket notation |·i is a reminder that the bit considered is a quantum system) or in
superposition of both states α |0i + β |1i with coefficients α, β ∈ C such that |α|2 + |β|2 = 1. The
amplitudes α and β are linked to the probabilities of observing either 0 or 1 when measuring the
qubit, since P (0) = |α|2 and P (1) = |β|2 .
Before the measurement, any superposition is possible, which gives quantum information special
abilities in terms of computation. With n qubits, the 2n possible binary combinations can exist
simultaneously, each with a specific amplitude. For instance we can consider an uniform distribution
P2n −1
√1 th
n i=0 |ii where |ii represents the i binary combination (e.g. |01 · · · 1001i). Multiple qubits
together are often called a quantum register.
In its most general formulation, a quantum state with n qubits can be seen as vector in a complex
Hilbert space of dimension 2n . This vector must be normalized under `2 -norm, to guarantee that the
squared amplitudes sum to 1.
Quantum Computation: To process qubits and therefore quantum registers, we use quantum
gates. These gates are unitary operators in the Hilbert space as they should map unit-norm vectors
n
to unit-norm vectors. Formally, we can see a quantum gate acting on n qubits as a matrix U ∈ C2
such that U U † = U † U = I,whereU † is the conjugate transpose of U . Some basic single
qubit
0 1 1 1 1
gates includes the NOT gate that inverts |0i and |1i, or the Hadamard gate √2
1 0 1 −1
1 1
that maps |0i 7→ √2 (|0i + |1i) and |1i 7→ √2 (|0i − |1i), creating the quantum superposition.
Finally, multiple qubits gates exist, such as the Controlled-NOT that applies a NOT gate on a target
qubit conditioned on the state of a control qubit.
The main advantage of quantum gates is their ability to be applied to a superposition of inputs.
Indeed, givenP that U |xi 7→ |f (x)i, we can apply it to all possible combinations of x
a gate U such P
at once U ( C1 x |xi) 7→ C1 x |f (x)i.
We now state some primitive quantum circuits, which we will use in our algorithm: For two integers
i and j, we can check their equality with the mapping |ii |ji |0i 7→ |ii |ji |[i = j]i. For two real
value numbers a > 0 and δ > 0, we can compare them using |ai |δi |0i 7→ |ai |δi |[a ≤ δ]i. Finally,
for a real value numbers a > 0, we can obtain its square |ai |0i 7→ |ai |a2 i. Note that these circuits
are basically a reversible version of the classical ones and are linear in the number of qubits used to
encode the input values.
Any classical boolean function can be implemented in a quantum unitary, even though this seems
at first contradictory with the requirements of unitaries (reversibility, linearity). Let σ : R 7→ R
be a classical function, we define Uσ the unitary that acts as Uσ |xi |0i 7→ |xi |σ(x)i. Using a
second quantum register to encode the result of the function, the properties of quantum unitaries are
respected.
Knowing some basic principles of quantum information, the next step is to understand how data can
be efficiently encoded using quantum states. While several approaches could exist, we present the
most common one called amplitude encoding, which leads to interesting and efficient applications.
Let x ∈ Rd be a vector with components (x1 , · · · , xd ). Using only dlog(d)e qubits, we can form
1
Pd−1 th
|xi, the quantum state encoding x, given by |xi = kxk j=0 xj |ji. We see that the j component
th th
xj becomes the amplitude of |ji, the j binary combination (or equivalently the j vector in the
standard basis). Each amplitude must be divided by kxk to preserve the unit `2 -norm of |xi.
Similarly, for a matrix A ∈ Rn×d or equivalently for n vectors Ai for i ∈ [n], we can express each
Pd−1
row of A as |Ai i = kA1i k i=0 Aij |ji.
12
Published as a conference paper at ICLR 2020
We can now explain an important definition, the ability to have quantum access to a matrix. This
will be a requirements for many algorithms.
By using appropriate data structures the first mapping can be reduced to the ability to perform a
mapping of the form |ii |ji |0i 7→ |ii |ji |Aij i. The second requirement can be replaced by the
ability of performing |ii |0i 7→ |ii |kAi ki or to just have the knowledge of each norm. Therefore,
using matrices such that all rows Ai have the same norm makes it simpler to obtain the quantum
access.
The time or complexity T necessary for the quantum access can be reduced to polylogarithmic
dependence in n and d if we consider the access to a Quantum Memory or QRAM. The QRAM
Kerenidis & Prakash (2017a) is a specific data structure from which a quantum circuit can allow
quantum access to data in time O(log (nd)).
Theorem B.1 (QRAM data structure, see Kerenidis & Prakash (2017a)) Let A ∈ Rn×d , there
is a data structure to store the rows of A such that,
1. The time to insert, update or delete a single entry Aij is O(log2 (n)).
2. A quantum algorithm with access to the data structure can perform the following unitaries
in time T = O(log2 n).
(a) |ii |0i → |ii |Ai i for i ∈ [n].
P
(b) |0i → i∈[n] kAi k |ii.
We now state important methods for processing the quantum information. Their goal is to store some
information alternatively in the quantum state’s amplitude or in the quantum register as a bitstring.
Theorem B.3 [Conditional Rotation] Given the quantum state |ai, with a ∈ [−1, 1], it is possible
√
to perform |ai |0i 7→ |ai (a |0i + 1 − a |1i) with complexity O(1).
e
Pd−1
Using Theorem F.3 followed by Theorem F.2, it then possible to transform the state √1 |xj i
d j=0
1
Pd−1
into kxk j=0 xj |xj i.
In addition to amplitude estimation, we will make use of a tool developed in Wiebe et al. (2014a)
to boost the probability of getting a good estimate for the inner product required for the quantum
convolution algorithm. In high level, we take multiple copies of the estimator from the amplitude
estimation procedure, compute the median, and reverse the circuit to get rid of the garbage. Here we
provide a theorem with respect to time and not query complexity.
Theorem B.4 (Median Evaluation, see Wiebe et al. (2014a)) Let U be a unitary operation that
maps √ √
U : |0⊗n i 7→ a |x, 1i + 1 − a |G, 0i
for some 1/2 < a ≤ 1 in time T . Then there exists a quantum algorithm that,√ for any ∆ > 0 and
for any 1/2 < a0 ≤ a, produces a state |Ψi such that k |Ψi − |0⊗nL i |xi k ≤ 2∆ for some integer
13
Published as a conference paper at ICLR 2020
In the recent years, as the field of quantum machine learning grew, its “toolkit” for linear alge-
bra algorithms has become important enough to allow the development of many quantum machine
learning algorithms. We introduce here the important subroutines for this work, without detailing
the circuits or the algorithms.
The next theorems allow to compute the distance between vectors encoded as quantum states, and
use this idea to perform the k-means algorithm.
Theorem B.5 [Quantum Distance Estimation Wiebe et al. (2014b); Kerenidis et al. (2019)]
Given quantum access in time T to two matrices U and V with rows ui and vj of dimen-
sion d, there is a quantum algorithm that, for any pair (i, j), performs the following mapping
|ii |ji |0i 7→ |ii |ji |d2 (ui , vj )i, estimating the euclidean distance between ui and vj with precision
|d2 (ui , vj ) − d2 (ui , vj )| ≤ for any > 0. The algorithm has a running time given by O(T e η/),
where η = maxij (kui k kvj k), assuming that mini (kui k) = mini (kvi k) = 1.
In theorem F.6, the other parameters in the running time can be interpreted as follows : δ is the
precision in the estimation of the distances, but also in the estimation of the position of the centroids.
κ(V ) is the condition number of V and µ(V ) is defined above (Definition 5). Finally, in the case
of well clusterable datasets, which should be the case when we will apply k-means during spectral
2.5 2
e × (k 2 d η(V 3) + k 2.5 η(V3) )).
clustering, the running simplifies to O(T δ δ
Note that the dependence in n is hidden in the time T to load the data. This dependence becomes
polylogarithmic in n if we assume access to a QRAM.
Theorem B.7 (Quantum Matrix Operations, Chakraborty et al. (2018) ) Let M ∈ Rd×d and
x ∈ Rd . Let δ1 , δ2 > 0. If M is stored in appropriate QRAM data structures and the time to
prepare |xi is Tx , then there exist quantum algorithms that with probability at least 1 − 1/poly(d)
return
n×d
appliedto any rectangular matrix V ∈ R
The linear algebra procedures above can also be by
0 V
considering instead the symmetric matrix V = .
VT 0
14
Published as a conference paper at ICLR 2020
CNN is a specific type of neural network, designed in particular for image processing or time series.
It uses the Convolution Product as a main procedure for each layer. We will focus on image pro-
cessing with a tensor framework for all elements of the network. Our goal is to explicitly describe
the CNN procedures in a form that can be translated in the context of quantum algorithms.
As a regular neural network, a CNN should learn how to classify any input, in our case images. The
training consists of optimizing a series of parameters, learned on the inputs and their corresponding
labels.
Images, or more generally layers of the network, can be seen as tensors. A tensor is a generalization
of a matrix to higher dimensions. For instance an image of height H and width W can be seen as
a matrix in RH×W , where every pixel is a greyscale value between 0 ans 255 (8 bit). However the
three channels of color (RGB: Red Green Blue) must be taken into account, by stacking three times
the matrix for each color. The whole image is then seen as a 3 dimensional tensor in RH×W ×D
where D is the number of channels. We will see that the Convolution Product in the CNN can be
expressed between 3-tensors (input) and 4-tensors (convolution filters or kernels), the output being
a 3-tensor of different dimensions (spatial size and number of channels).
C.2 A RCHITECTURE
A CNN is composed of 4 main procedures, compiled and repeated in any order : Convolution layers,
most often followed by an Activation Function, Pooling Layers and some Fully Connected layers at
the end. We will note ` the current layer.
Convolution Layer : The `th layer is convolved by a set of filters called kernels. The output of this
operation is the (` + 1)th layer. A convolution by a single kernel can be seen as a feature detector,
that will screen over all regions of the input. If the feature represented by the kernel, for instance
a vertical edge, is present in some part of the input, there will be a high value at the corresponding
position of the output. The output is commonly called the feature map of this convolution.
Activation Function : As in regular neural network, we insert some non linearities also called
activation functions. These are mandatory for a neural network to be able to learn any function. In
the case of a CNN, each convolution is often followed by a Rectified Linear Unit function, or ReLu.
This is a simple function that puts all negative values of the output to zero, and lets the positive
values as they are.
Pooling Layer : This downsampling technique reduces the dimensionality of the layer, in order
to improve the computation. Moreover, it gives to the CNN the ability to learn a representation
invariant to small translations. Most of the time, we apply a Maximum Pooling or an Average
Pooling. The first one consists of replacing a subregion of P × P elements only by the one with the
15
Published as a conference paper at ICLR 2020
maximum value. The second does the same by averaging all values. Recall that the value of a pixel
corresponds to how much a particular feature was present in the previous convolution layer.
Fully Connected Layer : After a certain number of convolution layers, the input has been suf-
ficiently processed so that we can apply a fully connected network. Weights connect each input
to each output, where inputs are all element of the previous layer. The last layer should have one
node per possible label. Each node value can be interpreted as the probability of the initial image to
belong to the corresponding class.
Most of the following mathematical formulations have been very well detailed by Wu (2017).
At layer `, we consider the convolution of a multiple channels image, seen as a 3-tensor X ` ∈
` ` ` `
RH ×W ×D . Let’s consider a single kernel in RH×W ×D . Note that its third dimension must
match the number of channels of the input, as in Figure 4. The kernel passes over all possible re-
gions of the input and outputs a value for each region, stored in the corresponding element of the
`+1 `+1
output. Therefore the output is 2 dimensional, in RH ×W
Figure 4: Convolution of a 3-tensor input (Left) by one 3-tensor kernel (Center). The ouput (Right)
is a matrix for which each entry is a inner product between the kernel and the corresponding over-
lapping region of the input.
In a CNN, the most general case is to apply several convolution products to the input, each one with
a different 3-tensor kernel. Let’s consider an input convolved by D`+1 kernels. We can globally
` `+1
see this process as a whole, represented by one 4-tensor kernel K ` ∈ RH×W ×D ×D . As D`+1
`+1
convolutions are applied, there are D outputs of 2 dimensions, equivalent to a 3-tensor X `+1 ∈
`+1 `+1 `+1
RH ×W ×D
We can see on Figure 5 that the output’s dimensions are modified given the following rule:
H `+1 = H ` − H + 1
(4)
W `+1 = W ` − W + 1
We omit to detail the use of Padding and Stride, two parameters that control how the kernel moves
through the input, but these can easily be incorporated in the algorithms.
An element of X ` is determined by 3 indices (i` , j ` , d` ), while an element of the kernel K ` is
determined by 4 indices (i, j, d, d0 ). For an element of X `+1 we use 3 indices (i`+1 , j `+1 , d`+1 ).
We can express the value of each element of the output X `+1 with the relation
`
H X
X W X
D
Xi`+1
`+1 ,j `+1 ,d`+1 = `
Ki,j,d,d `
`+1 Xi`+1 +i,j `+1 +j,d (5)
i=0 j=0 d=0
16
Published as a conference paper at ICLR 2020
Figure 5: Convolutions of the 3-tensor input X ` (Left) by one 4-tensor kernel K ` (Center). Each
channel of the output X `+1 (Right) corresponds to the output matrix of the convolution with one of
the 3-tensor kernel.
It is possible to reformulate Equation (5) as a matrix product. For this we have to reshape our objects.
`+1 `+1 `
We expand the input X ` into a matrix A` ∈ R(H W )×(HW D ) . Each row of A` is a vectorized
version of a subregion of X ` . This subregion is a volume of the same size as a single kernel volume
H × W × D` . Hence each of the H `+1 × W `+1 rows of A` is used for creating one value in X `+1 .
Given such a subregion of X ` , the rule for creating the row of A` is to stack, channel by channel,
a column first vectorized form of each matrix. Then, we reshape the kernel tensor K ` into a matrix
` `+1
F ` ∈ R(HW D )×D , such that each column of F ` is a column first vectorized version of one of
the D`+1 kernels.
The convolution operation X ` ∗ K ` = X `+1 is equivalent to the following matrix multiplication
A` F ` = Y `+1 , (6)
17
Published as a conference paper at ICLR 2020
In order to develop a quantum algorithm to perform the convolution as described above, we will
make use of quantum linear algebra procedures. We will use quantum states proportional to the rows
of A` , noted |Ap i, and the columns of F ` , noted |Fq i (we omit the ` exponent in the quantum states
PHW D` −1
to simplify the notation). These states are given by |Ap i = kA1p k r=0 Apr |ri and |Fq i =
`+1
1
P D −1
kFq k s=0 Fsq |si. We suppose we can load these vectors in quantum states by performing the
following queries:
|pi |0i 7→ |pi |Ap i
(8)
|qi |0i 7→ |qi |Fq i
18
Published as a conference paper at ICLR 2020
Such queries, in time poly-logarithmic in the dimension of the vector, can be implemented with
a Quantum Random Access Memory (QRAM). See Section D.2 for more details on the QRAM
update rules and its integration layer by layer.
19
Published as a conference paper at ICLR 2020
`+1 1 X `+1
|f (Y )i = √ |pi |qi |f (Y pq )i |gpq i (9)
H `+1 W `+1 D`+1 p,q
`+1
Because of the precision on |P pq i, our estimation Y pq = (2P pq − 1) kAp k kFq k, is obtained
`+1 `+1
with error such that |Y pq − Ypq | ≤ 2 kAp k kFq k.
`+1 `+1
In superposition, we can bound this error by |Y pq − Ypq | ≤ 2M where we define
M is the maximum product between norms of one of the D`+1 kernels, and one of the regions of X `
of size HW D` . Finally, since the previous error estimation is valid for all pairs (p, q), the overall
`+1
error committed on the convolution product can be bounded by Y − Y `+1 ≤ 2M , where
∞
`+1 `+1
k.k∞ denotes the `∞ norm. Recall that Y is just a reshaped version of X . Since the non
linearity adds no approximation, we can conclude on the final error committed for a layer of our
QCNN
`+1
f (X ) − f (X `+1 ) ≤ 2M (11)
∞
At this point, we have established Theorem D.1 as we have created the quantum state (9), with given
precision guarantees, in time poly-logarithmic in ∆ and in the size of X ` and K ` .
`+1
We know aim to retrieve classical information from this quantum state. Note that |Ypq i is rep-
resenting a scalar encoded in as many qubits as needed for the precision, whereas |Ap i was repre-
senting a vector as a quantum state in superposition, where each element Ap,r is encoded in one
amplitude (See Section F). The next step can be seen as a way to retrieve both encoding at the same
time, that will allow an efficient tomography focus on the values of high magnitude.
20
Published as a conference paper at ICLR 2020
q
! √
1 maxp,q (f (Y pq ))
queries is O maxp,q (f (Y pq )) q =O √ , where the nota-
1
P
HW D p,q f (Y pq ) Ep,q (f (Y pq ))
tion Ep,q (f (Y pq )) represents the average value of the matrix f (Y ). It can also be written E(f (X))
1
P
as in Result 1: Ep,q (f (Y pq )) = HW D p,q f (Y pq ). At the end of these iterations, we have modi-
fied with high probability the state to the following:
1 X
0
|f (Y )i = √ αpq |pi |qi |f (Y pq )i (12)
HW D p,q
0 αpq
Where, to respect the normalization of the quantum state, αpq = r . Eventually, the
P α2
pq
p,q HW D
0 2
(αpq ) f (Y pq )
probability of measuring (p, q, f (Y pq )) is given by p(p, q, f (Y pq )) = HW D = P
f (Y pq )
. Note
p,q
that we have used the same type of name |f (Y )i for both state (9) and state (12). For now on, this
state name will refer only to the latter (12).
`+1
We see here that f (Y pq ), the values of each pixel, are encoded in both the last register and in
the amplitude. We will use this property to extract efficiently the exact values of high magnitude
`+1
pixels. For simplicity, we will use instead the notation f (X n ) to denote a pixel’s value, with
`+1 `+1 `+1 `+1 `+1
n ∈ [H W D ]. Recall that Y and X are reshaped version of the same object.
The pixels with high values will have more probability of being sampled. Specifically, we perform
a tomography with `∞ guarantee and precision parameter η > 0. See Theorem G.1 and Section G
2
for details. The `∞ guarantee allows to obtain each pixel with error at most η, and require O(1/η
e )
`+1
samples from the state (13). Pixels with low values f (X n ) < η will probably not be sampled due
to their low amplitude. Therefore the error committed will be significative and we adopt the rule of
`+1
setting them to 0. Pixels with higher values f (X n ) ≥ η, will be sample with high probability,
`+1
and only one appearance is enough to get the exact register value f (X n ) of the pixel, as is it also
written in the last register.
To conclude, let’s note Xn`+1 the resulting pixel values after the tomography, and compare it to the
`+1
real classical outputs f (Xn`+1 ). Recall that the measured values f (X n ) are approximated with
error at most 2M with M = maxp,q kAp k kFq k. The algorithm described above implements the
following rules:
`+1
(
|Xn`+1 − f (Xn`+1 )| ≤ 2M if f (X n ) ≥ η
`+1 (14)
Xn`+1 = 0 if f (X n ) < η
Concerning the running time, one could ask what values of η are sufficient to obtain enough mean-
ingful pixels. Obviously this highly depends on the output’s size H `+1 W `+1 D`+1 and on the out-
put’s content itself. But we can view this question from an other perspective, by considering that we
sample a constant fraction of pixels given by σ · (H `+1 W `+1 D`+1 ) where σ ∈ [0, 1] is a sampling
ratio. Because of the particular amplitudes of state (13), the high value pixels will be measured and
known with higher probability. The points that are not sampled are being set to 0. We see that this
approach is equivalent to the `∞ tomography, therefore we have η12 = σ · H `+1 W `+1 D`+1 .
21
Published as a conference paper at ICLR 2020
We will use this analogy in the numerical simulations (Section 6) to estimate, for a particular QCNN
architecture and a particular dataset of images, which values of σ are enough to allow the neural
network to learn.
Figure 7: Activation functions: ReLu (Left) and capReLu (Right) with a cap C at 5.
We wish to detail the use of the QRAM between each quantum convolution layer, and present how
the pooling operation can happen during this phase. General results about the QRAM is given as
Theorem F.1. Implementation details can be found in the work of Kerenidis & Prakash (2017a). In
this section, we will show how to store samples from the output of the layer `, to create the input of
layer ` + 1.
22
Published as a conference paper at ICLR 2020
Having stored pixels in this way, we can then query |p0 i |0i 7→ |p0 i |A`p0 i, using the quan-
tum circuit developed by Kerenidis & Prakash (2017b), where we correctly have |A`+1 p0 i =
1
P `+1 0
`+1
A p0 r 0 Ap0 r 0 |r i. Note that each tree has a logarithmic depth in the number of leaves, hence the
running time of writing the output of the quantum convolution layer in the QRAM gives a marginal
`+1
multiplicative increase, poly-logarithmic in the number of points sampled from |f (Y )i, namely
O(log(1/η 2 )).
˜`+1 = d`+1
d
`+1
j̃ `+1 = b j P c (15)
`+1
= biP c
`+1
ĩ
Figure 8: A 2×2 tensor pooling. A point in f (X `+1 ) (left) is given by its position (i`+1 , j `+1 , d`+1 ).
A point in X̃ `+1 (right) is given by its position (ĩ`+1 , j̃ `+1 , d˜`+1 ). Different pooling regions in
f (X `+1 ) have separate colours, and each one corresponds to a unique point in X̃ `+1 .
We now show how any kind of pooling can be efficiently integrated to our QCNN structure. In-
deed the pooling operation will occur during the QRAM update described above, at the end of a
convolution layer. At this moment we will store sampled values according to the pooling rules.
In the quantum setting, the output of layer ` after tomography is noted X `+1 . After pooling, we will
`+1 `+1
describe it by X̃ `+1 , which has dimensions HP × WP × D`+1 . X̃ `+1 will be effectively used as
input for layer ` + 1 and its values should be stored in the QRAM to form the trees T̃p`+1
0 , related to
`+1
the matrix expansion à .
However X `+1 is not known before the tomography is over. Therefore we have to modify the
update rule of the QRAM to implement the pooling in an online fashion, each time a sample from
`+1 `+1
|f (X )i is drawn. Since several sampled values of |f (X )i can correspond to the same leaf
`+1
Ãp0 r0 (points in the same pooling region), we need an overwrite rule, that will depend on the type
of pooling. In the case of Maximum Pooling, we simply update the leaf and the parent nodes if the
new sampled value is higher that the one already written. In the case of Average Polling, we replace
the actual value by the new averaged value.
In the end, any pooling can be included in the already existing QRAM update. In the worst case, the
2
running time is increased by O(P/η
e ), an overhead corresponding to the number of times we need
to overwrite existing leaves, with P being a small constant in most cases.
23
Published as a conference paper at ICLR 2020
`+1
As we will see in Section E, the final positions (p, q) that were sampled from |f (X )i and selected
after pooling must be stored for further use during the backpropagation phase.
We will now summarise the running time for one forward pass of convolution layer `. With Õ we
hide the polylogaryhtmic factors. We first write
the running time of the classical CNN layer, which
e H `+1 W `+1 D`+1 · HW D` . For the QCNN, the previous steps prove Result 1 and
is given by O
√
1 M C
can be implemented in time O
e
η 2 ·√ . Note that, as explain in Section D.1.5, the
`+1
E(f (X ))
√
M C
`+1
quantum running time can also be written O σH W
e `+1 `+1
D · √ `+1
, with σ ∈ [0, 1]
E(f (X ))
being the fraction of sampled elements among H `+1 W `+1 D`+1 of them.
It is interesting to notice that the one quantum convolution layer can also include the ReLu operation
and the Pooling operation in the same circuit, for no significant increase in the running time, whereas
in the classical CNN each operation must be done on the whole data again.
The entire QCNN is made of multiple layers. For the last layer’s output, we expect only one possible
outcome, or a few in the case of a classification task, which means that the dimension of the quantum
output is very small. A full tomography can be performed on the last layer’s output in order to
calculate the outcome. The loss L is then calculated, as a measure of correctness of the predictions
compared to the ground truth. As the classical CNN, our QCNN should be able to perform the
optimization of its weights (elements of the kernels) to minimize the loss by an iterative method.
∂L ∂L ∂L ∂L
to perform gradient descent such that ∀(s, q), `
∂Fs,q
− `
∂Fs,q
≤ 2δ ∂F ` 2
. Let ∂Y `
be the gradi-
th
ent with respect to the ` layer. The running time of a single layer ` for quantum backpropagation
is given by
∂L ∂L ∂L ∂L log 1/δ
O µ(A` ) + µ( ) κ( ) + µ( ) + µ(F `
) κ( ) (16)
∂Y `+1 ∂F ` ∂Y `+1 ∂Y ` δ2
where for a matrix V , κ(V ) is the condition number and µ(V ) is defined in Equation (5).
After each forward pass, the outcome is compared to the true labels and define a loss. We can
update our weights by gradient descent to minimize this loss, and iterate. The main idea behind the
backpropagation is to compute the derivatives of the loss L, layer by layer, starting from the last
one.
∂L ∂L
At layer `, the derivatives needed to perform the gradient descent are ∂F ` and ∂Y ` . The first one
represents the gradient of the final loss L with respect to each kernel element, a matrix of values that
`
we will use to update the kernel weights Fs,q . The second one is the gradient of L with respect to
the layer itself and is only needed to calculate the gradient ∂F∂L
`−1 at layer ` − 1.
24
Published as a conference paper at ICLR 2020
∂L
input, we will show how to calculate ∂F ` , the matrix of derivatives with respect to the elements of
`
the previous kernel matrix F . This is the main goal in order to optimize the kernel’s weights.
The details of the following calculations can be found in the work of Wu (2017). We will use the
notation vec(X) to represents the vectorized form of any tensor X.
Recall that A` is the matrix expansion of the tensor X ` , whereas Y ` is a matrix reshaping of X ` .
∂L ∂L ∂vec(X `+1 )
By applying the chain rule ∂vec(F ` )T = ∂vec(X `+1 )T ∂vec(F ` )T , we can obtain:
∂L ∂L
`
= (A` )T (17)
∂F ∂Y `+1
See calculations details in the work of Wu (2017). Equation (17) shows that, to obtain the desired
gradient, we can just perform a matrix-matrix multiplication between the transposed layer itself (A` )
and the gradient with respect to the previous layer ( ∂Y∂L
`+1 ).
∂L
Equation (17) explains also why we will need to calculate ∂Y `
in order to backpropagate through
`+1
∂L ∂L ∂vec(X )
layer ` − 1. To calculate it, we use the chain rule again for ∂vec(X ` )T = ∂vec(X `+1 )T ∂vec(X ` )T .
Recall that a point in A` , indexed by the pair (p, r), can correspond to several triplets (i` , j ` , d` ) in
X ` . We will use the notation (p, r) ↔ (i` , j ` , d` ) to express formally this relation. One can show
that ∂Y∂L ` T
`+1 (F ) is a matrix of same shape as A` , and that the chain rule leads to a simple relation
∂L
to calculate ∂Y ` :
∂L X ∂L
= (F ` )T (18)
∂X ` i` ,j ` ,d` ∂Y `+1 p,r
(p,r)↔(i` ,j ` ,d` )
We have shown how to obtain the gradients with respect to the kernels F ` and to the layer itself Y `
(or equivalently X ` ).
E.1.2 N ON L INEARITY
The activation function has also an impact on the gradient. In the case of the ReLu, we should only
cancel gradient for points with negative values. For points with positive value, the derivatives remain
the same since the function is the identity. A formal relation can be given by
(h i
∂L
if Xi`+1
`+1 ,j `+1 ,d`+1 ≥ 0
∂L ∂f (X `+1 ) i`+1 ,j `+1 ,d`+1
= (19)
∂X `+1 i`+1 ,j `+1 ,d`+1 0 otherwise
E.1.3 P OOLING
If we take into account the pooling operation, we must change some of the gradients. Indeed, a
pixel that hasn’t been selected during pooling has no impact on the final loss, thus should have a
gradient equal to 0. We will focus on the case of Max Pooling (Average Pooling relies on similar
idea). To state a formal relation, we will use the notations of Section D.2.2: an element in the output
of the layer, the tensor f (X `+1 ), is located by the triplet (i`+1 , j `+1 , d`+1 ). The tensor after pooling
is noted X̃ `+1 and its points are located by the triplet (ĩ`+1 , j̃ `+1 , d˜`+1 ). During backpropagation,
after the calculation of ∂ X̃∂L
`+1
, some of the derivatives of f (X `+1 ) should be set to zero with the
following rule:
(h i
∂L
if (i`+1 , j `+1 , d`+1 ) was selected during pooling
∂L ∂ X̃ `+1 ĩ`+1 ,j̃ `+1 ,d˜`+1
=
∂f (X `+1 ) i`+1 ,j `+1 ,d`+1 0 otherwise
(20)
25
Published as a conference paper at ICLR 2020
In this section, we want to give a quantum algorithm to perform backrpopagation on a layer `, and
detail the impact on the derivatives, given by the following diagram:
∂L
∂X `
∂L ∂L ∂L ∂L ∂L
∂L ← `+1
← `+1
← `+1
← = (21)
∂F ` ∂X ∂f (X ) ∂X ∂ X̃ `+1 ∂X `+1
We assume that backpropagation has been done on layer `+1. This means in particular that ∂X∂L
`+1 is
∂L ∂L
stored in QRAM. However, as shown on Diagram (21), ∂X `+1 corresponds formally to ∂ X̃ `+1 , and
not ∂L `+1 . Therefore, we will have to modify the values stored in QRAM to take into account non
∂X
∂L ∂L
linearity, tomography and pooling. We will first consider how to implement ∂X ` and ∂F ` through
∂L ∂L
backpropagation, considering only convolution product, as if `+1 and ∂X `+1 where the same.
∂X
∂L
Then we will detail how to simply modify ∂X `+1
a priori, by setting some of its values to 0.
In this section we consider only the quantum convolution product without non linearity, tomography
nor pooling, hence writing its output directly as X `+1 . Regarding derivatives, the quantum convo-
lution product is equivalent to the classical one. Gradient relations (17) and (18) remain the same.
Note that the -approximation from Section D.1.2 doesn’t participate in gradient considerations.
The gradient relations being the same, we still have to specify the quantum algorithm that imple-
∂L ∂L
ments the backpropagation and outputs classical description of ∂X ` and ∂F ` . We have seen that the
two main calculations (17) and (18) are in fact matrix-matrix multiplications both involving ∂Y∂L `+1 ,
the reshaped form of ∂X∂L `+1 . For each, the classical running time is O(H
`+1
W `+1 D`+1 HW D` ).
We know from Theorem F.7 and Theorem G.1 a quantum algorithm to perform efficiently a
matrix-vector multiplication and return a classical state with `∞ norm guarantees. For a matrix
V and a vector b, both accessible from the QRAM, the running time to perform this operation
µ(V )κ(V ) log 1/δ
is O δ2 , where κ(V ) is the condition number of the matrix and µ(V ) is a matrix
parameter defined in Equation (5). Precision parameter δ > 0 is the error committed in the approxi-
mation for both Theorems F.7 and G.1.
We can therefore apply theses theorems to perform matrix-matrix multiplications, by simply de-
composing them in several matrix-vector multiplications. For instance, in Equation (17), the matrix
could be (A` )T and the different vectors would be each column of ∂Y∂L `+1 . The global running time
by κ((A` )T · ∂Y∂L ∂L ` ∂L ` T
`+1 ). Likewise, for Equation (18), we have µ( ∂Y `+1 )+µ(F ) and κ( ∂Y `+1 ·(F ) ).
Note that the dimension of the matrix doesn’t appear in the running time since we tolerate a `∞ norm
guarantee for the error, instead of a `2 guarantee (see Section G for details). The reason why `∞
tomography is the right approximation here is because the result of these linear algebra operations
are rows of the gradient matrices, that are not vectors in an euclidean space, but a series of numbers
for which we want to be δ-close to the exact values. See next section for more details.
It is a open question to see if one can apply the same sub-sampling technique as in the forward pass
∂L
(Section D.1) and sample only the highest derivatives of ∂X ` , to reduce the computation cost while
maintaining a good optimization. We then have to understand which elements of ∂X∂L `+1 must be set
to zero to take into account the effects the non linearity, tomography and pooling.
To include the impact of the non linearity, one could apply the same rule as in (19), and simply
`+1
replace ReLu by capReLu. After the non linearity, we obtain f (X ), and the gradient relation
would be given by
26
Published as a conference paper at ICLR 2020
h i
∂L `+1
if 0 ≤ X i`+1 ,j `+1 ,d`+1 ≤ C
∂L
∂f (X
`+1
) i`+1 ,j `+1 ,d`+1
`+1
= (22)
∂X i`+1 ,j `+1 ,d`+1
0 otherwise
`+1
If an element of X was negative or bigger than the cap C, its derivative should be zero during
the backpropagation. However, this operation was performed in quantum superposition. In the
quantum algorithm, one cannot record at which positions (i`+1 , j `+1 , d`+1 ) the activation function
was selective or not. The gradient relation (22) cannot be implemented a posteriori. We provide
a partial solution to this problem, using the fact that quantum tomography must also be taken into
account for some derivatives. Indeed, only the points (i`+1 , j `+1 , d`+1 ) that have been sampled
should have an impact on the gradient of the loss. Therefore we replace the previous relation by
(
∂L
if (i`+1 , j `+1 , d`+1 ) was sampled
∂L ∂X `+1 i`+1 ,j `+1 ,d`+1
`+1
= (23)
∂X i`+1 ,j `+1 ,d`+1 0 otherwise
Nonetheless, we can argue that this approximation will be tolerable. In the first case where
`+1
X i`+1 ,j `+1 ,d`+1 < 0, the derivatives can not be set to zero as they should. But in practice, their values
will be zero after the activation function and such points would not have a chance to be sampled. In
`+1
conclusion their derivatives would be zero as required. In the other case, where X i`+1 ,j `+1 ,d`+1 > C,
the derivatives can not be set to zero as well but the points have a high probability of being sampled.
Therefore their derivative will remain unchanged, as if we were using a ReLu instead of a capReLu.
However in cases where the cap C is high enough, this shouldn’t be a source of disadvantage in
practice.
(h i
∂L
if (i`+1 , j `+1 , d`+1 ) was selected during pooling
∂L ∂ X̃ `+1 ĩ`+1 ,j̃ `+1 ,d˜`+1
=
∂X `+1 i`+1 ,j `+1 ,d`+1 0 otherwise
(24)
∂L ∂L
Note that we know as it is equal to
∂ X̃ `+1
the gradient with respect to the input of layer `+1,
∂X `+1
,
known by assumption and stored in the QRAM.
In conclusion, given ∂Y∂L`+1 in the QRAM, the quantum backpropagation first consists in applying
the relations (24) followed by (23). The effective gradient now take into account non linearity,
tomography and pooling that occurred during layer `. We can know use apply the quantum algorithm
for matrix-matrix multiplication that implements relations (18) and (17).
Note that the steps in Algorithm 2 could also be reversed: during backpropagation of layer ` + 1,
when storing values for each elements of ∂Y∂L `+1 in the QRAM, one can already take into account
(24) and (23) of layer `. In this case we directly store ∂L`+1 , at no supplementary cost.
∂X
Therefore, the running time of the quantum backpropagation for one layer
`, given as Algorithm 2, corresponds to the sum of the running times of
the circuits for implementing relations (17) and (18). We finally obtain
27
Published as a conference paper at ICLR 2020
log 1/δ
µ(A` ) + µ( ∂Y∂L · ∂Y∂L ∂L ∂L
` T `
` T
O `+1 ) κ((A ) `+1 ) + µ( ∂Y `+1 ) + µ(F ) κ( ∂Y `+1 · (F ) ) δ2 ,
which can be rewritten as
∂L ∂L ∂L ∂L log 1/δ
O µ(A` ) + µ( ) κ( ) + µ( ) + µ(F `
) κ( ) (25)
∂Y `+1 ∂F ` ∂Y `+1 ∂Y ` δ2
∂L ∂L
Besides storing ∂X ` , the main output is a classical description of ∂F ` , necessary to perform gradient
descent of the parameters of F ` . In the Appendix (Section E.4), which details the impact of the
quantum backpropagation compared to the classical case, which can be reduced to a simple noise
addition during the gradient descent.
In this part we will see the impact of the quantum backpropagation compared to the classical case,
which can be reduced to a simple noise addition during the gradient descent. Recall that gradient
∂L
descent, in our case, would consist in applying the following update rule F ` ← F ` − λ ∂F ` with the
learning rate λ.
∂L ∂L
Let’s note x = ∂F ` and its elements xs,q = ∂F ` . From the first result of Theorem F.7 with
s,q
error δ < 0, and the tomography procedure from Theorem G.1, with same error δ, we can obtain a
x
classical description of kxk with `∞ norm guarantee, such that:
2
x x
− ≤δ
kxk2 kxk2 ∞
e κ(V )µ(V2) log(δ) ), where we note V is the matrix stored in the QRAM that allows to obtain
in time O( δ
x, as explained in Section E.2. The `∞ norm tomography is used so that the error δ is at most the
same for each component
xs,q xs,q
∀(s, q), − ≤δ
kxk2 kxk2
From the second result of the Theorem F.7 we can also obtain an estimate kxk2 of the norm, for the
same error δ, such that
| kxk2 − kxk2 | ≤ δ kxk2
κ(V )µ(V )
in time O(
e
δ log(δ)) (which does not affect the overall asymptotic running time). Using both
results we can obtain an unnormalized state close to x such that, by the triangular inequality
x x
kx − xk∞ = kxk2 − kxk2
kxk2 kxk2 ∞
x x x x
≤ kxk2 − kxk2 + kxk2 − kxk2
kxk2 kxk2 ∞
kxk 2 kxk 2 ∞
x x
≤ 1 · | kxk2 − kxk2 | + kxk2 · −
kxk2 kxk2 ∞
≤ δ kxk2 + kxk2 δ ≤ 2δ kxk2
e κ(V )µ(V2) log(δ) ). In conclusion, with `∞ norm guarantee, having also access to the norm
in time O( δ
of the result is costless.
` ` ∂L
Finally, the noisy gradient descent update rule, expressed as Fs,q ← Fs,q − λ ∂F ` can written in
s,q
the worst case with
∂L ∂L ∂L
`
= `
± 2δ (26)
∂Fs,q ∂Fs,q ∂F ` 2
To summarize, using the quantum linear algebra from Theroem F.7 with `∞ norm tomography from
Theroem G.1, both with error δ, along with norm estimation with relative error δ too, we can obtain
∂L ∂L ∂L ∂L
classically the unnormalized values ∂F ` such that ∂F `
− ∂F ` ≤ 2δ ∂F ` 2 or equivalently
∞
∂L ∂L ∂L
∀(s, q), `
− `
≤ 2δ (27)
∂Fs,q ∂Fs,q ∂F ` 2
28
Published as a conference paper at ICLR 2020
` ` ∂L
Therefore the gradient descent update rule in the quantum case becomes Fs,q ← Fs,q − λ ∂F ` ,
s,q
which in the worst case becomes
` ` ∂L ∂L
Fs,q ← Fs,q − λ `
± 2δ (28)
∂Fs,q ∂F ` 2
This proves the Theorem E.1. This update rule can be simulated by the addition of a random relative
noise given as a gaussian centered on 0, with standard deviation equal to δ. This is how we will
simulate quantum backpropagation in the Numerical Simulations.
Compared to the classical update rule, this corresponds to the addition of noise during the optimiza-
∂L
tion step. This noise decreases as ∂F `
2
, which is expected to happen while converging. Recall
that the gradient descent is already a stochastic process. Therefore, we expect that such noise, with
acceptable values of δ, will not disturb the convergence of the gradient, as the following numerical
simulations tend to confirm.
Quantum Bits and Quantum Registers: The bit is the most basic unit of classical information.
It can be either in state 0 or 1. Similarly a quantum bit or qubit, is a quantum system that can be is
state |0i, |1i (the braket notation |·i is a reminder that the bit considered is a quantum system) or in
superposition of both states α |0i + β |1i with coefficients α, β ∈ C such that |α|2 + |β|2 = 1. The
amplitudes α and β are linked to the probabilities of observing either 0 or 1 when measuring the
qubit, since P (0) = |α|2 and P (1) = |β|2 .
Before the measurement, any superposition is possible, which gives quantum information special
abilities in terms of computation. With n qubits, the 2n possible binary combinations can exist
simultaneously, each with a specific amplitude. For instance we can consider an uniform distribution
P2n −1
√1 th
n i=0 |ii where |ii represents the i binary combination (e.g. |01 · · · 1001i). Multiple qubits
together are often called a quantum register.
In its most general formulation, a quantum state with n qubits can be seen as vector in a complex
Hilbert space of dimension 2n . This vector must be normalized under `2 -norm, to guarantee that the
squared amplitudes sum to 1.
Quantum Computation: To process qubits and therefore quantum registers, we use quantum
gates. These gates are unitary operators in the Hilbert space as they should map unit-norm vectors
n
to unit-norm vectors. Formally, we can see a quantum gate acting on n qubits as a matrix U ∈ C2
such that U U † = U † U = I,whereU † is the conjugate transpose of U . Some basic single
qubit
0 1 1 1 1
gates includes the NOT gate that inverts |0i and |1i, or the Hadamard gate √2
1 0 1 −1
1 1
that maps |0i 7→ 2 (|0i + |1i) and |1i 7→ 2 (|0i − |1i), creating the quantum superposition.
√ √
Finally, multiple qubits gates exist, such as the Controlled-NOT that applies a NOT gate on a target
qubit conditioned on the state of a control qubit.
The main advantage of quantum gates is their ability to be applied to a superposition of inputs.
Indeed, givenP that U |xi 7→ |f (x)i, we can apply it to all possible combinations of x
a gate U such P
at once U ( C1 x |xi) 7→ C1 x |f (x)i.
We now state some primitive quantum circuits, which we will use in our algorithm:
For two integers i and j, we can check their equality with the mapping |ii |ji |0i 7→ |ii |ji |[i = j]i.
For two real value numbers a > 0 and δ > 0, we can compare them using |ai |δi |0i 7→
|ai |δi |[a ≤ δ]i. Finally, for a real value numbers a > 0, we can obtain its square |ai |0i 7→ |ai |a2 i.
29
Published as a conference paper at ICLR 2020
Note that these circuits are basically a reversible version of the classical ones and are linear in the
number of qubits used to encode the input values.
Knowing some basic principles of quantum information, the next step is to understand how data can
be efficiently encoded using quantum states. While several approaches could exist, we present the
most common one called amplitude encoding, which leads to interesting and efficient applications.
Let x ∈ Rd be a vector with components (x1 , · · · , xd ). Using only dlog(d)e qubits, we can form
1
Pd−1 th
|xi, the quantum state encoding x, given by |xi = kxk j=0 xj |ji. We see that the j component
th th
xj becomes the amplitude of |ji, the j binary combination (or equivalently the j vector in the
standard basis). Each amplitude must be divided by kxk to preserve the unit `2 -norm of |xi.
Similarly, for a matrix A ∈ Rn×d or equivalently for n vectors Ai for i ∈ [n], we can express each
Pd−1
row of A as |Ai i = kA1i k i=0 Aij |ji.
We can now explain an important definition, the ability to have quantum access to a matrix. This
will be a requirements for many algorithms.
By using appropriate data structures the first mapping can be reduced to the ability to perform a
mapping of the form |ii |ji |0i 7→ |ii |ji |Aij i. The second requirement can be replaced by the
ability of performing |ii |0i 7→ |ii |kAi ki or to just have the knowledge of each norm. Therefore,
using matrices such that all rows Ai have the same norm makes it simpler to obtain the quantum
access.
The time or complexity T necessary for the quantum access can be reduced to polylogarithmic
dependence in n and d if we consider the access to a Quantum Memory or QRAM. The QRAM
Kerenidis & Prakash (2017a) is a specific data structure from which a quantum circuit can allow
quantum access to data in time O(log (nd)).
Theorem F.1 (QRAM data structure, see Kerenidis & Prakash (2017a)) Let A ∈ Rn×d , there
is a data structure to store the rows of A such that,
1. The time to insert, update or delete a single entry Aij is O(log2 (n)).
2. A quantum algorithm with access to the data structure can perform the following unitaries
in time T = O(log2 n).
(a) |ii |0i → |ii |Ai i for i ∈ [n].
P
(b) |0i → i∈[n] kAi k |ii.
We now state important methods for processing the quantum information. Their goal is to store some
information alternatively in the quantum state’s amplitude or in the quantum register as a bitstring.
Theorem F.3 [Conditional Rotation] Given the quantum state |ai, with a ∈ [−1, 1], it is possible
√
to perform |ai |0i 7→ |ai (a |0i + 1 − a |1i) with complexity O(1).
e
30
Published as a conference paper at ICLR 2020
Pd−1
Using Theorem F.3 followed by Theorem F.2, it then possible to transform the state √1 |xj i
d j=0
1
Pd−1
into kxk j=0 xj |xj i.
In addition to amplitude estimation, we will make use of a tool developed in Wiebe et al. (2014a)
to boost the probability of getting a good estimate for the inner product required for the quantum
convolution algorithm. In high level, we take multiple copies of the estimator from the amplitude
estimation procedure, compute the median, and reverse the circuit to get rid of the garbage. Here we
provide a theorem with respect to time and not query complexity.
Theorem F.4 (Median Evaluation, see Wiebe et al. (2014a)) Let U be a unitary operation that
maps √ √
U : |0⊗n i 7→ a |x, 1i + 1 − a |G, 0i
for some 1/2 < a ≤ 1 in time T . Then there exists a quantum algorithm that,√ for any ∆ > 0 and
for any 1/2 < a0 ≤ a, produces a state |Ψi such that k |Ψi − |0⊗nL i |xi k ≤ 2∆ for some integer
L, in time & '
ln(1/∆)
2T 2 .
2 |a0 | − 12
In the recent years, as the field of quantum machine learning grew, its “toolkit” for linear alge-
bra algorithms has become important enough to allow the development of many quantum machine
learning algorithms. We introduce here the important subroutines for this work, without detailing
the circuits or the algorithms.
The next theorems allow to compute the distance between vectors encoded as quantum states, and
use this idea to perform the k-means algorithm.
Theorem F.5 [Quantum Distance Estimation Wiebe et al. (2014b); Kerenidis et al. (2019)] Given
quantum access in time T to two matrices U and V with rows ui and vj of dimension d, there
is a quantum algorithm that, for any pair (i, j), performs the following mapping |ii |ji |0i 7→
|ii |ji |d2 (ui , vj )i, estimating the euclidean distance between ui and vj with precision |d2 (ui , vj ) −
d2 (ui , vj )| ≤ for any > 0. The algorithm has a running time given by O(T e η/), where
η = maxij (kui k kvj k), assuming that mini (kui k) = mini (kvi k) = 1.
In theorem F.6, the other parameters in the running time can be interpreted as follows : δ is the
precision in the estimation of the distances, but also in the estimation of the position of the centroids.
κ(V ) is the condition number of V and µ(V ) is defined above (Definition 5). Finally, in the case
of well clusterable datasets, which should be the case when we will apply k-means during spectral
2.5 2
e × (k 2 d η(V 3) + k 2.5 η(V3) )).
clustering, the running simplifies to O(T δ δ
Note that the dependence in n is hidden in the time T to load the data. This dependence becomes
polylogarithmic in n if we assume access to a QRAM.
31
Published as a conference paper at ICLR 2020
Theorem F.7 (Quantum Matrix Operations, Chakraborty et al. (2018) ) Let M ∈ Rd×d and
x ∈ Rd . Let δ1 , δ2 > 0. If M is stored in appropriate QRAM data structures and the time to
prepare |xi is Tx , then there exist quantum algorithms that with probability at least 1 − 1/poly(d)
return
n×d
appliedto any rectangular matrix V ∈ R
The linear algebra procedures above can also be by
0 V
considering instead the symmetric matrix V = .
VT 0
Finally, we present a logarithmic time algorithm for vector state tomography that will be used to re-
cover classical information from the quantum states with `∞ norm guarantee. Given a unitary U that
1
Pd−1 2
produces a quantum state |xi = kxk j=0 xj |ji, by calling O(log d/δ ) times U , the tomography
2
algorithm is able to reconstruct a vector Xe that approximates |xi with `∞ norm guarantee, such that
|Xi
e − |xi ≤ δ, or equivalently that ∀i ∈ [d], |xi − Xei | ≤ δ. Such a tomography is of interest
∞
when the components xi of a quantum state are not the coordinates of an meaningful vector in some
linear space, but just a series of values, such that we don’t want an overall guarantee on the vector
(which is the case with usual `2 tomography) but a similar error guarantee for each component in
the estimation.
Theorem G.1 (`∞ Vector state tomography) Given access to unitary U such that U |0i = |xi
and its controlled version in time T (U ), there is a tomography algorithm with time complexity
O(T (U ) log2 d ) that produces unit vector X
δ
e ∈ Rd such that X
e −x ≤ δ with probability at least
∞
(1 − 1/poly(d)).
The proof of this theorem is similar to the proof of the `2 -norm tomography by Kerenidis & Prakash
(2018). However the `∞ norm tomography introduced in this paper depends only logarithmically
and not linearly in the dimension d. Note that in our case, T (U ) will be logarithmic in the dimension.
Theorem G.2 [`2 Vector state tomography Kerenidis & Prakash (2018)] Given access to unitary
U such that U |0i = |xi and its controlled version in time T (U ), there is an algorithm that allows
e ∈ Rd with `2 -norm guarantee X
to output a classical vector X e − x ≤ δ for any δ > 0, in time
2
d log(d)
O(T (U ) × δ2 ).
The following version of the Chernoff Bound will be used for analysis of algorithm 3.
Theorem G.3 (ChernoffP Bound) Let Xj , for j ∈ [N ], be independent random variables such that
Xj ∈ [0, 1] and let X = j∈[N ] Xj . We have the three following inqualities:
2
1. For 0 < β < 1, P[X < (1 − β)E[X]] ≤ e−β E[X]/2
β2
2. For β > 0, P[X > (1 + β)E[X]] ≤ e− 2+β E[X]
2
3. For 0 < β < 1, P[|X − E[X]| ≥ βE[X]] ≤ e−β E[X]/3
, by composing 1. and 2.
32
Published as a conference paper at ICLR 2020
36 ln d
1: Measure N = δ2copies of |xi in the standard basis and count ni , the number of times the
√ p
outcome i is observed. Store pi = ni /N in QRAM data structure.
√
2: Create N = 36δln d
copies of the state √12 |0i i∈[d] xi |ii + √12 |1i i∈[d] pi |ii.
P P
2
1 X √ √
|φi = ((xi + pi ) |0, ii + (xi − pi ) |1, ii)
2
i∈[d]
4: Measure both registers of each copy in the standard basis, and count n(0, i) the number of time
the outcome (0, i) is observed.
5: Set σ(i) = +1 if n(0, i) > 0.4N pi and σ(i) = −1 otherwise.
ei = σi √pi
e such that ∀i ∈ [N ], X
6: Output the unit vector X
√
e ∈ Rd such that X
Theorem G.4 Algorithm 3 produces an estimate X e −x < (1 + 2)δ with
∞
1
probability at least 1 − d0.83 .
Proving x − X e ≤ O(δ) is equivalent to show that for all i ∈ [d], we have |xi − X ei | =
√ ∞
|xi − σ(i) pi | ≤ O(δ). Let S be the set of indices defined by S = {i ∈ [d]; |xi | > δ}. We will
separate the proof for the two cases where i ∈ S and i ∈
/ S.
Case 1 : i ∈ S.
We will show that if i ∈ S, we√correctly have√σ(i) = sgn(xi ) with high probability. Therefore
we will need to bound |xi − σ(i) pi | = ||xi | − pi |.
We suppose that xi > 0. The value of σ(i) correctly determines sgn(xi ) if the number of times
we have measured (0, i) at Step 4. is more than half of the outcomes, i.e. n(0, i) > 21 E[n(0, i)]. If
xi < 0, the same arguments holds for n(1, i). We consider the random variable that represents the
outcome of a measurement on state |φi. The Chernoff Bound, part 1 with β = 1/2 gives
1
E[n(0, i)]] ≤ e−E[n(0,i)]/8
P[n(0, i) ≤ (29)
2
√
From the definition of |φi we have E[n(0, i)] = N4 (xi + pi )2 . We will lower bound this value with
the following argument.
For the k th measurement of |xi, with k ∈ [N ], let XP
k be a random variable such that Xk = 1 if
the outcome is i, and 0 otherwise. We define X = k∈[N ] Xk . Note that X = ni = N pi and
E[X] = N x2i . We can apply the Chernoff Bound, part 3 on X for β = 1/2 to obtain,
2
P[|x2i − pi | ≥ x2i /2] ≤ e−N xi /12
36 ln d
We have N = δ2 and by assumption x2i > δ 2 (since i ∈ S). Therefore,
33
Published as a conference paper at ICLR 2020
This proves that the event |x2i − pi | ≤ x2i /2 occurs with probability at least 1 − d13 if i ∈ S. This
p √
previous inequality is equivalent to 2pi /3 ≤ |xp i| ≤ 2pi . Thus, with high probability we have
√
E[n(0, i)] = N4 (xi + pi )2 ≥ 0.82N pi , since 2pi /3 ≤ |xi |. Moreover, since |pi | ≤ x2i /2,
E[n(0, i)] ≥ 0.82N x2i /2 ≥ 14.7 ln d. Therefore, equation equation 29 becomes
We conclude that for i ∈ S, if n(0, i) > 0.41N pi , the sign of xi is determined correctly by σ(i)
1
with high probability 1 − d1.83 , as indicated in Step 5.
√ √
We finally show |xi − σ(i) pi | = ||xi | − pi | is bounded. Again by the Chernoff Bound (3.) we
have, for 0 < β < 1:
2
N x2i /3
P[|x2i − pi | ≥ βx2i ] ≤ eβ
√ √
By the identity |x2i − pi | = (|xi | − pi )(|xi | + pi ) we have
x2i
√ 2 2
P |xi | − pi ≥ β √ ≤ eβ N xi /3
|xi | + pi
√ x2 x2
h √ i
Since pi > 0, we have β |xi |+i√pi ≤ β |xii | = β|xi |, therefore P |xi | − pi ≥ β|xi | ≤
2
N x2i /3
eβ . Finally, by chosing β = δ/|xi | < 1 we have
h √ i
P |xi | − pi ≥ δ ≤ e36 ln d/3 = 1/d12
Case 2 : i ∈
/ S.
If i ∈
/ S, we need to separate again in √
two cases. When√the estimated sign is wrong, i.e. σ(i) =
−sgn(xi ), we have to bound |xi − σ(i) pi | √ = ||xi | + pi |.√On the contrary,
√ if it is correct, i.e.
σ(i) = sgn(xi ), we have to bound |xi − σ(i) pi | = ||xi | − pi | ≤ ||xi | + pi |. Therefore only
one bound is necessary.
We use Chernoff Bound (2.) on the random variable X with β > 0 to obtain
β2 2
P[pi > (1 + β)x2i ] ≤ e 2+β N xi
δ4
We chose β = δ 2 /x2i and obtain P[pi > x2i + δ 2 ] ≤ e 3δ2 N = 1/d12 . Therefore, if i ∈
/ S, with very
high probability 1 − d112 we have pi ≤ x2i + δ 2 ≤ 2δ 2 . We can conclude and bound the error:
√ √ √
|xi − X̃i | ≤ ||xi | + pi | ≤ δ + 2δ = (1 + 2)δ
1
Since |S| ≤ d, the probability for this result to be true for all i ∈
/ S is 1 − d11 . This follows from
applying the Union Bound on the event pi > x2i + δ 2 .
34
Published as a conference paper at ICLR 2020
Figure 9: Numerical simulations of the training of the QCNN. These training curves represent the
evolution of the Loss L as we iterate through the MNIST dataset. For each graph, the amplitude
estimation error (0.1, 0.01), the non linearity cap C (2, 10), and the backpropagation error δ
(0.1, 0.01) are fixed whereas the quantum sampling ratio σ varies from 0.1 to 0.5. We can compare
each training curve to the classical learning (CNN). Note that these training curves are smoothed,
over windows of 12 steps, for readability.
In the following we report the classification results of the QCNN when applied on the test set
(10.000 images). We distinguish to use cases: in Table 4 the QCNN has been trained quantumly as
described in this paper, whereas in Table 5 we first have trained the classical CNN, then transferred
the weights to the QCNN only for the classification. This second use case has a global running time
worst than the first one, but we see it as another concrete application: quantum machine learning
could be used only for faster classification from a classically generated model, which could be
the case for high rate classification task (e.g. for autonomous systems, classification over many
simultaneous inputs). We report the test loss and accuracy for different values of the sampling ratio
σ, the amplitude estimation error , and for the backpropagation noise δ in the first case. The cap C
is fixed at 10. These values must be compared to the classical CNN classification metrics, for which
the loss is 0.129 and the accuracy is 96.1%. Note that we used a relatively small CNN and hence
35
Published as a conference paper at ICLR 2020
the accuracy is just over 96%, lower than the best possible accuracy with larger CNN.
Table 4: QCNN trained with quantum backpropagation on MNIST dataset. With C = 10 fixed.
Table 5: QCNN created from a classical CNN trained on MNIST dataset. With δ = 0.01 and
C = 10 fixed.
36