0% found this document useful (0 votes)
36 views

abbas2021 (1)

This document explores the capabilities of quantum neural networks compared to classical neural networks, focusing on their effective dimension as a measure of power and trainability. The authors demonstrate that quantum neural networks can achieve a better effective dimension and faster training than classical counterparts, suggesting potential advantages in machine learning tasks. The study highlights the importance of data encoding strategies and the challenges posed by noise in training quantum models.

Uploaded by

scribd.6t58z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

abbas2021 (1)

This document explores the capabilities of quantum neural networks compared to classical neural networks, focusing on their effective dimension as a measure of power and trainability. The authors demonstrate that quantum neural networks can achieve a better effective dimension and faster training than classical counterparts, suggesting potential advantages in machine learning tasks. The study highlights the importance of data encoding strategies and the challenges posed by noise in training quantum models.

Uploaded by

scribd.6t58z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Articles

https://doi.org/10.1038/s43588-021-00084-1

The power of quantum neural networks


Amira Abbas1,2, David Sutter1, Christa Zoufal1,3, Aurelien Lucchi3, Alessio Figalli3 and Stefan Woerner 1 ✉

It is unknown whether near-term quantum computers are advantageous for machine learning tasks. In this work we address this
question by trying to understand how powerful and trainable quantum machine learning models are in relation to popular clas-
sical neural networks. We propose the effective dimension—a measure that captures these qualities—and prove that it can be
used to assess any statistical model’s ability to generalize on new data. Crucially, the effective dimension is a data-dependent
measure that depends on the Fisher information, which allows us to gauge the ability of a model to train. We demonstrate
numerically that a class of quantum neural networks is able to achieve a considerably better effective dimension than compa-
rable feedforward networks and train faster, suggesting an advantage for quantum machine learning, which we verify on real
quantum hardware.

T
he power of a model lies in its ability to fit a variety of func- We turn our attention to measures that are easy to estimate in
tions1. In machine learning, power is often referred to as a practice and, importantly, incorporate the distribution of data. In
model’s capacity to express different relationships between particular, measures such as the effective dimension have been
variables2. Deep neural networks have proven to be extremely pow- motivated from an information-theoretic standpoint and depend
erful models, capable of capturing intricate relationships by learn- on the Fisher information, a quantity that describes the geometry
ing from data3. Quantum neural networks serve as a newer class of of a model’s parameter space and is essential in both statistics and
machine learning models that are deployed on quantum computers machine learning22–24. We argue that the effective dimension is a
and use quantum effects such as superposition, entanglement and robust capacity measure through proof of a generalization error
interference to perform computation. Some proposals for quantum bound and supporting numerical analyses, and thus use this mea-
neural networks include4–11—and hint at—potential advantages such sure to study the power of a popular class of neural networks in both
as speed-ups in training and faster processing. Although there has classical and quantum regimes.
been much development in the growing field of quantum machine Despite a lack of quantitative statements on the power of quan-
learning, a systematic study of the trade-offs between quantum and tum neural networks, another issue is rooted in the trainability
classical models has yet to be conducted12. In particular, the ques- of these models. A precise connection between expressibility and
tion of whether quantum neural networks are more powerful than trainability for certain classes of quantum neural networks is out-
classical neural networks is still open. lined in refs. 25,26. Quantum neural networks often suffer from the
A common way to quantify the power of a model is by its com- barren plateau phenomenon, wherein the loss landscape is peril-
plexity13. In statistical learning theory, the Vapnik–Chervonenkis ously flat and parameter optimization is therefore extremely diffi-
dimension is an established complexity measure, where error cult27. As shown in ref. 28, barren plateaus may be noise induced,
bounds on how well a model generalizes (that is, performs on unseen where certain noise models are assumed on the hardware. In other
data) can be derived14. Although the Vapnik–Chervonenkis dimen- words, the effect of hardware noise can make it very difficult to
sion has attractive properties in theory, computing it in practice is train a quantum model. Furthermore, barren plateaus can be circuit
notoriously difficult. Furthermore, using the Vapnik–Chervonenkis induced, which relates to the design of a model and random param-
dimension to bound generalization error requires several unreal- eter initialization. Methods to avoid the latter have been explored
istic assumptions, including that the model has access to infinite in refs. 29–32, but noise-induced barren plateaus remain problematic.
data15,16. The measure also scales with the number of parameters A particular attempt to understand the loss landscape of quan-
in the model and ignores the distribution of data. As modern deep tum models uses the Hessian33, which quantifies the curvature of a
neural networks are heavily overparameterized, generalization model’s loss function at a point in its parameter space34. Properties
bounds based on the Vapnik–Chervonenkis dimension—and other of the Hessian, such as its spectrum, provide useful diagnostic
measures alike—are typically vacuous17,18. information on the trainability of a model35. It was discovered that
In ref. 19, the authors analyzed the expressive power of param- the entries of the Hessian vanish exponentially in models suffering
eterized quantum circuits using memory capacity and found that from a barren plateau36. For certain loss functions, the Fisher infor-
quantum neural networks had limited advantages over classical mation matrix coincides with the Hessian of the loss function37.
neural networks. Memory capacity is, however, closely related to Consequently, we can examine the trainability of quantum and clas-
the Vapnik–Chervonenkis dimension and is thus subject to sim- sical neural networks by analyzing the Fisher information matrix,
ilar criticisms. In ref. 20, a quantum neural network is presented which is incorporated by the effective dimension. In this way, we
that exhibits a higher expressibility than certain classical models, may explicitly relate the effective dimension to model trainability38.
captured by the types of probability distributions it can gener- We find that a class of quantum neural networks is able to achieve
ate. Another result from ref. 21 is based on strong heuristics and a considerably higher capacity and faster training ability numeri-
provides systematic examples of possible advantages for quantum cally than comparable classical feedforward neural networks. A
neural networks. higher capacity is captured by a higher effective dimension, whereas

IBM Quantum, IBM Research—Zurich, Rueschlikon, Switzerland. 2University of KwaZulu-Natal, Durban, South Africa. 3ETH Zurich, Zurich, Switzerland.
1

✉e-mail: [email protected]

Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci 403


Articles Nature ComputatIonal ScIence

network, we apply a softmax function to the final layer, whereas in


∣0〉 z1 → the quantum network we obtain probabilities based on the parity

of the output bit strings. The input distribution p(x) is a prior dis-

OUTPUT
∣0〉 z2
INPUT

U x ∣0〉⊗S = ∣ψx 〉 ∣ψ x 〉 = ∣gθ(x)〉


x ..
θ
.. y tribution, whereas the conditional distribution p(y∣x;θ) describes

. .
the input–output relation of the model for a fixed θ ∈ Θ; Θ forms
∣0〉 zS →
a Riemannian space, which gives rise to a Riemannian metric,
namely, the Fisher information matrix
Feature map Variational model Measurement
Ux f(z) = y
θ
∂ ∂
F(θ ) = E(x,y)∼p [ log p(x, y;θ ) log p(x, y;θ )T ] ∈ Rd×d ,
∂θ ∂θ
Fig. 1 | Overview of the quantum neural network used in this study. The
input x ∈ Rsin is encoded into an S-qubit Hilbert space by applying the which can be approximated by the empirical Fisher information
feature map |ψ x ⟩ := Ux |0⟩⊗S. This state is then evolved via a variational matrix
form |gθ (x)⟩ := Gθ |ψ x ⟩, where G is a parameterized unitary evolving
the state after the feature map to a new state, and the parameters θ ∈ Θ
k
are chosen to minimize a certain loss function. Finally a measurement is 1∑ ∂ ∂
F̃k (θ ) = log p(xj , yj ;θ ) log p(xj , yj ;θ )T , (1)
performed whose outcome z = (z1, …, zS) is post-processed to extract the k ∂θ ∂θ
j= 1
output of the model y := f(z).
where (xj , yj )kj=1 are independent and identically distributed, drawn
from the distribution p(x,y;θ) (ref. 37). By definition, the Fisher
faster training implies that a model will reach a lower training error information matrix is positive semidefinite and hence its eigenval-
than another comparable model for a fixed number of training ues are non-negative, real numbers.
iterations. More generally, trainability is assessed by leveraging the The Fisher information conveniently helps capture the sensitivity
information-theoretic properties of the Fisher information, which of a neural network’s output relative to movements in the parameter
we connect to the barren plateau phenomenon. Our experiments space43. In ref. 44, the authors leverage geometric invariances associ-
reveal that how you encode data in a quantum neural network influ- ated with the Fisher information to produce the Fisher–Rao norm, a
ences the likelihood of your model encountering a barren plateau. robust norm-based capacity measure defined as the quadratic form
A quantum neural network with a data encoding strategy that is ||θ ||2fr := θ T F(θ )θ for θ. Notably, the Fisher–Rao norm acts as an
easy to simulate classically seems more likely to encounter a barren umbrella for several other existing norm-based measures45–47 and has
plateau, whereas a harder encoding strategy shows resilience to the demonstrated desirable properties both theoretically and empirically.
phenomenon. Noise, however, remains problematic by inhibiting
training in general. The effective dimension. The effective dimension is a complexity
measure motivated by information geometry, with useful qualities.
Results The goal of the effective dimension is to estimate the size that a
Quantum neural networks. Quantum neural networks are a sub- model occupies in model space—the space of all possible functions
class of variational quantum algorithms that comprise quantum for a particular model class, where the Fisher information matrix
circuits containing parameterized gate operations39. Information serves as the metric. Although there are many ways to define the
(usually in the form of classical data) is first encoded into a quan- effective dimension, a useful definition is presented in ref. 22, which
tum state via a state-preparation routine called a quantum feature is designed to be operationally meaningful in settings where data
map40. The choice of feature map is geared towards enhancing the are limited. More precisely, the number of data observations deter-
performance of the quantum neural network and is typically neither mines a natural scale or resolution used to observe model space.
optimized nor trained, although this idea is discussed in ref. 41. Once This is beneficial in practice where data are often scarce and can
data are encoded into a quantum state, a model called a variational help in understanding how data availability influences the accurate
model is applied, which contains parameterized gate operations that capture of model complexity.
are optimized for a particular task, analogous to classical machine The effective dimension is motivated by the theory of minimum
learning techniques5–7,42. The final output of the quantum neural description length, which is a model selection principle favoring
network is extracted from measurements made to the quantum models with the shortest description of the given data. Based on this
circuit after the variational model is applied. These measurements principle, it can be shown that the complexity at size n of a model
are often converted to labels or predictions through classical post- is given by
processing before being passed to a loss function, where the idea is (∫ )
d n √
to choose parameters for the variational model that minimize the log + log det F(θ )dθ + o(1) ,
loss function. 2 2π Θ
The quantum models we use can be summarized in Fig. 1, with
details of the structure and implementation in the Methods. We cre- where o(1) vanishes as n → ∞ (ref. 48). The first term containing d is
ate two model variants: one which we call a quantum neural net- usually interpreted as the dimension of the model, whereas the sec-
work and the other an easy quantum model. ond term is known as the geometric complexity. Information geo-
metric manipulations allow us to combine both terms into a single
The Fisher information. A way to assess the information gained by expression, referred to as the effective dimension22.
a particular parameterization of a statistical model is epitomized by Definition 1. The effective dimension of a statistical model
the Fisher information. By defining a neural network as a statisti- MΘ := {p(·, ·;θ ) : θ ∈ Θ} with respect to γ ∈ (0,1], a d-dimensional
cal model, we can describe the joint relationship between data pairs parameter space Θ ⊂ Rd and n ∈ N, n > 1 data samples is defined as
(x,y) as p(x,y;θ) = p(y∣x;θ)p(x) for all x ∈ X ⊂ Rsin, y ∈ Y ⊂ Rsout
and θ ∈ Θ ⊂ [−1,1]d (where θ is a vectorized parameter set, Θ is (
1
∫ √ γn
)
the full parameter space and d is the number of trainable param- log VΘ Θ
det ( idd + 2πlog n F̂(θ ) ) dθ
eters). This is achieved by applying an appropriate post-processing dγ,n (MΘ ) := 2 ( ) , (2)
γn
function in both classical and quantum networks. In the classical log 2πlog n

404 Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci


Nature ComputatIonal ScIence Articles
20.0 7

17.5 6 4

15.0 5
3
12.5
Eigenvalues

4
10.0
3 2
7.5
2
5.0
1
2.5 1

0 0 0

1.0 1.0 1.0

0.8 0.8 0.8


Eigenvalues <1

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

Classical neural network Easy quantum model Quantum neural network

Fig. 2 | Average Fisher information spectrum distribution. Here, box plots are used to reveal the average distribution of eigenvalues of the Fisher
information matrix for the classical feedforward neural network and the quantum neural network with two different feature maps. The dots in the box plots
represent outlier values relative to the length of the whiskers. The lower whiskers are at the lowest data points above Q1 – 1.5 × (Q3 – Q1), whereas the
upper whiskers are at the highest data points below Q3 + 1.5 × (Q3 – Q1), where Q1 and Q3 are the first and third quartiles, respectively. This is a standard
method to compute these plots. The easy quantum model has a classically simulatable data encoding strategy, whereas the quantum neural network’s
encoding scheme is conjectured to be difficult. In each model, we compute the Fisher information matrix 100 times using parameters sampled uniformly at
random and plot the resulting average distribution of the eigenvalues. We fix d = 40, input size at sin = 4 and output size at sout = 2. The top row contains the
average distribution of all eigenvalues for each model, whereas the bottom row contains the average distribution of eigenvalues less than 1 for each model.


where VΘ := Θ dθ ∈ R+ is the volume of the parameter space. The proving a generalization bound such that the effective dimension
matrix F̂(θ ) ∈ Rd×d is the normalized Fisher information matrix may be interpreted as a bounded capacity measure, serving as a use-
defined as ful tool to analyze the power of statistical models. We demonstrate
this in the Methods.

F̂ij (θ ) := d ∫ Fij (θ ) .
Θ
tr( F(θ ))dθ The Fisher information spectrum. Classically, the Fisher informa-
tion spectrum reveals a lot about the optimization landscape of a
Remark 1 (properties of the effective dimension). In the limit model. The magnitude of the eigenvalues illustrates the curvature
n → ∞, the effective dimension converges to the maximal rank of a model for a particular parameterization. If there is a large con-
r̄ := maxθ∈Θ rθ , where rθ ≤ d denotes the rank of the Fisher centration of eigenvalues near zero, the optimization landscape
information matrix F(θ). The proof of this result can be seen in will be predominantly flat and parameters become difficult to train
Supplementary Section 2.1, but it is worthwhile to note that the with gradient-based methods38. On the quantum side, we show in
effective dimension does not necessarily increase monotonically Supplementary Section 4 that if a model is in a barren plateau, the
with n, as explained in Supplementary Section 2.2. The geometric Fisher information spectrum will be concentrated around zero and
operational meaning of the effective dimension only holds if n is training also becomes unfeasible. We can thus make connections
sufficiently large. We conduct experiments over a wide range of to trainability via the spectrum of the Fisher information matrix by
n and ensure that conclusions are drawn from results where the using the effective dimension. Looking closely at equation (2), we
choice of n is sufficient. see that the effective dimension converges to its maximum fastest if
Another noteworthy point is that the effective dimension is easy the Fisher information spectrum is evenly distributed, on average.
to estimate. To see this, recall that we need to first estimate F(θ) and, We analyze the Fisher information spectra for the quantum neu-
second, calculate the integral over Θ given in equation (2). Both of ral network, the easy quantum model, and all possible configura-
these steps can be achieved via Monte Carlo integration which, in tions of the fully connected feedforward neural network—where all
practice, does not depend on the model’s dimension. models share a specified triple (d,sin,sout). To be robust, we sample
There are also two minor differences between equation (2) and 100 sets of parameters uniformly on Θ = [−1,1]d and compute the
the effective dimension from ref. 22: the presence of the constant Fisher information matrix 100 times using data sampled from a
γ ∈ (0,1], and the log n term. These modifications are helpful in standard Gaussian distribution. The resulting average distributions

Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci 405


Articles Nature ComputatIonal ScIence

a b
1.0
0.8 Classical neural network
Easy quantum model
Quantum neural network
0.9 0.7 Ibmq_montreal backend
Normalized effective dimension

0.6
0.8

Loss value
0.6
0.7

0.4

0.6
0.3

0.5 0.2
0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100
Number of data (×106) Number of training iterations

Fig. 3 | Normalized effective dimension and training loss. a, The normalized effective dimension plotted for the quantum neural network (green), the easy
quantum model (blue) and the classical feedforward neural network (purple). We fix sin = 4, sout = 2 and d = 40. b, Using the first two classes of the Iris
dataset55, we train all three models at d = 8, with a full batch size. The ADAM optimizer, with an initial learning rate of 0.1, is selected. For a fixed number
of training iterations (100), we train all models over 100 trials and plot the average training loss along with ± 1 s.d. We further verify the performance
of the quantum neural network on real quantum hardware and train the model using the ibmq_montreal 27-qubit device. We plot the hardware results
until they stabilize, at roughly 33 training iterations, thereafter we stop training and denote this final loss value with a dashed line. The actual hardware
implementation contains less CNOT gates by using linear connectivity for the feature map and variational circuit instead of all-to-all connectivity to cope
with limited resources, leading to the lower loss values.

of the eigenvalues of these 100 Fisher information matrices are plot- therefore consistently achieves the highest effective dimension over
ted in the top row of Fig. 2 for d = 40, sin = 4 and sout = 2. A sensitivity all ranges of finite data considered. Intuitively, we would expect the
analysis is included in Supplementary Section 3.1 to verify that 100 additional effects of quantum operations such as entanglement and
parameter samples are reasonable for the models we consider. In superposition—if used effectively—to generate models with higher
higher dimensions, this number will need to increase. The bottom capacity. The quantum neural network with a strong feature map
row of Fig. 2 contains the distribution for eigenvalues less than 1. is thus expected to deliver the highest capacity, but recall that in
The classical model depicted in Fig. 2 is the one with the highest the limit n → ∞, all models will converge to an effective dimension
average rank of Fisher information matrices. The majority of eigen- equal to the maximum rank of the Fisher information matrix (see
values are negligible (of the order 10−14), with a few very large val- Remark 1).
ues. This behavior is observed across all classical configurations that To support these observations, we calculate the capacity of each
we consider and is consistent with results from literature, where the model using a different measure, the Fisher–Rao norm44. The aver-
Fisher information matrix of non-linear feedforward neural net- age Fisher–Rao norm after training each model 100 times is roughly
works is known to be highly degenerate, with a few large eigenval- 250% higher in the quantum neural network than in the classical
ues38. The concentration around zero becomes more evident in the neural network, with the easy quantum model inbetween (see
bottom row of the plot, which depicts the eigenvalue distribution of Supplementary Section 3.3).
just the eigenvalues less than 1.
The easy quantum model also has most of its eigenvalues close Trainability. The observed Fisher information spectrum of the
to zero, and although there are some large eigenvalues, their magni- feedforward model is known to have undesirable optimization
tudes are not as extreme as the classical model. properties, where the outlying eigenvalues slow down training and
The quantum neural network, on the other hand, has a distribu- loss convergence35. These large eigenvalues become even more pro-
tion of eigenvalues that is more uniform, with no outlying values. nounced in bigger models, as seen in Supplementary Fig. 5. On
This can be seen from the range of the eigenvalues on the y-axis in examining the easy quantum model over an increasing system size,
Fig. 2. This distribution remains more or less constant as the num- the average Fisher spectrum becomes more concentrated around
ber of qubits increase, even in the presence of hardware noise (see zero. This is characteristic of models encountering a barren pla-
Supplementary Section 3.2); this has implications for capacity and teau, presenting another unfavorable scenario for optimization. The
trainability, which we examine next. quantum neural network, however, maintains its more even distri-
bution of eigenvalues as the number of qubits and trainable param-
Capacity analysis. In Fig. 3a, we plot the normalized effective eters increase. Furthermore, a large amount of the eigenvalues are
dimension for all three model types. The normalization ensures that not near zero. This highlights the importance of a feature map in
the effective dimension lies between 0 and 1 by simply dividing by d. a quantum model. The harder data encoding strategy used in the
The convergence speed of the effective dimension to its maximum is quantum neural network seems to structurally change the optimi-
slowed down by smaller eigenvalues and uneven Fisher information zation landscape and remove the flatness, usually associated with
spectra. As the classical models contain highly degenerate Fisher suboptimal optimization conditions such as barren plateaus.
matrices, the effective dimension converges the slowest, followed by We confirm the training statements for all three models with
the easy quantum model. The quantum neural network has non- an experiment illustrated in Fig. 3b. Using a cross-entropy loss
degenerate Fisher information matrices and more even spectra, it function, optimized with ADAM for a fixed number of training

406 Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci


Nature ComputatIonal ScIence Articles
iterations (100) and an initial learning rate of 0.1, the quantum neu- any qubits, as performed at the beginning of Supplementary Fig. 1. These
ral network trains to a lower loss, faster than the other two models operations are not repeated.
After the feature map circuit is applied, we apply another set of operations
over an average of 100 trials. To support the promising training per- that depend on traiable parameters. We call this the variational circuit, Gθ .
formance of the quantum neural network, we also train it once on Supplementary Fig. 2 depicts the variational form deployed in both the easy
real hardware using the ibmq_montreal 27-qubit device. We reduce quantum model and the quantum neural network. The circuit consists of S qubits,
the number of controlled NOT (CNOT) gates by only considering to which parameterized RY gates are applied to every qubit. CNOT gates are
thereafter applied between each pair of qubits in the circuit. Finally, another set of
linear entanglement instead of all-to-all entanglement in the feature
parameterized RY gates are applied to every qubit. This circuit has, by definition,
map and variational circuit. This is to cope with hardware limita- 2S parameters. If the depth is increased, the entangling layers and second set of
tions and could be the reason the hardware training performs even parameterized RY gates are repeated; d can be calculated as d = (D + 1)S, where S is
better than the simulated results, as too much entanglement has equal to sin due to the choice of both feature maps used in this study and D is called
been shown to have negative effects on model trainability49. The full the depth of the circuit (that is, how many times the entanglement and second set
of RY operations are repeated).
details of the experiment are contained in Supplementary Section We finally measure all qubits in the σz basis and classically compute the parity
3.4. We find that the quantum neural network tangibly demon- of the output bit strings. For simplicity, we consider binary classification, where the
strates faster training; however, the addition of hardware noise may probability of observing class 0 corresponds to the probability of seeing even parity
still make training difficult, regardless of the optimization land- bit strings and similarly, for class 1 with odd parity bit strings.
scape (see Supplementary Section 3.2). The are two reasons for the choice of these models’s architecture: first, the hard
feature map is motivated in ref. 50 to serve as a useful data embedding strategy that
is believed to be difficult to simulate classically as the depth and width increase,
Discussion and the easy feature map allows us to benchmark this; second, the variational
In stark contrast to classical models, understanding the capacity of design aims to create more expressive models for quantum algorithms51. We
quantum neural networks is not well explored. Moreover, classical benchmark the quantum models against a class of classical models that forms
neural networks are known to produce highly degenerate Fisher part of the foundation of deep learning, namely, feedforward neural networks.
We consider all possible topologies with full connectivity for a fixed number of
information matrices, which can considerably slow down training. trainable parameters. Networks with and without biases and different activation
No such analysis has been performed for quantum neural networks. functions are explored.
This work attempts to address this gap but leaves room for fur-
ther research. The feature map in a quantum model plays a large Generalization error bounds for the effective dimension. Suppose we are
role in determining both its capacity and trainability via the effec- given a hypothesis class, H, of functions mapping from X to Y and a training
set Sn = {(x1 , y1 ), …, (xn , yn )} ∈ (X × Y)n, where the pairs (xi,yi) are drawn
tive dimension and Fisher information spectrum. A deeper inves- independent and identically distributed from some unknown joint distribution,
tigation needs to be conducted on why the particular higher-order p. Furthermore, let L : Y × Y → R be a loss function. The challenge is to find a
feature map used in this study produces a desirable model landscape particular hypothesis h ∈ H with the smallest possible expected risk, defined as
that induces both a high capacity and faster training ability. Different R(h) := E(x,y)∼p [L(h(x), y)]. As we only have access to a training set Sn,
variational circuits could also influence the model’s landscape and ∑ h ∈ H is to minimize the so called
a good strategy to find the best hypothesis
empirical risk, defined as Rn (h) := n1 ni=1 L(h(xi ), yi ). The difference between
the effects of non-unitary operations (for example, induced through the expected and the empirical risk is the generalization error—an important
intermediate measurements) should be investigated. The Fisher quantity in machine learning that dictates whether a hypothesis h ∈ H learned
information spectra of certain quantum models seem robust against on a training set will perform well on unseen data, drawn from the unknown joint
hardware noise, but trainability remains problematic and the pos- distribution p (ref. 17). Therefore, an upper bound on the quantity
sibility of noise-induced barren plateaus needs examination. Finally, suph∈H |R(h) − Rn (h)| , (3)
understanding generalization performance on multiple datasets and
larger models with complexities that we would be interested in prac- which vanishes as n grows large, is of considerable interest. Capacity measures help
quantify the expressiveness and power of H. The generalization error in equation
tice, might prove insightful.
(3) is thus typically bounded by an expression that depends on a capacity measure,
Overall, we have shown that quantum neural networks can pos- such as the Vapnik–Chervonenkis dimension3 or the Fisher–Rao norm44. Theorem
sess a desirable Fisher information spectrum that enables them to 1 provides a bound based on the effective dimension, which we use to study the
train faster and express more functions than comparable classical power of neural networks from hereon.
and quantum models—a promising reveal for quantum machine In this manuscript we consider neural networks as models described by
stochastic maps, parameterized by some θ ∈ Θ. As a result, the variables h and
learning, which we hope leads to further studies on the power of H are replaced by θ and Θ, respectively. The corresponding loss functions
quantum models. are mappings L : P(Y) × P(Y) → R, where P(Y) denotes the set of
distributions on Y . We assume the following regularity assumption on the model
Methods MΘ := {p(·, ·;θ ) : θ ∈ Θ}:
Quantum models used in this study. The quantum models used in this study first
Θ ∋ θ �→ p(·, ·;θ ) is M1 -Lipschitz continuous w.r.t. the supremum norm. (4)
encode classical data x ∈ Rsin into an S-qubit Hilbert space using a feature map,
Ux . For the quantum neural network, we use a feature map originally proposed in
Theorem 1 (generalization bound for the effective dimension). Let Θ = [−1,1]d and
ref. 50, and in the easy quantum model we swap out this feature map for one that is
consider a statistical model MΘ := {p(·, ·;θ ) : θ ∈ Θ} that satisfies equation (4)
easy to simulate classically. Supplementary Fig. 1 contains a circuit representation
such that F̂(θ ) has full rank for all θ ∈ Θ, and ||∇θ log F̂(θ )|| ≤ Λ for some Λ ≥ 0
of the feature map from ref. 50, which we refer to as the hard feature map. Here the
and all θ ∈ Θ. Let dγ,n denote the effective dimension of MΘ as defined in equation
number of qubits in the model is chosen to equal the number of feature values of
(2). Furthermore, let L : P(Y) × P(Y) → [−B/2, B/2] for B > 0 be a loss function
the data (that is, S := sin). That way, we can associate the same index for each qubit,
that is α-Hölder continuous with constant M2 in the first argument with regards to
with each feature value of a data point; for example, if we have data that has three
the total variation distance for some α ∈ (0,1]. Then there exists a constant cd,Λ such
feature values (that is, x = (x1 , x2 , x3 )T), we will have a three qubit model with
that for γ ∈ (0,1] and all n ∈ N, we have
qubits = (q1,q2,q3).
The operations in the hard feature map first apply Hadamard gates to each of ( √ )
the qubits, followed by a layer of RZ gates, whereby the angle of the Z rotation on P supθ ∈Θ |R(θ ) − Rn (θ )| ≥ 4M 2πlog γn
n

qubit i depends on the ith feature of the data point x, normalized between [−1,1]. (5)
RZZ gates are then applied to every pair of qubits. This time, the value of the ( ) dγ,n1/α ( 2
)
γn1/α
controlled Z rotations depends on a product of feature values. For example, if ≤ cd,Λ 2πlogn1/α
2
exp − 16MB2πlog
γ
n
,
the RZZ gate is controlled by qubit i and targets qubit j, then the angle of the
controlled rotation applied to qubit j is dependent on the product of feature where M = M1α M2.
values xixj. The RZZ gates are implemented using a decomposition into two CNOT The proof is given in Supplementary Section 5.1. Note that the choice of
gates and one RZ gate; thereafter, the RZ and RZZ gates are repeated once. The the norm to bound the gradient of the Fisher information matrix is irrelevant
classically simulatable feature map employed in the easy quantum model is due to the presence of the dimensional constant cd,Λ. In the special case where
simply the first sets of Hadamard and RZ gates with no entanglement between the Fisher information matrix does not depend on θ, we have Λ = 0 and (5)

Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci 407


Articles Nature ComputatIonal ScIence

holds for cd,0 = 2 d . This may occur in scenarios where a neural network is 6. Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders for
already trained, that is, the parameters θ ∈ Θ are fixed. If we choose γ ∈ (0,1] efficient compression of quantum data. Quant. Sci. Technol. 2, 045001 (2017).
to be sufficiently small, we can ensure that the right-hand side of equation (5) 7. Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the
vanishes in the limit n → ∞. This is explained in Supplementary Section 5. To quantum domain: a review of recent progress. Rep. Prog. Phys. 81, 074001
verify the ability of the effective dimension to capture generalization behavior, we (2018).
conduct a numerical analysis similar to work presented in ref. 52. We find that the 8. Ciliberto, C. et al. Quantum machine learning: a classical perspective. Proc.
effective dimension for a model trained on confusion sets with increasing label Roy. Soc. A 474, 20170551 (2018).
corruption, accurately captures generalization behavior. The details can be found in 9. Killoran, N. et al. Continuous-variable quantum neural networks. Phys. Rev.
Supplementary Section 5.2. Res. 1, 033063 (2019).
The continuity assumptions of Theorem 1 are satisfied for a large class of 10. Schuld, M., Sinayskiy, I. & Petruccione, F. The quest for a quantum neural
classical and quantum statistical models53,54, as well as many popular loss functions. network. Quant. Inf. Proc. 13, 2567–2586 (2014).
The full rank assumption on the Fisher information matrix, however, often does 11. Farhi, E. & Neven, H. Classification with quantum neural networks on near
not hold in classical models. Non-linear feedforward neural networks, which we term processors. Quant. Rev. Lett. 1, 2 (2020).
consider in this study, have particularly degenerate Fisher information matrices38. 12. Aaronson, S. Read the fine print. Nat. Phys. 11, 291–293 (2015).
We thus further extend the generalization bound to account for a broad range of 13. Vapnik, V. The Nature of Statistical Learning Theory Vol. 8, 1–15 (Springer,
models that may not have a full rank Fisher information matrix. 2000).
Remark 2. (Relaxing the rank constraint in Theorem 1) The generalization 14. Vapnik, V. N. & Chervonenkis, A. Y. On the uniform convergence of relative
bound in equation (5) can be modified to hold for a statistical model without frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280
a full rank Fisher information matrix. By partitioning Θ, we discretize the (1971).
statistical model and prove a generalization bound for the discretized version of 15. Sontag, E. D Neural Networks and Machine Learning 69–95 (Springer, 1998).
(κ )
MΘ := {p(·, ·;θ ) : θ ∈ Θ} denoted by MΘ := {p(κ ) (·, ·;θ ) : θ ∈ Θ}, where 16. Vapnik, V., Levin, E. & Cun, Y. L. Measuring the VC-dimension of a learning
κ ∈ N is a discretization parameter. By choosing κ carefully, we can control the machine. Neural Comput. 6, 851–876 (1994).
discretization error. We then proceed similarly as in the proof of Theorem 1, that is, 17. Neyshabur, B., Bhojanapalli, S., McAllester, D. & Srebro, N. Exploring
first connecting the generalization error to the covering number and then relating generalization in deep learning. In Advances in Neural Information Processing
the covering number to the effective dimension. This is explained in detail, along Systems 30, 5947–5956 (NIPS, 2017).
with the proof, in Supplementary Section 5.3. 18. Arora, S., Ge, R., Neyshabur, B. & Zhang, Y. Stronger generalization bounds
for deep nets via a compression approach. In Proc. 35th International
Training the quantum neural network on real hardware. The hardware Conference on Machine Learning Vol. 80, 254–263 (PMLR, 2018); http://
experiment is conducted on the ibmq_montreal 27-qubit device. We use four proceedings.mlr.press/v80/arora18b.html
qubits with linear connectivity to train the quantum neural network on the first 19. Wright, L. G. & McMahon, P. L. The capacity of quantum neural networks. In
two classes of the Iris dataset55. We deploy the same training specifications as Conference on Lasers and Electro-Optics JM4G.5 (Optical Society of America,
in Supplementary Section 3.3 and randomly initialize the parameters. Once the 2020); http://www.osapublishing.org/abstract.
training loss stabilizes, that is the change in the loss from one iteration to the cfm?URI=CLEO_QELS-2020-JM4G.5
next is small, we stop the hardware training. This occurs after roughly 33 training 20. Du, Y., Hsieh, M.-H., Liu, T. & Tao, D. Expressive power of parametrized
steps. The results are contained in Fig. 3b and the real hardware shows remarkable quantum circuits. Phys. Rev. Res. 2, 033125 (2020).
performance relative to all other models. Due to limited hardware availability, this 21. Huang, H.-Y. et al. Power of data in quantum machine learning. Nat.
experiment is only run once and an analysis of the hardware noise and the spread Commun. 12, 2631 (2021).
of the training loss for differently sampled initial parameters would make these 22. Berezniuk, O., Figalli, A., Ghigliazza, R. & Musaelian, K. A scale-dependent
results more robust. notion of effective dimension. Preprint at https://arxiv.org/abs/2001.10872
We plot the circuit that is implemented on the quantum device in (2020).
Supplementary Fig. 8. As in the quantum neural network discussed in 23. Rissanen, J. J. Fisher information and stochastic complexity. IEEE Trans. Inf.
Supplementary Section 1, the circuit contains parameterized RZ and RZZ rotations Theory 42, 40–47 (1996).
that depend on the data, as well as parameterized RY gates with eight trainable 24. Cover, T. M. & Thomas, J. A. Elements of Information Theory (Wiley, 2006).
parameters. Note the different entanglement structure presented here as opposed 25. Nakaji, K. & Yamamoto, N. Expressibility of the alternating layered ansatz for
to the circuits in Supplementary Figs. 1 and 2. This is to reduce the number of quantum computation. Quantum 5, 434 (2021).
CNOT gates required to incorporate current hardware constraints and could 26. Holmes, Z., Sharma, K., Cerezo, M. & Coles, P. J. Connecting ansatz
be the reason the actual hardware implementation trains so well as too much expressibility to gradient magnitudes and barren plateaus. Preprint at https://
entanglement has been shown to have a negative effect on model trainability49. arxiv.org/abs/2101.02138 (2021).
The full circuit repeats the feature map encoding once before the variational form 27. McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren
is applied. plateaus in quantum neural network training landscapes. Nat. Commun. 9,
1–6 (2018).
Data availability 28. Wang, S. et al. Noise-induced barren plateaus in variational quantum
The data for the graphs and analyses in this study was generated using Python. algorithms. Preprint at https://arxiv.org/abs/2007.14384 (2020).
Source data are provided with this paper. All other data can be accessed via the 29. Cerezo, M., Sone, A., Volkoff, T., Cincio, L. & Coles, P. J. Cost function
following Zenodo repository: https://doi.org/10.5281/zenodo.4732830 (ref. 56). dependent barren plateaus in shallow parametrized quantum circuits. Nat.
Commun. 12, 1791 (2021).
30. Verdon, G. et al. Learning to learn with quantum neural networks via
Code availability classical neural networks. Preprint at https://arxiv.org/abs/1907.05415 (2019).
All code to generate the data, figures and analyses in this study is publicly available 31. Volkoff, T. & Coles, P. J. Large gradients via correlation in random
with detailed information on the implementation via the following Zenodo parameterized quantum circuits. Quant. Sci. Technol. 6, 025008 (2021).
repository: https://doi.org/10.5281/zenodo.4732830 (ref. 56). 32. Skolik, A., McClean, J. R., Mohseni, M., van der Smagt, P. & Leib, M.
Layerwise learning for quantum neural networks. Quant. Mach. Intell. 3, 5
Received: 20 November 2020; Accepted: 14 May 2021; (2021).
Published online: 24 June 2021 33. Huembeli, P. & Dauphin, A. Characterizing the loss landscape of variational
quantum circuits. Quant. Sci. Technol. 6, 025011 (2021).
34. Bishop, C. Exact calculation of the Hessian matrix for the multilayer
References perceptron. Neural Comput. 4, 494–501 (1992).
1. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016); 35. LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.-R. Efficient BackProp 9–48
http://www.deeplearningbook.org (Springer, 2012); https://doi.org/10.1007/978-3-642-35289-8_3
2. Baldi, P. & Vershynin, R. The capacity of feedforward neural networks. Neural 36. Cerezo, M. & Coles, P. J. Higher order derivatives of quantum neural
Networks 116, 288–311 (2019). networks with barren plateaus. Quant. Sci. Technol. 6, 035006 (2021).
3. Dziugaite, G. K. & Roy, D. M. Computing nonvacuous generalization bounds 37. Kunstner, F., Hennig, P. & Balles, L. Limitations of the empirical Fisher
for deep (stochastic) neural networks with many more parameters than approximation for natural gradient descent. In Advances in Neural
training data. In Proc. 33rd Conference on Uncertainty in Artificial Intelligence Information Processing Systems 32 4156–4167 (NIPS, 2019); http://papers.nips.
(UAI, 2017). cc/paper/limitations-of-fisher-approximation
4. Schuld, M. Supervised Learning with Quantum Computers (Springer, 2018). 38. Karakida, R., Akaho, S. & Amari, S.-I. Universal statistics of Fisher
5. Zoufal, C., Lucchi, A. & Woerner, S. Quantum generative adversarial information in deep neural networks: mean field approach. In Proc. Machine
networks for learning and loading random distributions. npj Quant. Inf. 5, Learning Research Vol. 89, 1032–1041 (PMLR, 2019); http://proceedings.mlr.
1–9 (2019). press/v89/karakida19a.html

408 Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci


Nature ComputatIonal ScIence Articles
39. Schuld, M., Bocharov, A., Svore, K. M. & Wiebe, N. Circuit-centric quantum 53. Virmaux, A. & Scaman, K. Lipschitz regularity of deep neural networks:
classifiers. Phys. Rev. A 101, 032308 (2020). analysis and efficient estimation. In Advances in Neural Information Processing
40. Schuld, M., Sweke, R. & Meyer, J. J. Effect of data encoding on the expressive Systems 31, 3835–3844 (NIPS, 2018); http://papers.nips.cc/paper/
power of variational quantum-machine-learning models. Phys. Rev. A 103, lipschitz-regularity-of-deep-neural-networks
032430 (2021). 54. Sweke, R. et al. Stochastic gradient descent for hybrid quantum-classical
41. Lloyd, S., Schuld, M., Ijaz, A., Izaac, J. & Killoran, N. Quantum embeddings optimization. Quantum 4, 314 (2020).
for machine learning. Preprint at https://arxiv.org/abs/2001.03622 55. Dua, D. & Graff, C. UCI Machine Learning Repository (2017); http://archive.
(2020). ics.uci.edu/ml
42. Cong, I., Choi, S. & Lukin, M. D. Quantum convolutional neural networks. 56. Abbas, A. et al. amyami187/effective_dimension: The Effective Dimension Code
Nat. Phys. 15, 1273–1278 (2019). (Zenodo, 2021); https://doi.org/10.5281/zenodo.4732830
43. Amari, S.-I. Natural gradient works efficiently in learning. Neural Comput. 10,
251–276 (1998). Acknowledgements
44. Liang, T., Poggio, T., Rakhlin, A. & Stokes, J. Fisher–Rao metric, geometry, We thank M. Schuld for insightful discussions on data embedding in quantum models.
and complexity of neural networks. In Proc. Machine Learning Research Vol. We also thank T. L. Scholten for constructive feedback on the manuscript. C.Z.
89, 888–896 (PMLR, 2019); http://proceedings.mlr.press/v89/liang19a.html acknowledges support from the National Centre of Competence in Research Quantum
45. Neyshabur, B., Salakhutdinov, R. R. & Srebro, N. Path-SGD: path-normalized Science and Technology (QSIT).
optimization in deep neural networks. In Advances in Neural Information
Processing Systems 28, 2422–2430 (NIPS, 2015).
46. Neyshabur, B., Tomioka, R. & Srebro, N. Norm-based capacity control in Author contributions
neural networks. In Proc. Machine Learning Research Vol. 40, 1376–1401 The main ideas were developed by all of the authors. A.A. provided numerical simulations.
(PMLR, 2015); http://proceedings.mlr.press/v40/Neyshabur15.html D.S. and A.F. proved the technical claims. All authors contributed to the write-up.
47. Bartlett, P. L., Foster, D. J. & Telgarsky, M. J. Spectrally-normalized margin
bounds for neural networks. In Advances in Neural Information Processing Competing interests
Systems 30, 6240–6249 (NIPS, 2017); http://papers.nips.cc/ The authors declare no competing interests.
paper/7204-spectrally-normalized
48. Rissanen, J. J. Fisher information and stochastic complexity. IEEE Trans. Inf. Additional information
Theory 42, 40–47 (1996). Supplementary information The online version contains supplementary material
49. Marrero, C. O., Kieferová, M. & Wiebe, N. Entanglement induced barren available at https://doi.org/10.1038/s43588-021-00084-1.
plateaus. Preprint at https://arxiv.org/abs/2010.15968 (2020).
50. Havlíček, V. et al. Supervised learning with quantum-enhanced feature spaces. Correspondence and requests for materials should be addressed to S.W.
Nature 567, 209–212 (2019). Peer review information Nature Computational Science thanks Patrick Coles and the
51. Sim, S., Johnson, P. D. & Aspuru-Guzik, A. Expressibility and entangling other, anonymous, reviewer(s) for their contribution to the peer review of this work.
capability of parameterized quantum circuits for hybrid quantum-classical Handling editor: Jie Pan, in collaboration with the Nature Computational Science team.
algorithms. Adv. Quant. Technol. 2, 1900070 (2019). Reprints and permissions information is available at www.nature.com/reprints.
52. Jia, Z. & Su, H. Information-theoretic local minima characterization and
regularization. In Proc. 37th International Conference on Machine Learning Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
Vol. 119, 4773–4783 (PMLR, 2020); http://proceedings.mlr.press/v119/jia20a. published maps and institutional affiliations.
html © The Author(s), under exclusive licence to Springer Nature America, Inc. 2021

Nature Computational Science | VOL 1 | June 2021 | 403–409 | www.nature.com/natcomputsci 409

You might also like