abbas2021 (1)
abbas2021 (1)
https://doi.org/10.1038/s43588-021-00084-1
It is unknown whether near-term quantum computers are advantageous for machine learning tasks. In this work we address this
question by trying to understand how powerful and trainable quantum machine learning models are in relation to popular clas-
sical neural networks. We propose the effective dimension—a measure that captures these qualities—and prove that it can be
used to assess any statistical model’s ability to generalize on new data. Crucially, the effective dimension is a data-dependent
measure that depends on the Fisher information, which allows us to gauge the ability of a model to train. We demonstrate
numerically that a class of quantum neural networks is able to achieve a considerably better effective dimension than compa-
rable feedforward networks and train faster, suggesting an advantage for quantum machine learning, which we verify on real
quantum hardware.
T
he power of a model lies in its ability to fit a variety of func- We turn our attention to measures that are easy to estimate in
tions1. In machine learning, power is often referred to as a practice and, importantly, incorporate the distribution of data. In
model’s capacity to express different relationships between particular, measures such as the effective dimension have been
variables2. Deep neural networks have proven to be extremely pow- motivated from an information-theoretic standpoint and depend
erful models, capable of capturing intricate relationships by learn- on the Fisher information, a quantity that describes the geometry
ing from data3. Quantum neural networks serve as a newer class of of a model’s parameter space and is essential in both statistics and
machine learning models that are deployed on quantum computers machine learning22–24. We argue that the effective dimension is a
and use quantum effects such as superposition, entanglement and robust capacity measure through proof of a generalization error
interference to perform computation. Some proposals for quantum bound and supporting numerical analyses, and thus use this mea-
neural networks include4–11—and hint at—potential advantages such sure to study the power of a popular class of neural networks in both
as speed-ups in training and faster processing. Although there has classical and quantum regimes.
been much development in the growing field of quantum machine Despite a lack of quantitative statements on the power of quan-
learning, a systematic study of the trade-offs between quantum and tum neural networks, another issue is rooted in the trainability
classical models has yet to be conducted12. In particular, the ques- of these models. A precise connection between expressibility and
tion of whether quantum neural networks are more powerful than trainability for certain classes of quantum neural networks is out-
classical neural networks is still open. lined in refs. 25,26. Quantum neural networks often suffer from the
A common way to quantify the power of a model is by its com- barren plateau phenomenon, wherein the loss landscape is peril-
plexity13. In statistical learning theory, the Vapnik–Chervonenkis ously flat and parameter optimization is therefore extremely diffi-
dimension is an established complexity measure, where error cult27. As shown in ref. 28, barren plateaus may be noise induced,
bounds on how well a model generalizes (that is, performs on unseen where certain noise models are assumed on the hardware. In other
data) can be derived14. Although the Vapnik–Chervonenkis dimen- words, the effect of hardware noise can make it very difficult to
sion has attractive properties in theory, computing it in practice is train a quantum model. Furthermore, barren plateaus can be circuit
notoriously difficult. Furthermore, using the Vapnik–Chervonenkis induced, which relates to the design of a model and random param-
dimension to bound generalization error requires several unreal- eter initialization. Methods to avoid the latter have been explored
istic assumptions, including that the model has access to infinite in refs. 29–32, but noise-induced barren plateaus remain problematic.
data15,16. The measure also scales with the number of parameters A particular attempt to understand the loss landscape of quan-
in the model and ignores the distribution of data. As modern deep tum models uses the Hessian33, which quantifies the curvature of a
neural networks are heavily overparameterized, generalization model’s loss function at a point in its parameter space34. Properties
bounds based on the Vapnik–Chervonenkis dimension—and other of the Hessian, such as its spectrum, provide useful diagnostic
measures alike—are typically vacuous17,18. information on the trainability of a model35. It was discovered that
In ref. 19, the authors analyzed the expressive power of param- the entries of the Hessian vanish exponentially in models suffering
eterized quantum circuits using memory capacity and found that from a barren plateau36. For certain loss functions, the Fisher infor-
quantum neural networks had limited advantages over classical mation matrix coincides with the Hessian of the loss function37.
neural networks. Memory capacity is, however, closely related to Consequently, we can examine the trainability of quantum and clas-
the Vapnik–Chervonenkis dimension and is thus subject to sim- sical neural networks by analyzing the Fisher information matrix,
ilar criticisms. In ref. 20, a quantum neural network is presented which is incorporated by the effective dimension. In this way, we
that exhibits a higher expressibility than certain classical models, may explicitly relate the effective dimension to model trainability38.
captured by the types of probability distributions it can gener- We find that a class of quantum neural networks is able to achieve
ate. Another result from ref. 21 is based on strong heuristics and a considerably higher capacity and faster training ability numeri-
provides systematic examples of possible advantages for quantum cally than comparable classical feedforward neural networks. A
neural networks. higher capacity is captured by a higher effective dimension, whereas
IBM Quantum, IBM Research—Zurich, Rueschlikon, Switzerland. 2University of KwaZulu-Natal, Durban, South Africa. 3ETH Zurich, Zurich, Switzerland.
1
✉e-mail: [email protected]
OUTPUT
∣0〉 z2
INPUT
17.5 6 4
15.0 5
3
12.5
Eigenvalues
4
10.0
3 2
7.5
2
5.0
1
2.5 1
0 0 0
0 0 0
Fig. 2 | Average Fisher information spectrum distribution. Here, box plots are used to reveal the average distribution of eigenvalues of the Fisher
information matrix for the classical feedforward neural network and the quantum neural network with two different feature maps. The dots in the box plots
represent outlier values relative to the length of the whiskers. The lower whiskers are at the lowest data points above Q1 – 1.5 × (Q3 – Q1), whereas the
upper whiskers are at the highest data points below Q3 + 1.5 × (Q3 – Q1), where Q1 and Q3 are the first and third quartiles, respectively. This is a standard
method to compute these plots. The easy quantum model has a classically simulatable data encoding strategy, whereas the quantum neural network’s
encoding scheme is conjectured to be difficult. In each model, we compute the Fisher information matrix 100 times using parameters sampled uniformly at
random and plot the resulting average distribution of the eigenvalues. We fix d = 40, input size at sin = 4 and output size at sout = 2. The top row contains the
average distribution of all eigenvalues for each model, whereas the bottom row contains the average distribution of eigenvalues less than 1 for each model.
∫
where VΘ := Θ dθ ∈ R+ is the volume of the parameter space. The proving a generalization bound such that the effective dimension
matrix F̂(θ ) ∈ Rd×d is the normalized Fisher information matrix may be interpreted as a bounded capacity measure, serving as a use-
defined as ful tool to analyze the power of statistical models. We demonstrate
this in the Methods.
VΘ
F̂ij (θ ) := d ∫ Fij (θ ) .
Θ
tr( F(θ ))dθ The Fisher information spectrum. Classically, the Fisher informa-
tion spectrum reveals a lot about the optimization landscape of a
Remark 1 (properties of the effective dimension). In the limit model. The magnitude of the eigenvalues illustrates the curvature
n → ∞, the effective dimension converges to the maximal rank of a model for a particular parameterization. If there is a large con-
r̄ := maxθ∈Θ rθ , where rθ ≤ d denotes the rank of the Fisher centration of eigenvalues near zero, the optimization landscape
information matrix F(θ). The proof of this result can be seen in will be predominantly flat and parameters become difficult to train
Supplementary Section 2.1, but it is worthwhile to note that the with gradient-based methods38. On the quantum side, we show in
effective dimension does not necessarily increase monotonically Supplementary Section 4 that if a model is in a barren plateau, the
with n, as explained in Supplementary Section 2.2. The geometric Fisher information spectrum will be concentrated around zero and
operational meaning of the effective dimension only holds if n is training also becomes unfeasible. We can thus make connections
sufficiently large. We conduct experiments over a wide range of to trainability via the spectrum of the Fisher information matrix by
n and ensure that conclusions are drawn from results where the using the effective dimension. Looking closely at equation (2), we
choice of n is sufficient. see that the effective dimension converges to its maximum fastest if
Another noteworthy point is that the effective dimension is easy the Fisher information spectrum is evenly distributed, on average.
to estimate. To see this, recall that we need to first estimate F(θ) and, We analyze the Fisher information spectra for the quantum neu-
second, calculate the integral over Θ given in equation (2). Both of ral network, the easy quantum model, and all possible configura-
these steps can be achieved via Monte Carlo integration which, in tions of the fully connected feedforward neural network—where all
practice, does not depend on the model’s dimension. models share a specified triple (d,sin,sout). To be robust, we sample
There are also two minor differences between equation (2) and 100 sets of parameters uniformly on Θ = [−1,1]d and compute the
the effective dimension from ref. 22: the presence of the constant Fisher information matrix 100 times using data sampled from a
γ ∈ (0,1], and the log n term. These modifications are helpful in standard Gaussian distribution. The resulting average distributions
a b
1.0
0.8 Classical neural network
Easy quantum model
Quantum neural network
0.9 0.7 Ibmq_montreal backend
Normalized effective dimension
0.6
0.8
Loss value
0.6
0.7
0.4
0.6
0.3
0.5 0.2
0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100
Number of data (×106) Number of training iterations
Fig. 3 | Normalized effective dimension and training loss. a, The normalized effective dimension plotted for the quantum neural network (green), the easy
quantum model (blue) and the classical feedforward neural network (purple). We fix sin = 4, sout = 2 and d = 40. b, Using the first two classes of the Iris
dataset55, we train all three models at d = 8, with a full batch size. The ADAM optimizer, with an initial learning rate of 0.1, is selected. For a fixed number
of training iterations (100), we train all models over 100 trials and plot the average training loss along with ± 1 s.d. We further verify the performance
of the quantum neural network on real quantum hardware and train the model using the ibmq_montreal 27-qubit device. We plot the hardware results
until they stabilize, at roughly 33 training iterations, thereafter we stop training and denote this final loss value with a dashed line. The actual hardware
implementation contains less CNOT gates by using linear connectivity for the feature map and variational circuit instead of all-to-all connectivity to cope
with limited resources, leading to the lower loss values.
of the eigenvalues of these 100 Fisher information matrices are plot- therefore consistently achieves the highest effective dimension over
ted in the top row of Fig. 2 for d = 40, sin = 4 and sout = 2. A sensitivity all ranges of finite data considered. Intuitively, we would expect the
analysis is included in Supplementary Section 3.1 to verify that 100 additional effects of quantum operations such as entanglement and
parameter samples are reasonable for the models we consider. In superposition—if used effectively—to generate models with higher
higher dimensions, this number will need to increase. The bottom capacity. The quantum neural network with a strong feature map
row of Fig. 2 contains the distribution for eigenvalues less than 1. is thus expected to deliver the highest capacity, but recall that in
The classical model depicted in Fig. 2 is the one with the highest the limit n → ∞, all models will converge to an effective dimension
average rank of Fisher information matrices. The majority of eigen- equal to the maximum rank of the Fisher information matrix (see
values are negligible (of the order 10−14), with a few very large val- Remark 1).
ues. This behavior is observed across all classical configurations that To support these observations, we calculate the capacity of each
we consider and is consistent with results from literature, where the model using a different measure, the Fisher–Rao norm44. The aver-
Fisher information matrix of non-linear feedforward neural net- age Fisher–Rao norm after training each model 100 times is roughly
works is known to be highly degenerate, with a few large eigenval- 250% higher in the quantum neural network than in the classical
ues38. The concentration around zero becomes more evident in the neural network, with the easy quantum model inbetween (see
bottom row of the plot, which depicts the eigenvalue distribution of Supplementary Section 3.3).
just the eigenvalues less than 1.
The easy quantum model also has most of its eigenvalues close Trainability. The observed Fisher information spectrum of the
to zero, and although there are some large eigenvalues, their magni- feedforward model is known to have undesirable optimization
tudes are not as extreme as the classical model. properties, where the outlying eigenvalues slow down training and
The quantum neural network, on the other hand, has a distribu- loss convergence35. These large eigenvalues become even more pro-
tion of eigenvalues that is more uniform, with no outlying values. nounced in bigger models, as seen in Supplementary Fig. 5. On
This can be seen from the range of the eigenvalues on the y-axis in examining the easy quantum model over an increasing system size,
Fig. 2. This distribution remains more or less constant as the num- the average Fisher spectrum becomes more concentrated around
ber of qubits increase, even in the presence of hardware noise (see zero. This is characteristic of models encountering a barren pla-
Supplementary Section 3.2); this has implications for capacity and teau, presenting another unfavorable scenario for optimization. The
trainability, which we examine next. quantum neural network, however, maintains its more even distri-
bution of eigenvalues as the number of qubits and trainable param-
Capacity analysis. In Fig. 3a, we plot the normalized effective eters increase. Furthermore, a large amount of the eigenvalues are
dimension for all three model types. The normalization ensures that not near zero. This highlights the importance of a feature map in
the effective dimension lies between 0 and 1 by simply dividing by d. a quantum model. The harder data encoding strategy used in the
The convergence speed of the effective dimension to its maximum is quantum neural network seems to structurally change the optimi-
slowed down by smaller eigenvalues and uneven Fisher information zation landscape and remove the flatness, usually associated with
spectra. As the classical models contain highly degenerate Fisher suboptimal optimization conditions such as barren plateaus.
matrices, the effective dimension converges the slowest, followed by We confirm the training statements for all three models with
the easy quantum model. The quantum neural network has non- an experiment illustrated in Fig. 3b. Using a cross-entropy loss
degenerate Fisher information matrices and more even spectra, it function, optimized with ADAM for a fixed number of training
qubit i depends on the ith feature of the data point x, normalized between [−1,1]. (5)
RZZ gates are then applied to every pair of qubits. This time, the value of the ( ) dγ,n1/α ( 2
)
γn1/α
controlled Z rotations depends on a product of feature values. For example, if ≤ cd,Λ 2πlogn1/α
2
exp − 16MB2πlog
γ
n
,
the RZZ gate is controlled by qubit i and targets qubit j, then the angle of the
controlled rotation applied to qubit j is dependent on the product of feature where M = M1α M2.
values xixj. The RZZ gates are implemented using a decomposition into two CNOT The proof is given in Supplementary Section 5.1. Note that the choice of
gates and one RZ gate; thereafter, the RZ and RZZ gates are repeated once. The the norm to bound the gradient of the Fisher information matrix is irrelevant
classically simulatable feature map employed in the easy quantum model is due to the presence of the dimensional constant cd,Λ. In the special case where
simply the first sets of Hadamard and RZ gates with no entanglement between the Fisher information matrix does not depend on θ, we have Λ = 0 and (5)