Quantum-Classical Hybrid Machine Learning For Image Classification
Quantum-Classical Hybrid Machine Learning For Image Classification
Abstract—Image classification is a major application domain medical diagnostics [9], [10], biometric security [11], [12],
for conventional deep learning (DL). Quantum machine learning to name a few. An image classification ML pipeline gen-
(QML) has the potential to revolutionize image classification. In erally consists of two stages: (i) feature extraction, and (ii)
any typical DL-based image classification, we use convolutional
neural network (CNN) to extract features from the image and classification based on the extracted features. Before the rise
multi-layer perceptron network (MLP) to create the actual of convolutional neural networks (CNN), various statistical
arXiv:2109.02862v3 [cs.CV] 17 Sep 2021
decision boundaries. QML models can be useful in both of these techniques dominated feature extraction from images (e.g.,
tasks. On one hand, convolution with parameterized quantum SIFT, SURF, FAST, etc.) commonly referred to as feature
circuits (Quanvolution) can extract rich features from the images. engineering [13]. Later, these extracted features are used as
On the other hand, quantum neural network (QNN) models
can create complex decision boundaries. Therefore, Quanvolution inputs to a classifier (e.g., KNN, SVM, Decision Tree, Naive
and QNN can be used to create an end-to-end QML model for Bayes, MLP, etc.) [14]. CNN can extract the features and learn
image classification. Alternatively, we can extract image features the classification decision boundaries simultaneously, and thus,
separately using classical dimension reduction techniques such as, eliminates the tedious step of feature engineering. As a result,
Principal Components Analysis (PCA) or Convolutional Autoen- CNN has become the ML algorithm of choice for image
coder (CAE) and use the extracted features to train a QNN. We
review two proposals on quantum-classical hybrid ML models for classification in recent years. It has also achieved human-level
image classification namely, Quanvolutional Neural Network and accuracy in many image recognition tasks [9], [10], [15].
dimension reduction using a classical algorithm followed by QNN. Several QML models have been proposed for image classifi-
Particularly, we make a case for trainable filters in Quanvolution cation to exploit quantum computers in practical use cases [6],
and CAE-based feature extraction for image datasets (instead [16]–[22]. In [6], the authors proposed Quanvolutional Neural
of dimension reduction using linear transformations such as,
PCA). We discuss various design choices, potential opportunities, Networks where parametric quantum circuits are used as
and drawbacks of these models. We also release a Python-based filters/kernels to extract features from images. These quantum
framework to create and explore these hybrid models with a filters take image segments as inputs and produce output fea-
variety of design choices. ture maps by transforming the data in the quantum space. The
output features are used as inputs to an MLP network. In [17],
I. I NTRODUCTION
the authors extended the classical transfer learning approach
Quantum computing is a new computing paradigm with to the quantum domain. Here, the trained convolutional layers
tremendous future potential. Even though the technology is in a classical deep neural network are used to extract image
still in a nascent stage, the community is seeking com- features. Later, a Quantum Neural Network (QNN) is trained
putational advantage from quantum computers (i.e., quan- separately to learn the classification decision boundaries from
tum supremacy) for practical applications. Recently, Google these features. Several works have used classical dimension
claimed quantum supremacy can be achieved even with a 53- reduction techniques (e.g., Principal Component Analysis or
qubit quantum processor by completing a specific computation PCA) to extract image features and later, used them as inputs
in 200 seconds that might take 10K years [1] (later rectified to a QNN [23], [24]. In [16], the authors propose Quantum
to 2.5 days [2]) on a state-of-the-art supercomputer. Convolutional Neural Network (QCNN) which is motivated
The near-term quantum devices have limited number of by CNN. Here, convolutions are multi-qubit operations per-
qubits. Moreover, they suffer from various types of noises formed on neighboring pairs of qubits. These convolutions
(decoherence, gate errors, measurement errors, crosstalk, etc.). are followed by pooling layers, which are implemented by
Due to these limitations, these machines are not yet perfectly measuring a subset of the qubits, and using the results to
suitable to execute quantum algorithms that rely on high orders control subsequent operations. The network ends with pairwise
of error correction (e.g., Shor’s factorization, Grover’s search). operations on the remaining qubits before measurement.
Quantum machine learning (QML) promises to achieve quan- In this paper, we review two promising hybrid architectures
tum advantage with near-term machines because it is based on for image classification, (i) Quanvolutional Neural Network
a variational principle (similar to other near-term algorithms (Quanvolution + MLP), and (ii) classical dimension reduction
such as, Quantum Approximate Optimization Algorithm or + QNN. We discuss their design choices, characteristics,
QAOA [3], Variational Quantum Eigensolver or VQE [4] and enhancements, and potential drawbacks. Particularly, we advo-
so on) that does not necessitate error correction [5]. cate for trainable quantum filters in Quanvolution, and classical
Image classification is one of the most useful ML task Convolutional Autoencoder (CAE) for image feature extrac-
having wide applications in autonomous driving [7], [8], tion in quantum-classical hybrid image classification models.
Repeat
Quantum Data
R(w1, w2,…)
Input 2D
Encoding
Measure
Image Output
features
Input 2D Filters Output |0 ( , , )
Image features Parametric
|0 ( , , ) Quantum Circuit
Hidden Output Parameterized Quantum
(PQC) Input
Quantum Layer Layer Circuit (PQC)
|0 ( , , )
Filter
(a) Feature Extraction through
Classical Convolution (b) Feature Extraction through Quanvolution (c) Multi-layer Perceptron (MLP) (d) Quantum Neural Network (QNN)
Fig. 1. (a) shows a classical convolution operation. (b) shows a toy Quanvolution operation proposed in [6]. In Quanvolution, a quantum circuit (also referred
to as quantum filter) encodes an image segment as a quantum state, and produces output features corresponding to that segment through state transformation
using a parameterized circuit and subsequent measurement operations. (c) shows the network diagram of a toy Multi-layer Perceptron (MLP) network. A
conventional Quantum Neural Network (QNN) is shown in (d). It consists of a data encoding circuit, a parameterized circuit, and measurement operations.
A QNN/quantum filter has a myriad of design choices in terms tionship. QNN generally consists of three segments: (i) a
of encoding methods, parametric circuits, and measurement classical to quantum data encoding (or embedding) circuit,
operations. However, in this work, we only use two config- (ii) a parameterized circuit, and (iii) measurement operations.
urations for demonstration. The accompanying Python-based A variety of encoding methods are available in the literature
framework supports a wide variety of QNN/quantum filter [26]. For continuous variables, the most widely used encoding
design choices (6 encoding circuits, 19 parametric circuits, and scheme is angle encoding where a continuous variable input
6 measurement circuits). Interested readers can utilize/extend classical feature is encoded as a rotation of a qubit along
this framework to explore the design space. the desired axis (X/Y/Z) [26]–[29]. For ‘n’ classical features,
In the remaining paper, we cover basics on quantum com- we require ‘n’ qubits. For example, RZ(f1) on a qubit in
puting and QNN in Section II, discuss the hybrid architectures superposition (Hadamard - H gate is used to put the qubit in
in Section III, present relevant results in Section IV, and draw superposition) is used to encode a classical feature ‘f1’ in Fig.
the conclusions in Section V. 2(b). We can also encode multiple continuous variables in a
II. P RELIMINARIES single qubit using sequential rotations. For example, ‘f1’, ‘f2’,
‘f3’, and ‘f4’ are encoded using consecutive RZ(f1), RX(f2),
Qubits, Quantum Gates, State Vector, & Measurements:
RZ(f2), and RX(f4) rotations on a single qubit in Fig. 2(c).
Qubit is analogous to classical bits. However, unlike a classical
As the states produced by a qubit rotation along any axis will
bit, a qubit can be in a superposition state i.e., a combination
repeat in 2π intervals (Fig. 2(a)), features are generally scaled
of |0i and |1i at the same time. A variety of qubit technolo-
within 0 to 2π (or -π to π) in a data pre-processing step.
gies exists, e.g., superconducting qubits, trapped-ions, neutral
atoms, silicon spin qubits, to name a few [25]. Quantum gates The parametric circuit has two components: entangling op-
such as, single qubit (e.g., Pauli-X (σx ) gate) or multiple qubit erations and parameterized single-qubit rotations. The entan-
(e.g., 2-qubit CNOT gate) gates modulate the state of qubits glement operations are a set of multi-qubit operations between
to perform computations. These gates can either perform a all the qubits to generate correlated states [29]. The following
fixed or tunable computation e.g., an X gate flips a qubit state parametric single-qubit operations are used to search through
while the RY(θ) gate rotates the qubit along the Y-axis by the solution space. This combination of entangling and single-
θ. A two-qubit gate changes the state of one qubit (target qubit rotation operations is referred to as a parametric layer in
qubit) based on the current state of the other qubit (control QNN. A widely used parametric layer architecture is shown in
qubit). A quantum circuit can contain many gate operations. Fig. 2(d) [27], [30]. Here, CRZ(θ) gates between neighboring
Qubits are measured in a desired basis to retrieve the final qubits create the entanglement, which is followed by rotations
state of a quantum program. In physical quantum computers, along Y-axis using RY(θ). Normally, these layers are repeated
measurements are generally restricted to a computational basis, multiple times to extend the search space [27], [28].
e.g., Z-basis in IBM quantum computers. QNN Cost Functions: Qubits in a QNN circuit are measured
Expectation Value of an Operator: Expectation value is the in the computational basis to retrieve the output state. A cost
average of the eigenvalues, weighted by the probabilities that function is derived from the measurements to train the network
the state is measured to be in the corresponding eigenstate. [27], [28], [31]. For example, in a binary classification prob-
Mathematically, expectation value of an operator (σ) is defined lem, the authors measured all the qubits in the QNN model
as hψ|σ|ψi where |ψi is the qubit state vector. It varies be- in Pauli-Z basis and associated class 0 with the probability
tween the minimum and maximum eigenvalues of the operator. of obtaining even parity, and class 1 with odd parity [27].
For example, the Pauli-Z (σz ) operator has two eigenvalues: Then, the model is trained using binary cross-entropy loss. In
+1 and -1. Therefore, the Pauli-Z expectation value of a qubit [32], the authors used the Pauli-Z expectation value of a single
will vary in the range of [-1, 1] depending on the qubit state. qubit (-1 associated with class 1 and +1 associated with class
Quantum Neural Network: QNN involves parameter op- 0) for a binary classifier and trained it using mean squared
timization of a PQC to obtain a desired input-output rela- error (MSE) loss.In [23], the authors fed the outputs of the
H RZ( ) RZ( ) RY( )
0 H RZ( ) RX( ) RZ( ) RX( )
Fig. 2. (a) Bloch sphere representation of a qubit. At any given step, a qubit can be rotated along the X, Y, or Z axis by applying a gate. The states will
repeat in 2π intervals. (b) Angle encoding 1:1 (1 continuous variable encoded in a single qubit state using RZ rotation, n qubits are required to encode n
continuous variables as an n-qubit state). (c) Angle encoding 4:1 (4 continuous variables encoded in a single qubit state using alternating RZ and RX rotations,
4 qubits encode 16 continuous variables as a 4-qubit state). (d) Parametric layer used in this work. Parametric CRZ gates entangle the qubits; this is followed
by single qubit RY rotations. Each n-qubit parametric layer has 2n circuit parameters.
will be worthwhile for the research community to explore them Fig. 4. Convolutional Autoencoder (CAE) architecture used in this work to
for possible quantum advantage [40]. extract image features (for both the MNIST and Fashion-MNIST datasets).
Network Design: Similar to CNN, a Quanvolutional layer can
have many filters and multiple Quanvolutional layers can be Autoencoders (CAE) are much more powerful tools for image
stacked upon each other to develop a deep Quanvolutional feature extraction/dimension reduction [42], [43]. PCA is a
Neural Network [6], [21]. The outputs from the final Quanvo- linear transformation of the data whereas AE/CAE can model
lutional layer can be fed to a an MLP (or a QNN). One can also more complex non-linear relationships in the data using non-
apply classical non-linear activation functions (for additional linear activation functions and regularization [43].
non-linearity) and maxpooling (downsampling) at the output AE/CAE: AE’s are a specific type of feedforward neural
of a Quanvolutional layer. One can create separate filters by networks. They compress the input into a lower-dimensional
initializing the same PQC with different random seeds. Al- code using an encoder network and then reconstruct the output
ternatively, we can use different PQC architectures, encoding from this representation through a decoder network. The code
methods, and measurement operations altogether to create is a compact representation of the input, also called the latent-
different filters. One can also stack classical convolutional space representation. The distance between the input and
layers with Quanvolutional layers. Fig. 3 shows the network reconstructed output (e.g., MSE loss) is used as the feedback
diagram used in this work with a single Quanvolutional layer signal to train the network. Both the encoder and decoder in
followed by two fully connected classical layers. a simple AE consist of several fully connected layers. CAE
Number of Circuit Executions: The number of quantum cir- provides a better architecture than AE to extract the textural
cuit executions per sample during training/inference depends features of images. In CAE, the encoder block starts with one
on the kernel size, image size, and the stride (the amount of or more successive convolutional layers. The decoder block
movement of the kernel in terms of pixels). For a 28x28 image ends with convolutional transpose/deconvolutional layers. In
and 4x4 kernel size, we need to execute a total of 7x7 quantum the middle there is a fully connected AE whose innermost
circuits when stride = 4 (a single non-trainable quantum filter). layer is composed of a small number of neurons. Once trained,
If this filter has 10 trainable parameters, the total number of the encoder block can be used as a standalone entity to extract
circuit execution becomes 7x7 + 2x10x7x7 where the later lower dimensional representation of the input data.
2x10x7x7 circuit executions are necessary to compute the Fig. 4 shows the CAE network architecture used in this
gradients using parameter-shift rule (2 extra circuit executions work (for both MNIST and Fashion-MNIST datasets). The
for each parameters [36]). If we take 50 samples per batch, final ConvTranspose2d layer uses Sigmoid activation where
we will need 50x(7x7 + 2x10x7x7) filter execution for each ‘d’ is the dimension of the latent-space.
batches during training. This is in fact a prohibitively large Network Design: The hybrid network (Fig. 5) consists of two
number for a single batch and a filter. However, all these separate networks - a CAE and a QNN. The CAE is trained
circuits are independent of each other. Hence, one can argue with the original image dataset to learn a lower dimensional
that all these computations can be done simultaneously if one representation of the data. The trained encoder network is
has access to multiple quantum computing resources. used to extract image features. A conventional QNN is trained
with these extracted features and image labels to perform final
B. Classical Dimension Reduction + QNN classification. When both of these networks are trained, the
Another popular hybrid QML model targeted for smaller encoder block and the QNN block is used together to classify
quantum devices uses classical algorithm (e.g., Principal Com- data samples. We refer to this architecture as CAE+QNN.
ponent Analysis or PCA, Linear Discriminant Analysis or QNN Design-Space: As mentioned earlier, numerous choices
LDA, etc.) to reduce data dimension to a level that is tractable exist for the encoding circuits, PQC, and measurement circuits
for a small QNN model [23], [24], [41]. Although PCA/LDA to build a QNN model. The accompanying Python framework
work quite well to extract most salient features from small to this work supports a wide variety of these choices which
tabular data, they are not suitable to extract features from will impact the learnability of the QNN [30]. However, in
large images. Autoencoders (AE), particularly Convolutional this work we only use the single feature/qubit encoding
Convolutional Autoencoder Fashion-MNIST datasets. Later, we have created 6 smaller
Image Decoder
classification datasets as before using the trained models with
Encoder
Dataset
various latent dimensions (5/10) and principal component (10).
Metrics: We have divided the datasets into two equal sets for
Training:
• CAE is trained on image dataset training and validation (600 samples/set). We use the average
Quantum Data
• Trained Encoder extracts features
Encoding
loss and accuracy over the entire training and validation
Measure
• QNN is trained separately with
PQC
the extracted dataset datasets to measure the performance of the QML models [46].
Inference:
• Data Encoder QNN Training Setup: We use the gradient-based Adagrad, SGD,
Class Assignment
Quantum Neural Network
and Adam optimizers to train these models [47], [48]. We use
the same set of hyper-parameters across all the runs (learning
Fig. 5. The CAE + QNN network architecture. The trained CAE Encoder rate = 0.5 for all the quantum/hybrid models).
block creates a lower-dimensional representation of the image for the QNN.
Trainable Vs. Non-trainable Filters in Quanvolution: We
method (Fig. 2(b)), the PQC layer of Fig. 2(d), and Z-basis trained quanvolutional neural networks with a single quanvo-
measurements of the qubits in the QNN. We also restrict the lutional layer (Fig. 3) for six 3-class classification problems
number of parametric layers to 3. Following [23], we feed the with a trainable and a non-trainable quantum filter (stride =
QNN outputs to a fully-connected layer. The number of output 4). We used 4-qubit circuits as Quanvolutional filters. We used
neurons is equal to the number of classes in the dataset. the 4 variables/qubit encoding method shown in Fig. 2(c) to
CAE+QNN Vs. Transfer Learning: Although, the encode 4x4 pixels into a 4-qubit state. The input pixels were
CAE+QNN network has some similarities with transfer scaled to 0-2π (originally 0-1). The PQC architecture in Fig.
learning [17], there are some noteworthy differences as 2(d) was used with 3 parametric layers (3x2x4 parameters). In
well. Both of these approaches extract image features using Quanvolution with trainable filters, these PQC parameters were
a classical network. In transfer learning, the convolutional trained alongside other network parameters using gradient
layers of a classical CNN network, trained with a different descent. In the non-trainable filter, we had set the PQC
dataset (e.g., AlexNet trained to classify the ImageNet parameters randomly (-π to π) at the beginning, and kept them
dataset), is used to extract features for a target dataset. In constant throughout the training. Pauli-Z expectation values
contrast, the CAE in CAE+QNN network is trained separately of qubits were used as the output features. The results are
to extract features from the target dataset. Therefore, the tabulated in Table I (performance after 10 training epochs).
CAE extracted features may capture more variance in the Each training epoch took ≈195 seconds with the trainable
target dataset compared to transfer learning, and thus, it filter compared to ≈57 with the non-trainable filter on a single
may provide better performance (lower training cost/higher Core i7-10750H machine with 16 GB RAM.
accuracy). The features extracted through transfer learning On average, Quanvolution with trainable filter provided
can be more generic [17], and it eliminates the need to train 15.98% lower training loss, 7.49% lower validation loss,
the classical network separately (as it is already trained). 3.46% higher training accuracy, and 3.32% higher valida-
tion accuracy after 10 training epochs. In some cases, the
IV. E VALUATION Quanvolution with non-trainable filter performed at a similar
In this Section, we compare the performance differences level as its trainable counterpart. For example, MNIST 179
between (a) Quanvolutional Neural Networks with trainable and MNIST 358 provided similar performance in both these
filters, and non-trainable filters, and (b) CAE + QNN and PCA- approaches. However, in all other cases, there was a no-
based approach for a variety of datasets. ticeable performance gap between these two approaches. We
Datasets: We pick the MNIST and Fashion-MNIST datasets repeated the experiments 5 times with the MNIST 179 and
for this work which are widely used in contemporary works MNIST 358 dataset with different random initialization. How-
in QML research (each pixel value scaled within 0-1) [44], ever, the performance remained at similar levels in both these
[45]. Both of these datasets have 60,000 training samples, approaches. In fact, both these models performed poorly on
and 10,000 test samples of 2D images (28x28 pixels) that these two datasets compared to the others (average training
belong to 10 different classes. For empirical evaluation of the loss of 0.535 against 0.266 overall with the trainable filter).
Quanvolution approach, we have picked 1,200 samples from The overall results indicate potential benefits of trainable filters
three different classes to create 6 smaller classification datasets which can be worthwhile to explore in the future.
- MNIST 179, MNIST 246, MNIST 358, Fashion 012, Fash- CAE + QNN: We trained the CAE in Fig. 4 with 60000 train-
ion 345, and Fashion 678. MNIST 179 has ≈400 samples ing samples of the MNIST and Fashion-MNIST datasets with
of digits 1, 7, and 9 each. Similarly, Fashion 012 has ≈400 latent-dimension of 5 and 10 (optimizer: Adam, learning rate:
samples of classes 0 (t-shirt/top), 1 (trouser/pant), and 2 0.001, weight-decay: e−5 , epochs: 30, batch-size: 50). The
(pullover shirt) each. We have reduced the dimension of the extracted datasets (5/10 features) were used to train a QNN.
samples from 28x28 to 14x14 using maxpooling to lower In the QNN, we used 1 variable/qubit encoding method as
the simulation time. To compare CAE and PCA, we have shown in Fig. 2(b). We used 5 and 10 qubits for the 5-feature
trained the corresponding CAE (with latent dimension of and 10-feature datasets, respectively. The QNN shared same
5 and 10) and PCA models with the entire MNIST and PQC architectures and output measurements as the quantum
TABLE I
Q UANVOLUTIONAL NEURAL NETWORK PERFORMANCE AFTER 10 EPOCHS OF TRAINING (O PTIMIZER : A DAGRAD , LEARNING RATE : 0.5)
Training Set Validation Training Set Validation Training Set Validation Training Set Validation Set
Datasets
Loss Set Loss Accuracy Set Accuracy Loss Set Loss Accuracy Accuracy
MNIST_179 0.3717 0.5537 0.8416 0.783 0.3881 0.5338 0.8500 0.7733
MNIST_246 0.3562 0.4607 0.8733 0.8333 0.2684 0.4879 0.9100 0.8583
MNIST_358 0.6577 0.8453 0.6916 0.6350 0.6825 0.9452 0.7083 0.6300
Fashion_012 0.1149 0.4213 0.9533 0.8833 0.0560 0.3813 0.9783 0.9200
Fashion_345 0.1416 0.2788 0.9433 0.8983 0.0827 0.1670 0.9666 0.9366
Fashion_678 0.2626 0.4434 0.8950 0.8450 0.1226 0.2631 0.9650 0.9216
TABLE II
CAE + QNN NETWORK PERFORMANCE AFTER 20 EPOCHS OF TRAINING OF THE QNN (O PTIMIZER : SGD, LEARNING RATE : 0.5)
Training Set Validation Training Set Validation Training Set Validation Training Set Validation Set
Datasets
Loss Set Loss Accuracy Set Accuracy Loss Set Loss Accuracy Accuracy
MNIST_179 0.1479 0.1676 0.9633 0.9533 0.1385 0.1120 0.9633 0.9683
MNIST_246 0.0574 0.0793 0.9900 0.9883 0.0676 0.0791 0.9833 0.9816
MNIST_358 0.3630 0.3281 0.8917 0.9183 0.2000 0.1866 0.9350 0.9383
Fashion_012 0.3620 0.2888 0.9083 0.9200 0.2154 0.1758 0.9266 0.9483
Fashion_345 0.3876 0.2084 0.8500 0.7533 0.1793 0.1541 0.9300 0.9466
Fashion_678 0.2468 0.1962 0.9233 0.9483 0.2968 0.2662 0.9033 0.9550
filters (Fig. 2(d)). We restricted the parametric layers to 3 in the TABLE III
10-qubit model (3x2x10 parameters). To match the number of CAE + QNN AND PCA + QNN NETWORK PERFORMANCE AFTER 20
EPOCHS OF TRAINING ON MNIST DATASET (4000 SAMPLES , 10 CLASSES )
trainable circuit parameters, we restricted the parametric layers
to 6 in the 5-qubit models (6x2x5 parameters). Approach
Training Validation Training Validation
Loss Loss Accuracy Accuracy
The results are tabulated in Table II (performance after 20 PCA(10) + QNN 0.6496 0.6724 0.7875 0.7175
epochs of training). All these models were trainable as evident CAE(10) + QNN 0.3336 0.3463 0.8965 0.8980