0% found this document useful (0 votes)
8 views

2307.09025v1

Uploaded by

zeddthegoat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

2307.09025v1

Uploaded by

zeddthegoat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

qecGPT: decoding Quantum Error-correcting Codes with

Generative Pre-trained Transformers

Hanyan Cao,1, 2 Feng Pan,1 Yijia Wang,1, 2 and Pan Zhang1, 3, 4, ∗


1
CAS Key Laboratory for Theoretical Physics, Institute of Theoretical Physics,
Chinese Academy of Sciences, Beijing 100190, China
2
School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
3
School of Fundamental Physics and Mathematical Sciences,
Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China
4
International Centre for Theoretical Physics Asia-Pacific, Beijing/Hangzhou, China
We propose a general framework for decoding quantum error-correcting codes with generative
modeling. The model utilizes autoregressive neural networks, specifically Transformers, to learn
the joint probability of logical operators and syndromes. This training is in an unsupervised way,
without the need for labeled training data, and is thus referred to as pre-training. After the pre-
arXiv:2307.09025v1 [quant-ph] 18 Jul 2023

training, the model can efficiently compute the likelihood of logical operators for any given syndrome,
using maximum likelihood decoding. It can directly generate the most-likely logical operators with
computational complexity O(2k) in the number of logical qubits k, which is significantly better than
the conventional maximum likelihood decoding algorithms that require O(4k ) computation.
Based on the pre-trained model, we further propose refinement to achieve more accurately the
likelihood of logical operators for a given syndrome by directly sampling the stabilizer operators.
We perform numerical experiments on stabilizer codes with small code distances, using both depo-
larizing error models and error models with correlated noise. The results show that our approach
provides significantly better decoding accuracy than the minimum weight perfect matching and
belief-propagation-based algorithms. Our framework is general and can be applied to any error
model and quantum codes with different topologies such as surface codes and quantum LDPC codes.
Furthermore, it leverages the parallelization capabilities of GPUs, enabling simultaneous decoding
of a large number of syndromes. Our approach sheds light on the efficient and accurate decoding
of quantum error-correcting codes using generative artificial intelligence and modern computational
power.

Quantum computers can potentially solve practical While a number of algorithms have been proposed for
problems which are intractable for classical computers. decoding quantum error-correcting codes, we lack general
However, the current implementation of quantum com- decoding algorithms that are efficient and accurate. The
puters has an issue with noise, which limits its power. An minimum weight perfect matching algorithm [4, 5] can
essential step towards fault-tolerant quantum computing decode surface code efficiently, however, as a minimum-
is quantum error correction (QEC), which now becomes weight decoder ignores the degeneracy of quantum codes,
one of the key research frontiers in both theoretical stud- in principle its performance usually has a gap to the the-
ies and hardware developments [1, 2] of quantum compu- oretical limit. Moreover, it is less efficient in non-planar
tation [3]. In QEC, logical states with k logical qubits are graphs and is challenging when applied to code on hy-
encoded using n physical qubits with redundancy. The pergraphs where the distances between two nodes are
effects of continuous errors can be digitalized into a finite not well defined. As a prototype of the maximum likeli-
set of discrete errors, which can be obtained by measuring hood decoder (MLD), tensor network methods (e.g. the
the redundant ancilla qubits, giving an error syndrome. boundary matrix product state method [6]) consider the
Then a decoding algorithm infers the information of er- degeneracies of quantum codes and work close to the the-
rors based on the syndrome and determines an appro- oretical limit in surface code. However, for general codes
priate operation to correct the logical error. However, not defined on lattices with open boundaries, the ten-
the decoding problem is a hard problem, for example, it sor network contractions are difficult to apply due to the
belongs to the class of #P hard problem in the classi- large treewidth of the graph. Another issue for existing
cal error-correcting codes. In quantum codes, decoding maximum-likelihood decoders is computing probabilities
is considered to be more challenging than classical code, for 4k logical operators for k logical qubits, which is in-
because the errors inherently degenerate, and the cor- tractable for a large k. Moreover, the contraction of ten-
responding factor graph for the codes is more complex, sor networks for each syndrome consumes significantly
for example, in CSS code the factor graph always con- more computational resources than the minimum-weight
tains loops with various sizes due to the commutation decoding algorithms and hence less efficient. Recently,
relations, so standard decoding algorithms such as belief a number of neural network decoders are proposed for
propagation do not work as well as in classical low-density leveraging fast inference in neural networks on modern
parity check (LDPC) codes. GPUs [7–14]. These methods are based on supervised
2

learning, meaning that training the neural network model which is Abelian and satisfies the commutation relation
requires a large dataset prepared in advance with la- ei gi = (−1)δij gj ei . The three subgroups introduced above
bels computed using another teacher decoding algorithm. indicate a structure of the Pauli group, which is revealed
Both the dataset size and the accuracy of the teacher de- by the decomposition of the Pauli group Pn = E ⊗ L ⊗ S.
coding algorithm limit the performance of the supervised Based on the {E, L, S} decomposition, we can map an
neural network decoders. error E to a configuration α, β, γ, and vice versa. Here
In this work, we propose a maximum likelihood decod- α ∈ {0, 1}m is the configuration of m stabilizer gener-
ing approach based on unsupervised generative model- ators, and each value αi is determined by the commu-
ing in machine learning. The proposed algorithm enjoys tation relation between E and the pure error generator
efficient decoding using the fast inference of autoregres- Ei ; β ∈ {0, 1}2k denotes the configuration for logical X
sive neural networks, especially on GPUs, the training and logical Z operators; γ ∈ {0, 1}m is the configuration
directly uses the error model and thus does not require of m pure error generators, and each value γi is deter-
preparing labeled data for training. The autoregressive mined by the commutation relation between E and the
neural networks can be applied to quantum codes with stabilizer generators. Using this mapping, we can see the
arbitrary topology, e.g. for general quantum low-density degeneracy of errors, that is, given the logical configura-
parity check (QLDPC) codes, it also supports directly tion β, there are 2m assignments of {α} give the same
generating logical operators for an arbitrary number of syndrome. So computing likelihood of an logical opera-
logical qubits k by reducing the computational complex- tor configuration β needs to consider all α configurations
ity from O(4k ) (in computing probabilities of all logical with
operators in conventional maximum-likelihood decoding) X
p(β, γ) = p(α, β, γ). (1)
to O(2k). In the following text we will first introduce
α
how to link the decoding of stabilizer codes to the gen-
erative modeling, then introduce the qecGPT, the pre- In this sense, we consider the total probability of a coset
trained version of our approach using a specific autore- of the stabilizer sub-group S, rather than the probability
gressive model, casual transformers, then introduce the of a single error.
refinement of decoding accuracy based on the pre-trained However, there are several challenges for maximum
model. likelihood decoding. The first one is that computing
Maximum likelihood decoding— Consider a [[n, k, d]] the coset probability is a #P problem, no general exact
quantum correction code where a logical state |ϕ⟩ with algorithm exists and approximate algorithms e.g. ten-
k logical qubits is encoded using a code word |ψ⟩ with sor network contractions are usually time-consuming; the
n physical qubits. The minimum distance between the second challenge is that one needs to repeat the compu-
code words is d. When an error occurs on the state ψ, it tation (by summing over all α configurations) for each
is considered as an effect of applying an error operator E syndrome; the third challenge is the exponential com-
belonging to the Pauli group Pn = ±i{I, X, Y, Z}. In the putational complexity in the number of logical qubits k
stabilizer formalism [15, 16], encoded states are stabilized because conventionally, one needs to enumerate 4k logi-
by some operators {s}, i.e. S |ψ⟩ = |ψ⟩. The operators cal operators, compute their closet probabilities, and find
form a stabilizer group S = ⟨g1 , gn , · · · , gm ⟩, which is an the one with the largest probability.
Abelian sub-group of Pn , and is generated by m = n − k Generative maximum likelihood decoding— We pro-
independent generators. pose to solve the challenges of maximum likelihood de-
When an error E occurs, the encoding state ψ may coding using a framework based on generative modeling.
not be stabilized by the stabilizers anymore, this can First we approximate the joint distribution p(α, β, γ) us-
be tested by measuring the ancilla qubits corresponding ing a parameterized variational distribution qθ (α, β, γ)
to the stabilizer generators, yielding syndrome γ(E) = satisfying
{γ1 (E), γ1 (E), · · · , γm (E)}, with γi (E) = 0 if gi and E qθ (α, β, γ) = q(α|β, γ)q(β|γ)q(γ). (2)
commute and γi (E) = 1 if they anti-commute. In other
words, if the syndrome is not trivial, then the error E Here q(α|β, γ) is the conditional probability distribution
must anti-commute with some of the stabilizer genera- of stabilizer configuration α given the logical operator
tors. If the syndrome is trivial, the error E commutates β and the syndrome γ, q(β|γ) is the conditional prob-
with all m stabilizer generators, then E is either an ele- ability of β given syndrome. θ denotes the parameters
ment of the stabilizer group or belongs to the logical oper- of the variational distribution qθ (α, β, γ). By learning
ators, which is generated by L = ⟨l1x , l1z , l2x , l2z , · · · , lkx , lkz ⟩. θ, we make the variational distribution close to the true
Here lix and liz denote the logical X and logical Z oper- joint distribution p(α, β, γ) given by the error model, and
ators of the i’th logical qubits respectively. In addition make the variational conditional distribution q(β|γ) close
to 2m stabilizer operators and 4k logical operators, there to the true conditional distribution p(β|γ) which is in-
are still 2m operators that do not commutate with stabi- tractable in general. With an accurate estimate of con-
lizer generators, they belong to the pure error subgroup E ditional probabilities, we can evaluate the likelihood of
3

logical operators for all syndromes, and generate a config- Decoding: Transformer:
uration of logical operators by sampling q(β|γ). In other
words, the learned variational joint distribution satisfies

Syndrome
the condition that the configuration for stabilizer gener-
ators α can be traced out automatically. Sigmoid
We further ask the conditional probabilities also sat-
Linear
isfy the autoregressive properties for each variable, i.e.
q(β|γ) = q(β1 |β2 , ..., β2k |γ)q(β2 |β3 , ..., β2k |γ) · · · q(β2k |γ).
In this way, all 2k logical variables can be generated one Self-Attention
Layer
by one following the conditional probabilities. This is
known as ancestral sampling which is an unbiased sam-
pling from the variational conditional distribution q(β|γ)
Self-Attention
[17]. The pictorial representation of the generative mod- Layer
eling is illustrated in Fig. 1, where we can see that all the
variables are assigned order and each variable only relies
Position Encoding
on the variables prior to it. I.e., the conditional probabil-
ity of configuration of a variable si is a function of con- : Pure Errors Embeding
figurations of variables before it {s1 , s2 , · · · , si−1 } = s<i ,
with q(si |s<i ). This property of the parameterization : Logical Operators
is known as the autoregressive property, also known as : Stabilizers
causal property, if we regard the variables before it as
its “history”, and the variables behind it as its “future”. FIG. 1. Illustration of the structure of qecGPT.
Many neural network models satisfy this property and
are known as autoregressive neural networks, especially
in the models for natural languages where the words are generate a configuration of the logical operator β =
generated one by one. {β1 , β2 , · · · , β2k } one by one using the learned condi-
The Generative Pre-trained Transformers— In this tional probabilities
work, to parameterize qθ (α, β, γ) we adopt the Trans-
formers, one of the most powerful autoregressive neural β̂i = arg max q(βi |β1 , β2 , · · · , βi−1 , γ1 , γ2 , · · · , γm ).
βi
networks [18] and has been used in many applications in-
cluding chatGPT [19–21]. We use the decoder layer of So a logical configuration is generated variable-by-
the Transformer composed of an embedding layer and a variable given a syndrome. This is analogous to the
positioning encoding layer which map the input config- generation of text from the chatGPT [19–21], where the
uration to a higher dimensional feature space, attention text is generated word-by-word given a prompt. Notice
layers with triangular mask (to ensure the autoregressive that by training once, a single variational distribution
properties), and a linear layer and a sigmoid output layer gives the conditional probabilities of all 2m syndromes,
that outputs the joint probability distributions. The de- maximizing the likelihood of all syndromes. So we term
tails of the Transformer can be found in the appendix. it pre-training. The advantage of the pre-training is
that the conditional probability for any syndrome can
The parameters are learned to minimize the distance be computed efficiently using a single pass of the neural
between the true distribution p(α, β, γ) and the varia- network. We can further optimize the accuracy of the
tional distribution parameterized using the Transformer. conditional probability based on the pre-trained model
In this work, we assume that we have samples of the noise given a particular syndrome, which we term as refine-
model p(E), we choose the forward Kullback-Leibler di- ment. There could be approaches for the refinement, for
vergence as the distance measure of two probability dis- example, we can minimize the DKL (qθ (α, β|γ)|p(α, β|γ))
tributions using e.g. the method of variational autoregressive net-
X p(α, β, γ) works. In this work, we propose a straightforward way
DKL (p|q) = p(α, β, γ) log . for the refinement with a small number of logical qubits
qθ (α, β, γ)
α,β,γ k, taking advantage of the generative modeling. In addi-
This yields a negative log-likelihood loss function tion to generating β configurations, we can also generate
  efficiently the stabilizer configurations α using q(α, β, γ),
X and use them to evaluate an unbiased estimate of the
θ̂ = arg min DKL = arg min − log qθ (α, β, γ) . joint probability of β and γ
θ θ
α,β,γ∼p
X p(α, β, γ) 1 X p(α, β, γ)
And the parameters are updated using a gradient-based p(β, γ) = qθ (α, β, γ) ≈ .
α
qθ (α, β, γ) N α∼q qθ (α, β, γ)
optimizer. After training, given a syndrome γ, we can
4

Exact MLD 0.7 Exact MLD 0.6 Exact MLD


0.6 BPOSD BPOSD BPOSD
MWPM MWPM qecGPT
qecGPT 0.6 qecGPT 0.5
0.5 qecGPT+refinement
Logical Error Rate

0.5
0.4 0.4
0.4
0.3 0.3
0.3
Difference 100 100

Difference
0.2 0.2 0.2
10 2 10 2

0.1 0.1 0.1


10 4 10 4
0.0 0.2 0.4 0.0 0.2 0.4
0.0 0.0 0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Physical Error Rate Physical Error Rate Physical Error Rate

FIG. 2. Logical error rates of our algorithm (qecGPT and qecGPT+refinement) with different physical error rates, compared
with MWPM, and BPOSD algorithm on the [13, 1, 3] surface code (left), [41, 1, 5] surface code (middle), and [12, 1, 2]
3D-surface code (right). The error model is the depolarizing model. Each data point in the figures is averaged over 10000
error instances. The black lines are the optimal maximum-likelihood decoding algorithm which exactly sums all stabilizer
configurations. The insets are the difference between the approximate algorithms and the exact algorithm.

Here we use the samples of the variational distribution We also see that the refinement significantly improves the
and the reweighting to compute an unbiased estimate of performance of qecGPT, making the performance very
the joint distribution, N is the number of samples. close to the optimal MLD decoder.
An advantage of our approach is the insensitivity to the To demonstrate the generality of our approach with
topology of the code, i.e. the connectivity of the stabilizer different code topologies, we also test the algorithm in
generators, or in other words, the structure of the par- the stabilizer code on a 3-dimensional lattice, which is
ity check matrix. The transformer representation of the usually termed as 3D surface code [24]. Note that for the
variational distribution can be used for any code topol- 3D structure code, the MWPM algorithm does not ap-
ogy without modifying the structure of the transformer, ply directly, so we only compare the logical error rates of
thanks to the self-attention mechanism which can auto- qecGPT with BPOSD. We can see from Fig. 2 that the
matically capture correlations in variables. Our approach logical error of qecGPT coincides very well with the exact
is also insensitive to the true parameters used in the er- MLD algorithm and significantly outperforms BPOSD.
ror model. For example, under the depolarizing noise Additional numerical results and performance compar-
model, we can train qecGPT with a particular physical isons with k > 1 logical qubits and under a noise model
error rate smaller than the threshold, and use our model with correlated noise can be found in Appendices.
to decode error under the depolarizing noise model with Discussions— We have introduced a general frame-
distinct error rates, without significantly increasing the work for decoding quantum error correction code with
logical error rate. This effect is quite common in infer- generative modeling. Our method approximates the
ence problems with mismatched parameters, e.g. in [22]. joint distribution of errors using variational autoregres-
We refer to the Appendices for details. sive neural networks. We propose a pre-trained model for
Numerical experiments— We evaluate our algorithm the fast generation of the maximum-likelihood logical op-
by comparing the logical error rate of our algorithm to erators and a refinement to increase the accuracy given
the minimum weight perfect matching (MWPM) [5] and a syndrome. The advantage of our generative molding
belief propagation augmented by ordered statistics de- is that it solves the difficulties of the maximum likeli-
coding (BPOSD) [23] algorithm on the surface code. In hood decoding in summing over 2n stabilizer configura-
Fig. 2 the surface codes have k = 1 logical qubits. In tions and in computing the probability of 4k logical con-
Fig. 2(left) the distance d = 3 is small, and we see that figurations. Another advantage is its generality in code
the pre-trained model qecGPT performs very close to the topologies, e.g. it can be applied to 2D codes and QLD-
exact maximum likelihood decoding which computes the PCs code without modifying the model or the algorithm.
exact likelihood for all 4k logical operators using exact In our work, we have successfully trained the qecGPT
tensor network contractions. In Fig. 2 middle, the sys- using a single GPU and conducted experiments on small
tem is larger with d = 5, and we see that the pre-trained codes with distances up to d = 7. Although the decod-
model gives slightly worse results than the exact algo- ing process is fast and efficient, the training phase is slow
rithm while still much better than MWPM and BPOSD. and poses a challenge when it comes to applying it to a
5

larger code. However, we believe that this bottleneck eral framework for constructing fast and near-optimal
can be resolved by using a larger model and exploiting machine-learning-based decoder of the topological stabi-
more computational resources, such as multiple GPUs or lizer codes, Physical Review Research 2 (2020).
even a supercomputer. This approach is similar to the [14] S. Gicev, L. C. Hollenberg, and M. Usman, A scalable
behavior of chatGPT, which has shown remarkable per- and fast artificial neural network syndrome decoder for
surface codes, arXiv preprint arXiv:2110.05854 (2021).
formance when trained with a large amount of data and [15] D. Gottesman, Stabilizer codes and quantum error cor-
computational power, as reported in [19, 21]. We intend rection (California Institute of Technology, 1997).
to explore this avenue in the future and see how it can [16] M. A. Nielsen and I. L. Chuang, Quantum Computation
further improve the qecGPT’s efficiency and scalability. and Quantum Information: 10th Anniversary Edition
A python implementation and a Jupyter Notebook (Cambridge University Press, 2010).
tutorial of our algorithm are available at [25]. We thank [17] C. M. Bishop, Pattern Recognition and Machine Learning
Weilei Zeng, Lingling Lao, and Ying Li for their helpful (Information Science and Statistics) (Springer-Verlag,
Berlin, Heidelberg, 2006).
discussions and Michael Vasmer for providing 3D surface [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
code data. L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, At-
tention is all you need, Advances in neural information
processing systems 30 (2017).
[19] A. Radford, K. Narasimhan, T. Salimans, and
I. Sutskever, Improving language understanding by gen-

[email protected] erative pre-training (OpenAI, 2018).
[1] P. Panteleev and G. Kalachev, Quantum LDPC codes [20] https://openai.com/chatgpt.
with almost linear minimum distance, IEEE Transactions [21] OpenAI, Gpt-4 technical report (2023), arXiv:2303.08774
on Information Theory 68, 213 (2022). [cs.CL].
[2] P. Panteleev and G. Kalachev, Asymptotically good [22] P. Zhang and C. Moore, Scalable detection of statisti-
quantum and locally testable classical ldpc codes, in Pro- cally significant communities and hierarchies, using mes-
ceedings of the 54th Annual ACM SIGACT Symposium sage passing for modularity, Proceedings of the National
on Theory of Computing, STOC 2022 (Association for Academy of Sciences of the United States of America
Computing Machinery, New York, NY, USA, 2022) p. 111, 18144 (2014).
375–388. [23] J. Roffe, D. R. White, S. Burton, and E. Campbell, De-
[3] Google Quantum AI, Suppressing quantum errors by coding across the quantum low-density parity-check code
scaling a surface code logical qubit, Nature 614, 676 landscape, Physical Review Research 2, 10.1103/Phys-
(2023). RevResearch.2.043423 (2020).
[4] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, Topo- [24] M. Vasmer and D. E. Browne, Three-dimensional surface
logical quantum memory, Tech. Rep. 9 (2002). codes: Transversal gates and fault-tolerant architectures,
[5] O. Higgott, Pymatching: A python package for decoding Phys. Rev. A 100, 012312 (2019).
quantum codes with minimum-weight perfect matching [25] https:github.com/CHY-i/qecGPT.
(2021), arXiv:2105.13082 [quant-ph]. [26] V. Kolmogorov, Blossom V: A new implementation of a
[6] S. Bravyi, M. Suchara, and A. Vargo, Efficient algorithms minimum cost perfect matching algorithm, Mathematical
for maximum likelihood decoding in the surface code, Programming Computation 1, 43 (2009).
Physical Review A 90, 032326 (2014). [27] R. Fakoor, P. Chaudhari, J. Mueller, and A. J. Smola,
[7] G. Torlai and R. G. Melko, Neural decoder for topological Trade: Transformers for density estimation (2020),
codes, Phys. Rev. Lett. 119, 030501 (2017). arXiv:2004.02441 [cs.LG].
[8] S. Varsamopoulos, B. Criger, and K. Bertels, Decoding [28] R. J. Williams, Simple statistical gradient-following algo-
small surface codes with feedforward neural networks, rithms for connectionist reinforcement learning, Machine
Quantum Science and Technology 3 (2017). learning 8, 229 (1992).
[9] S. Krastanov and L. Jiang, Deep Neural Network Prob- [29] D. Wu, L. Wang, and P. Zhang, Solving statistical me-
abilistic Decoder for Stabilizer Codes, Scientific Reports chanics using variational autoregressive networks, Phys.
7 (2017). Rev. Lett. 122, 080602 (2019).
[10] S. Varsamopoulos, K. Bertels, and C. G. Almudever, [30] H. Bombin, R. S. Andrist, M. Ohzeki, H. G. Katzgraber,
Comparing neural network based decoders for the surface and M. A. Martin-Delgado, Strong resilience of topo-
code, IEEE Transactions on Computers 69, 300 (2020). logical codes to depolarization, Physical Review X 2,
[11] R. W. Overwater, M. Babaie, and F. Sebastiano, Neural- 10.1103/physrevx.2.021004 (2012).
Network Decoders for Quantum Error Correction Us- [31] F. Battistel, C. Chamberland, K. Johar, R. W. Overwa-
ing Surface Codes: A Space Exploration of the Hard- ter, F. Sebastiano, L. Skoric, Y. Ueno, and M. Usman,
ware Cost-Performance Tradeoffs, IEEE Transactions on Real-time decoding for fault-tolerant quantum comput-
Quantum Engineering 3, 1 (2022). ing: Progress, challenges and outlook, arXiv preprint
[12] P. Baireuther, T. E. O’Brien, B. Tarasinski, and C. W. arXiv:2303.00054 (2023).
Beenakker, Machine-learning-assisted correction of cor- [32] H. Bombin and M. A. Martin-Delgado, Optimal resources
related qubit errors in a topological code, Quantum 2, for topological two-dimensional stabilizer codes: Com-
48 (2018). parative study, Physical Review A 76 (2007).
[13] A. Davaasuren, Y. Suzuki, K. Fujii, and M. Koashi, Gen-
6

Stabilizer codes

The stabilizer code [15] is a very important class of quantum error-correcting codes. Here we will first describe the
stabilizer codes and then introduce the decoding algorithms. Consider a [n, k, d] stabilizer code, states of k logical
qubits are encoded to n physical qubits states. The states |ϕ⟩ of n physical form a 2n Hilbert space Hn , thus the
encoding states |ψ⟩ of k logical qubits form a subspace of Hn and can be represented by superposition of |ϕ⟩. The
bit-flip X and phase-flip Z errors may occur on a single qubit state. Then, for n qubits state |ψ⟩ ∈ Hn , all errors
form a group Pn = P ⊗n Pauli group. An elements E ∈ Pn acting on |ψ⟩ may cause an error state |ψ ′ ⟩. The quantum
error correction is to find a recover operator E ′ to correct the state E ′ |ψ ′ ⟩ = |ψ⟩. A straightforward idea is to find
which error E has occurred and E ′ = E because of the self-inverse property of Pauli operators. However, a special
encoding allows us to find only a collection of operators and any operator belonging to this collection can recover
the error. Such an encoding method is called stabilizer code due to the construction is based on a subgroup S of Pn
called stabilizer group. This group satisfies the following properties:
(a). S is an abelian group.
(b). −I ∈ / S.
Then the encoding states can be chosen as follow:
{|ψ⟩ | S |ψ⟩ = |ψ⟩ , ∀S ∈ S, ∀ |ψ⟩ ∈ Hn } (3)
According to the properties of S, there are 2m elements in S, where m denotes number of generators ⟨g1 , · · · , gm ⟩.
If there exists a error E ∈ Pn , one may observe γ(E) with length m called error syndrome
(
0, [gi , E] = 0
γ(E)i = (4)
1, {gi , E} = 0
The stabilizers are usually described by a parity check matrix H with size m × 2n, where m = n − k denotes the
number of stabilizer generators.
F2 representaiton— The F2 representation is an isomorphic map from the Pauli group to itself. Under the F2
representation, single qubit Pauli operators are represented by two binary numbers:

I → 00 X → 01 Z → 10 Y → 11 (5)
At this time, any n-qubits Pauli operator is represented by a binary vector with length 2n [15]. The group multipli-
cation is just the addition(mod 2) between vectors. And the commutation relation between two operators A and B
can be represented by:
(
0, [A, B] = 0
A · Λ · BT = (6)
1, {A, B} = 0
In Eq. 6 the bold letters indicate
 F 2 representations of operators. The dot symbol · denotes matrix multiplication.
0 I
And the Λ is a 2n × 2n matrix . Further, the generators of the stabilizer group form a m × 2n matrix H named
I 0
parity check matrix.
We have another two subgroups of Pn . One is the pure errors group E. The group E is an Abelian group with 2m
elements and all generators ⟨e1 , · · · , em ⟩ satisfy
(
ei gj = (−1)δij gj ei
(7)
ei ej = ej ei
All generators of the pure error group can be stored as another matrix ME with size m × 2n, satisfying
H · MET = Im×m
(8)
H · MEL = 0.
And it can be determined with a matrix D′ which is further computed using Gaussian elimination on matrix D =
(H|Im×m ). D′ can be organized as D′ = (A|B), where matrix A is a row echelon matrix. And each row of ME can
be solved by these new equations:
A · (ME )Ti = Bi (9)
7

Since the number of rows of matrix H is less than the number of columns, there are some free variables. For simplicity,
we fix these variables to 0. Moreover, we want these pure error generators to commute to each other as defined in
Eq. (7).

ME · Λ · MET = 0 (10)

This requires that some stabilizer generators acting on these operators are determined. It is equivalent to adding
some rows of H, corresponding to the stabilizer generators, on rows of ME . The pseudo-code for the whole process is
given as follows:

Algorithm1 Find Pure Errors ME


Input: Parity Check Matrix H
Output: ME
m × 2n = |H|
D = (H|Im×m )
D′ (A|B) = GE(D), where A is a row echelon matrix.
ME ← Each row of ME can be solved from the equation j Akj (MET )ji = Bki . All free variables are set to 0.
P

for i ∈ [1, m] do
for j > i do
sij = (ME )i · Λ · (ME )Tj
if sij ̸= 0 then
(ME )i = (ME )i + Hj
else
(ME )i = (ME )i
end if
end for
end for
return ME

In addition to the stabilizer group and the pure error group, another sub-group is the logical-operator group L.
It represents the logical errors of logical qubits, is a non-Abelian group, and is generated by ⟨lx1 , lz1 · · · , lxk , lzk ⟩
satisfying


l(x/z)i gj = gj l(x/z)i

l(x/z)i ej = ej l(x/z)i (11)

lxi lzj = (−1)δij lzj lxi

This group has 4 × 22k elements, the constant 4 comes from the overall phase {±1, ±i}. However, during the actual
error correction process the overall phase is always ignored. The matrix of generators of the logicalsubgroup,  ML ,
H
can be determined given H and ME . As defined in Eq. 11, i.e., the ML is the kernel of matrix M = . Here
ME
M is a matrix with size 2m × 2n, Gaussian elimination of M gives M ′ and there are 2k free variables. The ML is
actually a set of bases of the kernel space, each two rows of the ML must be linearly independent. Therefore we
choose these free variables as a one-hot vector (0, · · · , 1i , · · · , 0) for ith row of ML . And for satisfying the condition
Eq. 11. We can perform the symplectic Gram-Schmidt orthogonalization procedure (SGSOP) on ML , finding k pairs
Lx and Lz . The pseudo-code is described in Algorithm 2.
8

Algorithm2 Find Logical Operators ML


Input: H, ME
Output:  ML 
H
M=
ME
|M | = 2m × 2n
M ′ = GE(M ), where M ′ is a row echelon matrix.
ML ← Each row of ME can be solved from the equation M ′ · (MLT )i = 0. The free variables are set to (01 , · · · , 1i , · · · , 02k )
for ith row of ML .
for i ∈ [1, k] do
for j ∈ [i + 1, 2k] do
if (ML )i · Λ · (ML )Tj ̸= 0 then
(ML )i+1 ↔ (ML )j
break
end if
end for
end for
for i ∈ [1, k] do
for j ∈ [i + 2, 2k] do
if (ML )i · Λ · (ML )Tj ̸= 0 then
(ML )j = (ML )j + (ML )i+1
else if (ML )i+1 · Λ · (ML )Tj ̸= 0 then
(ML )j = (ML )j + (ML )i
end if
end for
end for
return ML

This is a general algorithm to find logical operators for a given H. Actually, one can not distinguish the logical
X and logical Z through this algorithm. However for the CSS code, the parity check matrix of which can always be
written as H = (Hz |Hx ), the Hx and Hz are always treated respectively.

The {ELS} decomposition

Note that under the F2 representation, any two operators are commutative to each other. And the information of
anticommutation is stored in a special class of inner products Eq. 6. This allows us to use a more efficient way to
represent Pauli operators. The Pauli group becomes a self-inverse and Abelian group, under the F2 representation.
Thus any error operator can be generated by generators ⟨e1 , · · · , em , lx1 , lz1 · · · , lxk , lzk , g1 , · · · , gm ⟩ and their powers
(γ, β, α).
β
Y γ
E= ei i × lj j × gkαk , (12)
i,j,k

with α = {αk } ∈ {0, 1}m , β = {βj } ∈ {0, 1}2k , γ = {γi } ∈ {0, 1}m , E ∈ Pn . This means that there is a
correspondence between an error operator E and a α, β, γ configuration.

E ⇐⇒ (α, β, γ) (13)

We term the power-configuration (γ, β, α) the ELS configuration of an operator which forms a binary vector with
length 2n {α, β, γ}2n . Given an ELS configuration, one only needs a series of vector additions (under the F2 rep-
resentation) to generate the corresponding operator. On the other hand, given an error operator, the corresponding
ELS configuration can be determined using Eq. 6 and
(
0, [ei , E] = 0
α(E)i = (14)
1, {ei , E} = 0

(
0, [l(z/x)i , E] = 0
β(E)(x/z)i = (15)
1, {l(z/x)i , E} = 0
9

(
0, [gi , E] = 0
γ(E)i = (16)
1, {gi , E} = 0

The minimum weight decoder

Decoding is to determine the recovery operator of the quantum error correction code given the syndrome. There
are basically two kinds of decoding algorithms. The first kind of decoding algorithm is known as the minimum
weight decoder, it determines an error with the maximum probability that satisfies all the syndrome constraints.
In this sense, when a nontrivial syndrome γ has been measured, the minimum weight decoding algorithm finds an
error operator Ê(γ) = arg maxE P (E(γ)), where P (E(γ))) is the generation probability in the error model which is
consistent with the syndrome. The most famous algorithm of the first kind is the Minimum Weight Perfect Matching
(MWPM) algorithm, which assigns a weight to each edge in the code graph using the probability of the error in
the error model [5], then finds the shortest error chain given the syndrome, it can be done by employing an efficient
algorithm e.g. Blossom algorithm [26]. The MWPM algorithm can decode the surface code efficiently. However, it is
challenging for the MWPM algorithm to decode when the code graph is a hypergraph, where each edge of the code
graph links to more than two nodes and the distance between two nodes is ill-defined [5]. The main limitation of the
minimum weight decoding algorithm is that the error with the maximum probability may not be the right recovery
operator because of the degeneracy of the quantum code.

The maximum likelihood decoder

The second kind of decoding algorithm is known as the maximum likelihood decoder. Notice that in the ELS
decomposition, the stabilizers, and the logical operators do not modify the syndrome γ. This means that any element
of the normalizer of the stabilizer group, N (S) = L ⊗ S, does not change the syndrome. One can imagine that
given an error that produces the syndrome, one can apply, on top of the error, any element from N (s) without
modifying the syndrome. So in principle, instead of considering a single error that produces the syndrome, one should
consider all possible errors that produce the syndrome, which form a closet of the N (S). The maximum likelihood
decoder determines a logical operator L by considering all operators in the same equivalent class C(L, γ) rather than
considering a single error, equivalently, it is summing the probabilities of the closet C(β(L), γ), where β(L) is the β
configuration corresponding to the logical operator L. This respects the degeneracy of quantum codes and is the best
decoding algorithm one could do. Obviously, the coset probability can be computed by summing probabilities of all
elements of the stabilizer group with a particular syndrome and logical operator. The summation can be done by
considering all possible α = {0, 1} configurations.

X X
L̂ = arg max P (C(L, γ)) = P (E(γ) × L × S) = P (e(γ) × L × (g1α1 × · · · × gm
αm
)), (17)
L∈L α
S∈S

where × denotes the multiplication of group elements, e(γ) is the pure error corresponding to syndrome the Using
the representation of β configuration for a logical operator, we have
X
β̂ = arg max P (α, β, γ). (18)
β α

The computation of summing all possible α configurations is analogous to the computation for the partition function
of an Ising spin glass, but one needs to do the computation for all possible β configurations. This computation belongs
to the #P problem and there is no exact algorithm to solve it in general in polynomial time. The exact computation
of the maximum likely logical operator requires an exponential algorithm. For some special cases, e.g. code in the 2D
lattice such as the Surface code, the summation can be approximately computed using the tensor network contractions
(e.g. with the boundary matrix product states method) [6] of a 2D tensor network constructed for a given syndrome,
which could be time-consuming. In addition to the computational cost, the tensor network contraction method has
several limitations. The first limitation is that it is difficult to generalize to code on other topologies such as on 3-
dimensional lattice, or codes with long-range interactions as in qLDPC, due to the fast increase of the computational
cost. and the decrease in accuracy with a topology having large treewidth; the second limitation is that one needs to
perform tensor network contractions for each β configuration.
10

Decoding: Transformer:
Attention Block

Syndrome
Sigmoid MatMul

Linear Norm
SoftMax
Self-Attention
Layer
Forward
Mask
Norm
Self-Attention Scale
Layer

Masked
Attention MatMul
Position Encoding

: Pure Errors Embeding

: Logical Operators

: Stabilizers

FIG. 3. A pictorial illustration of decoding of the quantum error correction code with a generative pretrained transformer.

The generative decoder with Transformers

As described in the main text, our generative decoder models the joint distribution P (α, β, γ) using the autoregres-
sive neural networks, it factorizes the joint distribution as a product of conditional distributions
qθ (α, β, γ) = q(α|β, γ)q(β|γ)q(γ) (19)
Note that changing the order of the α, β, γ variables does not change the joint probability distribution due to the
Abelian nature of the subgroups. In this work, we always put the logical variable β in the middle as in Eq. 19. The
benefit of this is that the β can be always determined prior to the stabilizer variables α in the joint distribution. In
this way, decoding is performed by sampling the β variables from the marginal distribution
X
q(β, γ) = qθ (α, β, γ).
α

Furthermore, we ask the conditional probabilities are organized in such as way p(βi |βj<i ) that allows generating a β
configuration among all 4k possible ones variable-by-variable since we have already stored all conditional probabilities
for each βi variables. In this way, we can reduce the computational complexity of generating a maximum-likelihood
logical operator out of 4k logical operators to O(2k).
The autoregressive model we used here for representing qθ (α, β, γ) is the encoder part of a Transformer [18, 27]
with a mask to ensure the autoregressive property. It is also known as causal transformer, the structure is shown in
Fig. 1. The input of the Transformer is the configuration (α, β, γ). The embedding layer increases the dimension of
input to the dimension of a model. The information on position will be added and learned from the position encoding.
A triangular mask is added in the attention block before the Softmax layer to ensure that each conditional probability
of variable i only depends on the variables before i in the input configuration. Multiple transformer encoder layers
are added after position encoding, and the final Linear layer maps the data from the model dimension to a length n
vector which is the same as the input. The output is a vector (α̂, β̂, γ̂), which uses Sigmoid functions to represent the
Bernoulli distributions for the conditional probabilities. For example, as illustrated in Fig. 1, we have
γ̂1 = σ(Fγ1 (γ1 )) = q(γ1 )
γ̂2 = σ(Fγ2 (γ1 )) = q(γ2 |γ1 )
β̂1 = σ(Fβ1 (γ1 , γ2 )) = q(β1 |γ1 , γ2 )
β̂2 = σ(Fβ2 (β1 , γ1 , γ2 )) = q(β2 |β1 , γ1 , γ2 )
α̂1 = σ(Fα1 (α1 , β1 , γ1 , γ2 )) = q(α1 |β2 , β1 , γ1 , γ2 )
α̂2 = σ(Fα2 (α2 , α1 , β1 , γ1 , γ2 )) = q(α2 |α1 , β2 , β1 , γ1 , γ2 ) (20)
11

where σ(·) is the Sigmoid function, and function F denotes the map of the neural network. We can see that the
product of all the output of the transformer gives the joint distribution, as

α̂2 α̂1 β̂2 β̂1 γ̂2 γ̂1 = q(α2 |α1 , β2 , β1 , γ2 , γ1 )q(α1 |β2 , β1 , γ2 γ1 )q(β2 |β1 , γ2 , γ1 )q(β1 |γ2 , γ1 )q(γ2 |γ1 )q(γ1 )
= q(α2 , α1 , β2 , β1 , γ2 , γ1 ). (21)

Another important property (but not obvious) that we can obtain from the product of the output of the Transformer
is the marginal distribution.

β̂2 β̂1 = q(β2 |β1 , γ2 , γ1 )q(β1 |γ2 , γ1 )


= q(β2 , β1 |γ2 , γ1 ). (22)

Which gives the normalized conditional probability of logical variables given a syndrome, and surprisingly ignores
the stabilizer configurations α. Based on this conditional probability one can evaluate the likelihood of the logical
operators and perform the decoding after the neural network is well trained.

Pre-training of the model

The training of the neural network is conducted by minimizing the distance between the error distribution (given
by the error model) and the variational distribution qθ . Here we adopt the Kullback-Leibler divergence.

θ̂ = arg min DKL [P (α, β, γ) || qθ (α, β, γ)] (23)


θ

Here we use the forward KL divergence because we always assume that we have N samples {α, β, γ} ∼ P (α, β, γ) of
the errors, which can be obtained by sampling the error model or collected from experiments. Then the loss function
can be evaluated as

θ̂ = arg min DKL [P (α, β, γ) || qθ (α, β, γ)]


θ
X
= arg min P (α, β, γ) [log P (α, β, γ) − log qθ (α, β, γ)]
θ
α,β,γ
1 X
= arg min [log P (α, β, γ) − log qθ (α, β, γ)]
θ N
{α,β,γ}∼P (α,β,γ)
X
= arg min − log qθ (α, β, γ)
θ
{α,β,γ}∼P (α,β,γ)

= arg min FNLL , (24)


θ
P
where FNLL = − {α,β,γ}∼P (α,β,γ) log qθ (α, β, γ) is the so-called negative log-likelihood loss function or the cross-
entropy loss.
In this work, we consider error models that are easy to sample. For example, for the depolarizing model, the errors
on each qubit are generated independently. We also consider the correlated noise where the errors are generated
according to some pairwise correlations. In this case, we can adopt the Metropolis-Hasting algorithm to sample the
error. Once an error operator is sampled, according to the isomorphic mappings between the Pauli group and its
ELS representation (i.e. with Eq. 14, Eq. 15, and Eq. 16), a {α, β, γ} configuration is generated as training data for
training the model for all syndromes. So we call it Pre-training, it learns a joint distribution for all syndromes, and
also keeps the conditional probabilities q(β|γ) for all syndromes.
So the decoding is very fast because one can evaluate the conditional probabilities for each βi variable by the
forward passes of the neural network especially using GPUs. Here we describe it in detail using the example of Fig. 3.
First, a syndrome γ1 , γ2 is sent as an input to the Transformer, after one forward pass of the Transformer Fβ1 (γ2 , γ1 ),
we compute a conditional probability q(β1 |γ2 , γ1 ) using e.g. Eq. 20 and sample a configuration of β1 which maximize
the conditional probability with

β̂1 = arg max q(β1 |γ2 , γ1 ). (25)


β1
12

Then, we send β1 and γ as input to the transformer, compute a conditional probability q(β2 |β1 , γ2 , γ1 ), then sample
a β2 configuration according to the conditional probability

β̂2 = arg max q(β2 |β1 , γ2 , γ1 ). (26)


β2

We call the method of sample β variables one by one the generative MLD decoding. In Fig. 4, we compare the logical
error rate given by the exact MLD decoding which enumerates all possible 4k logical operators, and the generative
MLD and we can see that on the surface code and we can see that the generative results are almost identical to the
exact MLD results while the computational complexity has decreased from 4k to 2k for 2k conditional probabilities.

1.0
k=1 Generative MLD
k=1 Exact MLD
k=3 Generative MLD
0.8 k=3 Exact MLD
k=5 Generative MLD
Logical Error Rate

k=5 Exact MLD


0.6

0.4

0.2

0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Physical Error Rate
FIG. 4. Comparison between the exact maximum likelihood decoding (MLD) and an exact MLD (solid lines) and generative
MLD (stars). The coset probabilities p(β|γ) and the conditional coset probabilities p(βi |βj<i , γ) are calculated through con-
traction of tensor networks. We do the lattice surgeries on d=5 Surface Code. The numbers of logical qubits are [1, 3, 5]. Each
data point is averaged over 10000 random syndrome instances.

We remark that the decoding for multiple syndromes can be done simultaneously because the Transformer can
accept a batch of syndromes as an input and process the batched forward pass efficiently, especially using GPUs.

Refinement

The pre-training learns a joint distribution qθ (α, β, γ) for all 2m γ configurations and offers a fast forward pass for
decoding. For a given syndrome, we can spend more computational cost to further enhance the accuracy of decoding,
we term it as refinement. A straightforward way for the refinement of the transformer is minimizing the distance
between the conditional distribution and the true conditional distribution. However, since it is not possible to obtain
samples of the conditional distribution p(α, β|γ) given a syndrome directly from the error model due to the lack of
the normalization factor, we can not directly refine the variational distribution given the syndrome using the forward
KL divergence DKL (p(α, β|γ)∥q(α, β|γ). Instead, we can do the refinement by minimizing the reverse KL

DKL (q(α, β|γ)∥p(α, β|γ)),

because the variational distribution q(α, β|γ) is always sample able. However, using the reverse KL divergence requires
computing the gradients using the samples of the variational distribution with the REINFORCE method [28] (also
13

known as policy gradients), analogous to reinforcement learning. The procedure is quite similar to the variational
autoregressive neural networks for minimizing the variational free energy for statistical mechanics problems [29].
However, minimizing the backward KL is computationally expansive. In this work, we propose another way for the
refinement of qecGPT, which we call the generative refinement. The idea is to explicitly compute the summation of
the stabilizer configurations using the α configurations sampled from the variational distribution (taking advantage
of the generative models), given a syndrome configuration γ. That is the unbiased version of the joint probability
p(β, γ) can be computed using samples of stabilizer configurations that are reweighted.

X p(α, β, γ) 1 X p(α, β, γ)
p(β, γ) = qθ (α, β, γ) ≈ .
α
qθ (α, β, γ) N α∼q qθ (α, β, γ)

On the R.H.S. of the last equation, we use the samples of the variational distribution and the reweighting to compute
an unbiased estimate of the joint distribution. The samples of α configurations can be computed using conditional
probabilities, but different from the way we obtained β configurations. Again we use Fig. 3 as a simple example.
Suppose we are given a syndrome configuration γ1 , γ2 and we want to estimate the joint probability of a β configuration
β1 , β2 and the syndrome configuration p(β2 , β1 , γ2 , γ1 ). We can send β2 , β1 , γ2 , γ1 as an input to the transformer,
compute the conditional probability and sample the α1 configuration according to this probability distribution
α1 ∼ q(α1 |β2 , β1 , γ2 , γ1 ).
Notice that this is different from the sampling procedure to determine the β configuration in the decoding of the pre-
trained model, as here we are sampling from the distribution while in decoding of the pre-trained model maximizing
the conditional probabilities as shown in Eq. (25). After determining the value of α1 , we send α1 , β2 , β1 , γ2 , γ1 as an
input to the transformer, compute the conditional probability and sample the α1 configuration according to this it
α2 ∼ q(α2 |α1 , β2 , β1 , γ2 , γ1 ).
The advantage of the reweighting formula is the unbiased estimates, and it greatly improves the decoding accuracy
from the pre-training, but the drawback is that we have to evaluate 4k configurations of β variables for the maximum-
likelihood decoding.

Fast decoding with mismatched training parameters to the error model

In maximum likelihood decoding, computation of the likelihood of logical operators requires the parameters of the
error model. In the qecGPT, the parameters of the error model explicitly appear in the (slow) training process but do
not appear in the (fast) decoding process. It would be very efficient if a transformer trained using a set of parameters
of the noise model but used for decoding under another noise model parameter. In the area of statistical inference,
this is known as inference with mismatched parameters and it is well known that although not optimal, sometimes
the mismatched parameters already provide very accurate inference results. For example, in the community detection
problem (which can be studied using the statistical inference of the stochastic block model), It has been shown that
the inference using the parameters at the phase transition always gives good results, and is even optimal in the sense
of the range of detection [22]. Inspired by the results of [22], we can always train the qecGPT using the parameters of
the noise model at the theoretical phase transition point. To give a concrete example, for the surface code, the phase
transition happens with n → ∞ at a physical error rate p ≈ 0.189 [30]. We can train the qecGPT with p = 0.189 and
decode for codes with other physical error rate values.
In Fig. 5, we have tasted the performance of mismatched decoding of the surface code with various code distances.
The maximum likelihood decoding is performed using exact tensor network contractions with noise parameter p′ . We
have tested two kinds of p′ values. The first one is the matched parameters, where p′ is set to the true physical error
rate p in the depolarizing noise model; the second one is fixed to p′ = 0.189 which is the threshold (phase transition)
of the surface code with d = ∞. From the figure, we can see that the results of the MLD decoding with the threshold
parameter is indistinguishable from the exact MLD decoding with true parameters.

Compared with other neural network decoders

Recently, several decoding algorithms based on neural networks have been proposed. These include algorithms
using Boltzmann machines [7], multilinear neural networks [8–11], the Long Short-Term Memory (LSTM) neural
14

threshold
0.7 p =0.189
p =p
0.6

Logical Error Rate


0.5

0.4

0.3
0.3
0.2
0.2
0.1
0.175 0.200
0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Physical Error Rate p
FIG. 5. The performance of exact maximum likelihood decoding with mismatched parameters on the d = [3, 7, 11] surface
codes under the depolarizing error model. The decoding is implemented using exact tensor network contractions with noise
parameter p′ . The blue line shows the decoding results with p′ = p, i.e. with matched error model parameters. The red symbols
show the decoding results with p′ = 0.189, which is fixed to the physical error rate at the threshold (phase transition). Each
data point is averaged over 10000 random syndrome instances.

networks [12], and the convolutional neural network (CNN) [13, 14]. We notice that all of the existing neural network
decoders are based on supervised learning. This means that the training of the models requires a training dataset with
labels prepared using another decoder. The labels are either the operators that are computed using the minimum
weight perfect matching algorithm, or the correct type of logical operators that are computed using a maximum
likelihood decoding algorithm. For detailed introductions to the neural network decoders we refer to [31].
In this sense, our generative decoders are much different from the existing neural network decoders. Among
many differences, the most crucial difference is that our approach uses unsupervised learning rather than supervised
learning. It models the joint distribution of errors using neural networks, rather than learning the probability of labels.
In other words, for training the neural networks, we do not need to prepare any labeled data. Instead, we directly
draw unlabelled samples from the noise model as training data. Moreover, the existing neural decoders belong to
discriminative learning while our approach belongs to generative learning which generates the maximum likely logical
operators variable by variable, in analogous to generating a sentence word by word. Another significant difference
between our approach and the existing neural network decoder is that our generative decoder can generate maximum
likely logical operators variable by variable, hence is capable of decoding with a large number of logical qubits. On
the opposite, the maximum likelihood decoders based on supervised learning require labeled data with 4d different
kinds of labels, which is intractable. For example, with k = 7, our generative decoder can generate logical operators
with computational complexity O(14), while the neural network decoders based on the generative learning require
47 = 16384 different types of labels and hence is intractable.

Additional numerical results

In this section, we provide numerical results in addition to the results we have shown in the main text.
Rotated Surface Codes— Here we benchmark qecGPT on the Rotated Surface Codes [32]. This type of code
has the same threshold as the surface Code but with only d2 physical qubits for encoding a single logical qubit,
rather than n = d2 + (d − 1)2 qubits in the surface code. Recently this code is frequently used in quantum hardware
15

experiments [3]. We compared the performance of qecGPT with MWPM in Fig. 6 on the rotated surface code with
different code distances. It can be seen that qecGPT always provides more accurate results (with lower logical error
rates) than MWPM.

0.6 Exact-MLD (a) 0.7 Exact-MLD (b)


MWPM MWPM
qecGPT 0.6 qecGPT
0.5
0.5
Logical Error Rate

Logical Error Rate


0.4
0.4
0.3
0.3
0.2 10 2
Difference

Difference
0.2 10 2
10 3
10 3
0.1 0.1
10 4 10 4
0.0 0.2 0.0 0.2
0.0 0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Physical Error Rate Physical Error Rate
0.7
Exact-MLD (c)
MWPM
qecGPT
0.6
Logical Error Rate

0.5

0.4

0.3
10 1
Difference

0.2 10 2

0.1 10 3

0.0 0.2
0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Physical Error Rate
FIG. 6. Comparison of decoding performance of various algorithms on the rotated surface code with d = 3(a), d = 5(b),
and d = 7(c). Each data point is averaged over 10, 000 random instances. MWPM means minimum weight perfect matching
algorithm, and the Exact-MLD is implemented by summing all possible stabilizer configurations using exact tensor network
contractions.

With k > 1 logical qubits— An advantage of qecGPT is that it can decode with k ≫ 1 logical qubits efficiently
via generative capability. In maximum likelihood decoding, one usually needs to compare the likelihood of all 4k
logical operators with k logical qubits, which is intractable for a large k. Instead, the complexity of qecGPT is only
O(2k) because it generates logical configurations variable by variable. To evaluate the performance of qecGPT with
16

k > 1 logical qubits, we did lattice surgery on d = 3 surface code to increase the number of logical qubits k. In detail,
we have removed 2 (4, 6) stabilizers randomly and increased the number of logical qubits to 3 (5, 7) respectively.
This increases the number of logical operators to 64 with k = 2, 1024 with k = 5, and 16384 with k = 7. During the
training of the transformer, we always fix the physical error rate to p′ = 0.15 and use the model to decode with other
physical error rates p. From Fig. 7 we can see that qecGPT outperforms minimum weight perfect matching on both
surface code and toric code [4] with multiple logical qubits.

Exact-MLD Exact MLD


qecGPT 0.8 BPOSD
0.8 MWPM MWPM
qecGPT
Logical Error Rate

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Physical Error Rate Physical Error Rate
FIG. 7. Comparison of decoding performance with k > 1 logical qubits.(left) distance-3 Surface Codes with random lattice
surgeries. The number of physical qubits is 26 and the numbers of logical qubits are k=3 (black), k=5 (blue), and k=7 (red)
respectively. (right) Toric Code with code distance d = 3 and k = 2 logical qubits. In the figures, MWPM is the minimum weight
perfect matching [5], and BPOSD is belief propagation augmented by ordered statistics decoding (BPOSD) [23] algorithm.
Exact MLD is the maximum likelihood decoding which sums all possible stabilizer configurations using exact tensor network
contractions.

Correlated noise— Although simple error models e.g. depolarizing models assume that errors are independent
on each qubit, in practical quantum hardware there inevitably exist correlations between errors on different qubits.
In our approach, we only need samples from the error models for training qecGPT, so we can decode error-correcting
codes with correlated noise in exactly the same way as we described and evaluated with the independent noise
models. In this section, we propose a simple noise model to evaluate our approach for the correlated noise. For a
[n, k, d] quantum code, we generate an Ising model on degree = 4 regular random graph with n spins. The couplings
Jij of the Ising model are sampled from a uniform distribution U(0, 1). And for breaking the Z2 symmetry, a small
external field h = 0.3 is added to Hamiltonian as Eq. 27.
X X
H = −β Jij si sj − h si (27)
<ij> i

Then we draw samples from the Boltzmann distribution using the Metropolis-Hasting algorithm. The error that
occurred is determined by a sample s. If si = 1, an identity I acts on ith physical qubit. Else, There is a Pauli error
{X, Y, Z} with the same probabilities. We can see that the error configurations are mapped from the configuration
of Ising models which are generated from the Boltzmann distribution with long-range correlations. With β = 0, all
the configurations are drawn randomly, each spin has a probability of 0.5 to take 1 or −1, which is analogous to the
depolarizing error model with a high physical error rate. With β > 0 the correlations between the error configurations
are long-range and non-trivial. With β = ∞, the ground state of the Ising model is a ferromagnetic configuration,
so there are almost no Pauli error appears, corresponding to a low physical error rate. In Fig. 8 we compare the
performance of qecGPT with the minimum weight perfect matching algorithm which determines the weights using
17

the marginal probabilities calculated using the samples. In the figure, with β = 0, the error model is purely random
so the logical error rate of both MWPM and qecGPT is high. With β large, the physical error rate is small and the
decoding is very easy we also see that both MWPM and qecGPT give a very low logical error rate. In the middle
when the physical error rate is moderate, we can see that qecGPT significantly outperforms the minimum weight
perfect matching algorithm.

0.7 MWPM
qecGPT
Logical Error Rate 0.6

0.5

0.4

0.3

0.2

0.1

0.0
2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00
Beta
FIG. 8. Decoding of the surface code with d = 5 with correlated errors described in the text. The weights of MWPM are
determined using the marginal probabilities calculated using samples. Each data point is averaged over 10000 instances.

Parameters of the neural networks

In the numerical experiments, all neural networks were trained on a single NVIDIA A100 GPU. To ensure that the
distribution of the Transformers has the autoregressive property, an upper-triangle mask is added to the attention
block as illustrated in Fig. 3. We note that in the Softmax layer of the Transformers, we have added a mask matrix
with diagonal elements set to 1.
 
1 -∞ · · · -∞
 .. . . . . . 
. . . .. 

.
 (28)
 .. .. 
. -∞
1 ··· ··· 1

In this setting, the autoregressive property is not satisfied because the ith output depends on the ith input. The
solution is to introduce a virtual variable x0 and modify the order of input variables as

X(x1 , · · · , xn ) → X(x0 , · · · , xn−1 ), (29)

and x0 always equals 1. The hyperparameters for training the Transformers are detailed in Tab. I. Where D is the
dimension of the model, Nh is the number of heads of multi-head attention, Nl is the number of encoder layers, and
Df is the dimension of feed-forward layers.
18

BATCH EPOCH LR D Nh Nl Df
Sur3 104 105 10−3 256 4 2 256
Sur5 104 2 × 105 10−3 256 4 3 256
RSur3 104 105 10−3 256 4 2 256
RSur5 104 2 × 105 10−3 256 4 3 256
RSur7 104 3 × 105 10−3 512 4 2 512
3DSur2 104 105 10−3 256 4 2 256
Tor3 104 105 10−3 256 4 2 256
N13k3 104 105 10−3 256 4 2 256
N13k5 104 105 10−3 256 4 2 256
N13k7 104 105 10−3 256 4 2 256

TABLE I. Parameters of qecGPT and hyperparameters in the training.

You might also like