0% found this document useful (0 votes)
4 views

Your Dataset is a Multiset And

This document presents a method for compressing multisets using Neural Compressors (NCs) by eliminating the need to store order information, thus reducing the bitrate. The proposed approach utilizes bits-back coding on a latent variable model, allowing for efficient compression without retraining models, achieving up to 7.6% savings on Binarized MNIST data with only a 10% increase in compute time. The method is optimal for compressing multisets of high-dimensional data, outperforming existing entropy coders tailored for such data types.

Uploaded by

katpuje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Your Dataset is a Multiset And

This document presents a method for compressing multisets using Neural Compressors (NCs) by eliminating the need to store order information, thus reducing the bitrate. The proposed approach utilizes bits-back coding on a latent variable model, allowing for efficient compression without retraining models, achieving up to 7.6% savings on Binarized MNIST data with only a 10% increase in compute time. The method is optimal for compressing multisets of high-dimensional data, outperforming existing entropy coders tailored for such data types.

Uploaded by

katpuje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Your Dataset is a Multiset and You Should Compress

it Like One

Daniel Severo∗ 1 2 3 James Townsend∗ 4 Ashish Khisti2 Alireza Makhzani3 2

Karen Ullrich1
1 2
Facebook AI Research University of Toronto
[email protected] [email protected]
[email protected] [email protected]
3 4
Vector Institute for AI University College London
[email protected] [email protected]

Abstract
Neural Compressors (NCs) are codecs that leverage neural networks and entropy
coding to achieve competitive compression performance for images, audio, and
other data types. These compressors exploit parallel hardware, and are particularly
well suited to compressing i.i.d. batches of data. The average number of bits needed
to represent each example is at least the well-known cross-entropy. However, the
cross-entropy bound assumes the order of the compressed examples in a batch
is preserved, which in many applications is not necessary. The number of bits
used to implicitly store the order information is the logarithm of the number of
unique permutations of the dataset. In this work, we present a method that reduces
the bitrate of any codec by exactly the number of bits needed to store the order,
at the expense of shuffling the dataset in the process. Conceptually, our method
applies bits-back coding to a latent variable model with observed symbol counts
(i.e. multiset) and a latent permutation defining the ordering, and does not require
retraining any models. We present experiments with both lossy off-the-shelf codecs
(WebP) as well as lossless NCs. On Binarized MNIST, lossless NCs achieved
savings of up to 7.6%, while adding only 10% extra compute time.

1 Introduction
A data source is usually modeled as an ordered sequence of random variables with some joint
distribution. The objective of lossless data compression is to map each trajectory, i.e. the sequence
of instances, to a compact representation such as a string of bits. The average number of bits of
the smallest lossless representation for any source is completely characterized by it’s probability
distribution, and is known as the entropy. Compression algorithms such as Arithmetic Coding (AC)
and Asymmetric Numeral Systems (ANS) [3] can compress any sequence to a size very close to the
entropy, if the probability distribution is known.
In real-world applications, the source distribution is not known and must be estimated from data. The
better the estimate, the closer the size of the representation is to the entropy. Deep generative models
are good estimators for this task, and have recently been shown to reach competitive compression
performance when paired with AC and ANS [7, 15, 17, 8, 1, 2, 6]. The pairings of these models with
compression algorithms are increasingly being referred to as Neural Compressors (NCs) [9].

Equal contribution.

NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications


In the compression literature, the elements of
an ordered sequence are referred to as symbols.
AC and ANS preserve the ordering of symbols
in the input sequence. However, there are data
types where order is not meaningful, such as
collections of files, rows in a database, nodes
in a graph, and, notably, datasets in machine
learning applications. Formally, these are known Figure 1: Our entropy coder for datasets (multi-
as multisets: a generalization of a set that allows sets) is equivalent to bits-back coding on a latent
for repetition of elements. A multiset can be variable model with observed multisets and latent
compressed with NCs by somehow ordering the permutations.
elements and communicating the corresponding
ordered sequence, but this is wasteful, because
the order contains information, and therefore more bits are used to represent the source than are truly
necessary.
Current entropy coders tailored to multisets of i.i.d. symbols either have a sub-optimal compression
rate or compute time which scales linearly with data dimensionality, making them unsuitable for
multisets of images and other high dimensional sources [13, 14, 12]. To leverage the efficiency and
compression performance of existing NCs, while simultaneously compressing a multiset optimally,
requires some mechanism that can forget the ordering between symbols during encoding. In this
work, we present such a method based on bits-back coding, and the key observation that any model
over sequences can be seen as a latent variable model, where the observed variable is the unordered
multiset, and the latent variable is a permutation that defines the order between symbols [18]. A more
general, detailed exposition of the core multiset compression method is given in [11].

2 Background

Given access to some model P over sequences xn = (x1 , . . . , xn ) of discrete elements, entropy
coders are functions that map xn to bit-strings of approximately − log P P (xn ) in length. Under the
n
i.i.d. assumption xi ∼ Pd , the average bit-length − n1 log P (xn ) = − n1 i=1 log P (xi ) approaches
the sequence-cross-entropy H(Pd ) + DKL (Pd k P ) for large n, where H(Pd ) is the entropy of the
data source. Sequence-cross-entropy is the smallest possible average bit-length that can be achieved
for a given model P . We say an entropy coder is optimal if it can compress sequences close to the
sequence-cross-entropy.
The entropy coder we develop in this paper depends on another well known entropy coder called
“asymmetric numeral systems” (ANS) [3]. In contrast to Arithmetic Coding, ANS is last-in-first-
out (i.e. stack-like), as the encoded sequence xn is decoded in reverse order xn , . . . , x1 . ANS
encodes sequences into a single, large, natural number s. The outputted bit-string is just the binary
representation of s, which is log s bits in size. Each encoded symbol xi increases the bit-string by
approximately − log P (xi ), and decreases it by the same amount during decoding.
A key property of ANS is that performing a decode operation using distribution P will produce
samples distributed according to P . Decoding reduces the ANS state, which can be viewed as a
random seed that is slowly consumed as samples are drawn. The random seed can be recovered
by performing an encode with the sampled symbol. Therefore, ANS can be used as an invertible
sampler. Invertible sampling is also possible with other entropy coders and is commonly known as
‘bits-back coding’ [4]. However, the stack-like nature of ANS allows for interleaving of sampling
(decoding) and encoding with different distributions. This observation was first made by [15], and
used to compress images with latent variable models. Follow-up works propose elaborations on the
original idea for specific classes of latent variable models [8, 17, 16].

3 Method

In this section we present a method that converts an entropy coder for sequences xn of i.i.d. symbols
into one for multisets M of i.i.d symbols. We describe only the compression scheme, as decompres-
sion follows from all steps being invertible. The multiset-cross-entropy achieved by compressing M
is at most n1 log n! ≈ log n bits less than the usual sequence-cross-entropy resulting from compressing

2
xn , with equality when there are no repeated symbols. This is exactly the information required
to represent the ordering. Our method is optimal in the sense that it can recover all n1 log n! bits,
irrespective of how well the original sequence codec performs on xn . The naive implementation
discussed next can be seen as a McBits [10] entropy coder with an exact posterior, where the joint
distribution is the entropy coder chosen for conversion.
Key to our method is that any model over sequences can be seen as a latent variable model over
observed multisets (i.e. symbol counts) and latent permutations defining the ordering [18]. To see
how, let [xn ] represent the set of all possible permutations of xn . All sequences in this set have
the same frequency count of symbols, and therefore [xn ] has a one-to-one correspondence with a
unique multiset M. The amount of permutations |[xn ]| ≤ n! is equal to the multinomial coefficient
calculated from symbol counts in M, with equality when there are no repeated symbols. If the
symbols xi are i.i.d. under model P , then all sequences in [xn ] have mass P (xn ), implying
P (M) = P ([xn ]) = |[xn ]|P (xn ). (1)
n
A source that generates sequences x of i.i.d. symbols can thus be seen as a compound source that
first selects a multiset M and then picks an order uniformly at random from the corresponding set
[xn ]. If σ represents a positive integer that indexes into |[xn ]|, then the joint and posterior distributions
under model P are
n
1 n
Y
P (σ | M) = , P (M, σ) = P (x ) = P (xi ). (2)
|[xn ]| i=1

The distribution P (M) does not factorize over symbols,


making it inefficient for entropy coding. For example, we
can compress M by always picking x̃n ∈ |[xn ]| such that
x̃1 ≤ x̃2 ≤ · · · ≤ x̃n , together with P (x̃i | x̃i−1 ). This
requires updating the mass for all symbols a ≥ x̃i in the
alphabet at each step i. In contrast, P (xn ) factorizes over
xi and hence does not require updates, but uses exactly
log |[xn ]| surplus bits. By using invertible sampling to
randomly pick any sequence in |[xn ]|, we can compress
efficiently with P (xn ) while removing exactly the surplus
bits from the ANS state. Equivalently, we augment the
multiset to a sequence by pairing it with an order [10],
and then recover the bits using bits-back (Figure 1). First, Figure 2: Amortizing over symbols al-
an ANS state s is initialized and invertible sampling (de- lows us to compress a single multiset
coding) is performed with P (σ | M) to sample an index M while still achieving the information
σ. The sequence x̃n ∈ [xn ] corresponding to σ is then content − log P (M).
encoded with P (x̃n ) = P (M, σ). Sampling reduces s,
while encoding increases it. Overall, the bit-length of s
increases by exactly the number of bits required to represent M
log P (σ | M) − log P (M, σ) = − log P (M). (3)

The scheme as described above would only be optimal when amortizing over a collection of multisets,
due to the log s initial bits required to initialize the ANS state s. It is possible to compress a
single multiset while still achieving − log P (M) via an incremental version of our method that
amortizes over symbols. At each step, a symbol is sampled without replacement from the multiset
and immediately encoded with P . Sampling is performed using ANS with distribution SM , which
assigns probabilities proportional to the frequency counts in M. This is shown in Figure 2 for a
multiset of size 2. The sequence x̃n generated from the invertible sampling steps (see Section 2)
is a random permutation of xn . Therefore, the order information between symbols in xn has been
destroyed, as only x̃n is stored. This implies our method is optimal, since it removes the maximum
amount of redundant order information (log |[xn ]|), and achieves the multiset-cross-entropy if P
achieves the sequence-cross-entropy.
Sampling without replacement from the multiset can be done efficiently by using a binary search tree
(BST), which we describe in detail in [11]. Traversing the BST requires comparing symbols under a
fixed (usually lexicographic) ordering. In the extreme case where symbols in some alphabet A are
represented by binary strings of length log | A |, a single comparison between symbols would require

3
log | A | bit-wise operations. However, ‘short-circuit’ evaluations, where the next bits are compared
only if all previous comparisons result in equality, make this exponentially unlikely. All operations
required have worst-case and average complexity equivalent to that of a search on a balanced BST.
Overall, our method adds O(n log m) in average compute time to the original sequence codec, where
n and m are the total and unique number of symbols in the multiset.

4 Experiments

In the experiments that follow, we used the ANS implementation in the Craystack library [17]. Details
of each experiment are discussed further in the appendix.

Toy multisets We compressed synthetically generated toy multisets to provide evidence of the
computational complexity and optimal compression rate of the method. The alphabet A is always a
subset of N. For each run, we generate a multiset with m = 512 unique and |M| = n total symbols.
The true data distribution is used for P , which is sampled from a Dirichlet prior with coefficients
αk = k for k = 1, . . . , | A |. Results are shown in Section 4 averaged over 20 runs, with shaded
regions representing the 99% to 1% confidence intervals. The new codec compresses a multiset
close to it’s information content (− log P (M)) for varying alphabet sizes, as can be seen in the top
plot. The total encode plus decode time is unaffected by the alphabet size | A |. The total time scales
linearly with the multiset size |M| (bottom plot), as expected.

Compressed size
214 1.0 |M| = 2048
Information content |A| = 217
Time (seconds)

0.8
213 |A| = 213
Bits

0.6 |M| = 1024


212
0.4
|A| = 210 |M| = 512
11
2 0.2

750 1000 1250 1500 210 212 214 216


Multiset size |M| Alphabet size |A|

Figure 3: Left: Final compressed length is close to the information content − log P (M) for varying
alphabet and multiset sizes. Right: Computational complexity does not scale with alphabet size | A |,
and is linear in |M| = n.

Lossless Neural Compression on Binarized MNIST Here, a lossless compression algorithm


called ’Bits-back with ANS’ (BB-ANS) [15] was used to compressed Binarized MNIST. BB-ANS
is an entropy coder for latent variable models with factorized joint P (x, z) = P (z)P (x | z), and
approximate posterior Q(z | x), over observed x and latent z. The observed model P (x | z) is
composed of a factorized Bernoulli distribution over each pixel, while the approximate posterior
Q is a factorized Gaussian. The average bit-length achieved by BB-ANS is an upper-bound on the
sequence-cross-entropy. It is equal to the Negative Evidence Lower Bound of the discretized model,
with equality when the approximate posterior perfectly matches the true posterior P (z | x). We used
the pre-trained model and code made publicly available by the author2 . First, invertible sampling
is performed to select a binarized MNIST image for compression. Then, BB-ANS is applied as
described in the experimental section of [17]. This process is repeated, until all 10, 000 images are
compressed. We compared the average bit-length with and without the invertible sampling step.
Results are shown in Table 1. Since the images are all unique, the maximum theoretical savings is
log(10, 000!) bits ≈ 14 kB. This represents a potential savings of 7.6%, which is achieved by our
method, at the cost of 10% extra compute time on average.

2
https://github.com/bits-back/bits-back

4
Table 1: Savings incurred by our method on Binarized MNIST (BMNIST) with BB-ANS, and standard
MNIST with lossy WebP. The maximum theoretical savings for both datasets is log(10, 000!) bits ≈
14 kB, which is reached by our method. The last column shows the added mean ± 3 std. compute
time from using our method over 100 runs.
Bits-per-pixel Relative savings
Dataset Ordered Unordered (ours) Theoretical Actual Extra compute time
BMNIST 0.198 0.183 7.6% 7.6% 10% ± 4%
MNIST 1.031 1.016 1.5% 1.5% 37% ± 8%

Compressing standard MNIST with lossy WebP We showcase our method with WebP, an off-
the-shelf lossy codec, on standard MNIST. WebP is a lossy compression algorithm that outputs a
variable-length sequence of bytes which we call a byte-array. To encode, we first perform invertible
sampling to select an image from the test set. WebP is applied to the image, and the bytes in the
byte-array are encoded sequentially into the ANS state using a uniform distribution. Finally, because
the byte-array is variable in length, the length itself must also be encoded. As in the BB-ANS
experiment, all 10, 000 images are compressed. Results are shown in Table 1. The byte-arrays are all
unique and therefore the maximum theoretical savings in raw bits is also log(10, 000!) bits ≈ 14 kB.
This represents a potential savings of 1.5%, which is achieved by our method, at the cost of 37%
extra compute time on average.

5 Acknowledgements
In preparing this research, authors affiliated to the Vector Institute were funded, in part, by the
Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the
Vector Institute. We would like to thank Tim Vieira, whose blog post “Heaps for incremental
computation” inspired the binary search tree data structure3 . All plots were made using the Science
Plots package [5].

References
[1] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. “End-to-end optimized image com-
pression”. In: arXiv preprint arXiv:1611.01704 (2016).
[2] Johannes Ballé et al. Variational image compression with a scale hyperprior. 2018. arXiv:
1802.01436 [eess.IV].
[3] Jarek Duda. “Asymmetric Numeral Systems”. In: arXiv:0902.0271 [cs, math] (2009). arXiv:
0902.0271 [cs, math].
[4] Brendan J. Frey. “Bayesian Networks for Pattern Classification, Data Compression, and
Channel Coding”. PhD thesis. University of Toronto, 1997.
[5] John D. Garrett and Hsin-Hsiang Peng. “garrettj403/SciencePlots”. Version 1.0.7. In: (Feb.
2021). DOI: 10 . 5281 / zenodo . 4106649. URL: http : / / doi . org / 10 . 5281 / zenodo .
4106649.
[6] Jonathan Ho, Evan Lohn, and Pieter Abbeel. Compression with Flows via Local Bits-Back
Coding. 2020. arXiv: 1905.08500 [cs.LG].
[7] Diederik P. Kingma et al. Variational Diffusion Models. 2021. arXiv: 2107.00630 [cs.LG].
[8] Friso H. Kingma, Pieter Abbeel, and Jonathan Ho. “Bit-Swap: Recursive Bits-Back Coding
for Lossless Compression with Hierarchical Latent Variables”. In: International Conference
on Machine Learning. Oct. 2019.
[9] Matthew Muckley et al. NeuralCompression. https://github.com/facebookresearch/
NeuralCompression. 2021.
[10] Yangjun Ruan et al. “Improving Lossless Compression Rates via Monte Carlo Bits-Back
Coding”. In: International Conference on Machine Learning. 2021.
[11] Daniel Severo et al. Compressing Multisets with Large Alphabets. 2021. arXiv: 2107.09202
[cs.IT].
3
https://timvieira.github.io/blog/post/2016/11/21/heaps-for-incremental-computation/

5
[12] Christian Steinruecken. “Compressing Combinatorial Objects”. In: 2016 Data Compression
Conference (DCC). IEEE. 2016, pp. 389–396.
[13] Christian Steinruecken. “Compressing Sets and Multisets of Sequences”. In: IEEE Transactions
on Information Theory 61.3 (2015), pp. 1485–1490.
[14] Christian Steinruecken. “Lossless Data Compression”. PhD thesis. University of Cambridge,
2014.
[15] James Townsend, Thomas Bird, and David Barber. “Practical Lossless Compression with
Latent Variables Using Bits Back Coding”. In: International Conference on Learning Repre-
sentations (ICLR). 2019.
[16] James Townsend and Iain Murray. “Lossless Compression with State Space Models Using Bits
Back Coding”. In: Neural Compression: From Information Theory to Applications – Workshop
at ICLR. 2021.
[17] James Townsend et al. “HiLLoC: Lossless Image Compression with Hierarchical Latent
Variable Models”. In: International Conference on Learning Representations (ICLR). 2020.
[18] L.R. Varshney and V.K. Goyal. “Toward a Source Coding Theory for Sets”. In: 2006 Data
Compression Conference (DCC). IEEE. 2006, pp. 13–22.

6
6 Appendix

In this section we present experiments on synthetically generated multisets with known source
distribution, multisets of grayscale images with lossy codecs (MNIST), and collections of JSON maps
represented as a multiset of multisets. We used the ANS implementation in the Craystack library [17]
for all experiments.

6.1 Synthetic multisets

Here, synthetically generated multisets are compressed to provide evidence of the computational
complexity and optimal compression rate of the method. We grow the alphabet size | A | while
sampling from the source in a way that guarantees a fixed number of unique symbols m = 512. The
alphabet A is always a subset of N.
For each run, we generate a multiset with m = 512 unique symbols, and use a skewed distribution,
sampled from a Dirichlet prior with coefficients αk = k for k = 1, . . . , | A |, as the true data
distribution D.
The final compressed size of the multiset and the information content (i.e. Shannon lower bound)
assuming the distribution D is used, are shown in Section 4 for different settings of |A|, alongside the
total encode plus decode time. Results are averaged over 20 runs, with shaded regions representing
the 99% to 1% confidence intervals. In general, the new codec compresses a multiset close to it’s
information content for varying alphabet sizes, as can be seen in the left plot.
The total encode plus decode time is unaffected by the alphabet size | A |. As discussed previously,
the overall complexity depends on that of coding under the chosen codec used with distribution D.
Here, the codec does include a logarithmic time binary search over | A |, but this is implemented
efficiently and the alphabet size can be seen to have little effect on overall time. The total time scales
linearly with the multiset size |M| (right plot), as expected.

6.2 MNIST with lossy WebP

We implemented compression of a multiset M of grayscale images using the lossy WebP codec. We
tested on the MNIST test set, which is composed of |M| = 10, 000 distinct grayscale images of
handwritten digits, each 28×28 in size. To encode the multiset, we perform the sampling procedure to
select an image to compress, as usual. The output of WebP is a prefix-free, variable-length sequence
of bytes, which we encoded into the ANS state via a sequence of encode steps with a uniform
distribution. It is also possible (and faster) to move the WebP output directly into the lower-order bits
of the ANS state.
We compared the final compressed length with and without the sampling step. In other words, treating
the dataset as a multiset and treating it as an ordered sequence. The savings achieved by using our
method are shown in Figure 4. The theoretical limit shown in the left plot is log |M|!, while in the
right plot this quantity is divided by the number of bits needed to compress the data sequentially.
1
Note that the maximum savings per symbol − |M| log2 |M|! ≈ log2 |M| depends only on the size
of multiset. Therefore, when the representation of the symbol requires a large number of bits,
the percentage savings are marginal (roughly 1.5% for 10,000 images, in our case). To improve
percentage savings (right plot), one could use a better symbol codec or an adaptive codec which
doesn’t treat the symbols as independent. However, as mentioned, the savings in raw bits (left plot)
would remain the same, as it depends only on the multiset size |M|.

6.3 Collection of JSON maps as nested multisets

The method can be nested to compress a multiset of multisets, by performing an additional sampling
step that first chooses which inner multiset to compress. In this section we show results for a collection
of JSON maps M = {J1 , . . . , J|M| }, where each map Ji = {(k1 , v1 ), . . . , (k|Ji | , v|Ji | )} is itself
a multiset of key-value pairs. To compress, a depth-first approach is taken. First, some J ∈ M is
sampled without replacement. Key-value pairs are then sampled from J , also without replacement,
and compressed to the ANS state until J is depleted. This procedure repeats, until the outer multiset

7
1.5
Theoretical limit Theoretical limit
10 kB
JSON maps JSON maps
Savings (bits)
1.0

Savings (%)
MNIST MNIST
1 kB

0.5
100 B

10 B 0.0
100 10 2
10 4
100 102 104
Multiset size |M| Multiset size |M|

Figure 4: Left: Savings in raw bits. Right: Percentage savings. Rate savings due to using our
method to compress a multiset instead of treating it as an ordered sequence. Savings are close to the
theoretical limit in both MNIST and JSON experiments. The symbols are bytes outputted by lossy
WebP for MNIST, and UTF-8 encoding for JSON maps. A uniform distribution over bytes is used to
encode with ANS.

M is empty. Assuming all maps are unique, the maximum number of savable bits is
|M|
X
log |M|! + log |Ji |!. (4)
i=1

The collection of JSON maps is composed of public GitHub user data taken from a release of the
Zstandard project4 . All key-value elements were cast to strings, for simplicity, and are encoded as
UTF-8 bytes using a uniform distribtuion. Figure 4 shows the number and percentage of saved bits.
The theoretical limit curve shows the maximum savable bits with nesting, i.e. eq. (4). Note that,
without nesting, the theoretical limit would be that of MNIST (i.e. log |M|!). The method gets very
close to the maximum possible savings, for various numbers of JSON maps. The rate savings were
small, but these could be improved by using a better technique to encode the UTF-8 strings.
Assuming J represents the JSON map in M with the largest number of key-value pairs, and that
the time complexity of comparing two JSON maps is O(|J |), the complexity of the four BST
operations for the outer multiset is O(|M||J | log |M|). The overall expected time complexity for
both encoding and decoding is therefore O(|M||J |(log |M| + log |J |)). We believe it may be
possible to reduce this by performing the inner sampling steps in parallel, or by speeding up the
JSON map comparisons.

4
https://github.com/facebook/zstd/releases/tag/v1.1.3

You might also like