Compression in Hadoop
Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it
speeds up data transfer across the network or to or from disk. When dealing with large volumes of
data, both of these savings can be significant, so it pays to carefully consider how to use a
compression in Hadoop / MapReduce.There are many different compression formats, tools and
algorithms, each with different characteristics. The table below, some of the more common ones
All compression algorithms exhibit a space/time trade-off: faster compression and decompression
speeds usually come at the expense of smaller space savings. The tools listed in the above table
typically give some control over this trade-off at compression time by offering nine different
options: –1 means optimize for speed, and -9 means optimize for space. For example, the
following command creates a compressed file file.gz using the fastest compression method:
gzip -1 file
The different tools have very different compression characteristics. Gzip is a general-purpose
compressor and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively
than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it
is still slower than the other formats. LZO, LZ4 and Snappy, on the other hand, all optimize for
speed and are around an order of magnitude faster than gzip, but compress less effectively.
Snappy and LZ4 are also significantly faster than LZO for decompression.
The four most widely used Compression formats in Hadoop are as follows:
1) GZIP
i. Provides High compression ratio.
ii. Uses high CPU resources to compress and decompress data.
iii. Good choice for Cold data which is infrequently accessed.
iv. Compressed data is not splittable and hence not suitable for MapReduce jobs.
2) BZIP2
i. Provides High compression ratio (even higher than GZIP).
ii. Takes long time to compress and decompress data.
iii. Good choice for Cold data which is infrequently accessed.
iv. Compressed data is splittable.
v. Even though the compressed data is splittable, it is generally not suited for MR jobs
because of high compression/decompression time.
3) LZO
i. Provides Low compression ratio.
ii. Very fast in compressing and decompressing data.
iii. Compressed data is splittable if an appropriate indexing algorithm is used.
iv. Best suited for MR jobs because of property (ii) and (iii).
4) SNAPPY
i. Provides average compression ratio.
ii. Aimed at very fast compression and decompression time.
iii. Compressed data is not splittable if used with normal file like .txt
iv. Generally used to compress Container file formats like Avro and SequenceFile because
the files inside a Compressed Container file can be split.
Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage,transfer and distribution purposes on physical devices. Once the
serialized data is transmitted the reverse process of creating objects from the byte sequence
called deserialization.