0% found this document useful (0 votes)
45 views

Compression in Hadoop

Compression in Hadoop can reduce storage space and speed up data transfer. Common compression formats include GZIP, BZIP2, LZO and SNAPPY, which offer different tradeoffs between compression ratio and speed. GZIP and BZIP2 have high compression but are slow, while LZO and SNAPPY are faster but compress less. LZO and SNAPPY compressed data can be split for MapReduce jobs.

Uploaded by

hemantsingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Compression in Hadoop

Compression in Hadoop can reduce storage space and speed up data transfer. Common compression formats include GZIP, BZIP2, LZO and SNAPPY, which offer different tradeoffs between compression ratio and speed. GZIP and BZIP2 have high compression but are slow, while LZO and SNAPPY are faster but compress less. LZO and SNAPPY compressed data can be split for MapReduce jobs.

Uploaded by

hemantsingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Compression in Hadoop:

File compression brings two major benefits: it reduces the space needed to store files, and it

speeds up data transfer across the network or to or from disk. When dealing with large volumes of

data, both of these savings can be significant, so it pays to carefully consider how to use a

compression in Hadoop / MapReduce.There are many different compression formats, tools and

algorithms, each with different characteristics. The table below, some of the more common ones

that can be used with Hadoop

All compression algorithms exhibit a space/time trade-off: faster compression and decompression

speeds usually come at the expense of smaller space savings. The tools listed in the above table

typically give some control over this trade-off at compression time by offering nine different

options: –1 means optimize for speed, and -9 means optimize for space. For example, the

following command creates a compressed file file.gz using the fastest compression method:

gzip -1 file

The different tools have very different compression characteristics. Gzip is a general-purpose

compressor and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively

than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it

is still slower than the other formats. LZO, LZ4 and Snappy, on the other hand, all optimize for

speed and are around an order of magnitude faster than gzip, but compress less effectively.

Snappy and LZ4 are also significantly faster than LZO for decompression.

The four most widely used Compression formats in Hadoop are as follows:
1) GZIP
i. Provides High compression ratio.
ii. Uses high CPU resources to compress and decompress data.
iii. Good choice for Cold data which is infrequently accessed.
iv. Compressed data is not splittable and hence not suitable for MapReduce jobs.
2) BZIP2
i. Provides High compression ratio (even higher than GZIP).
ii. Takes long time to compress and decompress data.
iii. Good choice for Cold data which is infrequently accessed.
iv. Compressed data is splittable.
v. Even though the compressed data is splittable, it is generally not suited for MR jobs
because of high compression/decompression time.
3) LZO
i. Provides Low compression ratio.
ii. Very fast in compressing and decompressing data.
iii. Compressed data is splittable if an appropriate indexing algorithm is used.
iv. Best suited for MR jobs because of property (ii) and (iii).
4) SNAPPY
i. Provides average compression ratio.
ii. Aimed at very fast compression and decompression time.
iii. Compressed data is not splittable if used with normal file like .txt
iv. Generally used to compress Container file formats like Avro and SequenceFile because
the files inside a Compressed Container file can be split.

Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage,transfer and distribution purposes on physical devices. Once the
serialized data is transmitted the reverse process of creating objects from the byte sequence
called deserialization.

Serialization Formats in Hadoop • XML • CSV • YAML • JSON • BSON

• Message Pack • Thrift • Protocol buffers • Avro

You might also like