Compression in Hadoop

Compression in Hadoop can reduce storage space and speed up data transfer. Common compression formats include GZIP, BZIP2, LZO and SNAPPY, which offer different tradeoffs between compression ratio and speed. GZIP and BZIP2 have high compression but are slow, while LZO and SNAPPY are faster but compress less. LZO and SNAPPY compressed data can be split for MapReduce jobs.

Uploaded by

hemantsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Compression in Hadoop

Uploaded by

hemantsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Compression in Hadoop:

File compression brings two major benefits: it reduces the space needed to store files, and it

speeds up data transfer across the network or to or from disk. When dealing with large volumes of

data, both of these savings can be significant, so it pays to carefully consider how to use a

compression in Hadoop / MapReduce.There are many different compression formats, tools and

algorithms, each with different characteristics. The table below, some of the more common ones

that can be used with Hadoop

All compression algorithms exhibit a space/time trade-off: faster compression and decompression

speeds usually come at the expense of smaller space savings. The tools listed in the above table

typically give some control over this trade-off at compression time by offering nine different

options: –1 means optimize for speed, and -9 means optimize for space. For example, the

following command creates a compressed file file.gz using the fastest compression method:

gzip -1 file

The different tools have very different compression characteristics. Gzip is a general-purpose

compressor and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively

than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it

is still slower than the other formats. LZO, LZ4 and Snappy, on the other hand, all optimize for

speed and are around an order of magnitude faster than gzip, but compress less effectively.

Snappy and LZ4 are also significantly faster than LZO for decompression.

The four most widely used Compression formats in Hadoop are as follows:
1) GZIP
i. Provides High compression ratio.
ii. Uses high CPU resources to compress and decompress data.
iii. Good choice for Cold data which is infrequently accessed.
iv. Compressed data is not splittable and hence not suitable for MapReduce jobs.
2) BZIP2
i. Provides High compression ratio (even higher than GZIP).
ii. Takes long time to compress and decompress data.
iii. Good choice for Cold data which is infrequently accessed.
iv. Compressed data is splittable.
v. Even though the compressed data is splittable, it is generally not suited for MR jobs
because of high compression/decompression time.
3) LZO
i. Provides Low compression ratio.
ii. Very fast in compressing and decompressing data.
iii. Compressed data is splittable if an appropriate indexing algorithm is used.
iv. Best suited for MR jobs because of property (ii) and (iii).
4) SNAPPY
i. Provides average compression ratio.
ii. Aimed at very fast compression and decompression time.
iii. Compressed data is not splittable if used with normal file like .txt
iv. Generally used to compress Container file formats like Avro and SequenceFile because
the files inside a Compressed Container file can be split.

Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage,transfer and distribution purposes on physical devices. Once the
serialized data is transmitted the reverse process of creating objects from the byte sequence
called deserialization.

Serialization Formats in Hadoop • XML • CSV • YAML • JSON • BSON

• Message Pack • Thrift • Protocol buffers • Avro

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Design Basis Report 3
88% (8)
Design Basis Report 3
99 pages
BDA AVRO
No ratings yet
BDA AVRO
17 pages
Hadoop
No ratings yet
Hadoop
30 pages
Hadoop Primitives
No ratings yet
Hadoop Primitives
6 pages
Hadoop 1 Ref
No ratings yet
Hadoop 1 Ref
4 pages
Big_Data_PPT_Unit_2_1
No ratings yet
Big_Data_PPT_Unit_2_1
25 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Linuxreport
No ratings yet
Linuxreport
5 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Building a NAS Server with Raspberry Pi and Openmediavault
From Everand
Building a NAS Server with Raspberry Pi and Openmediavault
Brian Schell
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
A Survey On Compression Algorithms in Hadoop
No ratings yet
A Survey On Compression Algorithms in Hadoop
4 pages
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
IT JOB Tips
No ratings yet
IT JOB Tips
36 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Hadoop File Formats - YoussefEtman
No ratings yet
Hadoop File Formats - YoussefEtman
8 pages
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
From Everand
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
Fouad Sabry
No ratings yet
Rattanaopas 2017
No ratings yet
Rattanaopas 2017
5 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Compression_And_Decompression_Of_Files_Without_Loss_Of_Quality
No ratings yet
Compression_And_Decompression_Of_Files_Without_Loss_Of_Quality
6 pages
FreeBSD Mastery: Specialty Filesystems: IT Mastery, #8
From Everand
FreeBSD Mastery: Specialty Filesystems: IT Mastery, #8
Michael W. Lucas
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Beginner's Guide for Cybercrime Investigators
From Everand
Beginner's Guide for Cybercrime Investigators
Nicolae Sfetcu
5/5 (1)
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Hadoop_IO_Explanation
No ratings yet
Hadoop_IO_Explanation
3 pages
Linux 5 Day Introduction Course
From Everand
Linux 5 Day Introduction Course
Stephen Edwards
No ratings yet
Internet Information Services 8.5
From Everand
Internet Information Services 8.5
Murat Yildirimoglu
No ratings yet
Project Gutenberg "Best Of" CD August 2003
From Everand
Project Gutenberg "Best Of" CD August 2003
Project Gutenberg
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
Lzma
No ratings yet
Lzma
11 pages
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
No ratings yet
Content-Based Textual Big Data Analysis and Compression: Fei Gao Ananya Dutta Jiangjiang Liu
6 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Software Knowledge
From Everand
Software Knowledge
Debojit Acharjee
No ratings yet
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Data Analytics
No ratings yet
Data Analytics
26 pages
HADOOP notes unit 3 and 4
No ratings yet
HADOOP notes unit 3 and 4
14 pages
Why You Should Use ZSTD Instead of GZIP or ZLIB to Compress Data _ by Aditya Karnam _ Level Up Coding
No ratings yet
Why You Should Use ZSTD Instead of GZIP or ZLIB to Compress Data _ by Aditya Karnam _ Level Up Coding
12 pages
Relayd and Httpd Mastery: IT Mastery, #11
From Everand
Relayd and Httpd Mastery: IT Mastery, #11
Michael W. Lucas
No ratings yet
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
From Everand
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
Michael W. Lucas
No ratings yet
Config File Types
From Everand
Config File Types
Frank Wellington
No ratings yet
Mastering Apache Hudi: Building Real-Time Data Lakes
From Everand
Mastering Apache Hudi: Building Real-Time Data Lakes
Robert Johnson
No ratings yet
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
Hive Notes (1)
No ratings yet
Hive Notes (1)
26 pages
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
Linux Commands By Example
From Everand
Linux Commands By Example
Khaled Jamal
4.5/5 (3)
Environmental Studies Unit 3 - Dev Library
No ratings yet
Environmental Studies Unit 3 - Dev Library
26 pages
Írásbeli Angol Érettségi Feladatok
100% (1)
Írásbeli Angol Érettségi Feladatok
17 pages
Water Content of Soil Sample: ASTM D 4643-93
No ratings yet
Water Content of Soil Sample: ASTM D 4643-93
6 pages
Yo Ga Ga
No ratings yet
Yo Ga Ga
6 pages
53 531 - Submersible - Pumps - Submittals
No ratings yet
53 531 - Submersible - Pumps - Submittals
2 pages
Led Cube
No ratings yet
Led Cube
9 pages
Report Nishat 2
No ratings yet
Report Nishat 2
62 pages
New Microsoft Word Document (2)
No ratings yet
New Microsoft Word Document (2)
8 pages
Department of EXTC Engineering Problem Based Learning Experiment No 10
No ratings yet
Department of EXTC Engineering Problem Based Learning Experiment No 10
6 pages
cold storage
No ratings yet
cold storage
1 page
S6OA00801ZE03 - A3 Size
No ratings yet
S6OA00801ZE03 - A3 Size
46 pages
API-53-222 Factors Affecting The Angle of Inclination and Dog-Legging in Rotary Bore Holes
No ratings yet
API-53-222 Factors Affecting The Angle of Inclination and Dog-Legging in Rotary Bore Holes
29 pages
Expositions Text - Analytical & Hortatory Exposition - Adverbial Phrases - Kumpulan Soal 2 (Sedang) - Quizizz
No ratings yet
Expositions Text - Analytical & Hortatory Exposition - Adverbial Phrases - Kumpulan Soal 2 (Sedang) - Quizizz
10 pages
GET SMART PLUS 4 SPECIAL DAYS
No ratings yet
GET SMART PLUS 4 SPECIAL DAYS
7 pages
TD 3 Pert
No ratings yet
TD 3 Pert
1 page
1.1 Photosynthesis
No ratings yet
1.1 Photosynthesis
34 pages
p-7770-sv Svendborg Brake
No ratings yet
p-7770-sv Svendborg Brake
1 page
14 Multipulse Transformer
No ratings yet
14 Multipulse Transformer
30 pages
Sprinkler Hi-Fog 2000 Type C20-Xxc/0 & C20-57C
No ratings yet
Sprinkler Hi-Fog 2000 Type C20-Xxc/0 & C20-57C
1 page
Bulk Density ("Unit Weight") and Voids in Aggregate: Standard Test Method For
No ratings yet
Bulk Density ("Unit Weight") and Voids in Aggregate: Standard Test Method For
4 pages
Unit Plan
No ratings yet
Unit Plan
6 pages
Prisma
No ratings yet
Prisma
1 page
4property Tables
No ratings yet
4property Tables
66 pages
Mechanical Device
No ratings yet
Mechanical Device
54 pages
AHU Checklist
No ratings yet
AHU Checklist
11 pages
ROSHNI RAHIM - Project NEW
No ratings yet
ROSHNI RAHIM - Project NEW
37 pages
2020 Epidermic and Endermic Diseases
No ratings yet
2020 Epidermic and Endermic Diseases
5 pages
Envi285 Pset 6
No ratings yet
Envi285 Pset 6
5 pages
Synergy Annual Report 2023
No ratings yet
Synergy Annual Report 2023
160 pages

Compression in Hadoop

Uploaded by

Compression in Hadoop

Uploaded by

Compression in Hadoop:

that can be used with Hadoop

Serialization Formats in Hadoop • XML • CSV • YAML • JSON • BSON

• Message Pack • Thrift • Protocol buffers • Avro

You might also like