Hadoop_IO_Explanation
Hadoop_IO_Explanation
Definition
Hadoop I/O refers to the framework and mechanisms used for data input and output operations in the
Hadoop ecosystem. Efficient I/O is critical in distributed systems like Hadoop to ensure scalable and reliable
data processing. Hadoop provides a set of libraries and utilities to handle various data formats, serialization
Data Integrity
Data integrity in Hadoop ensures that the data being read or written is accurate and uncorrupted. Hadoop
uses checksums to verify the correctness of data blocks. Each file in HDFS is divided into blocks, and for
every block, a checksum is calculated and stored separately. When the data is read, the checksum is
recalculated and compared to the stored value. If a mismatch occurs, the system attempts to read the block
The Hadoop Local File System is a non-distributed file system used primarily for storing temporary data on a
single machine (often intermediate job outputs). It is not suitable for large-scale distributed data storage. It is
typically used by MapReduce tasks to read input splits and write temporary output before transferring to
HDFS. Although it does not offer replication and fault tolerance like HDFS, it provides fast local read/write
Compression
Compression in Hadoop reduces the size of data stored and transmitted across the network, improving
performance and reducing disk I/O. Hadoop supports various compression codecs such as Gzip, Bzip2, LZO,
- Input Compression: Reduces storage and bandwidth when reading input files.
Proper use of compression increases throughput but may add CPU overhead during
compression/decompression.
Serialization
Serialization is the process of converting data structures or objects into a format that can be stored or
transmitted and reconstructed later. Hadoop relies on serialization for transferring data between nodes in a
MapReduce job. Writable is Hadoop's native serialization format, providing efficient, compact binary
representations. Hadoop serialization must be fast, compact, and compatible with versioning.
- Writable (native)
- Avro
- Protocol Buffers
- Thrift
Avro
Avro is a serialization framework developed within the Hadoop ecosystem, used for compact, fast, binary
data serialization. It uses JSON for defining schemas and supports schema evolution, making it highly
Features of Avro:
Hadoop I/O
- Facilitates big data exchange between systems using different programming languages.
Avro is often used in Kafka, Hive, and Pig as well as for storing log data and in data lake solutions.