File Formats in Big Data
File Formats in Big Data
• Sequence files by default use Hadoop’s Writable interface in order to figure out how to serialize and
deserialize classes to the file.
• Typically if you need to store complex data in a sequence file you do so in the value part while encoding the
id in the key. The problem with this is that if you add or change fields in your Writable class it will not be
backwards compatible with the data stored in the sequence file.
• One benefit of sequence files is that they support block-level compression, so you can compress the contents
of the file while also maintaining the ability to split the file into segments for multiple map tasks.
• Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I think
represent the easiest next step away from text files.
Avro
• Avro is an opinionated format which understands that data stored in HDFS is usually not a
simple key/value combo like Int/String. The format encodes the schema of its contents
directly in the file which allows you to store complex objects natively.
• Honestly, Avro is not really a file format, it’s a file format plus a serialization and
deserialization framework. With regular old sequence files you can store complex objects
but you have to manage the process. Avro handles this complexity whilst providing other
tools to help manage data over time.
• Avro is a well thought out format which defines file data schemas in JSON (for
interoperability), allows for schema evolutions (remove a column, add a column), and
multiple serialization/deserialization use cases. It also supports block-level compression. For
most Hadoop-based use cases Avro is a really good choice.
Columner File Formats
• Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other.
So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing
framework just needs access to a subset of data that is stored on disk as it can access all values of a single
column very quickly without reading whole records.
• One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed
together which can yield some massive storage optimizations (as data in the same column tends to be similar).
• If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of
your application, but frankly if you have an application that usually needs entire rows of data then the columnar
formats may actually be a detriment to performance due to the increased network activity required.
• Overall these formats can drastically optimize workloads, especially for Hive and Spark which tend to just read
segments of records rather than the whole thing (which is more common in MapReduce).
• If you are storing intermediate data between MapReduce jobs-
Sequence file
• If query performance against the data is most important- ORC
(HortonWorks/Hive) or Parquet (Cloudera/Impala)
• If your schema is going to change over time- Avro
• If you are going to extract data from Hadoop to bulk load into a
database-CSV