0% found this document useful (0 votes)
137 views

File Formats in Big Data

The document discusses different file formats for storing data in Hadoop including text files, sequence files, Avro, and column-oriented formats like ORC and Parquet. Key factors for choosing a file format include the Hadoop distribution used, whether the data schema will evolve, processing and query requirements, and storage and compression needs. Different formats provide benefits like faster read/write times, splittable files, schema evolution support, and compression. The document recommends sequence files for intermediate MapReduce data, ORC for query performance with Hive, Avro if the schema will change, and CSV for extracting to databases.

Uploaded by

Meghna Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views

File Formats in Big Data

The document discusses different file formats for storing data in Hadoop including text files, sequence files, Avro, and column-oriented formats like ORC and Parquet. Key factors for choosing a file format include the Hadoop distribution used, whether the data schema will evolve, processing and query requirements, and storage and compression needs. Different formats provide benefits like faster read/write times, splittable files, schema evolution support, and compression. The document recommends sequence files for intermediate MapReduce data, ORC for query performance with Hive, Avro if the schema will change, and CSV for extracting to databases.

Uploaded by

Meghna Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

File Formats

By: Dr Meghna Sharma


How to choose a file format?
There are three types of performance to consider:
• Write performance -- how fast can the data be written.
• Partial read performance -- how fast can you read individual columns
within a file.
• Full read performance -- how fast can you read every data element in
a file.
Key Factors
Each file format is optimized by purpose. Your choice of format is driven by
your use case and environment. Here are the key factors to consider:
• Hadoop Distribution- Cloudera and Hortonworks support/favor different
formats
• Schema Evolution- Will the structure of your data evolve?
• Processing Requirements- Will you be crunching the data and with what
tools?
• Read/Query Requirements- Will you be using SQL on Hadoop? Which
engine?
• Extract Requirements- Will you be extracting the data from Hadoop for
import into an external database engine or other platform?
• Storage Requirements- Is data volume a significant factor? Will you get
significantly more bang for your storage buck through compression?
Why storage formats are important?
A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant
data in a particular location and the time it takes to write the data back to another location. These issues are
exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints.
The various Hadoop file formats have evolved as a way to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
• Faster read times
• Faster write times
• Splittable files (so you don’t need to read the whole file, just a part of it)
• Schema evolution support (allowing you to change the fields in a dataset)
• Advanced compression support (compress the files with a compression codec without sacrificing these
features)
• Some file formats are designed for general use (like MapReduce or Spark), others are designed for more
specific use cases (like powering a database), and some are designed with specific data characteristics in
mind. So there really is quite a lot of choice.
Different File Formats
A storage format is just a way to define how information is stored in a file. This is usually indicated by the
extension of the file .
• For example images have several common storage formats, PNG, JPG, and GIF are commonly used. All three of
those formats can store the same image, but each has specific file formats. For example JPG files tend to be
smaller, but store a compressed version of the image that is of lower quality.
• When dealing with Hadoop’s filesystem not only do you have all of these traditional storage formats available
to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused
file formats to use for structured and unstructured data.
• Some common storage formats for Hadoop include:
1. Text/CSV Files
2. JSON Records
3. Avro Files
4. Sequence Files
5. RC Files
6. ORC Files
7. Parquet Files
Text Based Files
• Simple text-based files are common in the non-Hadoop world, and they’re
super common in the Hadoop world too.
• Data is laid out in lines, with each line being a record. Lines are terminated
by a newline character \n in the typical unix fashion.
• Text-files are inherently splittable (just split on \n characters!), but if you
want to compress them you’ll have to use a file-level compression codec
that support splitting, such as BZIP2
• Because these files are just text files you can encode anything you like in a
line of the file. One common example is to make each line a JSON document
to add some structure. While this can waste space with needless column
headers, it is a simple way to start using structured data in HDFS.
Sequence Files
• Sequence files were originally designed for MapReduce, so the integration is smooth. They encode a key and
a value for each record and nothing more. Records are stored in a binary format that is smaller than a text-
based format would be. Like text files, the format does not encode the structure of the keys and values, so if
you make schema migrations they must be additive.

• Sequence files by default use Hadoop’s Writable interface in order to figure out how to serialize and
deserialize classes to the file.

• Typically if you need to store complex data in a sequence file you do so in the value part while encoding the
id in the key. The problem with this is that if you add or change fields in your Writable class it will not be
backwards compatible with the data stored in the sequence file.

• One benefit of sequence files is that they support block-level compression, so you can compress the contents
of the file while also maintaining the ability to split the file into segments for multiple map tasks.

• Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I think
represent the easiest next step away from text files.
Avro
• Avro is an opinionated format which understands that data stored in HDFS is usually not a
simple key/value combo like Int/String. The format encodes the schema of its contents
directly in the file which allows you to store complex objects natively.

• Honestly, Avro is not really a file format, it’s a file format plus a serialization and
deserialization framework. With regular old sequence files you can store complex objects
but you have to manage the process. Avro handles this complexity whilst providing other
tools to help manage data over time.

• Avro is a well thought out format which defines file data schemas in JSON (for
interoperability), allows for schema evolutions (remove a column, add a column), and
multiple serialization/deserialization use cases. It also supports block-level compression. For
most Hadoop-based use cases Avro is a really good choice.
Columner File Formats
• Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other.
So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing
framework just needs access to a subset of data that is stored on disk as it can access all values of a single
column very quickly without reading whole records.

• One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed
together which can yield some massive storage optimizations (as data in the same column tends to be similar).

• If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of
your application, but frankly if you have an application that usually needs entire rows of data then the columnar
formats may actually be a detriment to performance due to the increased network activity required.

• Overall these formats can drastically optimize workloads, especially for Hive and Spark which tend to just read
segments of records rather than the whole thing (which is more common in MapReduce).
• If you are storing intermediate data between MapReduce jobs-
Sequence file
• If query performance against the data is most important- ORC
(HortonWorks/Hive) or Parquet (Cloudera/Impala)
• If your schema is going to change over time- Avro
• If you are going to extract data from Hadoop to bulk load into a
database-CSV

You might also like