File Formats in Big Data

The document discusses different file formats for storing data in Hadoop including text files, sequence files, Avro, and column-oriented formats like ORC and Parquet. Key factors for choosing a file format include the Hadoop distribution used, whether the data schema will evolve, processing and query requirements, and storage and compression needs. Different formats provide benefits like faster read/write times, splittable files, schema evolution support, and compression. The document recommends sequence files for intermediate MapReduce data, ORC for query performance with Hive, Avro if the schema will change, and CSV for extracting to databases.

Uploaded by

Meghna Sharma

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views

File Formats in Big Data

Uploaded by

Meghna Sharma

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

File Formats

By: Dr Meghna Sharma

How to choose a file format?
There are three types of performance to consider:
• Write performance -- how fast can the data be written.
• Partial read performance -- how fast can you read individual columns
within a file.
• Full read performance -- how fast can you read every data element in
a file.
Key Factors
Each file format is optimized by purpose. Your choice of format is driven by
your use case and environment. Here are the key factors to consider:
• Hadoop Distribution- Cloudera and Hortonworks support/favor different
formats
• Schema Evolution- Will the structure of your data evolve?
• Processing Requirements- Will you be crunching the data and with what
tools?
• Read/Query Requirements- Will you be using SQL on Hadoop? Which
engine?
• Extract Requirements- Will you be extracting the data from Hadoop for
import into an external database engine or other platform?
• Storage Requirements- Is data volume a significant factor? Will you get
significantly more bang for your storage buck through compression?
Why storage formats are important?
A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant
data in a particular location and the time it takes to write the data back to another location. These issues are
exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints.
The various Hadoop file formats have evolved as a way to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
• Faster read times
• Faster write times
• Splittable files (so you don’t need to read the whole file, just a part of it)
• Schema evolution support (allowing you to change the fields in a dataset)
• Advanced compression support (compress the files with a compression codec without sacrificing these
features)
• Some file formats are designed for general use (like MapReduce or Spark), others are designed for more
specific use cases (like powering a database), and some are designed with specific data characteristics in
mind. So there really is quite a lot of choice.
Different File Formats
A storage format is just a way to define how information is stored in a file. This is usually indicated by the
extension of the file .
• For example images have several common storage formats, PNG, JPG, and GIF are commonly used. All three of
those formats can store the same image, but each has specific file formats. For example JPG files tend to be
smaller, but store a compressed version of the image that is of lower quality.
• When dealing with Hadoop’s filesystem not only do you have all of these traditional storage formats available
to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused
file formats to use for structured and unstructured data.
• Some common storage formats for Hadoop include:
1. Text/CSV Files
2. JSON Records
3. Avro Files
4. Sequence Files
5. RC Files
6. ORC Files
7. Parquet Files
Text Based Files
• Simple text-based files are common in the non-Hadoop world, and they’re
super common in the Hadoop world too.
• Data is laid out in lines, with each line being a record. Lines are terminated
by a newline character \n in the typical unix fashion.
• Text-files are inherently splittable (just split on \n characters!), but if you
want to compress them you’ll have to use a file-level compression codec
that support splitting, such as BZIP2
• Because these files are just text files you can encode anything you like in a
line of the file. One common example is to make each line a JSON document
to add some structure. While this can waste space with needless column
headers, it is a simple way to start using structured data in HDFS.
Sequence Files
• Sequence files were originally designed for MapReduce, so the integration is smooth. They encode a key and
a value for each record and nothing more. Records are stored in a binary format that is smaller than a text-
based format would be. Like text files, the format does not encode the structure of the keys and values, so if
you make schema migrations they must be additive.

• Sequence files by default use Hadoop’s Writable interface in order to figure out how to serialize and
deserialize classes to the file.

• Typically if you need to store complex data in a sequence file you do so in the value part while encoding the
id in the key. The problem with this is that if you add or change fields in your Writable class it will not be
backwards compatible with the data stored in the sequence file.

• One benefit of sequence files is that they support block-level compression, so you can compress the contents
of the file while also maintaining the ability to split the file into segments for multiple map tasks.

• Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I think
represent the easiest next step away from text files.
Avro
• Avro is an opinionated format which understands that data stored in HDFS is usually not a
simple key/value combo like Int/String. The format encodes the schema of its contents
directly in the file which allows you to store complex objects natively.

• Honestly, Avro is not really a file format, it’s a file format plus a serialization and
deserialization framework. With regular old sequence files you can store complex objects
but you have to manage the process. Avro handles this complexity whilst providing other
tools to help manage data over time.

• Avro is a well thought out format which defines file data schemas in JSON (for
interoperability), allows for schema evolutions (remove a column, add a column), and
multiple serialization/deserialization use cases. It also supports block-level compression. For
most Hadoop-based use cases Avro is a really good choice.
Columner File Formats
• Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other.
So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing
framework just needs access to a subset of data that is stored on disk as it can access all values of a single
column very quickly without reading whole records.

• One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed
together which can yield some massive storage optimizations (as data in the same column tends to be similar).

• If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of
your application, but frankly if you have an application that usually needs entire rows of data then the columnar
formats may actually be a detriment to performance due to the increased network activity required.

• Overall these formats can drastically optimize workloads, especially for Hive and Spark which tend to just read
segments of records rather than the whole thing (which is more common in MapReduce).
• If you are storing intermediate data between MapReduce jobs-
Sequence file
• If query performance against the data is most important- ORC
(HortonWorks/Hive) or Parquet (Cloudera/Impala)
• If your schema is going to change over time- Avro
• If you are going to extract data from Hadoop to bulk load into a
database-CSV

Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Developing Standard Logic For A Detailed Engineering Project Schedule in The Process Industry
50% (2)
Developing Standard Logic For A Detailed Engineering Project Schedule in The Process Industry
40 pages
Statement On Death of Michael Hickson
100% (1)
Statement On Death of Michael Hickson
2 pages
Malaysian Sewerage Industry Guideline (Volume III) 1st Edition
100% (2)
Malaysian Sewerage Industry Guideline (Volume III) 1st Edition
165 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Hive Tutorial For Beginners: Learn With Examples in 3 Days
No ratings yet
Hive Tutorial For Beginners: Learn With Examples in 3 Days
3 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Hadoop Questions
No ratings yet
Hadoop Questions
41 pages
Hadoop Commands Cheat Sheet
No ratings yet
Hadoop Commands Cheat Sheet
1 page
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Hive in Class Assignment Winter 2021
No ratings yet
Hive in Class Assignment Winter 2021
2 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
Interview PDF
No ratings yet
Interview PDF
100 pages
Sqoop Cheatsheet
No ratings yet
Sqoop Cheatsheet
3 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Sqoop Commands - Latest
No ratings yet
Sqoop Commands - Latest
4 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
44 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Full download PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes 1st Edition Raju Kumar Mishra pdf docx
100% (2)
Full download PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes 1st Edition Raju Kumar Mishra pdf docx
50 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
100 Days of Data Engineering - Make A Copy and Use As You Need
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need
7 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Studying For A Tech Interview Sucks
No ratings yet
Studying For A Tech Interview Sucks
8 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
AWS Certification Preparation Notes
No ratings yet
AWS Certification Preparation Notes
25 pages
TCS Technical Interview Questions
No ratings yet
TCS Technical Interview Questions
2 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Chapter 4. Database System Architecture & Modeling
No ratings yet
Chapter 4. Database System Architecture & Modeling
57 pages
Teradata Basics
No ratings yet
Teradata Basics
45 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
OLTP
No ratings yet
OLTP
12 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
TERADATA
No ratings yet
TERADATA
55 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
GDPR Privacy Notice For DHL International UK LTD PDF
No ratings yet
GDPR Privacy Notice For DHL International UK LTD PDF
4 pages
Letter To Parent - BYOD
No ratings yet
Letter To Parent - BYOD
2 pages
User Guide - PowerChute™ Network Shutdown
No ratings yet
User Guide - PowerChute™ Network Shutdown
71 pages
Bosh Internship Report
100% (1)
Bosh Internship Report
72 pages
Big Fan Submittal-ID (New Update 12 Des 2022)
No ratings yet
Big Fan Submittal-ID (New Update 12 Des 2022)
16 pages
Batch Wise PCCP Morning Time Table W e F 30-08-2021 To 04-09-2021
No ratings yet
Batch Wise PCCP Morning Time Table W e F 30-08-2021 To 04-09-2021
1 page
C Sequestration Potential Cedar
No ratings yet
C Sequestration Potential Cedar
9 pages
FAC3762 - Exam Jan Feb 2022
No ratings yet
FAC3762 - Exam Jan Feb 2022
13 pages
B.eda Course Structure With Electives 2022
No ratings yet
B.eda Course Structure With Electives 2022
21 pages
Self-Curing Concrete - Literature Review
No ratings yet
Self-Curing Concrete - Literature Review
3 pages
The Future of Manufacturing in Europe 2015-20
No ratings yet
The Future of Manufacturing in Europe 2015-20
71 pages
Computer application syllabus cluster university Srg
No ratings yet
Computer application syllabus cluster university Srg
2 pages
Sound and Noise
No ratings yet
Sound and Noise
22 pages
Lis - Dar.gov - ph-dAR Legal Information System-6
No ratings yet
Lis - Dar.gov - ph-dAR Legal Information System-6
51 pages
How Would You Fare at The Global Negotiating Table?: Explore Context
No ratings yet
How Would You Fare at The Global Negotiating Table?: Explore Context
11 pages
Fishing Vessel Safety
No ratings yet
Fishing Vessel Safety
62 pages
Modern Surveying Instruments
67% (3)
Modern Surveying Instruments
18 pages
Fitter Fabrication PDF
No ratings yet
Fitter Fabrication PDF
29 pages
CHCECE034 Student Assessment Template-Aspect of Approved Learning Framework v1.2
No ratings yet
CHCECE034 Student Assessment Template-Aspect of Approved Learning Framework v1.2
3 pages
Azimuth 1000 Brochure A4 116
No ratings yet
Azimuth 1000 Brochure A4 116
2 pages
Adinandra Hongiaoensis (Theaceae), A New Species From Lam Dong, Vietnam
No ratings yet
Adinandra Hongiaoensis (Theaceae), A New Species From Lam Dong, Vietnam
5 pages
Nego Digests
No ratings yet
Nego Digests
26 pages
Nursing Information Booklet 2024 (2)
No ratings yet
Nursing Information Booklet 2024 (2)
30 pages
Synopsis of Login System
50% (8)
Synopsis of Login System
29 pages
BJT Amplifier Design
No ratings yet
BJT Amplifier Design
14 pages
LTES Nutrition Month Report 2024
No ratings yet
LTES Nutrition Month Report 2024
8 pages
Problem Solving - Brainwriting Is Brainstorming On Steroids
No ratings yet
Problem Solving - Brainwriting Is Brainstorming On Steroids
3 pages