SlideShare a Scribd company logo
Hadoop storage
M.SandhiyaM.SC(IT)
Department of CS&IT
Nadar Saraswathi College of Arts Science
Theni
Apache Hadoop
• Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
• Created by Doug Cutting and Mike Carafella in
2005.
• Cutting named the program after his son’s toy
elephant.
Uses for Hadoop
• Data-intensive text processing
• Assembly of large genomes
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
Overview
• Responsible for storing data on the cluster
• Data files are split into blocks and distributed
across the nodes in the cluster
• Each block is replicated multiple times
HDFS Basic Concepts
• HDFS is a file system written in Java based on
the Google’s GFS
• Provides redundant storage for massive
amounts of data
How are Files Stored
• Files are split into blocks
• Blocks are split across many machines at load
time
– Different blocks from the same file will be stored
on different machines
• Blocks are replicated across multiple machines
• The NameNode keeps track of which blocks
make up a file and where they are stored
Storage efficiency
• with Parquet or Kudu and Snappy compression the total volume of the
• data can be reduced by a factor 10 comparing to uncompressed simple serialization format.
• • Data ingestion speed – all tested file based solutions provide faster ingestion rates (between
• x2 and x10) than specialized storage engines or MapFiles (sorted sequence).
• • Random data access time – using HBase or Kudu, typical random data lookup speed is below
• 500ms. With smart HDFS namespace partitioning Parquet could deliver random lookup on a
• level of a second but consumes more resources.
• • Data analytics – with Parquet or Kudu it is possible to perform fast and scalable (typically
• more than 300k records per second per CPU core) data aggregation, filtering and reporting.
• • Support of in-place data mutation – HBase and Kudu can modify records (schema and values)
• in-place where it is not possible with data stored directly in HDFS files.
• Figure
approaches for Core Storage
• The data access and ingestion tests were performed on a
cluster composed of 14 physical machines,
• each equipped with 2 CPUs with 8 physical cores with clock
speed 2.60GHz, 64 GB of RAM and 48
• SAS drives, 4TB each. Hadoop was installed from Cloudera
Data Hub (CDH) distribution version
• 5.7.0, which includes, Hadoop core 2.6.0, Impala 2.5.0, Hive
1.1.0, HBase 1.2.0 (configured JVM
• heap size for region servers = 30 GB) and Kudu 1.0
(configured memory limit = 30 GB). Apache
• Impala (incubating) was used as a data ingestion and data
access framework in all the conducted tests
• presented later in this report
Evaluated formats and technologies
• data serialization standard for compact binary format widely used for
• storing persistent data in HDFS as well as for communication protocols. One of the
advantages of
• using Avro is lightweight and fast data serialization and deserialization, which can
deliver very good
• ingestion performance.
• Even though it does not have any internal index (like in the case of MapFiles), the
HDFS directorybased
• partitioning technique can be applied to quickly navigate to the collections of
interest when fast
• random data access is needed. In the test a tuple of runnumber, project and
streamname was used as a
• partitioning key. This allowed obtaining good balance between the number of
partitions (few
• thousands) and an average partitions size (hundreds of megabytes). Two
supported by Apache Avro
• algorithms were used in the tests: Snappy and DEFLATE
Apache Avro
• Dictionary, Bit
• packing), and the compression applied on series
of values from the same columns that gives very
good
• compaction ratios. When storing data in HDFS in
Parquet format, the same partitioning strategy
was
• used as in the Avro case. Two Apache Parquet
supported algorithms have been used to
compressed
Ingestion speed
• Measuring records ingestion speed into a single data partition should reflect the performance of
• writing to the ATLAS EventIndex Core Storage system that can be expected when using different
• storage techniques. The results of this test are presented on Figure 2.
• In general, it is difficult to make a valid performance comparison between writing data to files and
• writing data to a storage engine. However, because Apache Impala performs writing into a single
• HDFS directory (Hive partition) serially, the results obtained for HDFS formats and HBase or Kudu
• can be directly compared for single data partition ingestion efficiency.
• Writing to HDFS files encoded with Avro or Parquet delivered much better results (at least by a
• factor 5) than storage engines like HBase and Kudu. Since Avro has the most lightweight encoder, it
• achieved the best ingestion performance. At the other end of the spectrum, HBase in this test was
very
• slow (worse than Kudu). This most likely was caused by the length of the row key (6 concatenated
• columns), that in average was around 60 bytes. HBase has to encode a key for each of the columns
in a
• row separately, which for long records (with many columns) can be suboptimal.
Random data lookup
• According to the measured results (Figure 3), when
accessing data by a record key, Kudu and
• HBase were the fastest ones, because of the usage of
built-in indexing. Values on the plot were
• measured with cold caches. Using Apache Impala for
random lookup test is suboptimal for Kudu and
• HBase as a significant amount of time is spent to set up
a query (planning, code generation etc.) before
• it really gets executed –
schema-less tables
• Apache Avro has proven to be a fast universal encoder for structured data.
Due to very efficient
• serialization and deserialization, this format can guarantee very good
performance whenever an access
• to all the attributes of a record is required at the same time – data
transportation, staging areas etc.
• On the other hand Apache HBase delivers very good random data access
performance and the
• biggest flexibility in structuring stored data (schema-less tables). The
performance of batch processing
• of HBase data heavily depends on a chosen data model and typically
cannot compete on this field with
• the other tested technologies. Therefore any analytics with HBase data
should be performed rather
• rarely.
• Notably, compression
Fault Tolerance
• Indexing events by event number and run number in HBase
database. In this approach the
• indexing key resolves to GUID and pointers to the complete records
stored on HDFS.
• So far both systems have proven to deliver very good events picking
performance on a level of tens of
• milliseconds – two orders of magnitude faster than the original
approach when using MapFiles solely.
• The only concern when running a hybrid approach in both cases is
the system size and internal
• coherence – robust procedures for handling HDFS raw data sets
updates and propagating them to
• indexing databases with low latency have to be maintained and
monitored
Core Hadoop Concepts
• Applications are written in a high-level
programming language
– No network programming or temporal dependency
• Nodes should communicate as little as possible
– A “shared nothing” architecture
• Data is spread among the machines in advance
– Perform computation where the data is already stored
as often as possible

More Related Content

What's hot (19)

PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
Hadoop
Tuan Cuong Luu
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
PPTX
Hadoop
ABHIJEET RAJ
 
PPTX
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
Asbury Hadoop Overview
Brian Enochson
 
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPTX
6.hive
Prashant Gupta
 
PPTX
Gfs vs hdfs
Yuval Carmel
 
PPTX
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
PPT
Hadoop
Mallikarjuna G D
 
PPTX
Pptx present
Nitish Bhardwaj
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PPT
Hadoop Technologies
Kannappan Sirchabesan
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop File system (HDFS)
Prashant Gupta
 
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Hadoop
ABHIJEET RAJ
 
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Asbury Hadoop Overview
Brian Enochson
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Hadoop And Their Ecosystem
sunera pathan
 
Big Data and Cloud Computing
Farzad Nozarian
 
Gfs vs hdfs
Yuval Carmel
 
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
Pptx present
Nitish Bhardwaj
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Technologies
Kannappan Sirchabesan
 
Hadoop Overview kdd2011
Milind Bhandarkar
 

Similar to Hadoop storage (20)

PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PDF
Apache kudu
Asim Jalis
 
PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
PDF
Hbase status quo apache-con europe - nov 2012
Chris Huang
 
PDF
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
IJDKP
 
PDF
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
IJDKP
 
PDF
ApacheCon09: Avro
Cloudera, Inc.
 
PDF
Avro Data | Washington DC HUG
Cloudera, Inc.
 
PDF
Hw09 Next Steps For Hadoop
Cloudera, Inc.
 
PDF
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
PPTX
Introduction to Apache Kudu
Jeff Holoman
 
KEY
HBase and Hadoop at Urban Airship
dave_revell
 
PPTX
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
PDF
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
DOC
Hadoop
Himanshu Soni
 
PPT
Parquet and impala overview external
mattlieber
 
PPTX
Hadoop_File_Formats_and_Data_Ingestion.pptx
ShashankSahu34
 
ODP
Hadoop Introduction
sheetal sharma
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Apache kudu
Asim Jalis
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
Hbase status quo apache-con europe - nov 2012
Chris Huang
 
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
IJDKP
 
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
IJDKP
 
ApacheCon09: Avro
Cloudera, Inc.
 
Avro Data | Washington DC HUG
Cloudera, Inc.
 
Hw09 Next Steps For Hadoop
Cloudera, Inc.
 
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
Introduction to Apache Kudu
Jeff Holoman
 
HBase and Hadoop at Urban Airship
dave_revell
 
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Parquet and impala overview external
mattlieber
 
Hadoop_File_Formats_and_Data_Ingestion.pptx
ShashankSahu34
 
Hadoop Introduction
sheetal sharma
 
Ad

More from SanSan149 (11)

PPTX
Sdma fdma-tdma-fixed tdm
SanSan149
 
PPTX
Histogram process spatial filtering
SanSan149
 
PPTX
Mobile computing vani
SanSan149
 
PPTX
Telecommunication system gms mobile service
SanSan149
 
PPTX
Adaptive filters and band reject filters
SanSan149
 
PPTX
Hema rdbms
SanSan149
 
PPTX
SDMA-FDMA-TDMA-fixed TDM
SanSan149
 
PPTX
joins and subqueries in big data analysis
SanSan149
 
PPTX
Normalization
SanSan149
 
PPTX
Manageral aspects of software maintance
SanSan149
 
PPTX
Common language runtime clr
SanSan149
 
Sdma fdma-tdma-fixed tdm
SanSan149
 
Histogram process spatial filtering
SanSan149
 
Mobile computing vani
SanSan149
 
Telecommunication system gms mobile service
SanSan149
 
Adaptive filters and band reject filters
SanSan149
 
Hema rdbms
SanSan149
 
SDMA-FDMA-TDMA-fixed TDM
SanSan149
 
joins and subqueries in big data analysis
SanSan149
 
Normalization
SanSan149
 
Manageral aspects of software maintance
SanSan149
 
Common language runtime clr
SanSan149
 
Ad

Recently uploaded (20)

PPTX
I INCLUDED THIS TOPIC IS INTELLIGENCE DEFINITION, MEANING, INDIVIDUAL DIFFERE...
parmarjuli1412
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
PDF
Tips for Writing the Research Title with Examples
Thelma Villaflores
 
PPTX
THE JEHOVAH’S WITNESSES’ ENCRYPTED SATANIC CULT
Claude LaCombe
 
PPTX
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PDF
EXCRETION-STRUCTURE OF NEPHRON,URINE FORMATION
raviralanaresh2
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Translation_ Definition, Scope & Historical Development.pptx
DhatriParmar
 
PPT
DRUGS USED IN THERAPY OF SHOCK, Shock Therapy, Treatment or management of shock
Rajshri Ghogare
 
PPTX
Unlock the Power of Cursor AI: MuleSoft Integrations
Veera Pallapu
 
PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PPTX
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
PPTX
ENGLISH 8 WEEK 3 Q1 - Analyzing the linguistic, historical, andor biographica...
OliverOllet
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
YSPH VMOC Special Report - Measles Outbreak Southwest US 7-20-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
I INCLUDED THIS TOPIC IS INTELLIGENCE DEFINITION, MEANING, INDIVIDUAL DIFFERE...
parmarjuli1412
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
Tips for Writing the Research Title with Examples
Thelma Villaflores
 
THE JEHOVAH’S WITNESSES’ ENCRYPTED SATANIC CULT
Claude LaCombe
 
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
EXCRETION-STRUCTURE OF NEPHRON,URINE FORMATION
raviralanaresh2
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Translation_ Definition, Scope & Historical Development.pptx
DhatriParmar
 
DRUGS USED IN THERAPY OF SHOCK, Shock Therapy, Treatment or management of shock
Rajshri Ghogare
 
Unlock the Power of Cursor AI: MuleSoft Integrations
Veera Pallapu
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
ENGLISH 8 WEEK 3 Q1 - Analyzing the linguistic, historical, andor biographica...
OliverOllet
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 7-20-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 

Hadoop storage

  • 1. Hadoop storage M.SandhiyaM.SC(IT) Department of CS&IT Nadar Saraswathi College of Arts Science Theni
  • 2. Apache Hadoop • Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware • Created by Doug Cutting and Mike Carafella in 2005. • Cutting named the program after his son’s toy elephant.
  • 3. Uses for Hadoop • Data-intensive text processing • Assembly of large genomes • Graph mining • Machine learning and data mining • Large scale social network analysis
  • 4. Overview • Responsible for storing data on the cluster • Data files are split into blocks and distributed across the nodes in the cluster • Each block is replicated multiple times
  • 5. HDFS Basic Concepts • HDFS is a file system written in Java based on the Google’s GFS • Provides redundant storage for massive amounts of data
  • 6. How are Files Stored • Files are split into blocks • Blocks are split across many machines at load time – Different blocks from the same file will be stored on different machines • Blocks are replicated across multiple machines • The NameNode keeps track of which blocks make up a file and where they are stored
  • 7. Storage efficiency • with Parquet or Kudu and Snappy compression the total volume of the • data can be reduced by a factor 10 comparing to uncompressed simple serialization format. • • Data ingestion speed – all tested file based solutions provide faster ingestion rates (between • x2 and x10) than specialized storage engines or MapFiles (sorted sequence). • • Random data access time – using HBase or Kudu, typical random data lookup speed is below • 500ms. With smart HDFS namespace partitioning Parquet could deliver random lookup on a • level of a second but consumes more resources. • • Data analytics – with Parquet or Kudu it is possible to perform fast and scalable (typically • more than 300k records per second per CPU core) data aggregation, filtering and reporting. • • Support of in-place data mutation – HBase and Kudu can modify records (schema and values) • in-place where it is not possible with data stored directly in HDFS files. • Figure
  • 8. approaches for Core Storage • The data access and ingestion tests were performed on a cluster composed of 14 physical machines, • each equipped with 2 CPUs with 8 physical cores with clock speed 2.60GHz, 64 GB of RAM and 48 • SAS drives, 4TB each. Hadoop was installed from Cloudera Data Hub (CDH) distribution version • 5.7.0, which includes, Hadoop core 2.6.0, Impala 2.5.0, Hive 1.1.0, HBase 1.2.0 (configured JVM • heap size for region servers = 30 GB) and Kudu 1.0 (configured memory limit = 30 GB). Apache • Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests • presented later in this report
  • 9. Evaluated formats and technologies • data serialization standard for compact binary format widely used for • storing persistent data in HDFS as well as for communication protocols. One of the advantages of • using Avro is lightweight and fast data serialization and deserialization, which can deliver very good • ingestion performance. • Even though it does not have any internal index (like in the case of MapFiles), the HDFS directorybased • partitioning technique can be applied to quickly navigate to the collections of interest when fast • random data access is needed. In the test a tuple of runnumber, project and streamname was used as a • partitioning key. This allowed obtaining good balance between the number of partitions (few • thousands) and an average partitions size (hundreds of megabytes). Two supported by Apache Avro • algorithms were used in the tests: Snappy and DEFLATE
  • 10. Apache Avro • Dictionary, Bit • packing), and the compression applied on series of values from the same columns that gives very good • compaction ratios. When storing data in HDFS in Parquet format, the same partitioning strategy was • used as in the Avro case. Two Apache Parquet supported algorithms have been used to compressed
  • 11. Ingestion speed • Measuring records ingestion speed into a single data partition should reflect the performance of • writing to the ATLAS EventIndex Core Storage system that can be expected when using different • storage techniques. The results of this test are presented on Figure 2. • In general, it is difficult to make a valid performance comparison between writing data to files and • writing data to a storage engine. However, because Apache Impala performs writing into a single • HDFS directory (Hive partition) serially, the results obtained for HDFS formats and HBase or Kudu • can be directly compared for single data partition ingestion efficiency. • Writing to HDFS files encoded with Avro or Parquet delivered much better results (at least by a • factor 5) than storage engines like HBase and Kudu. Since Avro has the most lightweight encoder, it • achieved the best ingestion performance. At the other end of the spectrum, HBase in this test was very • slow (worse than Kudu). This most likely was caused by the length of the row key (6 concatenated • columns), that in average was around 60 bytes. HBase has to encode a key for each of the columns in a • row separately, which for long records (with many columns) can be suboptimal.
  • 12. Random data lookup • According to the measured results (Figure 3), when accessing data by a record key, Kudu and • HBase were the fastest ones, because of the usage of built-in indexing. Values on the plot were • measured with cold caches. Using Apache Impala for random lookup test is suboptimal for Kudu and • HBase as a significant amount of time is spent to set up a query (planning, code generation etc.) before • it really gets executed –
  • 13. schema-less tables • Apache Avro has proven to be a fast universal encoder for structured data. Due to very efficient • serialization and deserialization, this format can guarantee very good performance whenever an access • to all the attributes of a record is required at the same time – data transportation, staging areas etc. • On the other hand Apache HBase delivers very good random data access performance and the • biggest flexibility in structuring stored data (schema-less tables). The performance of batch processing • of HBase data heavily depends on a chosen data model and typically cannot compete on this field with • the other tested technologies. Therefore any analytics with HBase data should be performed rather • rarely. • Notably, compression
  • 14. Fault Tolerance • Indexing events by event number and run number in HBase database. In this approach the • indexing key resolves to GUID and pointers to the complete records stored on HDFS. • So far both systems have proven to deliver very good events picking performance on a level of tens of • milliseconds – two orders of magnitude faster than the original approach when using MapFiles solely. • The only concern when running a hybrid approach in both cases is the system size and internal • coherence – robust procedures for handling HDFS raw data sets updates and propagating them to • indexing databases with low latency have to be maintained and monitored
  • 15. Core Hadoop Concepts • Applications are written in a high-level programming language – No network programming or temporal dependency • Nodes should communicate as little as possible – A “shared nothing” architecture • Data is spread among the machines in advance – Perform computation where the data is already stored as often as possible