Hadoop storage

Hadoop storage
M.SandhiyaM.SC(IT)
Department of CS&IT
Nadar Saraswathi College of Arts Science
Theni

Apache Hadoop
• Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
• Created by Doug Cutting and Mike Carafella in
2005.
• Cutting named the program after his son’s toy
elephant.

Uses for Hadoop
• Data-intensive text processing
• Assembly of large genomes
• Graph mining
• Machine learning and data mining
• Large scale social network analysis

Overview
• Responsible for storing data on the cluster
• Data files are split into blocks and distributed
across the nodes in the cluster
• Each block is replicated multiple times

HDFS Basic Concepts
• HDFS is a file system written in Java based on
the Google’s GFS
• Provides redundant storage for massive
amounts of data

How are Files Stored
• Files are split into blocks
• Blocks are split across many machines at load
time
– Different blocks from the same file will be stored
on different machines
• Blocks are replicated across multiple machines
• The NameNode keeps track of which blocks
make up a file and where they are stored

Storage efficiency
• with Parquet or Kudu and Snappy compression the total volume of the
• data can be reduced by a factor 10 comparing to uncompressed simple serialization format.
• • Data ingestion speed – all tested file based solutions provide faster ingestion rates (between
• x2 and x10) than specialized storage engines or MapFiles (sorted sequence).
• • Random data access time – using HBase or Kudu, typical random data lookup speed is below
• 500ms. With smart HDFS namespace partitioning Parquet could deliver random lookup on a
• level of a second but consumes more resources.
• • Data analytics – with Parquet or Kudu it is possible to perform fast and scalable (typically
• more than 300k records per second per CPU core) data aggregation, filtering and reporting.
• • Support of in-place data mutation – HBase and Kudu can modify records (schema and values)
• in-place where it is not possible with data stored directly in HDFS files.
• Figure

approaches for Core Storage
• The data access and ingestion tests were performed on a
cluster composed of 14 physical machines,
• each equipped with 2 CPUs with 8 physical cores with clock
speed 2.60GHz, 64 GB of RAM and 48
• SAS drives, 4TB each. Hadoop was installed from Cloudera
Data Hub (CDH) distribution version
• 5.7.0, which includes, Hadoop core 2.6.0, Impala 2.5.0, Hive
1.1.0, HBase 1.2.0 (configured JVM
• heap size for region servers = 30 GB) and Kudu 1.0
(configured memory limit = 30 GB). Apache
• Impala (incubating) was used as a data ingestion and data
access framework in all the conducted tests
• presented later in this report

Evaluated formats and technologies
• data serialization standard for compact binary format widely used for
• storing persistent data in HDFS as well as for communication protocols. One of the
advantages of
• using Avro is lightweight and fast data serialization and deserialization, which can
deliver very good
• ingestion performance.
• Even though it does not have any internal index (like in the case of MapFiles), the
HDFS directorybased
• partitioning technique can be applied to quickly navigate to the collections of
interest when fast
• random data access is needed. In the test a tuple of runnumber, project and
streamname was used as a
• partitioning key. This allowed obtaining good balance between the number of
partitions (few
• thousands) and an average partitions size (hundreds of megabytes). Two
supported by Apache Avro
• algorithms were used in the tests: Snappy and DEFLATE

Apache Avro
• Dictionary, Bit
• packing), and the compression applied on series
of values from the same columns that gives very
good
• compaction ratios. When storing data in HDFS in
Parquet format, the same partitioning strategy
was
• used as in the Avro case. Two Apache Parquet
supported algorithms have been used to
compressed

Ingestion speed
• Measuring records ingestion speed into a single data partition should reflect the performance of
• writing to the ATLAS EventIndex Core Storage system that can be expected when using different
• storage techniques. The results of this test are presented on Figure 2.
• In general, it is difficult to make a valid performance comparison between writing data to files and
• writing data to a storage engine. However, because Apache Impala performs writing into a single
• HDFS directory (Hive partition) serially, the results obtained for HDFS formats and HBase or Kudu
• can be directly compared for single data partition ingestion efficiency.
• Writing to HDFS files encoded with Avro or Parquet delivered much better results (at least by a
• factor 5) than storage engines like HBase and Kudu. Since Avro has the most lightweight encoder, it
• achieved the best ingestion performance. At the other end of the spectrum, HBase in this test was
very
• slow (worse than Kudu). This most likely was caused by the length of the row key (6 concatenated
• columns), that in average was around 60 bytes. HBase has to encode a key for each of the columns
in a
• row separately, which for long records (with many columns) can be suboptimal.

Random data lookup
• According to the measured results (Figure 3), when
accessing data by a record key, Kudu and
• HBase were the fastest ones, because of the usage of
built-in indexing. Values on the plot were
• measured with cold caches. Using Apache Impala for
random lookup test is suboptimal for Kudu and
• HBase as a significant amount of time is spent to set up
a query (planning, code generation etc.) before
• it really gets executed –

schema-less tables
• Apache Avro has proven to be a fast universal encoder for structured data.
Due to very efficient
• serialization and deserialization, this format can guarantee very good
performance whenever an access
• to all the attributes of a record is required at the same time – data
transportation, staging areas etc.
• On the other hand Apache HBase delivers very good random data access
performance and the
• biggest flexibility in structuring stored data (schema-less tables). The
performance of batch processing
• of HBase data heavily depends on a chosen data model and typically
cannot compete on this field with
• the other tested technologies. Therefore any analytics with HBase data
should be performed rather
• rarely.
• Notably, compression

Fault Tolerance
• Indexing events by event number and run number in HBase
database. In this approach the
• indexing key resolves to GUID and pointers to the complete records
stored on HDFS.
• So far both systems have proven to deliver very good events picking
performance on a level of tens of
• milliseconds – two orders of magnitude faster than the original
approach when using MapFiles solely.
• The only concern when running a hybrid approach in both cases is
the system size and internal
• coherence – robust procedures for handling HDFS raw data sets
updates and propagating them to
• indexing databases with low latency have to be maintained and
monitored

Core Hadoop Concepts
• Applications are written in a high-level
programming language
– No network programming or temporal dependency
• Nodes should communicate as little as possible
– A “shared nothing” architecture
• Data is spread among the machines in advance
– Perform computation where the data is already stored
as often as possible

Hadoop storage

More Related Content

What's hot (19)

Similar to Hadoop storage (20)

More from SanSan149 (11)

Recently uploaded (20)

Hadoop storage