SlideShare a Scribd company logo
© Hortonworks Inc. 2011
Interactive Hadoop via Flash and
Memory
Arpit Agarwal
aagarwal@hortonworks.com
@aagarw
Chris Nauroth
cnauroth@hortonworks.com
@cnauroth
Page 1
© Hortonworks Inc. 2011
HDFS Reads
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Short-Circuit Reads
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Short-Circuit Reads
Page 4
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Shortcomings of Existing RAM Utilization
• Lack of Control
– Kernel decides what to retain in cache and what to evict based on observations of
access patterns.
• Sub-optimal RAM Utilization
– Tasks for multiple jobs are interleaved on the same node, and one task’s activity
could trigger eviction of data that would have been valuable to retain in cache for
the other task.
Page 5
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Centralized Cache Management
• Provides users with explicit control of which HDFS file paths to keep
resident in memory.
• Allows clients to query location of cached block replicas, opening
possibility for job scheduling improvements.
• Utilizes off-heap memory, not subject to GC overhead or JVM tuning.
Page 6
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Using Centralized Cache Management
• Pre-Requisites
– Native Hadoop library required, currently supported on Linux only.
– Set process ulimit for maximum locked memory.
– Configure dfs.datanode.max.locked.memory in hdfs-site.xml, set to the amount of
memory to dedicate towards caching.
• New Concepts
– Cache Pool
– Contains and manages a group of cache directives.
– Has Unix-style permissions.
– Can constrain resource utilization by defining a maximum number of cached bytes or a
maximum time to live.
– Cache Directive
– Specifies a file system path to cache.
– Specifying a directory caches all files in that directory (not recursive).
– Can specify number of replicas to cache and time to live.
Page 7
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Using Centralized Cache Management
Page 8
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Using Centralized Cache Management
• CLI: Adding a Cache Pool
> hdfs cacheadmin -addPool common-pool
Successfully added cache pool common-pool.
> hdfs cacheadmin -listPools
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL
common-pool cnauroth cnauroth rwxr-xr-x unlimited never
Page 9
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Using Centralized Cache Management
• CLI: Adding a Cache Directive
> hdfs cacheadmin -addDirective 
-path /hello-amsterdam 
-pool common-pool
Added cache directive 1
> hdfs cacheadmin -listDirectives
Found 1 entry
ID POOL REPL EXPIRY PATH
1 common-pool 1 never /hello-amsterdam
Page 10
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Using Centralized Cache Management
• CLI: Removing a Cache Directive
> hdfs cacheadmin -removeDirective 1
Removed cached directive 1
> hdfs cacheadmin -removeDirectives 
-path /hello-amsterdam
Removed cached directive 1
Removed every cache directive with path /hello-amsterdam
Page 11
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Using Centralized Cache Management
• API: DistributedFileSystem Methods
public void addCachePool(CachePoolInfo info)
public RemoteIterator<CachePoolEntry> listCachePools()
public long addCacheDirective(CacheDirectiveInfo info)
public RemoteIterator<CacheDirectiveEntry>
listCacheDirectives(CacheDirectiveInfo filter)
public void removeCacheDirective(long id)
Page 12
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Centralized Cache Management Behind the
Scenes
Page 13
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Centralized Cache Management Behind the
Scenes
• Block files are memory-mapped into the DataNode process.
> pmap `jps | grep DataNode | awk '{ print $1 }'` |
grep blk
00007f92e4b1f000 124928K r--s- /data/dfs/data/current/BP-
1740238118-127.0.1.1-
1395252171596/current/finalized/blk_1073741827
00007f92ecd21000 131072K r--s- /data/dfs/data/current/BP-
1740238118-127.0.1.1-
1395252171596/current/finalized/blk_1073741826
Page 14
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Centralized Cache Management Behind the
Scenes
• Pages of each block file are 100% resident in memory.
> vmtouch /data/dfs/data/current/BP-1740238118-127.0.1.1-
1395252171596/current/finalized/blk_1073741826
Files: 1
Directories: 0
Resident Pages: 32768/32768 128M/128M 100%
Elapsed: 0.001198 seconds
> vmtouch /data/dfs/data/current/BP-1740238118-127.0.1.1-
1395252171596/current/finalized/blk_1073741827
Files: 1
Directories: 0
Resident Pages: 31232/31232 122M/122M 100%
Elapsed: 0.00172 seconds
Page 15
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Zero-Copy Reads
• Applications read straight from direct byte buffers, backed by the
memory-mapped block file.
• Eliminates overhead of intermediate copy of bytes to buffer in user
space.
• Applications must change code to use a new read API on
DFSInputStream:
public ByteBuffer read(ByteBufferPool factory, int maxLength,
EnumSet<ReadOption> opts)
Page 16
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Heterogeneous Storages for HDFS
Architecting the Future of Big Data
Page 17
© Hortonworks Inc. 2011
Goals
• Extend HDFS to support a variety of Storage Media
• Applications can choose their target storage
• Use existing APIs wherever possible
Page 18
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Interesting Storage Media
Page 19
Architecting the Future of Big Data
Cost Example Use case
Spinning Disk (HDD) Low High volume batch data
Solid State Disk (SSD) 10x of HDD HBase Tables
RAM 100x of HDD Hive Materialized Views
Your custom Media ? ?
© Hortonworks Inc. 2011
HDFS Storage Architecture - Before
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Storage Architecture - Now
Page 21
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Storage Preferences
• Introduce Storage Type per Storage Medium
• Storage Hint from application to HDFS
–Specifies application’s preferred Storage Type
• Advisory
• Subject to available space/quotas
• Fallback Storage is HDD
–May be configurable in the future
Page 22
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Storage Preferences (continued)
• Specify preference when creating a file
–Write replicas directly to Storage Medium of choice
• Change preference for an existing file
–E.g. to migrate existing file replicas from HDD to SSD
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Quota Management
• Extend existing Quota Mechanisms
• Administrators ensure fair distribution of limited
resources
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
File Creation with Storage Types
Page 25
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Move existing replicas to target Storage
Type
Page 26
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Transient Files (Planned feature)
• Target storage type is Memory
–Writes will go to RAM
–Allow short circuit writes equivalent to Short circuit reads to
local in-memory block replicas
• Checkpoint files to disk by changing storage type
• Or discard
• High performance writes For Low volume transient
data
–e.g. Hive Materialized Views
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011
References
• http://hortonworks.com/blog/heterogeneous-storages-hdfs/
• HDFS-2832 – Heterogeneous Storages phase 1 – DataNode as a
collection of storages
• HDFS-5682 – Heterogeneous Storages phase 2 – APIs to expose
Storage Types
• HDFS-4949 – Centralized cache management in HDFS
Page 28
Architecting the Future of Big Data
Ad

More Related Content

What's hot (20)

Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
Schubert Zhang
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Patrick McGarry
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
HBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
enissoz
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
enissoz
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
JAX London
 
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかApache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Toshihiro Suzuki
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Cloudera, Inc.
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
Schubert Zhang
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Patrick McGarry
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
HBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
enissoz
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
enissoz
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
JAX London
 
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかApache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Toshihiro Suzuki
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Cloudera, Inc.
 

Similar to Interactive Hadoop via Flash and Memory (20)

Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
DataWorks Summit
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
Alluxio, Inc.
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
Atanu Chatterjee
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
Chiou-Nan Chen
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
DataWorks Summit
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
Alluxio, Inc.
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
Chiou-Nan Chen
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 
Ad

Recently uploaded (20)

Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Ad

Interactive Hadoop via Flash and Memory

  • 1. © Hortonworks Inc. 2011 Interactive Hadoop via Flash and Memory Arpit Agarwal [email protected] @aagarw Chris Nauroth [email protected] @cnauroth Page 1
  • 2. © Hortonworks Inc. 2011 HDFS Reads Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 HDFS Short-Circuit Reads Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 HDFS Short-Circuit Reads Page 4 Architecting the Future of Big Data
  • 5. © Hortonworks Inc. 2011 Shortcomings of Existing RAM Utilization • Lack of Control – Kernel decides what to retain in cache and what to evict based on observations of access patterns. • Sub-optimal RAM Utilization – Tasks for multiple jobs are interleaved on the same node, and one task’s activity could trigger eviction of data that would have been valuable to retain in cache for the other task. Page 5 Architecting the Future of Big Data
  • 6. © Hortonworks Inc. 2011 Centralized Cache Management • Provides users with explicit control of which HDFS file paths to keep resident in memory. • Allows clients to query location of cached block replicas, opening possibility for job scheduling improvements. • Utilizes off-heap memory, not subject to GC overhead or JVM tuning. Page 6 Architecting the Future of Big Data
  • 7. © Hortonworks Inc. 2011 Using Centralized Cache Management • Pre-Requisites – Native Hadoop library required, currently supported on Linux only. – Set process ulimit for maximum locked memory. – Configure dfs.datanode.max.locked.memory in hdfs-site.xml, set to the amount of memory to dedicate towards caching. • New Concepts – Cache Pool – Contains and manages a group of cache directives. – Has Unix-style permissions. – Can constrain resource utilization by defining a maximum number of cached bytes or a maximum time to live. – Cache Directive – Specifies a file system path to cache. – Specifying a directory caches all files in that directory (not recursive). – Can specify number of replicas to cache and time to live. Page 7 Architecting the Future of Big Data
  • 8. © Hortonworks Inc. 2011 Using Centralized Cache Management Page 8 Architecting the Future of Big Data
  • 9. © Hortonworks Inc. 2011 Using Centralized Cache Management • CLI: Adding a Cache Pool > hdfs cacheadmin -addPool common-pool Successfully added cache pool common-pool. > hdfs cacheadmin -listPools Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL common-pool cnauroth cnauroth rwxr-xr-x unlimited never Page 9 Architecting the Future of Big Data
  • 10. © Hortonworks Inc. 2011 Using Centralized Cache Management • CLI: Adding a Cache Directive > hdfs cacheadmin -addDirective -path /hello-amsterdam -pool common-pool Added cache directive 1 > hdfs cacheadmin -listDirectives Found 1 entry ID POOL REPL EXPIRY PATH 1 common-pool 1 never /hello-amsterdam Page 10 Architecting the Future of Big Data
  • 11. © Hortonworks Inc. 2011 Using Centralized Cache Management • CLI: Removing a Cache Directive > hdfs cacheadmin -removeDirective 1 Removed cached directive 1 > hdfs cacheadmin -removeDirectives -path /hello-amsterdam Removed cached directive 1 Removed every cache directive with path /hello-amsterdam Page 11 Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011 Using Centralized Cache Management • API: DistributedFileSystem Methods public void addCachePool(CachePoolInfo info) public RemoteIterator<CachePoolEntry> listCachePools() public long addCacheDirective(CacheDirectiveInfo info) public RemoteIterator<CacheDirectiveEntry> listCacheDirectives(CacheDirectiveInfo filter) public void removeCacheDirective(long id) Page 12 Architecting the Future of Big Data
  • 13. © Hortonworks Inc. 2011 Centralized Cache Management Behind the Scenes Page 13 Architecting the Future of Big Data
  • 14. © Hortonworks Inc. 2011 Centralized Cache Management Behind the Scenes • Block files are memory-mapped into the DataNode process. > pmap `jps | grep DataNode | awk '{ print $1 }'` | grep blk 00007f92e4b1f000 124928K r--s- /data/dfs/data/current/BP- 1740238118-127.0.1.1- 1395252171596/current/finalized/blk_1073741827 00007f92ecd21000 131072K r--s- /data/dfs/data/current/BP- 1740238118-127.0.1.1- 1395252171596/current/finalized/blk_1073741826 Page 14 Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011 Centralized Cache Management Behind the Scenes • Pages of each block file are 100% resident in memory. > vmtouch /data/dfs/data/current/BP-1740238118-127.0.1.1- 1395252171596/current/finalized/blk_1073741826 Files: 1 Directories: 0 Resident Pages: 32768/32768 128M/128M 100% Elapsed: 0.001198 seconds > vmtouch /data/dfs/data/current/BP-1740238118-127.0.1.1- 1395252171596/current/finalized/blk_1073741827 Files: 1 Directories: 0 Resident Pages: 31232/31232 122M/122M 100% Elapsed: 0.00172 seconds Page 15 Architecting the Future of Big Data
  • 16. © Hortonworks Inc. 2011 HDFS Zero-Copy Reads • Applications read straight from direct byte buffers, backed by the memory-mapped block file. • Eliminates overhead of intermediate copy of bytes to buffer in user space. • Applications must change code to use a new read API on DFSInputStream: public ByteBuffer read(ByteBufferPool factory, int maxLength, EnumSet<ReadOption> opts) Page 16 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 Heterogeneous Storages for HDFS Architecting the Future of Big Data Page 17
  • 18. © Hortonworks Inc. 2011 Goals • Extend HDFS to support a variety of Storage Media • Applications can choose their target storage • Use existing APIs wherever possible Page 18 Architecting the Future of Big Data
  • 19. © Hortonworks Inc. 2011 Interesting Storage Media Page 19 Architecting the Future of Big Data Cost Example Use case Spinning Disk (HDD) Low High volume batch data Solid State Disk (SSD) 10x of HDD HBase Tables RAM 100x of HDD Hive Materialized Views Your custom Media ? ?
  • 20. © Hortonworks Inc. 2011 HDFS Storage Architecture - Before Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 HDFS Storage Architecture - Now Page 21 Architecting the Future of Big Data
  • 22. © Hortonworks Inc. 2011 Storage Preferences • Introduce Storage Type per Storage Medium • Storage Hint from application to HDFS –Specifies application’s preferred Storage Type • Advisory • Subject to available space/quotas • Fallback Storage is HDD –May be configurable in the future Page 22 Architecting the Future of Big Data
  • 23. © Hortonworks Inc. 2011 Storage Preferences (continued) • Specify preference when creating a file –Write replicas directly to Storage Medium of choice • Change preference for an existing file –E.g. to migrate existing file replicas from HDD to SSD Page 23 Architecting the Future of Big Data
  • 24. © Hortonworks Inc. 2011 Quota Management • Extend existing Quota Mechanisms • Administrators ensure fair distribution of limited resources Page 24 Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011 File Creation with Storage Types Page 25 Architecting the Future of Big Data
  • 26. © Hortonworks Inc. 2011 Move existing replicas to target Storage Type Page 26 Architecting the Future of Big Data
  • 27. © Hortonworks Inc. 2011 Transient Files (Planned feature) • Target storage type is Memory –Writes will go to RAM –Allow short circuit writes equivalent to Short circuit reads to local in-memory block replicas • Checkpoint files to disk by changing storage type • Or discard • High performance writes For Low volume transient data –e.g. Hive Materialized Views Page 27 Architecting the Future of Big Data
  • 28. © Hortonworks Inc. 2011 References • http://hortonworks.com/blog/heterogeneous-storages-hdfs/ • HDFS-2832 – Heterogeneous Storages phase 1 – DataNode as a collection of storages • HDFS-5682 – Heterogeneous Storages phase 2 – APIs to expose Storage Types • HDFS-4949 – Centralized cache management in HDFS Page 28 Architecting the Future of Big Data