SlideShare a Scribd company logo
HDFS: Optimization, Stabilization and
Supportability
April 13, 2016
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
Š Hortonworks Inc. 2011
About Me
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Page 2
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Motivation
• HDFS engineers are on the front line for operational support of Hadoop.
– HDFS is the foundational storage layer for typical Hadoop deployments.
– Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem.
– Conversely, application problems can become visible at the layer of HDFS operations.
• Analysis of Hadoop Support Cases
– Support case trends reveal common patterns for HDFS operational challenges.
– Those challenges inform what needs to improve in the software.
• Software Improvements
– Optimization: Identify bottlenecks and make them faster.
– Stabilization: Prevent unusual circumstances from harming cluster uptime.
– Supportability: When something goes wrong, provide visibility and tools to fix it.
Thank you to the entire community of Apache contributors.
Page 3
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Logging
• Logging requires a careful balance.
– Too little logging hides valuable operational information.
– Too much logging causes information overload, increased load and greater garbage collection overhead.
• Logging APIs
– Hadoop codebase currently uses a mix of logging APIs.
– Commons Logging and Log4J 1 require additional guard logic to prevent execution of expensive messages.
if (LOG.isDebugEnabled()) {
LOG.debug(“Processing block: “ + block); // expensive toString() implementation!
}
– SLF4J simplifies this.
LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled
• Pitfalls
– Forgotten guard logic.
– Logging in a tight loop.
– Logging while holding a shared resource, such as a mutually exclusive lock.
Page 4
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HADOOP-12318: better logging of LDAP exceptions
• Failure to log full details of an authentication failure.
– Very simple patch, huge payoff.
– Include exception details when logging failure.
• Before:
throw new SaslException("PLAIN auth failed: " + e.getMessage());
• After:
throw new SaslException("PLAIN auth failed: " + e.getMessage(), e);
Page 5
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-9434: Recommission a datanode with 500k blocks
may pause NN for 30 seconds
• Logging is too verbose
– Summary of patch: don’t log too much!
– Move detailed logging to trace level.
– It’s still accessible for edge case troubleshooting, but it doesn’t impact base operations.
• Before:
LOG.info("BLOCK* processOverReplicatedBlock: " +
"Postponing processing of over-replicated " +
block + " since storage + " + storage
+ "datanode " + cur + " does not yet have up-to-date " +
"block information.");
• After:
if (LOG.isTraceEnabled()) {
LOG.trace("BLOCK* processOverReplicatedBlock: Postponing " + block
+ " since storage " + storage
+ " does not yet have up-to-date information.");
}
Page 6
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Troubleshooting
• Kerberos is hard.
– Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration.
– Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration.
– When it doesn’t work, finding root cause is challenging.
• Metrics are vital for diagnosis of most operational problems.
– Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike)
– Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls)
Page 7
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HADOOP-12426: kdiag
• Kerberos misconfiguration diagnosis.
– Attempts to diagnose multiple sources of potential Kerberos misconfiguration problems.
– DNS
– Hadoop configuration files
– KDC configuration
• kdiag: a command-line tool for diagnosis of Kerberos problems
– Automatically trigger Java diagnostics, such as -Dsun.security.krb5.debug.
– Prints various environment variables, Java system properties and Hadoop configuration options related to
security.
– Attempt a login.
– If keytab used, print principal information from keytab.
– Print krb5.conf.
– Validate kinit executable (used for ticket renewals).
Page 8
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-6982: nntop
• Find activity trends of HDFS operations.
– HDFS audit log contains a record of each file system operation to the NameNode.
– NameNode metrics contain raw counts of operations.
– Identifying load trends from particular users or particular operations has always required ad-hoc scripting to
analyze the above sources of information.
• nntop: HDFS operation counts aggregated per operation and per user within time windows.
– curl
'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’
– Look for the “TopUserOpCounts” section in the returned JSON.
"ops": [
{
"totalCount": 1,
"opType": "delete",
"topUsers": [
{
"count": 1,
"user": "chris"
}
Page 9
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-7182: JMX metrics aren't accessible when NN is
busy
• Lock contention while attempting to query NameNode JMX metrics.
– JMX metrics are often queried in response to operational problems.
– Some metrics data required acquisition of a lock inside the NameNode. If another thread held this lock, then
metrics could not be accessed.
– During times of high load, the lock is likely to be held by another thread.
– At a time when the metrics are most likely to be needed, they were inaccessible.
– This patch addressed the problem by acquiring the metrics data without requiring the lock held.
Page 10
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Managing Load
• RPC call load.
– It’s too easy for a single inefficient job to overwhelm a cluster with too much RPC load.
– RPC servers accept calls into a single shared queue.
– Overflowing that queue causes increased latency and rejection of calls for all callers, not just the single inefficient
job that caused the problem.
– Load problems can be mitigated with enhanced admission control, client back-off and throttling policies
tailored to real-world usage patterns.
Page 11
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HADOOP-10282: FairCallQueue
• Hadoop RPC Architecture
– Traditionally, Hadoop RPC internally admits incoming RPC calls into a single shared queue.
– Worker threads consume the incoming calls from that shared queue and process them.
– In an overloaded situation, calls spend more time waiting in the queue for a worker thread to become available.
– At the extreme, the queue overflows, which then requires rejecting the calls.
– This tends to punish all callers, not just the caller that triggered the unusually high load.
• RPC Congestion Control with FairCallQueue
– Replace single shared queue with multiple prioritized queues.
– Call is placed into a queue with priority selected based on the calling user’s current history.
– Calls are dequeued and processed with greater frequency from higher-priority queues.
– Under normal operations, when the RPC server can keep up with load, this is not noticeably different from the
original architecture.
– Under high load, this tends to deprioritize users triggering unusually high load, thus allowing room for other
processes to make progress. There is less risk of a single runaway job overwhelming a cluster.
Page 12
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HADOOP-10597: RPC Server signals backoff to clients
when all request queues are full
• Client-side backoff from overloaded RPC servers.
– Builds upon work of the RPC FairCallQueue.
– If an RPC server’s queue is full, then optionally send a signal to additional incoming clients to request backoff.
– Clients are aware of the signal, and react by performing exponential backoff before sending additional calls.
– Improves quality of service for clients when server is under heavy load. RPC calls that would have failed will
instead succeed, but with longer latency.
– Improves likelihood of server recovering, because client backoff will give it more opportunity to catch up.
Page 13
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HADOOP-12916: Allow RPC scheduler/callqueue backoff
using response times
• More flexibility in back-off policies.
– Triggering backoff when the queue is full is in some sense too late. The problem has already grown too severe.
– Instead, track call response time, and trigger backoff when response time exceeds bounds.
– Any amount of queueing increases RPC response latency. Reacting to unusually high RPC response time can
prevent the problem from becoming so severe that the queue overflows.
Page 14
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Performance
• Garbage Collection
– NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.).
– Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are
larger, therefore the memory footprint has increased for tracking block state.)
– Much has been written about garbage collection tuning for large heap JVM processes.
– In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage
collection pressure.
• Block Reporting
– The process by which DataNodes report information about their stored blocks to the NameNode.
– Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently.
– Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently.
– All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency
directly.
– However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user
operations sufficiently.
Page 15
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-7097: Allow block reports to be processed during
checkpointing on standby name node
• Coarse-grained locking impedes block report processing.
– NameNode has a global lock required to enforce mutual exclusion for some operations.
– One such operation is checkpointing performed at the HA standby NameNode: process of creating a new fsimage
representing the full metadata state and beginning a new edit log. This can take a long time in large clusters.
– Block report processing also required holding the lock, and therefore could not proceed during a checkpoint.
• Coarse-grained lock contention can lead to cascading failure and downtime.
– Checkpointing holds lock.
– Frequent incremental block reports from DataNodes block waiting to acquire lock.
– Eventually consumes all available RPC handler threads, all waiting to acquire lock.
– In extreme case, blocks HA NameNode failover, because there is no RPC handler thread available to handle the
failover request.
– Even if HA failover can succeed, may still leave cluster in a state where it appears many nodes have gone dead,
because their blocked heartbeats couldn’t be processed.
• Solution: allow block report processing without holding global lock.
– Block reports now can be processed concurrently with a checkpoint in progress.
– Like most multi-threading and locking logic, required careful reasoning to ensure change was safe.
Page 16
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-7435: PB encoding of block reports is very inefficient
• Block report RPC message encoding can cause memory allocation inefficiency and garbage
collection churn.
– HDFS RPC messages are encoded using Protocol Buffers.
– Block reports encoded each block ID, length and generation stamp in a Protocol Buffers repeated long field.
– Behind the scenes, this becomes an ArrayList with a default capacity of 10.
– DataNodes in large clusters almost always send a larger block report than this, so ArrayList reallocation churn is almost
guaranteed.
– Data type contained in the ArrayList is Long (note captialization, not primitive long).
– Boxing and unboxing causes additional allocation requirements.
• Solution: a more GC-friendly encoding of block reports.
– Within the Protocol Buffers RPC message, take over serialization directly.
– Manually encode number of longs, followed by list of primitive longs.
– Eliminates ArrayList reallocation costs.
– Eliminates boxing and unboxing costs by deserializing straight to primitive long.
Page 17
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-7609: Avoid retry cache collision when Standby
NameNode loading edits
• Idempotence and at-most-once delivery of HDFS RPC messages.
– Some RPC message processing is inherently idempotent: can be applied multiple times, and the final result is still
the same. Example: setPermission.
– Other messages are not inherently idempotent, but the NameNode can still provide an “at-most-once” processing
guarantee by temporarily tracking recently executed operations by a unique call ID. Example: rename.
– The data structure that does this is called the RetryCache.
– This is important in failure modes, such as an HA failover or a network partition, which may cause a client to send
the same message more than once.
• Erroneous multiple RetryCache entries for same operation.
– Duplicate entries caused slowdown.
– Particularly noticeable during an HA transition.
– Bug fix to prevent duplicate entries.
Page 18
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-9710: Change DN to send block receipt IBRs in
batches
• Incremental block reports trigger multiple RPC calls.
– When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately.
– Even multiple block receipts translate to multiple individual incremental block report RPCs.
– With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the
NameNode to process.
• Solution: batch multiple block receipt events into a single RPC message.
– Reduces RPC overhead of sending multiple messages.
– Scales better with respect to number of nodes and number of blocks in a cluster.
Page 19
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Liveness
• "...make progress despite the fact that its concurrently executing components ("processes") may
have to "take turns" in critical sections, parts of the program that cannot be simultaneously run
by multiple processes." -Wikipedia
• DataNode Heartbeats
– Responsible for reporting health of a DataNode to the NameNode.
– Operational problems of managing load and performance can block timely heartbeat processing.
– Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and
asynchronous dispatch of commands (e.g. delete block).
• Blocked heartbeat processing can cause cascading failure and downtime.
– Blocked heartbeat processing can make the NameNode think DataNodes are not heartbeating at all, and
therefore are not running.
– DataNodes that stop running are flagged by the NameNode as dead.
– Too many dead DataNodes makes the cluster inoperable as a whole.
– Dead DataNodes must have their replicas copied to other DataNodes to satisfy replication requirements.
– Erroneously flagging DataNodes as dead can cause a storm of wasteful re-replication activity.
Page 20
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-9239: DataNode Lifeline Protocol: an alternative
protocol for reporting DataNode health
• The lifeline keeps the DataNode alive, despite conditions of unusually high load.
– Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by
DataNodes.
– Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for
asynchronous command dispatch, and therefore do not need to contend on a shared lock.
– Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive.
– Prevents erroneous and costly re-replication activity.
Page 21
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-9311: Support optional offload of NameNode HA
service health checks to a separate RPC server.
• RPC offload of HA health check and failover messages.
– Similar to problem of timely heartbeat message delivery.
– NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the
NameNode.
– Messages are related to handling periodic health checks and initiating shutdown and failover if necessary.
– A NameNode overwhelmed with unusually high load cannot process these messages.
– Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged
outage period.
– The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the
case of unusually high load.
Page 22
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Optimizing Applications
• HDFS Utilization Patterns
– Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS.
– FileSystem API unfortunately can make it too easy to implement inefficient call patterns.
Page 23
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HIVE-10223: Consolidate several redundant FileSystem
API calls.
• Hadoop FileSystem API can cause applications to make redundant RPC calls.
• Before:
if (fs.isFile(file)) { // RPC #1
...
} else if (fs.isDirectory(file)) { // RPC #2
...
}
• After:
FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC
if (fileStatus.isFile()) { // Local, no RPC
...
} else if (fileStatus.isDirectory()) { // Local, no RPC
...
}
• Good for Hive, because it reduces latency associated with NameNode RPCs.
• Good for the whole ecosystem, because it reduces load on the NameNode, a shared service.
Page 24
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
PIG-4442: Eliminate redundant RPC call to get file
information in HPath.
• A similar story of redundant RPC within Pig code.
• Before:
long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1
short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2
• After:
FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC
long blockSize = fileStatus.getBlockSize(); // Local, no RPC
short replication = fileStatus.getReplication(); // Local, no RPC
• Revealed from inspection of HDFS audit log.
– HDFS audit log shows a record of each file system operation executed against the NameNode.
– This continues to be one of the most significant sources of HDFS troubleshooting information.
– In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a
Pig job submission.
Page 25
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
HDFS-9924: Asynchronous HDFS Access
• Current Hadoop FileSystem API is inherently synchronous.
– Issue a single synchronous file system call.
– In the case of HDFS, that call is implemented with a synchronous RPC.
– Block waiting for the result.
– Then, client application may proceed.
• Some application usage patterns would benefit from asynchronous access.
– Some applications regularly issue a large sequence of multiple file system calls, with no data dependencies
between the results of those calls.
– For example, Hive partition logic can involve hundreds or thousands of rename operations, where each rename
can execute independently, with no data dependencies on the results of other renames.
public Future<Boolean> rename(Path src, Path dst) throws IOException;
Page 26
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Summary
• A variety of recent enhancements have improved the ability of HDFS to serve as the foundational
storage layer of the Hadoop ecosystem.
• Optimization
– Performance
– Optimizing Applications
• Stabilization
– Liveness
– Managing Load
• Supportability
– Logging
– Troubleshooting
Page 27
Architecting the Future of Big Data
Š Hortonworks Inc. 2011
Thank you!
Q&A
Ad

More Related Content

What's hot (20)

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
DataStax
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
DataWorks Summit
 
Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2
DataStax Academy
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
DataWorks Summit
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and BigtopAccelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
DataWorks Summit
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
DataStax
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
DataWorks Summit
 
Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2
DataStax Academy
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
DataWorks Summit
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and BigtopAccelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
DataWorks Summit
 

Viewers also liked (15)

Como utilizar las redes sociales
Como utilizar las redes socialesComo utilizar las redes sociales
Como utilizar las redes sociales
mberaliz
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016
Peter Bakas
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Giuseppe Paterno'
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
sheetal sharma
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
Yuval Carmel
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Como utilizar las redes sociales
Como utilizar las redes socialesComo utilizar las redes sociales
Como utilizar las redes sociales
mberaliz
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016
Peter Bakas
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Giuseppe Paterno'
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
sheetal sharma
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Ad

Similar to Hdfs 2016-hadoop-summit-dublin-v1 (20)

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Big Data Joe™ Rossi
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
Alfresco Software
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
pbelko82
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Big Data Joe™ Rossi
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
Alfresco Software
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
pbelko82
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Ad

Recently uploaded (20)

EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 

Hdfs 2016-hadoop-summit-dublin-v1

  • 1. HDFS: Optimization, Stabilization and Supportability April 13, 2016 Chris Nauroth email: [email protected] twitter: @cnauroth
  • 2. Š Hortonworks Inc. 2011 About Me Chris Nauroth • Member of Technical Staff, Hortonworks – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Major contributor to HDFS ACLs, Windows compatibility, and operability improvements • Hadoop user since 2010 – Prior employment experience deploying, maintaining and using Hadoop clusters Page 2 Architecting the Future of Big Data
  • 3. Š Hortonworks Inc. 2011 Motivation • HDFS engineers are on the front line for operational support of Hadoop. – HDFS is the foundational storage layer for typical Hadoop deployments. – Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem. – Conversely, application problems can become visible at the layer of HDFS operations. • Analysis of Hadoop Support Cases – Support case trends reveal common patterns for HDFS operational challenges. – Those challenges inform what needs to improve in the software. • Software Improvements – Optimization: Identify bottlenecks and make them faster. – Stabilization: Prevent unusual circumstances from harming cluster uptime. – Supportability: When something goes wrong, provide visibility and tools to fix it. Thank you to the entire community of Apache contributors. Page 3 Architecting the Future of Big Data
  • 4. Š Hortonworks Inc. 2011 Logging • Logging requires a careful balance. – Too little logging hides valuable operational information. – Too much logging causes information overload, increased load and greater garbage collection overhead. • Logging APIs – Hadoop codebase currently uses a mix of logging APIs. – Commons Logging and Log4J 1 require additional guard logic to prevent execution of expensive messages. if (LOG.isDebugEnabled()) { LOG.debug(“Processing block: “ + block); // expensive toString() implementation! } – SLF4J simplifies this. LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled • Pitfalls – Forgotten guard logic. – Logging in a tight loop. – Logging while holding a shared resource, such as a mutually exclusive lock. Page 4 Architecting the Future of Big Data
  • 5. Š Hortonworks Inc. 2011 HADOOP-12318: better logging of LDAP exceptions • Failure to log full details of an authentication failure. – Very simple patch, huge payoff. – Include exception details when logging failure. • Before: throw new SaslException("PLAIN auth failed: " + e.getMessage()); • After: throw new SaslException("PLAIN auth failed: " + e.getMessage(), e); Page 5 Architecting the Future of Big Data
  • 6. Š Hortonworks Inc. 2011 HDFS-9434: Recommission a datanode with 500k blocks may pause NN for 30 seconds • Logging is too verbose – Summary of patch: don’t log too much! – Move detailed logging to trace level. – It’s still accessible for edge case troubleshooting, but it doesn’t impact base operations. • Before: LOG.info("BLOCK* processOverReplicatedBlock: " + "Postponing processing of over-replicated " + block + " since storage + " + storage + "datanode " + cur + " does not yet have up-to-date " + "block information."); • After: if (LOG.isTraceEnabled()) { LOG.trace("BLOCK* processOverReplicatedBlock: Postponing " + block + " since storage " + storage + " does not yet have up-to-date information."); } Page 6 Architecting the Future of Big Data
  • 7. Š Hortonworks Inc. 2011 Troubleshooting • Kerberos is hard. – Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration. – Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration. – When it doesn’t work, finding root cause is challenging. • Metrics are vital for diagnosis of most operational problems. – Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike) – Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls) Page 7 Architecting the Future of Big Data
  • 8. Š Hortonworks Inc. 2011 HADOOP-12426: kdiag • Kerberos misconfiguration diagnosis. – Attempts to diagnose multiple sources of potential Kerberos misconfiguration problems. – DNS – Hadoop configuration files – KDC configuration • kdiag: a command-line tool for diagnosis of Kerberos problems – Automatically trigger Java diagnostics, such as -Dsun.security.krb5.debug. – Prints various environment variables, Java system properties and Hadoop configuration options related to security. – Attempt a login. – If keytab used, print principal information from keytab. – Print krb5.conf. – Validate kinit executable (used for ticket renewals). Page 8 Architecting the Future of Big Data
  • 9. Š Hortonworks Inc. 2011 HDFS-6982: nntop • Find activity trends of HDFS operations. – HDFS audit log contains a record of each file system operation to the NameNode. – NameNode metrics contain raw counts of operations. – Identifying load trends from particular users or particular operations has always required ad-hoc scripting to analyze the above sources of information. • nntop: HDFS operation counts aggregated per operation and per user within time windows. – curl 'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’ – Look for the “TopUserOpCounts” section in the returned JSON. "ops": [ { "totalCount": 1, "opType": "delete", "topUsers": [ { "count": 1, "user": "chris" } Page 9 Architecting the Future of Big Data
  • 10. Š Hortonworks Inc. 2011 HDFS-7182: JMX metrics aren't accessible when NN is busy • Lock contention while attempting to query NameNode JMX metrics. – JMX metrics are often queried in response to operational problems. – Some metrics data required acquisition of a lock inside the NameNode. If another thread held this lock, then metrics could not be accessed. – During times of high load, the lock is likely to be held by another thread. – At a time when the metrics are most likely to be needed, they were inaccessible. – This patch addressed the problem by acquiring the metrics data without requiring the lock held. Page 10 Architecting the Future of Big Data
  • 11. Š Hortonworks Inc. 2011 Managing Load • RPC call load. – It’s too easy for a single inefficient job to overwhelm a cluster with too much RPC load. – RPC servers accept calls into a single shared queue. – Overflowing that queue causes increased latency and rejection of calls for all callers, not just the single inefficient job that caused the problem. – Load problems can be mitigated with enhanced admission control, client back-off and throttling policies tailored to real-world usage patterns. Page 11 Architecting the Future of Big Data
  • 12. Š Hortonworks Inc. 2011 HADOOP-10282: FairCallQueue • Hadoop RPC Architecture – Traditionally, Hadoop RPC internally admits incoming RPC calls into a single shared queue. – Worker threads consume the incoming calls from that shared queue and process them. – In an overloaded situation, calls spend more time waiting in the queue for a worker thread to become available. – At the extreme, the queue overflows, which then requires rejecting the calls. – This tends to punish all callers, not just the caller that triggered the unusually high load. • RPC Congestion Control with FairCallQueue – Replace single shared queue with multiple prioritized queues. – Call is placed into a queue with priority selected based on the calling user’s current history. – Calls are dequeued and processed with greater frequency from higher-priority queues. – Under normal operations, when the RPC server can keep up with load, this is not noticeably different from the original architecture. – Under high load, this tends to deprioritize users triggering unusually high load, thus allowing room for other processes to make progress. There is less risk of a single runaway job overwhelming a cluster. Page 12 Architecting the Future of Big Data
  • 13. Š Hortonworks Inc. 2011 HADOOP-10597: RPC Server signals backoff to clients when all request queues are full • Client-side backoff from overloaded RPC servers. – Builds upon work of the RPC FairCallQueue. – If an RPC server’s queue is full, then optionally send a signal to additional incoming clients to request backoff. – Clients are aware of the signal, and react by performing exponential backoff before sending additional calls. – Improves quality of service for clients when server is under heavy load. RPC calls that would have failed will instead succeed, but with longer latency. – Improves likelihood of server recovering, because client backoff will give it more opportunity to catch up. Page 13 Architecting the Future of Big Data
  • 14. Š Hortonworks Inc. 2011 HADOOP-12916: Allow RPC scheduler/callqueue backoff using response times • More flexibility in back-off policies. – Triggering backoff when the queue is full is in some sense too late. The problem has already grown too severe. – Instead, track call response time, and trigger backoff when response time exceeds bounds. – Any amount of queueing increases RPC response latency. Reacting to unusually high RPC response time can prevent the problem from becoming so severe that the queue overflows. Page 14 Architecting the Future of Big Data
  • 15. Š Hortonworks Inc. 2011 Performance • Garbage Collection – NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.). – Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are larger, therefore the memory footprint has increased for tracking block state.) – Much has been written about garbage collection tuning for large heap JVM processes. – In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage collection pressure. • Block Reporting – The process by which DataNodes report information about their stored blocks to the NameNode. – Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently. – Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently. – All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency directly. – However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user operations sufficiently. Page 15 Architecting the Future of Big Data
  • 16. Š Hortonworks Inc. 2011 HDFS-7097: Allow block reports to be processed during checkpointing on standby name node • Coarse-grained locking impedes block report processing. – NameNode has a global lock required to enforce mutual exclusion for some operations. – One such operation is checkpointing performed at the HA standby NameNode: process of creating a new fsimage representing the full metadata state and beginning a new edit log. This can take a long time in large clusters. – Block report processing also required holding the lock, and therefore could not proceed during a checkpoint. • Coarse-grained lock contention can lead to cascading failure and downtime. – Checkpointing holds lock. – Frequent incremental block reports from DataNodes block waiting to acquire lock. – Eventually consumes all available RPC handler threads, all waiting to acquire lock. – In extreme case, blocks HA NameNode failover, because there is no RPC handler thread available to handle the failover request. – Even if HA failover can succeed, may still leave cluster in a state where it appears many nodes have gone dead, because their blocked heartbeats couldn’t be processed. • Solution: allow block report processing without holding global lock. – Block reports now can be processed concurrently with a checkpoint in progress. – Like most multi-threading and locking logic, required careful reasoning to ensure change was safe. Page 16 Architecting the Future of Big Data
  • 17. Š Hortonworks Inc. 2011 HDFS-7435: PB encoding of block reports is very inefficient • Block report RPC message encoding can cause memory allocation inefficiency and garbage collection churn. – HDFS RPC messages are encoded using Protocol Buffers. – Block reports encoded each block ID, length and generation stamp in a Protocol Buffers repeated long field. – Behind the scenes, this becomes an ArrayList with a default capacity of 10. – DataNodes in large clusters almost always send a larger block report than this, so ArrayList reallocation churn is almost guaranteed. – Data type contained in the ArrayList is Long (note captialization, not primitive long). – Boxing and unboxing causes additional allocation requirements. • Solution: a more GC-friendly encoding of block reports. – Within the Protocol Buffers RPC message, take over serialization directly. – Manually encode number of longs, followed by list of primitive longs. – Eliminates ArrayList reallocation costs. – Eliminates boxing and unboxing costs by deserializing straight to primitive long. Page 17 Architecting the Future of Big Data
  • 18. Š Hortonworks Inc. 2011 HDFS-7609: Avoid retry cache collision when Standby NameNode loading edits • Idempotence and at-most-once delivery of HDFS RPC messages. – Some RPC message processing is inherently idempotent: can be applied multiple times, and the final result is still the same. Example: setPermission. – Other messages are not inherently idempotent, but the NameNode can still provide an “at-most-once” processing guarantee by temporarily tracking recently executed operations by a unique call ID. Example: rename. – The data structure that does this is called the RetryCache. – This is important in failure modes, such as an HA failover or a network partition, which may cause a client to send the same message more than once. • Erroneous multiple RetryCache entries for same operation. – Duplicate entries caused slowdown. – Particularly noticeable during an HA transition. – Bug fix to prevent duplicate entries. Page 18 Architecting the Future of Big Data
  • 19. Š Hortonworks Inc. 2011 HDFS-9710: Change DN to send block receipt IBRs in batches • Incremental block reports trigger multiple RPC calls. – When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately. – Even multiple block receipts translate to multiple individual incremental block report RPCs. – With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the NameNode to process. • Solution: batch multiple block receipt events into a single RPC message. – Reduces RPC overhead of sending multiple messages. – Scales better with respect to number of nodes and number of blocks in a cluster. Page 19 Architecting the Future of Big Data
  • 20. Š Hortonworks Inc. 2011 Liveness • "...make progress despite the fact that its concurrently executing components ("processes") may have to "take turns" in critical sections, parts of the program that cannot be simultaneously run by multiple processes." -Wikipedia • DataNode Heartbeats – Responsible for reporting health of a DataNode to the NameNode. – Operational problems of managing load and performance can block timely heartbeat processing. – Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and asynchronous dispatch of commands (e.g. delete block). • Blocked heartbeat processing can cause cascading failure and downtime. – Blocked heartbeat processing can make the NameNode think DataNodes are not heartbeating at all, and therefore are not running. – DataNodes that stop running are flagged by the NameNode as dead. – Too many dead DataNodes makes the cluster inoperable as a whole. – Dead DataNodes must have their replicas copied to other DataNodes to satisfy replication requirements. – Erroneously flagging DataNodes as dead can cause a storm of wasteful re-replication activity. Page 20 Architecting the Future of Big Data
  • 21. Š Hortonworks Inc. 2011 HDFS-9239: DataNode Lifeline Protocol: an alternative protocol for reporting DataNode health • The lifeline keeps the DataNode alive, despite conditions of unusually high load. – Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by DataNodes. – Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for asynchronous command dispatch, and therefore do not need to contend on a shared lock. – Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive. – Prevents erroneous and costly re-replication activity. Page 21 Architecting the Future of Big Data
  • 22. Š Hortonworks Inc. 2011 HDFS-9311: Support optional offload of NameNode HA service health checks to a separate RPC server. • RPC offload of HA health check and failover messages. – Similar to problem of timely heartbeat message delivery. – NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the NameNode. – Messages are related to handling periodic health checks and initiating shutdown and failover if necessary. – A NameNode overwhelmed with unusually high load cannot process these messages. – Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged outage period. – The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the case of unusually high load. Page 22 Architecting the Future of Big Data
  • 23. Š Hortonworks Inc. 2011 Optimizing Applications • HDFS Utilization Patterns – Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS. – FileSystem API unfortunately can make it too easy to implement inefficient call patterns. Page 23 Architecting the Future of Big Data
  • 24. Š Hortonworks Inc. 2011 HIVE-10223: Consolidate several redundant FileSystem API calls. • Hadoop FileSystem API can cause applications to make redundant RPC calls. • Before: if (fs.isFile(file)) { // RPC #1 ... } else if (fs.isDirectory(file)) { // RPC #2 ... } • After: FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC if (fileStatus.isFile()) { // Local, no RPC ... } else if (fileStatus.isDirectory()) { // Local, no RPC ... } • Good for Hive, because it reduces latency associated with NameNode RPCs. • Good for the whole ecosystem, because it reduces load on the NameNode, a shared service. Page 24 Architecting the Future of Big Data
  • 25. Š Hortonworks Inc. 2011 PIG-4442: Eliminate redundant RPC call to get file information in HPath. • A similar story of redundant RPC within Pig code. • Before: long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1 short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2 • After: FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC long blockSize = fileStatus.getBlockSize(); // Local, no RPC short replication = fileStatus.getReplication(); // Local, no RPC • Revealed from inspection of HDFS audit log. – HDFS audit log shows a record of each file system operation executed against the NameNode. – This continues to be one of the most significant sources of HDFS troubleshooting information. – In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a Pig job submission. Page 25 Architecting the Future of Big Data
  • 26. Š Hortonworks Inc. 2011 HDFS-9924: Asynchronous HDFS Access • Current Hadoop FileSystem API is inherently synchronous. – Issue a single synchronous file system call. – In the case of HDFS, that call is implemented with a synchronous RPC. – Block waiting for the result. – Then, client application may proceed. • Some application usage patterns would benefit from asynchronous access. – Some applications regularly issue a large sequence of multiple file system calls, with no data dependencies between the results of those calls. – For example, Hive partition logic can involve hundreds or thousands of rename operations, where each rename can execute independently, with no data dependencies on the results of other renames. public Future<Boolean> rename(Path src, Path dst) throws IOException; Page 26 Architecting the Future of Big Data
  • 27. Š Hortonworks Inc. 2011 Summary • A variety of recent enhancements have improved the ability of HDFS to serve as the foundational storage layer of the Hadoop ecosystem. • Optimization – Performance – Optimizing Applications • Stabilization – Liveness – Managing Load • Supportability – Logging – Troubleshooting Page 27 Architecting the Future of Big Data
  • 28. Š Hortonworks Inc. 2011 Thank you! Q&A

Editor's Notes

  • #3: Thank Arpit.
  • #4: We’ll look at specific Apache JIRA issues, some not yet shipped, some still in progress. Small patches often yield big wins. Sometimes those patches are even small enough to fit on a PowerPoint slide, as you’re about to see. Some are larger.
  • #5: These are common challenges for any large Java codebase, not just specific to Hadoop.
  • #6: Too little logging. Size of code change: 3 characters. Without this extra logging information, diagnosis is very challenging.
  • #7: Too much logging.
  • #8: Kerberos is notorious for obtuse error messages that don’t directly point out root cause.
  • #9: These are often steps we need to follow in any case that requires Kerberos troubleshooting. Codifying these steps into a standard tool makes gathering this information easier and more consistent.
  • #10: Helps find the naughty user who is overwhelming your cluster.
  • #15: “smoothing”
  • #16: In contrast to managing an overloaded situation, how can we more effectively handle more load?
  • #18: Garbage collection friendly data structures are particularly relevant to the NameNode, which has a large heap size requirement.
  • #19: Data structure not efficient for duplicate entries. (Not the use case.)
  • #24: We’ve talked about how HDFS can better react to overloaded conditions, and we’ve talked about improving HDFS to handle more total load. What is the source of that load? Is it legitimate?
  • #26: I encourage you to explore and analyze the HDFS audit log in your clusters.
  • #27: Improving the API to encourage more efficient applications.
  • #28: Performance of HDFS itself and also optimizing applications.