Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
Keep your hadoop cluster at its best! v4Chris Nauroth
Hadoop has become a backbone of many enterprises. While it can do wonders for businesses, it sometimes can be overwhelming for its operators and users. Amateurs as well as seasoned operators of Hadoop are caught unaware by common pitfalls of deploying, tuning and operating a Hadoop cluster. Having spent 5+ years working with 100s of Hadoop users, running clusters with 1000s of nodes, managing 10s of petabytes of data and running 100s of 1000s of tasks per day, we have seen people's unintentional acts, suboptimal configurations and common mistakes have resulted into downtimes, SLA violations, many hours of recovery operations and in some cases even data loss! Most of these traumas could have been easily avoided by applying easy to follow best practices that would protect data and optimize performance. In this talk we present real life stories, common pitfalls and most importantly, strategies on how to correctly deploy and manage Hadoop clusters. The talk will empower users and help make their Hadoop journey more fulfilling and rewarding. We will also discuss SmartSense. SmartSense can identify latent problems in a cluster and provide recommendations so that an operator can fix them before they manifest as a service degradation or outage.
The document summarizes recommendations for efficiently and effectively managing Apache Hadoop based on observations from analyzing over 1,000 customer bundles. It covers common operational mistakes like inconsistent operating system configurations involving locale, transparent huge pages, NTP, and legacy kernel issues. It also provides recommendations for optimizing configurations involving HDFS name node and data node settings, YARN resource manager and node manager memory settings, and YARN ATS timeline storage. The presentation encourages adopting recommendations built into the SmartSense analytics product to improve cluster operations and prevent issues.
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
Hortonworks provides best practices for system testing Hadoop clusters. It recommends testing across different operating systems, configurations, workloads and hardware to mimic a production environment. The document outlines automating the testing process through continuous integration to test over 15,000 configurations. It provides guidance on test planning, including identifying requirements, selecting hardware and workloads to test upgrades, migrations and changes to security settings.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
The document discusses scaling HDFS to manage billions of files through distributed storage schemes. It outlines the current HDFS architecture and challenges with namespace and block scaling. It proposes a storage container architecture with distributed block maps and a storage container manager to address these challenges. This would allow HDFS to easily scale to manage trillions of blocks and billions of files across large clusters.
This document discusses Microsoft's use of Apache YARN for scale-out resource management. It describes how YARN is used to manage vast amounts of data and compute resources across many different applications and workloads. The document outlines some limitations of YARN and Microsoft's contributions to address those limitations, including Rayon for improved scheduling, Mercury and Yaq for distributed scheduling, and work on federation to scale YARN across multiple clusters. It provides details on the implementation and evaluation of these contributions through papers, JIRAs, and integration into Apache Hadoop releases.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes to YARN, it now supports running Docker containers alongside process containers. Couple this with the newly added support for running services on YARN and it allows a host of new possibilities. In this talk, we'll present how to run a potential container cloud on YARN. Leveraging the support in YARN for Docker and services, we can allow users to spin up a bunch of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications(using the Assemblies support in YARN). We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, etc.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings - but also in more traditional, on premise deployments - applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports (a)synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and active development is ongoing at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
This document summarizes a presentation about new features in Apache Hadoop 3.0 related to YARN and MapReduce. It discusses major evolutions like the re-architecture of the YARN Timeline Service (ATS) to address scalability, usability, and reliability limitations. Other evolutions mentioned include improved support for long-running native services in YARN, simplified REST APIs, service discovery via DNS, scheduling enhancements, and making YARN more cloud-friendly with features like dynamic resource configuration and container resizing. The presentation estimates the timeline for Apache Hadoop 3.0 releases with alpha, beta, and general availability targeted throughout 2017.
This document discusses integrating Docker containers with YARN by introducing a Docker container runtime to the LinuxContainerExecutor in YARN. The DockerContainerRuntime allows YARN to leverage Docker for container lifecycle management and supports features like resource isolation, Linux capabilities, privileged containers, users, networking and images. It remains a work in progress to support additional features around networking, users and images fully.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Sanjay Radia presents on evolving HDFS to support a generalized storage subsystem. HDFS currently scales well to large clusters and storage sizes but faces challenges with small files and blocks. The solution is to (1) only keep part of the namespace in memory to scale beyond memory limits and (2) use block containers of 2-16GB to reduce block metadata and improve scaling. This will generalize the storage layer to support containers for multiple use cases beyond HDFS blocks.
This document discusses running Spark applications on YARN and managing Spark clusters. It covers challenges like predictable job execution times and optimal cluster utilization. Spark on YARN is introduced as a way to leverage YARN's resource management. Techniques like dynamic allocation, locality-aware scheduling, and resource queues help improve cluster sharing and utilization for multi-tenant workloads. Security considerations for shared clusters running sensitive data are also addressed.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
This document provides information about Hadoop and its components. It discusses the history of Hadoop and how it has evolved over time. It describes key Hadoop components including HDFS, MapReduce, YARN, and HBase. HDFS is the distributed file system of Hadoop that stores and manages large datasets across clusters. MapReduce is a programming model used for processing large datasets in parallel. YARN is the cluster resource manager that allocates resources to applications. HBase is the Hadoop database that provides real-time random data access.
Apache Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides HDFS for distributed file storage and MapReduce as a programming model for distributed computations. Hadoop includes other technologies like YARN for resource management, Spark for fast computation, HBase for NoSQL database, and tools for data analysis, transfer, and security. Hadoop can run on-premise or in cloud environments and supports analytics workloads.
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
Hortonworks provides best practices for system testing Hadoop clusters. It recommends testing across different operating systems, configurations, workloads and hardware to mimic a production environment. The document outlines automating the testing process through continuous integration to test over 15,000 configurations. It provides guidance on test planning, including identifying requirements, selecting hardware and workloads to test upgrades, migrations and changes to security settings.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
The document discusses scaling HDFS to manage billions of files through distributed storage schemes. It outlines the current HDFS architecture and challenges with namespace and block scaling. It proposes a storage container architecture with distributed block maps and a storage container manager to address these challenges. This would allow HDFS to easily scale to manage trillions of blocks and billions of files across large clusters.
This document discusses Microsoft's use of Apache YARN for scale-out resource management. It describes how YARN is used to manage vast amounts of data and compute resources across many different applications and workloads. The document outlines some limitations of YARN and Microsoft's contributions to address those limitations, including Rayon for improved scheduling, Mercury and Yaq for distributed scheduling, and work on federation to scale YARN across multiple clusters. It provides details on the implementation and evaluation of these contributions through papers, JIRAs, and integration into Apache Hadoop releases.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes to YARN, it now supports running Docker containers alongside process containers. Couple this with the newly added support for running services on YARN and it allows a host of new possibilities. In this talk, we'll present how to run a potential container cloud on YARN. Leveraging the support in YARN for Docker and services, we can allow users to spin up a bunch of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications(using the Assemblies support in YARN). We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, etc.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings - but also in more traditional, on premise deployments - applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports (a)synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and active development is ongoing at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
This document summarizes a presentation about new features in Apache Hadoop 3.0 related to YARN and MapReduce. It discusses major evolutions like the re-architecture of the YARN Timeline Service (ATS) to address scalability, usability, and reliability limitations. Other evolutions mentioned include improved support for long-running native services in YARN, simplified REST APIs, service discovery via DNS, scheduling enhancements, and making YARN more cloud-friendly with features like dynamic resource configuration and container resizing. The presentation estimates the timeline for Apache Hadoop 3.0 releases with alpha, beta, and general availability targeted throughout 2017.
This document discusses integrating Docker containers with YARN by introducing a Docker container runtime to the LinuxContainerExecutor in YARN. The DockerContainerRuntime allows YARN to leverage Docker for container lifecycle management and supports features like resource isolation, Linux capabilities, privileged containers, users, networking and images. It remains a work in progress to support additional features around networking, users and images fully.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Sanjay Radia presents on evolving HDFS to support a generalized storage subsystem. HDFS currently scales well to large clusters and storage sizes but faces challenges with small files and blocks. The solution is to (1) only keep part of the namespace in memory to scale beyond memory limits and (2) use block containers of 2-16GB to reduce block metadata and improve scaling. This will generalize the storage layer to support containers for multiple use cases beyond HDFS blocks.
This document discusses running Spark applications on YARN and managing Spark clusters. It covers challenges like predictable job execution times and optimal cluster utilization. Spark on YARN is introduced as a way to leverage YARN's resource management. Techniques like dynamic allocation, locality-aware scheduling, and resource queues help improve cluster sharing and utilization for multi-tenant workloads. Security considerations for shared clusters running sensitive data are also addressed.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
This document provides information about Hadoop and its components. It discusses the history of Hadoop and how it has evolved over time. It describes key Hadoop components including HDFS, MapReduce, YARN, and HBase. HDFS is the distributed file system of Hadoop that stores and manages large datasets across clusters. MapReduce is a programming model used for processing large datasets in parallel. YARN is the cluster resource manager that allocates resources to applications. HBase is the Hadoop database that provides real-time random data access.
Apache Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides HDFS for distributed file storage and MapReduce as a programming model for distributed computations. Hadoop includes other technologies like YARN for resource management, Spark for fast computation, HBase for NoSQL database, and tools for data analysis, transfer, and security. Hadoop can run on-premise or in cloud environments and supports analytics workloads.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
This document provides an overview of the big data technology stack, including the data layer (HDFS, S3, GPFS), data processing layer (MapReduce, Pig, Hive, HBase, Cassandra, Storm, Solr, Spark, Mahout), data ingestion layer (Flume, Kafka, Sqoop), data presentation layer (Kibana), operations and scheduling layer (Ambari, Oozie, ZooKeeper), and concludes with a brief biography of the author.
This document provides an overview and configuration instructions for Hadoop, Flume, Hive, and HBase. It begins with an introduction to each tool, including what problems they aim to solve and high-level descriptions of how they work. It then provides step-by-step instructions for downloading, configuring, and running each tool on a single node or small cluster. Specific configuration files and properties are outlined for core Hadoop components as well as integrating Flume, Hive, and HBase.
Hadoop security has improved with additions such as HDFS ACLs, Hive column-level ACLs, HBase cell-level ACLs, and Knox for perimeter security. Data encryption has also been enhanced, with support for encrypting data in transit using SSL and data at rest through file encryption or the upcoming native HDFS encryption. Authentication is provided by Kerberos/AD with token-based authorization, and auditing tracks who accessed what data.
This document summarizes Hortonworks' Hadoop distribution called Hortonworks Data Platform (HDP). It discusses how HDP provides a comprehensive data management platform built around Apache Hadoop and YARN. HDP includes tools for storage, processing, security, operations and accessing data through batch, interactive and real-time methods. The document also outlines new capabilities in HDP 2.2 like improved engines for SQL, Spark and streaming and expanded deployment options.
This document discusses the Stinger initiative to improve the performance of Apache Hive. Stinger aims to speed up Hive queries by 100x, scale queries from terabytes to petabytes of data, and expand SQL support. Key developments include optimizing Hive to run on Apache Tez, the vectorized query execution engine, cost-based optimization using Optiq, and performance improvements from the ORC file format. The goals of Stinger Phase 3 are to deliver interactive query performance for Hive by integrating these technologies.
Hadoop is an open source software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. Hadoop supports various big data applications like HBase for distributed column storage, Hive for data warehousing and querying, Pig and Jaql for data flow languages, and Hadoop ecosystem projects for tasks like system monitoring and machine learning.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
Business intelligence analyzes data to provide actionable information for decision making. Big data is a $50 billion market by 2017, referring to technologies that capture, store, manage and analyze large variable data collections. Hadoop is an open source framework for distributed storage and processing of large data sets on commodity hardware, enabling businesses to gain insight from massive amounts of structured and unstructured data. It involves components like HDFS for data storage, MapReduce for processing, and others for accessing, storing, integrating, and managing data.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
This document provides an introduction to Hive, a data warehouse infrastructure tool used for querying and analyzing large datasets in Hadoop. It discusses how Hive resides on top of Hadoop and allows users to write SQL-like queries to process structured data using MapReduce. The document also describes Hive's architecture, which includes components like the metastore for storing metadata, the HiveQL process engine for compiling queries, and the execution engine for generating MapReduce jobs. It provides examples of how Hive queries are executed and processed via the different system components.
The document provides an introduction to the Hadoop ecosystem. It discusses the history of Hadoop, originating from Google's paper on MapReduce and Google File System. It describes some of the core components of Hadoop including HDFS for storage, MapReduce for distributed processing, and additional components like Hive, Pig, and HBase. It also discusses different Hadoop distributions from companies like Cloudera, Hortonworks, MapR, and others that package and support Hadoop deployments.
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
The document provides an overview and quick reference guide to big data concepts including Hadoop, MapReduce, HDFS, YARN, Spark, Storm, Hive, Pig, HBase and NoSQL databases. It discusses the evolution of Hadoop from versions 1 to 2, and new frameworks like Tez and YARN that allow different types of processing beyond MapReduce. The document also summarizes common big data challenges around skills, integration and analytics.
The document describes the architecture and design of the Hadoop Distributed File System (HDFS). It discusses key aspects of HDFS including its master/slave architecture with a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and regulates client access, while DataNodes store and retrieve blocks of data. HDFS is designed to reliably store very large files across machines by replicating blocks of data and detecting/recovering from failures.
Enroll Free Live demo of Hadoop online training and big data analytics courses online and become certified data analyst/ Hadoop developer. Get online Hadoop training & certification.
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Versionsaimabibi60507
Copy & Past Link👉👉
https://dr-up-community.info/
Pixologic ZBrush, now developed by Maxon, is a premier digital sculpting and painting software renowned for its ability to create highly detailed 3D models. Utilizing a unique "pixol" technology, ZBrush stores depth, lighting, and material information for each point on the screen, allowing artists to sculpt and paint with remarkable precision .
Solidworks Crack 2025 latest new + license codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
The two main methods for installing standalone licenses of SOLIDWORKS are clean installation and parallel installation (the process is different ...
Disable your internet connection to prevent the software from performing online checks during installation
Why Orangescrum Is a Game Changer for Construction Companies in 2025Orangescrum
Orangescrum revolutionizes construction project management in 2025 with real-time collaboration, resource planning, task tracking, and workflow automation, boosting efficiency, transparency, and on-time project delivery.
Discover why Wi-Fi 7 is set to transform wireless networking and how Router Architects is leading the way with next-gen router designs built for speed, reliability, and innovation.
How can one start with crypto wallet development.pptxlaravinson24
This presentation is a beginner-friendly guide to developing a crypto wallet from scratch. It covers essential concepts such as wallet types, blockchain integration, key management, and security best practices. Ideal for developers and tech enthusiasts looking to enter the world of Web3 and decentralized finance.
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...Egor Kaleynik
This case study explores how we partnered with a mid-sized U.S. healthcare SaaS provider to help them scale from a successful pilot phase to supporting over 10,000 users—while meeting strict HIPAA compliance requirements.
Faced with slow, manual testing cycles, frequent regression bugs, and looming audit risks, their growth was at risk. Their existing QA processes couldn’t keep up with the complexity of real-time biometric data handling, and earlier automation attempts had failed due to unreliable tools and fragmented workflows.
We stepped in to deliver a full QA and DevOps transformation. Our team replaced their fragile legacy tests with Testim’s self-healing automation, integrated Postman and OWASP ZAP into Jenkins pipelines for continuous API and security validation, and leveraged AWS Device Farm for real-device, region-specific compliance testing. Custom deployment scripts gave them control over rollouts without relying on heavy CI/CD infrastructure.
The result? Test cycle times were reduced from 3 days to just 8 hours, regression bugs dropped by 40%, and they passed their first HIPAA audit without issue—unlocking faster contract signings and enabling them to expand confidently. More than just a technical upgrade, this project embedded compliance into every phase of development, proving that SaaS providers in regulated industries can scale fast and stay secure.
Get & Download Wondershare Filmora Crack Latest [2025]saniaaftab72555
Copy & Past Link 👉👉
https://dr-up-community.info/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AIdanshalev
If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?
Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.
Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.
GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...Andre Hora
Unittest and pytest are the most popular testing frameworks in Python. Overall, pytest provides some advantages, including simpler assertion, reuse of fixtures, and interoperability. Due to such benefits, multiple projects in the Python ecosystem have migrated from unittest to pytest. To facilitate the migration, pytest can also run unittest tests, thus, the migration can happen gradually over time. However, the migration can be timeconsuming and take a long time to conclude. In this context, projects would benefit from automated solutions to support the migration process. In this paper, we propose TestMigrationsInPy, a dataset of test migrations from unittest to pytest. TestMigrationsInPy contains 923 real-world migrations performed by developers. Future research proposing novel solutions to migrate frameworks in Python can rely on TestMigrationsInPy as a ground truth. Moreover, as TestMigrationsInPy includes information about the migration type (e.g., changes in assertions or fixtures), our dataset enables novel solutions to be verified effectively, for instance, from simpler assertion migrations to more complex fixture migrations. TestMigrationsInPy is publicly available at: https://github.com/altinoalvesjunior/TestMigrationsInPy.
Douwan Crack 2025 new verson+ License codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
Douwan Preactivated Crack Douwan Crack Free Download. Douwan is a comprehensive software solution designed for data management and analysis.
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Ranjan Baisak
As software complexity grows, traditional static analysis tools struggle to detect vulnerabilities with both precision and context—often triggering high false positive rates and developer fatigue. This article explores how Graph Neural Networks (GNNs), when applied to source code representations like Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs), can revolutionize vulnerability detection. We break down how GNNs model code semantics more effectively than flat token sequences, and how techniques like attention mechanisms, hybrid graph construction, and feedback loops significantly reduce false positives. With insights from real-world datasets and recent research, this guide shows how to build more reliable, proactive, and interpretable vulnerability detection systems using GNNs.
⭕️➡️ FOR DOWNLOAD LINK : http://drfiles.net/ ⬅️⭕️
Maxon Cinema 4D 2025 is the latest version of the Maxon's 3D software, released in September 2024, and it builds upon previous versions with new tools for procedural modeling and animation, as well as enhancements to particle, Pyro, and rigid body simulations. CG Channel also mentions that Cinema 4D 2025.2, released in April 2025, focuses on spline tools and unified simulation enhancements.
Key improvements and features of Cinema 4D 2025 include:
Procedural Modeling: New tools and workflows for creating models procedurally, including fabric weave and constellation generators.
Procedural Animation: Field Driver tag for procedural animation.
Simulation Enhancements: Improved particle, Pyro, and rigid body simulations.
Spline Tools: Enhanced spline tools for motion graphics and animation, including spline modifiers from Rocket Lasso now included for all subscribers.
Unified Simulation & Particles: Refined physics-based effects and improved particle systems.
Boolean System: Modernized boolean system for precise 3D modeling.
Particle Node Modifier: New particle node modifier for creating particle scenes.
Learning Panel: Intuitive learning panel for new users.
Redshift Integration: Maxon now includes access to the full power of Redshift rendering for all new subscriptions.
In essence, Cinema 4D 2025 is a major update that provides artists with more powerful tools and workflows for creating 3D content, particularly in the fields of motion graphics, VFX, and visualization.
Adobe Master Collection CC Crack Advance Version 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Master Collection CC (Creative Cloud) is a comprehensive subscription-based package that bundles virtually all of Adobe's creative software applications. It provides access to a wide range of tools for graphic design, video editing, web development, photography, and more. Essentially, it's a one-stop-shop for creatives needing a broad set of professional tools.
Key Features and Benefits:
All-in-one access:
The Master Collection includes apps like Photoshop, Illustrator, InDesign, Premiere Pro, After Effects, Audition, and many others.
Subscription-based:
You pay a recurring fee for access to the latest versions of all the software, including new features and updates.
Comprehensive suite:
It offers tools for a wide variety of creative tasks, from photo editing and illustration to video editing and web development.
Cloud integration:
Creative Cloud provides cloud storage, asset sharing, and collaboration features.
Comparison to CS6:
While Adobe Creative Suite 6 (CS6) was a one-time purchase version of the software, Adobe Creative Cloud (CC) is a subscription service. CC offers access to the latest versions, regular updates, and cloud integration, while CS6 is no longer updated.
Examples of included software:
Adobe Photoshop: For image editing and manipulation.
Adobe Illustrator: For vector graphics and illustration.
Adobe InDesign: For page layout and desktop publishing.
Adobe Premiere Pro: For video editing and post-production.
Adobe After Effects: For visual effects and motion graphics.
Adobe Audition: For audio editing and mixing.
PDF Reader Pro Crack Latest Version FREE Download 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
PDF Reader Pro is a software application, often referred to as an AI-powered PDF editor and converter, designed for viewing, editing, annotating, and managing PDF files. It supports various PDF functionalities like merging, splitting, converting, and protecting PDFs. Additionally, it can handle tasks such as creating fillable forms, adding digital signatures, and performing optical character recognition (OCR).
6. Data Storage and Resource Management
HDFS (storage) works closely with MapReduce (data processing)
to provide scalable, fault-tolerant, cost-efficient storage for big
data.
Ver: 2.6.0
YARN is the process resource manager and is designed to be
co-deployed with HDFS such that there is a single cluster,
providing the ability to move the computation resource to the
data
Pig (scripting) is a high-level language (Pig Latin) for data
analysis programs, and infrastructure for evaluating these
programs via map-reduce.
Ver: 0.14.0
Hive (SQL) Provides data warehouse infrastructure, enabling
data summarization, ad-hoc query and analysis of large data
sets. The query language, HiveQL (HQL), is similar to SQL.
Ver: 0.14.0
HCatalog (SQL) Table &storage management layer that provides
users of Pig, MapReduce and Hive with a relational view of data
in HDFS.
Hbase (NoSQL) Non-relational database that provides random
real-time access to data in very large tables. HBase is
transactional and allows updates, inserts and deletes. HBase
includes support for SQL through Phoenix.
Ver: 0.98.4
Accumulo (NoSQL) Based on Google's BigTable design,
features a few novel improvements on the BigTable design in the
form of cell-based access control
Ver: 1.6.1
Data Access
Phoenix a SQL skin over HBase delivered as a client-
embedded JDBC driver targeting low latency queries over
HBase data.
Ver: 4.2.0
Who’s Who in the Zoo
Tez (SQL) leverages the MapReduce paradigm to enable the
execution complex Directed Acyclic Graphs (DAG) of tasks. Tez
eliminates unnecessary tasks, synchronization barriers and I/O to
HDFS, speeding up data processing.
Ver: 0.5.2
Map Reduce a programming model and an associated
implementation for processing and generating large data sets
with a parallel, distributed algorithm on a Hadoop cluster.
Ver: 2.6.0
Spark is an open-source data analytics cluster computing
framework Spark provides primitives for in-memory cluster
computing that allows user programs to load data into a cluster's
memory and query it repeatedly, making it well suited to machine
learning algorithms
Ver: 1.2.1
Solr is an open source enterprise search platform from the
Apache Lucene project. It includes full-text search, hit
highlighting, faceted search, dynamic clustering, database
integration, and rich document handling. And provides distributed
search and index replication.
Data Engines
7. Security
Knox (perimeter security) is a REST gateway for Hadoop
providing network isolation, SSO, authentication, authorization
and auditing functions.
Ver: 0.5.0
Ranger (security management) is a framework to monitor and
manage security assets across a cluster. It provides central
administration, monitoring and security hooks for Hadoop
applications.
Ver: 0.4.0
Operations & Management
Oozie (job scheduling) Oozie enables administrators to build
complex data transformations, enables greater control over long-
running jobs and can schedule repetitions of those jobs.
Ver: 4.1.0.
Ambari (cluster management) is an operational
framework to provision, manage and monitor Hadoop clusters. It
includes a web interface for administrators to start/stop/test
services and change configurations.
Ver: 1.7.0
Zookeeper is a distributed configuration and synchronization
service and a naming registry for distributed. Distributed
applications use Zookeeper to store and mediate updates to
important configuration information.
Ver: 3.4.6
Hue (Hadoop Use Experience) Web interface that supports Apache
Hadoop and its ecosystem. Hue aggregates the most common
Apache Hadoop components into a single interface and targets the
user experience.
Ver: 2.6.1
Who’s Who in the Zoo
Storm (data streaming). Distributed real-time computation
system for processing fast, large streams of data. Storm
topologies can be written in any programming language.
Ver: 0.9.3
Governance & Integration
Falcon is a data governance engine that defines Oozie data
pipelines, monitors those pipelines (in coordination with Ambari),
and tracks those pipelines for dependencies and audits.
Ver: 0.6.0
Scoop is a tool that efficiently transfers bulk data between
Hadoop and structured datastores such as relational databases.
Ver: 1.4.5
Flume is a distributed service to collect, aggregate and move
large amounts of streaming data into HDFS.
Ver: 1.5.2
Kafka (data streaming). A high-throughput distributed messaging
system. Apache Kafka is publish-subscribe messaging rethought
as a distributed commit log
Ver: 0.8.1.1
8. Libraries
File Formats
Avro A data serialization system. Provides, Rich data
structures.A compact, fast, binary data format.
A container file, to store persistent data.
Ver: 1.7.4
Mahout is a suite of machine learning libraries designed
to be scalable and robust
Ver: 0.9.0
ORC The Optimized Row Columnar (ORC) file format, part
of Hive, provides a highly efficient way to store Hive data. It
was designed to overcome limitations of the other Hive file
formats. Using ORC files improves performance when Hive
is reading, writing, and processing data.
Ver: 0.14.0
Thrift is a data serialization system that allows you to
define data types and service interfaces in a simple
definition file.
Ver: 0.13.0
Thrift
Cascading is an application development platform for
building Data applications on Apache Hadoop. Ver:
Who’s Who in the Zoo
51. Storage Policy Definition
Allows mapping between Dataset and Storage Type
Directory => Storage Type(s)
File => Storage Type(s)
Defines Creation Fallback Strategy
Strategy for when blocks are allocated and written
Where to write blocks when first Storage Type is out of space
Defines Replication Fallback Strategy
Strategy for when blocks are replicated
Where to store block replicas when first Storage Type is out of space
67. Map Reduce V1 Issues
• Scalability
– Maximum Cluster size – 4,000 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Coarse Grained Resource Management
• Single point of failure – Mitigated by High Availability
• Restart is very tricky due to complex state
72. YARN Timeline Server
• Storage and retrieval of YARN application status information
• Covers both current status and historical information
• Addresses MapReduce v1 scalability bottleneck of storing job status information in JobTracker
• Includes 2 kinds of status information
– Generic
- Application start/stop
- Attempts
- Container information
– Application-specific
- Varies depending on specific application running in YARN
- MapReduce map tasks, reduce tasks, counters, etc.
73. Node Labels
What
• Admin associates a label with a set of nodes (NodeManager).
–1 node :: n labels AND 1 label :: n nodes
• Admin allows (sets ACLs) for apps scheduled to a Capacity Scheduler Queue to schedule on
labeled nodes
• Admin can enforce (set Default) that all apps scheduled to a Queue be scheduled on label ‘A’
nodes
Why
• Need a mechanism to enforce node-level isolation
• Account for resource contention amongst non-YARN managed resources
• Account for hardware or software constraints
74. CGroup Isolation
What
• Admin enables CGroups for CPU Isolation for all YARN application workloads
Why
• Applications need guaranteed access to CPU resources
• To ensure SLAs, need to enforce CPU allocations given to an Application container
#3: First, a quick introduction. My name is Chris Nauroth. I’m a software engineer on the HDFS team at Hortonworks. I’m an Apache Hadoop committer and PMC member. I’m also an Apache Software Foundation member. Some of my major contributions include HDFS ACLs, Windows compatibility and various operability improvements.
Prior to Hortonworks, I worked for Disney and did an initial deployment of Hadoop there. As part of that job, I worked very closely with the systems engineering team responsible for maintaining those Hadoop clusters, so I tend to think back to that team and get excited about things I can do now as a software engineer to help make that team’s job easier.
I’m also here with Suresh Srinivas, one of the founders of Hortonworks, and a long-time Hadoop committer and PMC member. He has a lot of experience supporting some of the world’s largest clusters at Yahoo and elsewhere. Together with Suresh, we have experience supporting Hadoop clusters since 2008.
#4: For today’s agenda, I’d like to start by sharing some analysis that we’ve done of support case trends. In that analysis, we’re going to see that some common patterns emerge, and that’s going to lead into a discussion of configuration best practices and software improvements.
In the second half of the talk, we’ll move into a discussion of key learnings and best practices around how recent HDFS features can help prevent problems or manage day-to-day maintenance.
#6: For today’s agenda, I’d like to start by sharing some analysis that we’ve done of support case trends. In that analysis, we’re going to see that some common patterns emerge, and that’s going to lead into a discussion of configuration best practices and software improvements.
In the second half of the talk, we’ll move into a discussion of key learnings and best practices around how recent HDFS features can help prevent problems or manage day-to-day maintenance.
#17: By convention, snapshots can be referenced as a file system path under sub-directory “.snapshot”.
#18: If you’ve used POSIX ACLs on a Linux file system, then you already know how it works in HDFS too.
#24: Here is a screenshot pointing out a change in the HDFS web UI: Total Datanode Volume Failures is a hyperlink. Clicking that jumps to…
#25: …this new screen listing the volume failures in detail. We can see the path of each failed storage location, and an estimate of the capacity that was lost. I think of this screen being used by a system engineer as a to-do list as part of regular cluster maintenance.
#26: Here is what it looks like when there are no volume failures. I included this picture, because this is what we all want it to look like. Of course, it won’t always be that way.