An overview of the development of the Apache Hadoop software stack, including some of the barriers to participation -and how and why to overcome them. It closes with some open discussion points/ideas of how the existing process can be improved.
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
An update of the "Hadoop and Kerberos: the Madness Beyond the Gate" talk, covering recent work "the Fix Kerberos" JIRA and its first deliverable: KDiag
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
As Hadoop becomes a critical part of Enterprise data infrastructure, securing Hadoop has become critically important. Enterprises want assurance that all their data is protected and that only authorized users have access to the relevant bits of information. In this session we will cover all aspects of Hadoop security including authentication, authorization, audit and data protection. We will also provide demonstration and detailed instructions for implementing comprehensive Hadoop security.
Nl HUG 2016 Feb Hadoop security from the trenchesBolke de Bruin
Setting up a secure Hadoop cluster involves a magic combination of Kerberos, Sentry, Ranger, Knox, Atlas, LDAP and possibly PAM. Add encryption on the wire and at rest to the mix and you have, at the very least, a interesting configuration and installation task.
Nonetheless, the fact that there are a lot of knobs to turn, doesn't excuse you from the responsibility of taking proper care of your customers' data. In this talk, we'll detail how the different security components in Hadoop interact and how easy it actually can be to setup thing correctly, once you understand the concepts and tools. We'll outline a successful secure Hadoop setup with an example.
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it's applications don't integrate that well —something which starts right down at the file IO operations. This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the foundational "what's an object store?" to the practical "what should I avoid" and the timely "what's new in Hadoop?" — the latter covering the improved S3 support in Hadoop 2.8+. I'll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code —and equally, what they must avoid. Finally, I'll look at ongoing work, especially "S3Guard" and what its fast and consistent file metadata operations promise.
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder
The Apache Knox Gateway is an extensible reverse proxy framework for securely exposing REST APIs and HTTP-based services at a perimeter. It provides out of the box support for several common Hadoop services, integration with enterprise authentication systems, and other useful features. Knox is not an alternative to Kerberos for core Hadoop authentication or a channel for high-volume data ingest/export. It has graduated from the Apache incubator and is included in Hortonworks Data Platform releases to simplify access, provide centralized control, and enable enterprise integration of Hadoop services.
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
This document describes Structor, a tool that automates the creation of virtual Hadoop clusters using Vagrant and Puppet. It allows users to quickly set up development, testing, and demo environments for Hadoop without manual configuration. Structor addresses the difficulties of manually setting up Hadoop clusters, particularly around configuration, security testing, and experimentation. It provides pre-defined profiles that stand up clusters of different sizes on various operating systems with or without security enabled. Puppet modules configure and provision the Hadoop services while Vagrant manages the underlying virtual machines.
The document discusses security features in Hortonworks Data Platform (HDP) and Pivotal HD. It covers authentication with Kerberos, authorization and auditing using Apache Ranger, perimeter security with Apache Knox, and data encryption at rest and in transit. Various security flows are illustrated including typical access to Hive through Beeline and adding authorization, firewall routing, and encryption. Installation and configuration of Ranger and Knox are also outlined.
- Kerberos is used to authenticate Hadoop services and clients running on different nodes communicating over a non-secure network. It uses tickets for authentication.
- Key configuration changes are required to enable Kerberos authentication in Hadoop including setting hadoop.security.authentication to kerberos and generating keytabs containing principal keys for HDFS services.
- Services are associated with Kerberos principles using keytabs which are then configured for use by the relevant Hadoop processes and services.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
The document discusses the Apache Knox Gateway, which is an extensible reverse proxy framework that securely exposes REST APIs and HTTP-based services from Hadoop clusters. It provides features such as support for common Hadoop services, integration with enterprise authentication systems, centralized auditing of REST API access, and service-level authorization controls. The Knox Gateway aims to simplify access to Hadoop services, enhance security by protecting network details and supporting partial SSL, and enable centralized management and control over REST API access.
This document discusses YARN services in Hadoop, which allow long-lived applications to run within a Hadoop cluster. YARN (Yet Another Resource Negotiator) provides an operating system-like platform for data processing by allowing various applications to share cluster resources. The document outlines features for long-lived services in YARN, including log aggregation, Kerberos token renewal, and service registration/discovery. It also discusses how Hadoop 2.6 and later versions implement these features to enable long-running applications that can withstand failures.
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Hortonworks
This document provides an overview of how Hortonworks uses Apache Hadoop to enable a modern data architecture. Some key points:
- Hadoop allows organizations to create a "data lake" to store all types of data in one place and process it in various ways for different use cases.
- This provides a multi-use data platform that unlocks new approaches to insights by enabling analysis across all data, rather than just subsets stored in silos.
- A modern data architecture with Hadoop integrates with existing investments while freeing up resources for more valuable tasks by offloading lower value workloads to Hadoop.
- Examples of business applications that can benefit from Hadoop include optimizing customer insights
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
Hadoop security has improved with additions such as HDFS ACLs, Hive column-level ACLs, HBase cell-level ACLs, and Knox for perimeter security. Data encryption has also been enhanced, with support for encrypting data in transit using SSL and data at rest through file encryption or the upcoming native HDFS encryption. Authentication is provided by Kerberos/AD with token-based authorization, and auditing tracks who accessed what data.
The document discusses Hadoop security today and tomorrow. It describes the four pillars of Hadoop security as authentication, authorization, accountability, and data protection. It outlines the current security capabilities in Hadoop like Kerberos authentication and access controls, and future plans to improve security, such as encryption of data at rest and in motion. It also discusses the Apache Knox gateway for perimeter security and provides a demo of using Knox to submit a MapReduce job.
Redis for Security Data : SecurityScorecard JVM Redis UsageTimothy Spann
A quick talk about Java, Scala, Spring XD and Spring Data Redis against Redis.
An example of how we are using Redis at SecurityScorecard for security data and some hands-on development in Java and Scala.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
There are two ways to set up Apache Knox on a server: using Ambari or manually. The document then provides steps for configuring Knox using Ambari, including entering a master secret password and restarting services. It also provides commands for testing HDFS and Hive access through Knox by curling endpoints or using Beeline.
The document discusses security in Hadoop clusters. It introduces authentication using Kerberos and authorization using access control lists (ACLs). Kerberos provides mutual authentication between services and clients in Hadoop. ACLs control access at the service and file level. The document outlines how to configure Kerberos with Hadoop, including setting principals and keytabs for services. It also discusses integrating Kerberos with an Active Directory domain.
Hadoop security overview discusses Kerberos and LDAP configuration and authentication. It outlines Hadoop security features like authentication and authorization in HDFS, MapReduce, and HBase. The document also introduces Etu appliances and their benefits, as well as troubleshooting Hadoop security issues.
Security in IaaS, attacks, hardening, incident response, forensics and all about its automation. Despite I will talk about general concept related to AWS, Azure and GCP, I will show specific demos and threats in AWS and I will go in detail with some caveats and hazards in AWS.
This document provides an overview of Apache Hadoop security, both historically and what is currently available and planned for the future. It discusses how Hadoop security is different due to benefits like combining previously siloed data and tools. The four areas of enterprise security - perimeter, access, visibility, and data protection - are reviewed. Specific security capabilities like Kerberos authentication, Apache Sentry role-based access control, Cloudera Navigator auditing and encryption, and HDFS encryption are summarized. Planned future enhancements are also mentioned like attribute-based access controls and improved encryption capabilities.
This document discusses using the HP IBRIX file system as an alternative to HDFS in Hadoop to provide high availability. IBRIX is a segmented, fault tolerant file system that runs on top of RAIDed storage. It avoids single points of failure by distributing metadata across multiple segment servers rather than using a single NameNode as in HDFS. The document outlines how Hadoop could integrate with IBRIX, provides details on IBRIX architecture and configuration, and presents early performance results showing comparable performance to HDFS.
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
The document discusses data availability and integrity in Apache Hadoop. It explains that Hadoop uses replication and checksums to ensure data safety. While the NameNode is a single point of failure, Hadoop 1 provides cold failover capability using Linux HA clustering. Hadoop 2 will introduce live NameNode failover using ZooKeeper. The goal is full stack high availability to make the entire Hadoop system resilient to planned and unplanned outages.
The document discusses security features in Hortonworks Data Platform (HDP) and Pivotal HD. It covers authentication with Kerberos, authorization and auditing using Apache Ranger, perimeter security with Apache Knox, and data encryption at rest and in transit. Various security flows are illustrated including typical access to Hive through Beeline and adding authorization, firewall routing, and encryption. Installation and configuration of Ranger and Knox are also outlined.
- Kerberos is used to authenticate Hadoop services and clients running on different nodes communicating over a non-secure network. It uses tickets for authentication.
- Key configuration changes are required to enable Kerberos authentication in Hadoop including setting hadoop.security.authentication to kerberos and generating keytabs containing principal keys for HDFS services.
- Services are associated with Kerberos principles using keytabs which are then configured for use by the relevant Hadoop processes and services.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
The document discusses the Apache Knox Gateway, which is an extensible reverse proxy framework that securely exposes REST APIs and HTTP-based services from Hadoop clusters. It provides features such as support for common Hadoop services, integration with enterprise authentication systems, centralized auditing of REST API access, and service-level authorization controls. The Knox Gateway aims to simplify access to Hadoop services, enhance security by protecting network details and supporting partial SSL, and enable centralized management and control over REST API access.
This document discusses YARN services in Hadoop, which allow long-lived applications to run within a Hadoop cluster. YARN (Yet Another Resource Negotiator) provides an operating system-like platform for data processing by allowing various applications to share cluster resources. The document outlines features for long-lived services in YARN, including log aggregation, Kerberos token renewal, and service registration/discovery. It also discusses how Hadoop 2.6 and later versions implement these features to enable long-running applications that can withstand failures.
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Hortonworks
This document provides an overview of how Hortonworks uses Apache Hadoop to enable a modern data architecture. Some key points:
- Hadoop allows organizations to create a "data lake" to store all types of data in one place and process it in various ways for different use cases.
- This provides a multi-use data platform that unlocks new approaches to insights by enabling analysis across all data, rather than just subsets stored in silos.
- A modern data architecture with Hadoop integrates with existing investments while freeing up resources for more valuable tasks by offloading lower value workloads to Hadoop.
- Examples of business applications that can benefit from Hadoop include optimizing customer insights
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
Hadoop security has improved with additions such as HDFS ACLs, Hive column-level ACLs, HBase cell-level ACLs, and Knox for perimeter security. Data encryption has also been enhanced, with support for encrypting data in transit using SSL and data at rest through file encryption or the upcoming native HDFS encryption. Authentication is provided by Kerberos/AD with token-based authorization, and auditing tracks who accessed what data.
The document discusses Hadoop security today and tomorrow. It describes the four pillars of Hadoop security as authentication, authorization, accountability, and data protection. It outlines the current security capabilities in Hadoop like Kerberos authentication and access controls, and future plans to improve security, such as encryption of data at rest and in motion. It also discusses the Apache Knox gateway for perimeter security and provides a demo of using Knox to submit a MapReduce job.
Redis for Security Data : SecurityScorecard JVM Redis UsageTimothy Spann
A quick talk about Java, Scala, Spring XD and Spring Data Redis against Redis.
An example of how we are using Redis at SecurityScorecard for security data and some hands-on development in Java and Scala.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
There are two ways to set up Apache Knox on a server: using Ambari or manually. The document then provides steps for configuring Knox using Ambari, including entering a master secret password and restarting services. It also provides commands for testing HDFS and Hive access through Knox by curling endpoints or using Beeline.
The document discusses security in Hadoop clusters. It introduces authentication using Kerberos and authorization using access control lists (ACLs). Kerberos provides mutual authentication between services and clients in Hadoop. ACLs control access at the service and file level. The document outlines how to configure Kerberos with Hadoop, including setting principals and keytabs for services. It also discusses integrating Kerberos with an Active Directory domain.
Hadoop security overview discusses Kerberos and LDAP configuration and authentication. It outlines Hadoop security features like authentication and authorization in HDFS, MapReduce, and HBase. The document also introduces Etu appliances and their benefits, as well as troubleshooting Hadoop security issues.
Security in IaaS, attacks, hardening, incident response, forensics and all about its automation. Despite I will talk about general concept related to AWS, Azure and GCP, I will show specific demos and threats in AWS and I will go in detail with some caveats and hazards in AWS.
This document provides an overview of Apache Hadoop security, both historically and what is currently available and planned for the future. It discusses how Hadoop security is different due to benefits like combining previously siloed data and tools. The four areas of enterprise security - perimeter, access, visibility, and data protection - are reviewed. Specific security capabilities like Kerberos authentication, Apache Sentry role-based access control, Cloudera Navigator auditing and encryption, and HDFS encryption are summarized. Planned future enhancements are also mentioned like attribute-based access controls and improved encryption capabilities.
This document discusses using the HP IBRIX file system as an alternative to HDFS in Hadoop to provide high availability. IBRIX is a segmented, fault tolerant file system that runs on top of RAIDed storage. It avoids single points of failure by distributing metadata across multiple segment servers rather than using a single NameNode as in HDFS. The document outlines how Hadoop could integrate with IBRIX, provides details on IBRIX architecture and configuration, and presents early performance results showing comparable performance to HDFS.
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
The document discusses data availability and integrity in Apache Hadoop. It explains that Hadoop uses replication and checksums to ensure data safety. While the NameNode is a single point of failure, Hadoop 1 provides cold failover capability using Linux HA clustering. Hadoop 2 will introduce live NameNode failover using ZooKeeper. The goal is full stack high availability to make the entire Hadoop system resilient to planned and unplanned outages.
Hadoop has some built-in data protection features like replication, snapshots, and trash bins. However, these may not be sufficient on their own. Hadoop data can still be lost due to software bugs or human errors. A well-designed data protection strategy for Hadoop should include diversified copies of valuable data both within and outside the Hadoop environment. This protects against data loss from both software and hardware failures.
You probably have heard about Big Data, but ever wondered what it exactly is? And why should you care?
Mobile is playing a large part in driving this explosion in data. The data are also created by the apps and other services in the background. As people are moving towards more digital channels, tons of data are being created. This data can be used in a lot of ways for personal and professional use. Big Data and mobile apps are converging in an enterprise and interacting; transforming the whole mobile ecosystem.
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
Architectural considerations for Hadoop Applicationshadooparchbook
The document discusses architectural considerations for Hadoop applications using a case study on clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it considers HDFS vs HBase, file formats, and compression formats. SequenceFiles are identified as a good choice for raw data storage as they allow for splittable compression.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
This document provides an overview of Hortonworks and its Hortonworks Data Platform (HDP). Hortonworks develops, distributes and supports HDP, which is the only 100% open source Apache Hadoop distribution. Hortonworks focuses on innovation in Apache Hadoop projects, addressing enterprise requirements, enabling ecosystem interoperability, and ensuring no vendor lock-in through its open source approach. The document discusses Hortonworks' contributions to Apache Hadoop and other projects, as well as how HDP can be used for operational data refinery, big data exploration, and application enrichment.
Don't Let Security Be The 'Elephant in the Room'Hortonworks
Don't let security be the "elephant in the room" for enterprise big data. As big data now includes sensitive data from various sources, there are hidden risks to simply adopting big data technologies without also implementing proper data protection. While traditional IT security approaches provide some coverage, they also have gaps and do not fully address protecting data across its lifecycle and wherever it may travel. A data-centric security approach that encrypts data at capture can lock down data and keep it protected as it is stored, processed, and shared across systems.
Storm Demo Talk - Colorado Springs May 2015Mac Moore
The document discusses real-time processing capabilities in Hadoop and Hortonworks Data Platform (HDP). It begins with an introduction to Hortonworks and an overview of real-time streaming architectures on HDP. It then demonstrates streaming capabilities with and without predictive analytics additions. The document highlights how HDP provides a centralized architecture and open data platform to enable real-time and batch processing of any type of data for analytics applications.
Deploying and Managing Hadoop Clusters with AMBARIDataWorks Summit
Deploying, configuring, and managing large Hadoop and HBase clusters can be quite complex. Just upgrading one Hadoop component on a 2000-node cluster can take a lot of time and expertise, and there have been few tools specialized for Hadoop cluster administrators. AMBARI is an Apache incubator project to deliver Monitoring and Management functionality for Hadoop clusters. This paper presents the AMBARI tools for cluster management, specifically: Cluster pre-configuration and validation; Hadoop software deployment, installation, and smoketest; Hadoop configuration and re-config; and a basic set of management ops including start/stop service, add/remove node, etc. In providing these capabilities, AMBARI seeks to integrate with (rather than replace) existing open-source packaging and deployment technology available in most data centers, such as Puppet and Chef, Yum, Apt, and Zypper.
OSDC 2013 | Introduction into Hadoop by Olivier RenaultNETWAYS
Hortonworks is a company that was founded in 2011 to focus on developing and supporting Apache Hadoop for enterprise use. They distribute the Hortonworks Data Platform (HDP), which is the only 100% open source enterprise Hadoop distribution. HDP includes core Hadoop components like HDFS, YARN, MapReduce as well as data services like Hive, Pig, HBase. It also includes operational services to help manage and monitor large Hadoop clusters.
The document discusses real-time processing in Hadoop and provides an overview of streaming architectures using the Hortonworks Data Platform (HDP). It includes two demos, the first showing a basic streaming scenario and the second integrating predictive analytics. The document aims to introduce HDP's capabilities for real-time streaming and predictive analytics and demonstrate them through examples relevant to logistics companies.
This document provides an introduction to Apache Pig, including:
- Pig is a system for processing large unstructured data using HDFS and MapReduce. It uses a high-level data flow language called Pig Latin.
- Pig aims to increase programmer productivity by abstracting low-level MapReduce jobs and providing a procedural language for parallel data flows.
- Pig components include the Pig engine for parsing, optimizing, and executing queries, and the Grunt shell for running interactive commands.
- The document then covers Pig data types, input/output, relational operations, user-defined functions, and new features in Pig version 0.10.0.
Hortonworks provides an overview of their Tez framework for improving Hadoop query processing. Tez aims to accelerate queries by expressing them as dataflow graphs that can be optimized, rather than relying solely on MapReduce. It also aims to empower users by allowing flexible definition of data pipelines and composition of inputs, processors, and outputs. Early results show a 100x speedup on benchmark queries compared to traditional MapReduce.
Here are the steps to load the Avro serialized events into Pig:
1) Load the Avro data using AvroStorage():
enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
2) Describe the schema:
describe enron_emails;
This will show the schema including the fields like message_id, date, from, etc. that were serialized in the Avro data.
3) You can now use Pig operations like FILTER, FOREACH, etc. on the enron_emails relation to extract/transform the data as needed before exporting it for display.
So in summary - LOAD Avro data, DESCRIB
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
Hadoop’s cost effective scalability and flexibility to analyze all data types is driving organizations everywhere to embrace big data analytics. From proof of concept to deployment across the enterprise, join Datameer and Hortonworks as we answer the ‘now what?’ when rolling out your Hadoop big data analytics project. This webinar will address critical project components such as data security, data privacy, high availability, user training and use case development.
This document provides an overview of real-time processing capabilities on Hortonworks Data Platform (HDP). It discusses how a trucking company uses HDP to analyze sensor data from trucks in real-time to monitor for violations and integrate predictive analytics. The company collects data using Kafka and analyzes it using Storm, HBase and Hive on Tez. This provides real-time dashboards as well as querying of historical data to identify issues with routes, trucks or drivers. The document explains components like Kafka, Storm and HBase and how they enable a unified YARN-based architecture for multiple workloads on a single HDP cluster.
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
This document provides an overview of how a trucking company can use Hortonworks Data Platform (HDP) to gain insights from real-time streaming data generated by sensors in its trucks. The company wants to monitor trucks for locations, violations, and other events. HDP allows the company to ingest streaming data from trucks using Kafka and analyze it in real-time with Storm for alerts or serve it to applications with HBase. The company can also run interactive queries on historical data with Hive and Tez. All of this is run on a single HDP cluster for consistent governance, security, and operations across batch and real-time workloads.
The document discusses strategies for developing agile analytics applications using Hadoop, emphasizing an iterative approach where data is explored interactively to discover insights which then form the basis for shipped applications, rather than trying to design insights up front. It recommends setting up an environment where insights are repeatedly produced and shared with the team using an interactive application from the start to facilitate collaboration between data scientists and developers.
The document discusses building agile analytics applications using Hadoop. It recommends setting up an environment where insights can be repeatedly produced through iterative and interactive exploration of data. The document emphasizes making an application for exploring data rather than trying to design insights directly. Insights are discovered through many iterations of refining the data and interacting with it.
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
The document discusses real-time processing in Hadoop using the Hortonworks Data Platform (HDP). It provides an overview of using HDP for real-time streaming analytics in a logistics scenario. Example applications and architectures are presented, including using Kafka for ingesting sensor data, Storm for stream processing, and HBase for real-time querying. Demos will also illustrate integrating predictive analytics into streaming scenarios.
The document discusses strategies for developing agile analytics applications on Hadoop, emphasizing an iterative approach where the data model and insights evolve through exploration of data in an interactive web application, rather than trying to design insights up front, in order to discover insights rather than define them. It also highlights using techniques like storing data in documents rather than relational structures and using Pig for its ability to handle diverse data formats.
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
This document discusses Hortonworks Data Platform (HDP) for Windows. It includes an agenda for the presentation which covers an introduction to HDP for Windows, integrating HDP with Microsoft tools, and a demo. The document lists the speakers and provides information on Windows support for Hadoop components. It describes what is included in HDP for Windows, such as deployment choices and full interoperability across platforms. Integration with Microsoft tools like SQL Server, Excel, and Power BI is highlighted. A demo of using Excel to interact with HDP is promised.
LA HUG - Agile Analytics Applications on HDPHortonworks
The document discusses publishing structured event data to applications for analytics. It describes serializing event data from sources like emails and logs into Avro documents and loading them into Pig for processing. The data is then published from Pig to databases and analytics stacks using various options like ElasticSearch, MongoDB, HBase, and Hive/HCatalog for exploration and building analytics applications. Code examples demonstrate loading Avro data into Pig, illustrating the data schema, and publishing the data from Pig to MongoDB. The overall approach emphasizes agility, iteration, and flexibility in building analytics applications on Hadoop.
The document discusses how storage models need to evolve as the underlying technologies change. Object stores like S3 provide scale and high availability but lack semantics and performance of file systems. Non-volatile memory also challenges current models. The POSIX file system metaphor is ill-suited for object stores and NVM. SQL provides an alternative that abstracts away the underlying complexities, leaving just object-relational mapping and transaction isolation to address. The document examines renaming operations, asynchronous I/O, and persistent in-memory data structures as examples of areas where new models may be needed.
August 2018 version of my "What does rename() do", includes the full details on what the Hadoop MapReduce and Spark commit protocols are, so the audience will really understand why rename really, really matters
Put is the new rename: San Jose Summit EditionSteve Loughran
This is the June 2018 variant of the "Put is the new Rename Talk", looking at Hadoop stack integration with object stores, including S3, Azure storage and GCS.
This document outlines the development history of the Dissident bot from its creation in January 2017 to June 2018. It discusses improvements made over time including adding conversation mode, a TODO item to develop a Chomsky-Type-1 Grammar AI, and fixing a bug where conversation mode would spam the bot's username. It also provides details on the bot's configuration settings and methods used to detect spam, bots, and politicans spreading misinformation.
A review of the state of cloud store integration with the Hadoop stack in 2018; including S3Guard, the new S3A committers and S3 Select.
Presented at Dataworks Summit Berlin 2018, where the demos were live.
This document discusses the principles and practices of Extreme Programming (XP), an agile software development process. It describes XP as an intense, test-centric programming process focused on projects with high rates of change. Key practices include pair programming, test-driven development, planning with user stories and tasks, doing the simplest thing that could work, and refactoring code aggressively. Problems may include short-term "hill-climbing" solutions and risks of fundamental design errors. The document provides additional resources on XP and notes that the day's session will involve practicing XP techniques through pair programming.
Steve Loughran expresses dislike for mocking in tests because mock code reflects assumptions rather than reality. Any changes to the real code can break the tests, leading to false positives. Test failures are often "fixed" by editing the test or mock code, which could hide real problems. He proposes avoiding mock tests and instead adding functional tests against real infrastructure with fault injection for integration testing.
Berlin Buzzwords 2017 talk: A look at what our storage models, metaphors and APIs are, showing how we need to rethink the Posix APIs to work with object stores, while looking at different alternatives for local NVM.
This is the unabridged talk; the BBuzz talk was 20 minutes including demo and questions, so had ~half as many slides
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
A talk looking at the intricate details of working with an object store from Hadoop, Hive, Spark, etc, why the "filesystem" metaphor falls down, and what work myself and others have been up to to try and fix things
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
The March 2017 version of the "Apache Spark and Object Stores", includes coverage of the Staging Committer. If you'd been at the talk you'd have seen the projector fail just before the demo. It worked earlier! Honest!
This document discusses using Apache Spark with object stores like Amazon S3 and Microsoft Azure Blob Storage. It covers challenges around classpath configuration, credentials, code examples, and performance commitments when using these storage systems. Key points include using Hadoop connectors like S3A and WASB, configuring credentials through properties or environment variables, and tuning Spark for object store performance and consistency.
This document discusses household information security risks in the post-Sony era. It identifies key risks like data integrity, privacy, and availability issues. It provides examples of vulnerabilities across different devices and platforms like LG TVs, iPads, iPhones, and PS4s. It also discusses vulnerabilities in software like Firefox, Chrome, Internet Explorer, Flash, and SparkContext. It recommends approaches to address these risks like using containers for isolation, validating packages with PGP to ensure authentication, and enabling audit logs.
This document discusses Apache Slider, which allows applications to be deployed and managed on Apache Hadoop YARN. Slider uses an Application Master, agents, and scripts to deploy applications defined in an XML package. The Application Master keeps applications in a desired state across YARN containers and handles lifecycle commands like start, stop, and scaling. Slider integrates with Apache Ambari for graphical management and configuration of applications on YARN.
This document appears to be a list of terms related to computer science and data systems including HDFS, YARN, Kernighan, Cerf, Lamport, Codd, Knuth and SQL. It references people, technologies and concepts but provides no additional context or explanation.
The document discusses Hortonworks' Slider project, which aims to simplify deploying and managing distributed applications on YARN. Slider provides a packaging format for applications, launches application components as YARN containers via an Application Master, and handles service registration and configuration management. It addresses limitations of earlier frameworks by supporting dynamic configurations, embedded usage, and integration with service discovery in Zookeeper.
This document provides guidance on reporting bugs in the Apache Hadoop project. It explains that the JIRA issue tracker is used to report bugs and feature requests, and outlines best practices for submitting high-quality issue reports. These include searching for existing issues and solutions first, providing detailed steps to replicate the problem, and attaching relevant logs and stack traces. The document discourages "help!" emails and stresses that the best way to get a bug fixed is often for the reporter to propose a patch with tests.
5. History: ASF releases slowed
0.20.0 0.20.1 0.20.2 0.21.0 0.20.20{3,4,5}.0
• 64 Releases from 2006-2011
• Branches from the last 2.5 years:
–0.20.{0,1,2} – Stable release without security
–0.20.2xx.y – Stable release with security
–0.21.0 – released, unstable, deprecated
–0.22.0 – orphan, unstable, lack of community
–0.23.x
• Cloudera CDH: fork w/ patches pushed back
Page 5
6. Now: 2 ASF branches
Hadoop 1.x
• Stable, used in production systems
• Features focus on fixes & low-risk performance
Hadoop 2.x/trunk
• The successor
• Alpha-release. Download and test
• Where features & fixes first go in
• Your new code goes here.
Page 6
#3: This is my background: key point until 2012 I was working on my own things inside a large organisation; now I am FTE on Hadoop
#7: There's a CoI here between trunk features and branch-1 commits -the latter get into people's hands faster, but threaten the very feature -stability- that justifies branch-1's existence.All the interesting stuff goes into trunk, which is where I push most of my patches (it's easier to avoid backporting)
#10: Bigtop is ±Fedora: bleeding edge -but also defines RPM installation layout and startup scripts for everyone, for consistency.Hortonworks -trails with the stable artifacts, team manages the Apache Hadoop releases and QA team tests all.Cloudera do a mix of ASF + Apache; got own fork of Hadoop with different set/ordering of patches,.CDH vs HDP is a matter of argument. One thing to know is that everyone now tends to use Git to manage their individual branches
#19: Plugin points: yes, I think googleguice would be the alternative, but, well…
#20: Most people here do not have 500+ clusters with double digit PB of storage. Those clusters are the best for the stress testing of the storage and computer layers -but only a few people have them at this scale: Y! FB. We use Y!'s test clusters for all the apache & Hortonworks releases,
#21: you have your own issues. Does it scale down enough? does it assume the LAN is well managed, clocks in sync, DNS andrDNS works. Your problems -especially the networking ones -are your own. This is why testing them matters
#22: I'm proposing people write books for the benefit of the project, not the fame and money with comes with writing a book, Anyone else who has written a book will know precisely why I'm doing that.
#24: We do have this for the Apache Incubator -but they are projects above and alongside the existing codebase. I'm wondering here how to get medium-sized bits of work done in a way that is timely, not wasted.
#25: There's no easy answers here, but here are some things I think could be goodGit workflow support. Stops people having to resubmit patches all the time; git pull can be used to grab and apply a patch.Gerrit code review -makes reviewing much, much easier. We have HUG events -but they tend to not normally delve into the codebase. I'm proposing doing exactly that -in regions other than just the Bay Area. I will back this up by offering to host an all day one at a bar/café near me in Bristol if enough people are interested., I'm also advocating university involvement so that they get more of an idea of Hadoop internals.For those of outside the Bay Area, remote events are good. We've had some good webex'd events recently (e.g. the YARN one), but could do with more. I'd like to see something more interactive, and think we could/should try with an online only google+ hangout coding event, possibly using a shared IDE.