This document discusses using HBase for geo-based content processing at NAVTEQ. It outlines the problems with their previous system, such as ineffective scaling and high Oracle licensing costs. Their solution was to implement HBase on Hadoop for horizontal scalability and flexible rules-based processing. Some challenges included unstable early versions of HBase, database design issues, and interfacing batch systems with real-time systems. Cloudera support helped address many technical issues and provide best practices for operating Hadoop and HBase at scale.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
The document discusses security in Hadoop clusters. It introduces authentication using Kerberos and authorization using access control lists (ACLs). Kerberos provides mutual authentication between services and clients in Hadoop. ACLs control access at the service and file level. The document outlines how to configure Kerberos with Hadoop, including setting principals and keytabs for services. It also discusses integrating Kerberos with an Active Directory domain.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaCloudera, Inc.
This document discusses HBase, an open-source, non-relational, distributed database built on top of Hadoop. It provides an overview of why HBase is useful, examples of how Navteq uses HBase at scale, and considerations for designing HBase schemas and deploying HBase clusters, including hardware requirements and configuration tuning. The document also outlines some desired future features for HBase like better tools, secondary indexes, and security improvements.
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
The document discusses the Apache Knox Gateway, which is an extensible reverse proxy framework that securely exposes REST APIs and HTTP-based services from Hadoop clusters. It provides features such as support for common Hadoop services, integration with enterprise authentication systems, centralized auditing of REST API access, and service-level authorization controls. The Knox Gateway aims to simplify access to Hadoop services, enhance security by protecting network details and supporting partial SSL, and enable centralized management and control over REST API access.
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
A presentation by Anu Engineer of Cloudera regarding the state of the Ozone subproject. He covers a brief introduction of what Ozone is, and where it's headed.
This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
As Hadoop becomes a critical part of Enterprise data infrastructure, securing Hadoop has become critically important. Enterprises want assurance that all their data is protected and that only authorized users have access to the relevant bits of information. In this session we will cover all aspects of Hadoop security including authentication, authorization, audit and data protection. We will also provide demonstration and detailed instructions for implementing comprehensive Hadoop security.
This document provides an overview of securing Hadoop applications and clusters. It discusses authentication using Kerberos, authorization using POSIX permissions and HDFS ACLs, encrypting HDFS data at rest, and configuring secure communication between Hadoop services and clients. The principles of least privilege and separating duties are important to apply for a secure Hadoop deployment. Application code may need changes to use Kerberos authentication when accessing Hadoop services.
The document discusses Hadoop security today and tomorrow. It describes the four pillars of Hadoop security as authentication, authorization, accountability, and data protection. It outlines the current security capabilities in Hadoop like Kerberos authentication and access controls, and future plans to improve security, such as encryption of data at rest and in motion. It also discusses the Apache Knox gateway for perimeter security and provides a demo of using Knox to submit a MapReduce job.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
This presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
Speaker: Mike Drob
Apache Accumulo has long held a reputation for enabling high-throughput operations in write-heavy workloads. In this talk, we use the Yahoo! Cloud Serving Benchmark (YCSB) to put real numbers on Accumulo performance. We then compare these numbers to previous versions, to other databases, and wrap up with a discussion of parameters that can be tweaked to improve them.
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Difference between hadoop 2 vs hadoop 3Manish Chopra
Hadoop 3.x includes improvements over Hadoop 2.x such as supporting Java 8 as the minimum version, using erasure coding for fault tolerance which reduces storage overhead, improving the YARN timeline service for better scalability and reliability, and moving default ports out of the ephemeral range to prevent startup failures. Hadoop 3.x also adds support for the Microsoft Azure Data Lake filesystem and provides better scalability by allowing clusters to scale to over 10,000 nodes. Key features for resource management, high availability, and running analytics workloads are also continued from Hadoop 2.x in Hadoop 3.x.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
Hadoop security overview discusses Kerberos and LDAP configuration and authentication. It outlines Hadoop security features like authentication and authorization in HDFS, MapReduce, and HBase. The document also introduces Etu appliances and their benefits, as well as troubleshooting Hadoop security issues.
Project Savanna automates the deployment of Apache Hadoop on OpenStack. It provisions Hadoop clusters using templates, allows for elastic scaling of nodes, and provides multi-tenancy. The Hortonworks OpenStack plugin uses Ambari to install and manage Hadoop clusters on OpenStack. It demonstrates provisioning a Hadoop cluster with Ambari, installing services, and monitoring the cluster through the Ambari UI. OpenStack provides operational agility while Hadoop is well-suited for its scale-out architecture.
Hadoop has long had strong authentication via integration with Kerberos, authorization via user/group/other HDFS permissions and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to given important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC file format will also be covered.
Protecting Enterprise Data in Apache HadoopOwen O'Malley
From Hadoop Summit 2015, San Jose
From Apache BigData 2016, Vancouver
Hadoop has long had strong authentication via integration with Kerberos, authorization via User/Group/Other HDFS permissions, and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to give important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC columnar format will also be covered.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
A presentation by Anu Engineer of Cloudera regarding the state of the Ozone subproject. He covers a brief introduction of what Ozone is, and where it's headed.
This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
As Hadoop becomes a critical part of Enterprise data infrastructure, securing Hadoop has become critically important. Enterprises want assurance that all their data is protected and that only authorized users have access to the relevant bits of information. In this session we will cover all aspects of Hadoop security including authentication, authorization, audit and data protection. We will also provide demonstration and detailed instructions for implementing comprehensive Hadoop security.
This document provides an overview of securing Hadoop applications and clusters. It discusses authentication using Kerberos, authorization using POSIX permissions and HDFS ACLs, encrypting HDFS data at rest, and configuring secure communication between Hadoop services and clients. The principles of least privilege and separating duties are important to apply for a secure Hadoop deployment. Application code may need changes to use Kerberos authentication when accessing Hadoop services.
The document discusses Hadoop security today and tomorrow. It describes the four pillars of Hadoop security as authentication, authorization, accountability, and data protection. It outlines the current security capabilities in Hadoop like Kerberos authentication and access controls, and future plans to improve security, such as encryption of data at rest and in motion. It also discusses the Apache Knox gateway for perimeter security and provides a demo of using Knox to submit a MapReduce job.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
This presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
Speaker: Mike Drob
Apache Accumulo has long held a reputation for enabling high-throughput operations in write-heavy workloads. In this talk, we use the Yahoo! Cloud Serving Benchmark (YCSB) to put real numbers on Accumulo performance. We then compare these numbers to previous versions, to other databases, and wrap up with a discussion of parameters that can be tweaked to improve them.
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Difference between hadoop 2 vs hadoop 3Manish Chopra
Hadoop 3.x includes improvements over Hadoop 2.x such as supporting Java 8 as the minimum version, using erasure coding for fault tolerance which reduces storage overhead, improving the YARN timeline service for better scalability and reliability, and moving default ports out of the ephemeral range to prevent startup failures. Hadoop 3.x also adds support for the Microsoft Azure Data Lake filesystem and provides better scalability by allowing clusters to scale to over 10,000 nodes. Key features for resource management, high availability, and running analytics workloads are also continued from Hadoop 2.x in Hadoop 3.x.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
Hadoop security overview discusses Kerberos and LDAP configuration and authentication. It outlines Hadoop security features like authentication and authorization in HDFS, MapReduce, and HBase. The document also introduces Etu appliances and their benefits, as well as troubleshooting Hadoop security issues.
Project Savanna automates the deployment of Apache Hadoop on OpenStack. It provisions Hadoop clusters using templates, allows for elastic scaling of nodes, and provides multi-tenancy. The Hortonworks OpenStack plugin uses Ambari to install and manage Hadoop clusters on OpenStack. It demonstrates provisioning a Hadoop cluster with Ambari, installing services, and monitoring the cluster through the Ambari UI. OpenStack provides operational agility while Hadoop is well-suited for its scale-out architecture.
Hadoop has long had strong authentication via integration with Kerberos, authorization via user/group/other HDFS permissions and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to given important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC file format will also be covered.
Protecting Enterprise Data in Apache HadoopOwen O'Malley
From Hadoop Summit 2015, San Jose
From Apache BigData 2016, Vancouver
Hadoop has long had strong authentication via integration with Kerberos, authorization via User/Group/Other HDFS permissions, and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to give important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC columnar format will also be covered.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
The document discusses adding ACID properties like delete, update, and insert capabilities to the Hive data warehouse software. It describes the design which involves stitching together buckets of data and compacting files to support modifications. The document also covers limitations, development phases, and reasons for not using HBase for the task.
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
This document describes Structor, a tool that automates the creation of virtual Hadoop clusters using Vagrant and Puppet. It allows users to quickly set up development, testing, and demo environments for Hadoop without manual configuration. Structor addresses the difficulties of manually setting up Hadoop clusters, particularly around configuration, security testing, and experimentation. It provides pre-defined profiles that stand up clusters of different sizes on various operating systems with or without security enabled. Puppet modules configure and provision the Hadoop services while Vagrant manages the underlying virtual machines.
The document describes the ORC file format. It discusses the file structure, stripe structure, file layout, integer and string column serialization, compression techniques, projection and predicate filtering capabilities, example file sizes, and compares ORC files to RCFiles and Trevni. The document was authored by Owen O'Malley of Hortonworks for December 2012.
The next generation of Hadoop MapReduce
Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.
More information and video available at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Bay Area HUG Feb 2011 introduction and Yahoo refocusing on Apache Hadoop releases.
More information at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups.
More information and video at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
This document summarizes techniques for optimizing Hive queries, including recommendations around data layout, format, joins, and debugging. It discusses partitioning, bucketing, sort order, normalization, text format, sequence files, RCFiles, ORC format, compression, shuffle joins, map joins, sort merge bucket joins, count distinct queries, using explain plans, and dealing with skew.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
Eric Hanson and I gave this presentation at Hadoop Summit 2013:
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
The document discusses ISO 27001, an international standard for information security management systems (ISMS). It describes what an ISMS is, the benefits of ISO 27001 certification, and provides an overview of the methodology for implementing an ISO 27001-compliant ISMS, including defining security policies, conducting risk assessments, developing procedures, performing internal audits, and becoming certified.
This document discusses data encryption in Hadoop. It describes two common cases for encrypting data: using a Crypto API to encrypt/decrypt with an AES key stored in a keystore, and encrypting MapReduce outputs using a CryptoContext. It also covers the Hadoop Encryption Framework APIs, HBase encryption via HBASE-7544, and related JIRAs around Hive and Pig encryption. Key management tools like keytool and potential future improvements like Knox gateway integration are also mentioned.
This document provides an introduction to information system security. It discusses key concepts like security, information security, vulnerabilities, threats, attacks, security policies, and security measures. The document outlines common security risks like interruption, interception, modification, masquerading, and repudiation. It explains that security policies provide guidelines for implementing security controls to protect information system assets from such risks according to the security principles of confidentiality, integrity, and availability.
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
The document discusses the Optimized Row Columnar (ORC) file format. ORC provides columnar storage, row groups, optimized compression, vectorized processing and other features to improve performance of Hive queries up to 100x. It describes how ORC has been implemented at companies like Facebook and Spotify to significantly reduce storage needs, improve query performance and reduce CPU usage. The document outlines various optimizations in ORC including column projection, stripe splitting, statistics gathering, vectorized processing using SIMD and improvements in compression and encoding of integer, double and string data types.
Parquet is a columnar storage format for Hadoop data. It was developed by Twitter and Cloudera to optimize storage and querying of large datasets. Parquet provides more efficient compression and I/O compared to traditional row-based formats by storing data by column. Early results show a 28% reduction in storage size and up to a 114% improvement in query performance versus the original Thrift format. Parquet supports complex nested schemas and can be used with Hadoop tools like Hive, Pig, and Impala.
Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data.
The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
The document discusses security enhancements in Hortonworks Data Platform (HDP) 2.2, including centralized security with Apache Ranger and API security with Apache Knox. It provides an overview of Ranger's ability to centrally administer, authorize, and audit access across Hadoop. Knox is described as securing access to Hadoop APIs from various devices and applications by acting as a gateway. The document highlights new features for Ranger and Knox in HDP 2.2 such as deeper integration with Hadoop components and Ambari management capabilities.
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
Hadoop security has improved with additions such as HDFS ACLs, Hive column-level ACLs, HBase cell-level ACLs, and Knox for perimeter security. Data encryption has also been enhanced, with support for encrypting data in transit using SSL and data at rest through file encryption or the upcoming native HDFS encryption. Authentication is provided by Kerberos/AD with token-based authorization, and auditing tracks who accessed what data.
This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCloudIDSummit
Aaron T. Myers (ATM), Software Engineer, Cloudera, Inc.
The era of “Big Data for the masses” is upon us. Despite the mindshare Big Data has been receiving – driven by the development and distribution of Apache Hadoop, the first commercialized release was only in December of 2011 by Cloudera, Inc. Cloudera remains the leading Hadoop platform provider in the market today. Now, with a diverse enterprise and government early adopter customer list, through Cloudera we can get a bird’s eye view of the leading authentication issues beginning to emerge from these companies headed out of the sandbox and into full production.
Speaker Aaron T. Myers (ATM) was one of Cloudera’s earliest engineers and maintains a core focus on Apache Hadoop core, specifically focused on HDFS and Hadoop’s security features. ATM is an Apache Hadoop PMC Member and Committer.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
http://www.learntek.org/product/big-data-and-hadoop/
http://www.learntek.org
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
This document provides an overview of Hadoop, including:
1. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware.
2. It describes the architecture of Hadoop, including the Hadoop Distributed File System (HDFS) and MapReduce engine. HDFS uses a master/slave architecture with a NameNode and DataNodes, while MapReduce uses a JobTracker and TaskTrackers.
3. It discusses some common uses of Hadoop in industry, such as for log processing, web search indexing, and ad-hoc queries at large companies like Yahoo, Facebook, and Amazon.
Hive is a data warehouse infrastructure tool used to process large datasets in Hadoop. It allows users to query data using SQL-like queries. Hive resides on HDFS and uses MapReduce to process queries in parallel. It includes a metastore to store metadata about tables and partitions. When a query is executed, Hive's execution engine compiles it into a MapReduce job which is run on a Hadoop cluster. Hive is better suited for large datasets and queries compared to traditional RDBMS which are optimized for transactions.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology.
Thank you.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
Petabyte scale on commodity infrastructureelliando dias
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It describes how Hadoop addresses the need to reliably process huge datasets using a distributed file system and MapReduce processing on commodity hardware. It also provides details on how Hadoop has been implemented and used at Yahoo to process petabytes of data and support thousands of jobs weekly on large clusters.
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
This document discusses 10 common mistakes made when running an Apache project and how to avoid them. These include starting development on GitHub without an IP agreement, keeping the project secret instead of promoting it, not fostering diversity among contributors, holding exclusive in-person meetings, including binary artifacts, having a high barrier for committer status, ignoring trademark issues, not rewarding open source work, engaging in stealth development, and licensing problems. The key recommendations are to build an open community, do work transparently, train employees on open source best practices, and ensure proper licensing.
This document summarizes several big data systems and their approaches to providing ACID transactions. It finds that supporting object stores like S3 is becoming critical as streaming data ingest grows. Hive ACID is adding support for Presto and Impala queries. Iceberg is improving by adding delta files and Hive support. Overall, this is an active area of development that will continue changing in the next six months to better handle tasks like GDPR data removal and updating large datasets.
This document provides an overview of the ORC file format. It describes the key requirements and design decisions, including file structure, stripe structure, encoding columns, run length encoding, compression, indexing, and versioning. It also discusses optimizations, debugging, and using ORC from SQL, Java, C++, and the command line. The document is intended to help users and developers better understand how ORC works.
Protect your private data with ORC column encryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.
Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally.
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Hive, and Presto.
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
reading all of the columns
reading a few of the columns
filtering using a filter predicate
While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.
To provide better security, ORC files are adding column encryption. Column encryption provides the ability to grant access to different columns within the same file. All of the encryption is handled transparently to the user.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
2. Who Am I?
• Software Architect working on Hadoop since Jan 2006
– Before Hadoop worked on Yahoo Search’s WebMap
– My first patch on Hadoop was Nutch-197
– First Yahoo Hadoop committer
– Most prolific contributor to Hadoop (by patch count)
– Won the 2008 1TB and 2009 Minute and 100TB Sort
Benchmarks
• Apache VP of Hadoop
– Chair of the Hadoop Project Management Committee
– Quarterly reports on the state of Hadoop for Apache Board
Hadoop World NYC - 2009
3. What are the Problems?
• Our shared clusters increase:
– Developer and operations productivity
– Hardware utilization
– Access to data
• Yahoo! wants to put customer and financial data on our
Hadoop clusters.
– Great for providing access to all of the parts of Yahoo!
– Need to make sure that only the authorized people have
access.
• Rolling out new versions of Hadoop is painful
– Clients need to change and recompile their code
Hadoop World NYC - 2009
4. Hadoop Security
• Currently, the Hadoop servers trust the users to declare
who they are.
– It is very easy to spoof, especially with open source.
– For private clusters, we will leave non-security as option
• We need to ensure that users are who they claim to be.
• All access to HDFS (and therefore MapReduce) must
be authenticated.
• The standard distributed authentication service is
Kerberos (including ActiveDirectory).
• User code isn’t affected, since the security happens in
the RPC layer.
Hadoop World NYC - 2009
5. HDFS Security
• Hadoop security is grounded in HDFS security.
– Other services such as MapReduce store their state in HDFS.
• Use of Kerberos allows a single sign on where the
Hadoop commands pick up and use the user’s tickets.
• The framework authenticates the user to the Name
Node using Kerberos before any operations.
• The Name Node is also authenticated to the user.
• Client can request an HDFS Access Token to get
access later without going through Kerberos again.
– Prevents authorization storms as MapReduce jobs launch!
Hadoop World NYC - 2009
6. Accessing a File
• User uses Kerberos (or a HDFS Access Token) to
authenticate to the Name Node.
• They request to open a file X.
• If they have permission to file X, the Name Node
returns a token for reading the blocks of X.
• The user uses these tokens when communicating with
the Data Nodes to show they have access.
• There are also tokens for writing blocks when the file is
being created.
Hadoop World NYC - 2009
7. MapReduce Security
• Framework authenticates user to Job Tracker before
they can submit, modify, or kill jobs.
• The Job Tracker authenticates itself to the user.
• Job’s logs (including stdout) are only visible to the user.
• Map and Reduce tasks actually run as the user.
• Tasks’ working directories are protected from others.
• The Job Tracker’s system directory is no longer
readable and writable by everyone.
• Only the reduce tasks can get the map outputs.
Hadoop World NYC - 2009
8. Interactions with HDFS
• MapReduce jobs need to read and write HDFS files as
the user.
• Currently, we store the user name in the job.
• With security enabled, we will store HDFS Access
Tokens in the job.
• The job needs a token for each HDFS cluster.
• The tokens will be renewed by the Job Tracker so they
don’t expire for long running jobs.
• When the job completes, the tokens will be cancelled.
Hadoop World NYC - 2009
9. Interactions with Higher Layers
• Yahoo uses a workflow manager named Oozie to
submits MapReduce jobs on behalf of the user.
• We could store the user’s credentials with a modifier
(oom/oozie) in Oozie to access Hadoop as the user.
• Or we could create Token granting Tokens for HDFS
and MapReduce and store those in Oozie.
• In either case, such proxies are a potential source of
security problems, since they are storing large number
of user’s access credentials.
Hadoop World NYC - 2009
10. Web UIs
• Hadoop and especially MapReduce make heavy use of
the Web Uis.
• These need to be authenticated also…
• Fortunately, there is a standard solution for Kerberos
and HTTP, named SPNEGO.
• SPNEGO is supported by all of the major browsers.
• All of the servlets will use SPNEGO to authenticate the
user and enforce permissions appropriately.
Hadoop World NYC - 2009
11. Remaining Security Issues
• We are not encrypting on the wire.
– It will be possible within the framework, but not in 0.22.
• We are not encrypting on disk.
– For either HDFS or MapReduce.
• Encryption is expensive in terms of CPU and IO speed.
• Our current threat model is that the attacker has access
to a user account, but not root.
– They can’t sniff the packets on the network.
Hadoop World NYC - 2009
12. Backwards Compatibility
• API
• Protocols
• File Formats
• Configuration
Hadoop World NYC - 2009
13. API Compatibility
• Need to mark APIs with
– Audience: Public, Limited Private, Private
– Stability: Stable, Evolving, Unstable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Xxxx {…}
– Developers need to ensure that 0.22 is backwards compatible
with 0.21
• Defined new APIs designed to be future-proof:
– MapReduce – Context objects in org.apache.hadoop.mapreduce
– HDFS – FileContext in org.apache.hadoop.fs
Hadoop World NYC - 2009
14. Protocol Compatibility
• Currently all clients of a server must be the same
version (0.18, 0.19, 0.20, 0.21).
• Want to enable forward and backward compatibility
• Started work on Avro
– Includes the schema of the information as well as the data
– Can support different schemas on the client and server
– Still need to make the code tolerant of version differences
– Avro provides the mechanisms
• Avro will be used for file version compatibility too
Hadoop World NYC - 2009
15. Configuration
• Configuration in Hadoop is a string to string map.
• Maintaining backwards compatibility of configuration
knobs was done case by case.
• Now we have standard infrastructure for declaring old
knobs deprecated.
• Also have cleaned up a lot of the names in 0.21.
Hadoop World NYC - 2009