Kerberos is the system which underpins the vast majority of strong authentication across the Apache HBase/Hadoop application stack. Kerberos errors have brought many to their knees and it is often referred to as “black magic” or “the dark arts”; a long-standing joke that there are so few who understand how it works. This talk will cover the types of problems that Kerberos solves and doesn’t solve for HBase, decrypt some jargon on related libraries and technology that enable Kerberos authentication in HBase and Hadoop, and distill some basic takeaways designed to ease users in developing an application that can securely communicate with a “kerberized” HBase installation.
Apache Phoenix Query Server PhoenixCon2016Josh Elser
This document discusses Apache Phoenix Query Server, which provides a client-server abstraction for Apache Phoenix using Apache Calcite's Avatica sub-project. It allows Phoenix to have thin clients by offloading computational resources to query servers running on Hadoop clusters. This enables non-Java clients through a standardized HTTP API. The query server implementation uses HTTP, Protocol Buffers for serialization, and common libraries like Jetty and Dropwizard Metrics. It aims to simplify Phoenix client development and improve performance and scalability.
Apache Phoenix’s relational database view over Apache HBase delivers a powerful tool which enables users and developers to quickly and efficiently access their data using SQL. However, Phoenix only provides a Java client, in the form of a JDBC driver, which limits Phoenix access to JVM-based applications. The Phoenix QueryServer is a standalone service which provides the building blocks to use Phoenix from any language, not just those running in a JVM. This talk will serve as a general purpose introduction to the Phoenix QueryServer and how it complements existing Apache Phoenix applications. Topics covered will range from design and architecture of the technology to deployment strategies of the QueryServer in production environments. We will also include explorations of the new use cases enabled by this technology like integrations with non-JVM based languages (Ruby, Python or .NET) and the high-level abstractions made possible by these basic language integrations.
Effective Testing of Apache Accumulo IteratorsJosh Elser
Accumulo Summit 2016. Apache Accumulo’s Iterator are a powerful API which developers leverage to efficiently perform operations like aggregations and filters, reducing latency of these operations by orders of magnitude. However, Iterators are notoriously difficult to implement correctly. This talk will introduce an Iterator testing harness designed to improve code quality on newly created iterators, catch common runtime pitfalls, and present an end-to-end testing solution for Iterators.
Data-Center Replication with Apache AccumuloJosh Elser
This document describes the implementation of data replication in Apache Accumulo. It discusses justifying the need for replication to handle failures, describes how replication is implemented using write-ahead logs, and outlines future work including replicating to other systems and improving consistency.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...Trieu Nguyen
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack
Why we still need SQL for Big Data ?
How to make Big Data more responsive and faster ?
This document summarizes a presentation about Apache Phoenix, an open-source project that allows HBase to be queried with SQL. It discusses what Phoenix is, why tracing is important, and the features of a new tracing web app created for Phoenix, including listing traces, visualizing trace distributions and individual trace details. Programming challenges in creating the app and new issues filed are also summarized.
Apache Phoenix: Use Cases and New FeaturesHBaseCon
James Taylor (Salesforce) and Maryann Xue (Intel)
This talk with be broken into two parts: Phoenix use cases and new Phoenix features. Three use cases will be presented as lightning talks by individuals from 1) Sony about its social media NewsSuite app, 2) eHarmony on its matching service, and 3) Salesforce.com on its time-series metrics engine. Two new features will be discussed in detail by the engineers who developed them: ACID transactions in Phoenix through Apache Tephra. and cost-based query optimization through Apache Calcite. The focus will be on helping end users more easily develop scalable applications on top of Phoenix.
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
An overview of Apache Phoenix and Apache HBase from the angle of a traditional data warehousing solution. This talk focuses on where this open-source architect fits into the market outlines the features and integrations of the product, showing that it is a viable alternative to traditional data warehousing solutions.
This talk will be an overview of the new features and improvements currently implemented for the Apache Accumulo 1.8.0 release. This will be a discussion about some of these exciting changes with a focus on what is of the most importance for users.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
This document provides an overview of Apache HBase and Apache Phoenix. It discusses how HBase is a scalable, non-relational database that can store large volumes of data across commodity servers. Phoenix provides a SQL interface for HBase, allowing users to interact with HBase data using familiar SQL queries and functions. The document outlines new features in Phoenix for HDP 2.2, including improved support for secondary indexes and basic window functions.
- Hive originally only supported updating partitions by overwriting entire files, which caused issues for concurrent readers and limited functionality like row-level updates.
- The need for ACID transactions in Hive arose from wanting to support updating data in near real-time as it arrives and making ad hoc data changes without complex workarounds.
- Hive's ACID implementation stores changes as delta files, uses the metastore to manage transactions and locks, and runs compactions to merge deltas into base files.
- There were initial issues around correctness, performance, usability and resilience, but many have been addressed with ongoing work focused on further improvements and new features like multi-statement transactions and better integration with LLAP.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
Speaker: Alan Gates, Co-Founder, Hortonworks
Apache Phoenix is a SQL skin over HBase that allows for low latency SQL queries over HBase data. It transforms SQL queries into native HBase APIs like scans and puts. Phoenix supports features like secondary indexing, multi-tenancy, and limited hash joins. It aims to leverage existing SQL tooling while providing performance optimizations like parallel scans. Upcoming features include improved secondary indexing and transaction support. Phoenix maps to existing HBase tables and allows dynamic columns to extend schemas during queries.
Yifeng Jiang presented on Apache Hive's present and future capabilities. Hive has achieved 100x performance improvements through technologies like ORC file format, Tez execution engine, and vectorized processing. Upcoming features like LLAP caching and a persistent Hive server aim to provide sub-second query response times for interactive analytics. Hive continues to evolve as the standard SQL interface for Hadoop, supporting a wide range of use cases from ETL and reporting to real-time analytics.
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
Apache Phoenix is a SQL query layer over Apache HBase that allows users to interact with HBase through JDBC and SQL. It transforms SQL queries into native HBase API calls for efficient parallel execution on the cluster. Phoenix provides metadata storage, SQL support, and a JDBC driver. It is now a top-level Apache project after originally being developed at Salesforce. The speaker discussed Phoenix's capabilities like joins and subqueries, new features like HBase 1.0 support and functional indexes, and future plans like improved optimization through Calcite and transaction support.
The document discusses bringing multi-tenancy to Apache Zeppelin through the use of Apache Livy. Livy is an open-source REST interface that allows interacting with Spark from anywhere and enables features like multi-user sessions and security. It improves on previous versions of interactive analysis in Zeppelin by allowing custom user sessions through Livy and improving security and isolation between users through mechanisms like SPNEGO and impersonation. The integration of Livy provides multi-tenancy, security, and isolation for interactive analysis in Zeppelin.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark.
SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity.
Also, SHC has supported Phoenix data as input to HBase in addition to Avro data. Defaulting to a simple native binary encoding seems susceptible to future changes and is a risk for users who write data from SHC into HBase. For example, with SHC going forward, backwards compatibility needs to be properly handled. So the default, SHC needs to support a more standard and well tested format like Phoenix.
In this talk, we will demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
Apache Hive 2.0 provides major new features for SQL on Hadoop such as:
- HPLSQL which adds procedural SQL capabilities like loops and branches.
- LLAP which enables sub-second queries through persistent daemons and in-memory caching.
- Using HBase as the metastore which speeds up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark and the cost-based optimizer.
- Many bug fixes and under-the-hood improvements were also made while maintaining backwards compatibility where possible.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
SQL on Hadoop Batch, Interactive and Beyond.
Public Presentation showing history and where Hortonworks is looking to go with 100% Open Source Technology.
Apache Hive, Apache SparkSQL, Apache Pheonix, and Apache Druid
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...Trieu Nguyen
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack
Why we still need SQL for Big Data ?
How to make Big Data more responsive and faster ?
This document summarizes a presentation about Apache Phoenix, an open-source project that allows HBase to be queried with SQL. It discusses what Phoenix is, why tracing is important, and the features of a new tracing web app created for Phoenix, including listing traces, visualizing trace distributions and individual trace details. Programming challenges in creating the app and new issues filed are also summarized.
Apache Phoenix: Use Cases and New FeaturesHBaseCon
James Taylor (Salesforce) and Maryann Xue (Intel)
This talk with be broken into two parts: Phoenix use cases and new Phoenix features. Three use cases will be presented as lightning talks by individuals from 1) Sony about its social media NewsSuite app, 2) eHarmony on its matching service, and 3) Salesforce.com on its time-series metrics engine. Two new features will be discussed in detail by the engineers who developed them: ACID transactions in Phoenix through Apache Tephra. and cost-based query optimization through Apache Calcite. The focus will be on helping end users more easily develop scalable applications on top of Phoenix.
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
An overview of Apache Phoenix and Apache HBase from the angle of a traditional data warehousing solution. This talk focuses on where this open-source architect fits into the market outlines the features and integrations of the product, showing that it is a viable alternative to traditional data warehousing solutions.
This talk will be an overview of the new features and improvements currently implemented for the Apache Accumulo 1.8.0 release. This will be a discussion about some of these exciting changes with a focus on what is of the most importance for users.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
This document provides an overview of Apache HBase and Apache Phoenix. It discusses how HBase is a scalable, non-relational database that can store large volumes of data across commodity servers. Phoenix provides a SQL interface for HBase, allowing users to interact with HBase data using familiar SQL queries and functions. The document outlines new features in Phoenix for HDP 2.2, including improved support for secondary indexes and basic window functions.
- Hive originally only supported updating partitions by overwriting entire files, which caused issues for concurrent readers and limited functionality like row-level updates.
- The need for ACID transactions in Hive arose from wanting to support updating data in near real-time as it arrives and making ad hoc data changes without complex workarounds.
- Hive's ACID implementation stores changes as delta files, uses the metastore to manage transactions and locks, and runs compactions to merge deltas into base files.
- There were initial issues around correctness, performance, usability and resilience, but many have been addressed with ongoing work focused on further improvements and new features like multi-statement transactions and better integration with LLAP.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
Speaker: Alan Gates, Co-Founder, Hortonworks
Apache Phoenix is a SQL skin over HBase that allows for low latency SQL queries over HBase data. It transforms SQL queries into native HBase APIs like scans and puts. Phoenix supports features like secondary indexing, multi-tenancy, and limited hash joins. It aims to leverage existing SQL tooling while providing performance optimizations like parallel scans. Upcoming features include improved secondary indexing and transaction support. Phoenix maps to existing HBase tables and allows dynamic columns to extend schemas during queries.
Yifeng Jiang presented on Apache Hive's present and future capabilities. Hive has achieved 100x performance improvements through technologies like ORC file format, Tez execution engine, and vectorized processing. Upcoming features like LLAP caching and a persistent Hive server aim to provide sub-second query response times for interactive analytics. Hive continues to evolve as the standard SQL interface for Hadoop, supporting a wide range of use cases from ETL and reporting to real-time analytics.
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
Apache Phoenix is a SQL query layer over Apache HBase that allows users to interact with HBase through JDBC and SQL. It transforms SQL queries into native HBase API calls for efficient parallel execution on the cluster. Phoenix provides metadata storage, SQL support, and a JDBC driver. It is now a top-level Apache project after originally being developed at Salesforce. The speaker discussed Phoenix's capabilities like joins and subqueries, new features like HBase 1.0 support and functional indexes, and future plans like improved optimization through Calcite and transaction support.
The document discusses bringing multi-tenancy to Apache Zeppelin through the use of Apache Livy. Livy is an open-source REST interface that allows interacting with Spark from anywhere and enables features like multi-user sessions and security. It improves on previous versions of interactive analysis in Zeppelin by allowing custom user sessions through Livy and improving security and isolation between users through mechanisms like SPNEGO and impersonation. The integration of Livy provides multi-tenancy, security, and isolation for interactive analysis in Zeppelin.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark.
SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity.
Also, SHC has supported Phoenix data as input to HBase in addition to Avro data. Defaulting to a simple native binary encoding seems susceptible to future changes and is a risk for users who write data from SHC into HBase. For example, with SHC going forward, backwards compatibility needs to be properly handled. So the default, SHC needs to support a more standard and well tested format like Phoenix.
In this talk, we will demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
Apache Hive 2.0 provides major new features for SQL on Hadoop such as:
- HPLSQL which adds procedural SQL capabilities like loops and branches.
- LLAP which enables sub-second queries through persistent daemons and in-memory caching.
- Using HBase as the metastore which speeds up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark and the cost-based optimizer.
- Many bug fixes and under-the-hood improvements were also made while maintaining backwards compatibility where possible.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
SQL on Hadoop Batch, Interactive and Beyond.
Public Presentation showing history and where Hortonworks is looking to go with 100% Open Source Technology.
Apache Hive, Apache SparkSQL, Apache Pheonix, and Apache Druid
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
This document provides an overview and agenda for an HDF 3.0 Deep Dive presentation. It discusses new features in HDF 3.0 like record-based processing using a record reader/writer and QueryRecord processor. It also covers the latest efforts in the Apache NiFi community like component versioning and introducing a registry to enable capabilities like CI/CD, flow migration, and auditing of flows. The presentation demonstrates record processing in NiFi and concludes by discussing the evolution of Apache NiFi and its ecosystem.
This document discusses the evolution of Hadoop and its use cases in the adtech industry. It describes how Hadoop was initially used primarily for batch processing via Hive and MapReduce. Over time, improvements like Tez, Presto, and Impala enabled faster interactive SQL queries on big data. The document also outlines how the Hadoop ecosystem is now used for real-time log collection, reporting, model generation, and more across the entire adtech stack. Key recent developments discussed include improvements in Hive like LLAP that enable sub-second SQL and ACID transactions, as well as tools like Cloudbreak for deploying Hadoop clusters in the cloud.
This document discusses extending the functionality of Apache NiFi through custom processors and controller services. It provides an overview of the NiFi architecture and repositories, describes how to create extensions with minimal dependencies using Maven archetypes, and notes that most extensions can be developed within hours. Quick prototyping of data flows is possible using existing binaries, applications, and scripting languages. Resources for the NiFi developer guide and example Maven projects are also listed.
Apache Deep Learning 101 - DWS Berlin 2018Timothy Spann
Apache Deep Learning 101 with Apache MXNet, Apache NiFi, MiniFi, Apache Tika, Apache Open NLP, Apache Spark, Apache Hive, Apache HBase, Apache Livy and Apache Hadoop. Using Python we run various existing models via MXNet Model Server and via Python APIs. We also use NLP for entity resolution
This document provides an introduction to Apache Kafka. It begins with an overview of Kafka as a distributed messaging system that is real-time, scalable, low latency, and fault tolerant. It then covers key concepts such as topics, partitions, producers, consumers, and replication. The document explains how Kafka achieves fast reads and writes through its design and use of disk flushing and replication for durability. It also discusses how Kafka can be used to build real-time systems and provides examples like connected cars. Finally, it introduces Apache Metron as an example of a cyber security solution built on Kafka.
IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit
1) The document discusses using Apache MXNet for industrial IoT applications. MiniFi ingests camera images and sensor data at the edge and runs Apache MXNet to recognize objects in images. The data is then stored in Hadoop.
2) It describes using Apache MXNet on edge devices like the Raspberry Pi and Nvidia Jetson TX1 to perform tasks like image recognition from cameras and sensors.
3) The document provides information on setting up Apache MXNet on various IoT devices and edge servers to enable machine learning and deep learning capabilities for industrial IoT applications.
MiniFi and Apache NiFi : IoT in Berlin Germany 2018Timothy Spann
Future of Data : Berlin
Apache NiFi and MiniFi with Apache MXNet and Tensorfor for IoT from edge devices like Raspberry Pis. Including Python and other tools.
Apache MXNet for IoT with Apache NiFi. Using Apache MXNet with Apache NiFi and MiniFi for IoT use cases. Ingesting, managing, orchestration and running IoT workloads.
This document provides an agenda and overview for a presentation on deep learning on Hortonworks Data Platform (HDP). The presentation will cover using TensorFlow with Apache NiFi, running TensorFlow on YARN, using pre-built models with Apache MXNet, running an MXNet model server with NiFi, and running MXNet in Zeppelin notebooks and on YARN. It recommends installing CPU and GPU versions of frameworks on appropriate nodes and discusses options like TensorFlow, MXNet, and PyTorch. The document also outlines integrating Apache MXNet with NiFi for tasks like image classification using models on edge nodes or a model server.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
As seen at our meetup on 2017 Feb 21.
https://www.meetup.com/futureofdata-budapest/events/236853376/
Author: Marton Elek, Hortonworks
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFiAldrin Piri
This document discusses Apache NiFi and Apache MiNiFi. It begins with an overview of NiFi, describing its key features like guaranteed delivery, data buffering, and data provenance. It then introduces MiNiFi as a smaller version of NiFi that can operate on edge devices with limited resources. A use case is presented of a courier service gathering data from disparate sources using both NiFi and MiNiFi. The document concludes by discussing the NiFi ecosystem and encouraging participation in the community.
The document provides an introduction and overview of Apache NiFi and its architecture. It discusses how NiFi can be used to effectively manage and move data between different producers and consumers. It also summarizes key NiFi features like guaranteed delivery, data buffering, prioritization, and data provenance. Finally, it briefly outlines the NiFi architecture and components as well as opportunities for the future of the MiniFi project.
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA
Hortonworks DataFlow (HDF) is built with the vision of creating a platform that enables enterprises to build dataflow management and streaming analytics solutions that collect, curate, analyze and act on data in motion across the datacenter and cloud. Do you want to be able to provide a complete end-to-end streaming solution, from an IoT device all the way to a dashboard for your business users with no code? Come to this session to learn how this is now possible with HDF 3.1.
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
Curb your insecurity with HDP - Tips for a Secure Clusterahortonworks
NOTE: Slides contains gifs which may appear as dark pics.
You got your cluster installed and configured. You celebrate, until the party is ruined by your company's Security officer stamping a big "Deny" on your Hadoop cluster. And oops!! You cannot place any data onto the cluster until you can demonstrate it is secure. In this session you will learn the tips and tricks to fully secure your cluster for data at rest, data in motion and all the apps including Spark. Your Security officer can then join your Hadoop revelry (unless you don't authorize him to, with your newly acquired admin rights)
FME for Climate Data: Turning Big Data into Actionable InsightsSafe Software
Regional and local governments aim to provide essential services for stormwater management systems. However, rapid urbanization and the increasing impacts of climate change are putting growing pressure on these governments to identify stormwater needs and develop effective plans. To address these challenges, GHD developed an FME solution to process over 20 years of rainfall data from rain gauges and USGS radar datasets. This solution extracts, organizes, and analyzes Next Generation Weather Radar (NEXRAD) big data, validates it with other data sources, and produces Intensity Duration Frequency (IDF) curves and future climate projections tailored to local needs. This presentation will showcase how FME can be leveraged to manage big data and prioritize infrastructure investments.
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfvictordsane
Microsoft Azure is a cloud platform that empowers businesses with scalable computing, data analytics, artificial intelligence, and cybersecurity capabilities.
Arguably the biggest hurdle for most organizations is understanding how to get started.
Microsoft Azure is a consumption-based cloud service. This means you pay for what you use. Unlike traditional software, Azure resources (e.g., VMs, databases, storage) are billed based on usage time, storage size, data transfer, or resource configurations.
There are three primary Azure purchasing models:
• Pay-As-You-Go (PAYG): Ideal for flexibility. Billed monthly based on actual usage.
• Azure Reserved Instances (RI): Commit to 1- or 3-year terms for predictable workloads. This model offers up to 72% cost savings.
• Enterprise Agreements (EA): Best suited for large organizations needing comprehensive Azure solutions and custom pricing.
Licensing Azure: What You Need to Know
Azure doesn’t follow the traditional “per seat” licensing model. Instead, you pay for:
• Compute Hours (e.g., Virtual Machines)
• Storage Used (e.g., Blob, File, Disk)
• Database Transactions
• Data Transfer (Outbound)
Purchasing and subscribing to Microsoft Azure is more than a transactional step, it’s a strategic move.
Get in touch with our team of licensing experts via [email protected] to further understand the purchasing paths, licensing options, and cost management tools, to optimize your investment.
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdfVarsha Nayak
In recent years, organizations have increasingly sought robust open source alternative to Jasper Reports as the landscape of open-source reporting tools rapidly evolves. While Jaspersoft has been a longstanding choice for generating complex business intelligence and analytics reports, factors such as licensing changes and growing demands for flexibility have prompted many businesses to explore other options. Among the most notable alternatives to Jaspersoft, Helical Insight stands out for its powerful open-source architecture, intuitive analytics, and dynamic dashboard capabilities. Designed to be both flexible and budget-friendly, Helical Insight empowers users with advanced features—such as in-memory reporting, extensive data source integration, and customizable visualizations—making it an ideal solution for organizations seeking a modern, scalable reporting platform. This article explores the future of open-source reporting and highlights why Helical Insight and other emerging tools are redefining the standards for business intelligence solutions.
Generative Artificial Intelligence and its ApplicationsSandeepKS52
The exploration of generative AI begins with an overview of its fundamental concepts, highlighting how these technologies create new content and ideas by learning from existing data. Following this, the focus shifts to the processes involved in training and fine-tuning models, which are essential for enhancing their performance and ensuring they meet specific needs. Finally, the importance of responsible AI practices is emphasized, addressing ethical considerations and the impact of AI on society, which are crucial for developing systems that are not only effective but also beneficial and fair.
AI and Deep Learning with NVIDIA TechnologiesSandeepKS52
Artificial intelligence and deep learning are transforming various fields by enabling machines to learn from data and make decisions. Understanding how to prepare data effectively is crucial, as it lays the foundation for training models that can recognize patterns and improve over time. Once models are trained, the focus shifts to deployment, where these intelligent systems are integrated into real-world applications, allowing them to perform tasks and provide insights based on new information. This exploration of AI encompasses the entire process from initial concepts to practical implementation, highlighting the importance of each stage in creating effective and reliable AI solutions.
Marketo & Dynamics can be Most Excellent to Each Other – The SequelBradBedford3
So you’ve built trust in your Marketo Engage-Dynamics integration—excellent. But now what?
This sequel picks up where our last adventure left off, offering a step-by-step guide to move from stable sync to strategic power moves. We’ll share real-world project examples that empower sales and marketing to work smarter and stay aligned.
If you’re ready to go beyond the basics and do truly most excellent stuff, this session is your guide.
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...Safe Software
In today’s data-driven world, efficiency is key. For Cadac, a Dutch leading provider of SaaS solutions and Autodesk Platinum partner, ensuring that customers can process data on demand is crucial to delivering a seamless experience. However, with fluctuating user demand, a challenge emerged: How do we scale FME Flow to meet on-the-fly processing needs without over-investing in infrastructure? Enter Kubernetes and KEDA (Kubernetes Event-Driven Autoscaling). In this presentation, we will explore how these cutting-edge technologies helped dynamically scale FME Flow engines based on real-time demand, without wasting resources. Instead of relying on the standard Kubernetes autoscaling based on CPU and RAM metrics, which can lead to ineffective scaling, KEDA can integrate directly with the FME Flow REST API. This allowed autoscaling based on the actual number and type of jobs in the queue. Now, whenever demand spikes, Kubernetes automatically spins up additional machines tailored to the type of workload—whether it’s CPU-intensive tasks or memory-heavy processes—ensuring optimal performance and cost-efficiency. While afterwards also autoscaling to zero, to reduce costs. Join us as we dive into how this approach helped Cadac scale on demand, reduce infrastructure costs, and provide a better experience for their customers. This session will feature both a technical walkthrough and insights on the real-world impact and value this solution has delivered to their platform and client.
Rebuilding Cadabra Studio: AI as Our Core FoundationCadabra Studio
Cadabra Studio set out to reconstruct its core processes, driven entirely by AI, across all functions of its software development lifecycle. This journey resulted in remarkable efficiency improvements of 40–80% and reshaped the way teams collaborate. This presentation shares our challenges and lessons learned in becoming an AI-native firm, including overcoming internal resistance and achieving significant project delivery gains. Discover our strategic approach and transformative recommendations to integrate AI not just as a feature, but as a fundamental element of your operational structure. What changes will AI bring to your company?
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesMarjukka Niinioja
Teams delivering API are challenges with:
- Connecting APIs to business strategy
- Measuring API success (audit & lifecycle metrics)
- Partner/Ecosystem onboarding
- Consistent documentation, security, and publishing
🧠 The big takeaway?
Many teams can build APIs. But few connect them to value, visibility, and long-term improvement.
That’s why the APIOps Cycles method helps teams:
📍 Start where the pain is (one “metro station” at a time)
📈 Scale success across strategy, platform, and operations
🛠 Use collaborative canvases to get buy-in and visibility
Want to try it and learn more?
- Follow APIOps Cycles in LinkedIn
- Visit the www.apiopscycles.com site
- Subscribe to email list
-
How AI Can Improve Media Quality Testing Across Platforms (1).pptxkalichargn70th171
Media platforms, from video streaming to OTT and Smart TV apps, face unprecedented pressure to deliver seamless, high-quality experiences across diverse devices and networks. Ensuring top-notch Quality of Experience (QoE) is critical for user satisfaction and retention.
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdfQuickBooks Training
Are you preparing your budget for the next year, applying for a business credit card or loan, or opening a company bank account? If so, you may find QuickBooks financial statements to be a very useful tool.
These statements offer a brief, well-structured overview of your company’s finances, facilitating goal-setting and money management.
Don’t worry if you’re not knowledgeable about QuickBooks financial statements. These statements are complete reports from QuickBooks that provide an overview of your company’s financial procedures.
They thoroughly view your financial situation by including important features: income, expenses, investments, and disadvantages. QuickBooks financial statements facilitate your financial management and assist you in making wise determinations, regardless of your experience as a business owner.
Build Smarter, Deliver Faster with Choreo - An AI Native Internal Developer P...WSO2
Enterprises must deliver intelligent, cloud native applications quickly—without compromising governance or scalability. This session explores how an internal developer platform increases productivity via AI for code and accelerates AI-native app delivery via code for AI. Learn practical techniques for embedding AI in the software lifecycle, automating governance with AI agents, and applying a cell-based architecture for modularity and scalability. Real-world examples and proven patterns will illustrate how to simplify delivery, enhance developer productivity, and drive measurable outcomes.
Learn more: https://wso2.com/choreo
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATIONmiso_uam
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION (plenary talk at ANNSIM'2025)
Testing is essential to improve the correctness of software systems. Metamorphic testing (MT) is an approach especially suited when the system under test lacks oracles, or they are expensive to compute. However, building an MT environment for a particular domain (e.g., cloud simulation, automated driving simulation, production system simulation, etc) requires substantial effort.
To alleviate this problem, we propose a model-driven engineering approach to automate the construction of MT environments, which is especially useful to test domain-specific modelling and simulation systems. Starting from a meta-model capturing the domain concepts, and a description of the domain execution environment, our approach produces an MT environment featuring comprehensive support for the MT process. This includes the definition of domain-specific metamorphic relations, their evaluation, detailed reporting of the testing results, and the automated search-based generation of follow-up test cases.
In this talk, I presented the approach, along with ongoing work and perspectives for integrating intelligence assistance based on large language models in the MT process. The work is a joint collaboration with Pablo Gómez-Abajo, Pablo C. Cañizares and Esther Guerra from the miso research group and Alberto Núñez from UCM.
Join the Denver Marketo User Group, Captello and Integrate as we dive into the best practices, tools, and strategies for maintaining robust, high-performing databases. From managing vendors and automating orchestrations to enriching data for better insights, this session will unpack the key elements that keep your data ecosystem running smoothly—and smartly.
We will hear from Steve Armenti, Twelfth, and Aaron Karpaty, Captello, and Frannie Danzinger, Integrate.
Best Inbound Call Tracking Software for Small BusinessesTheTelephony
The best inbound call tracking software for small businesses offers features like call recording, real-time analytics, lead attribution, and CRM integration. It helps track marketing campaign performance, improve customer service, and manage leads efficiently. Look for solutions with user-friendly dashboards, customizable reporting, and scalable pricing plans tailored for small teams. Choosing the right tool can significantly enhance communication and boost overall business growth.
zOS CommServer support for the Network Express feature on z17zOSCommserver
The IBM z17 has undergone a transformation with an entirely new System I/O hardware and architecture model for both storage and networking. The z17 offers I/O capability that is integrated directly within the Z processor complex. The new system design moves I/O operations closer to the system processor and memory. This new design approach transforms I/O operations allowing Z workloads to grow and scale to meet the growing needs of current and future IBM Hybrid Cloud Enterprise workloads. This presentation will focus on the networking I/O transformation by introducing you to the new IBM z17 Network Express feature.
The Network Express feature introduces new system architecture called Enhanced QDIO (EQDIO). EQDIO allows the updated z/OS Communications Server software to interact with the Network Express hardware using new optimized I/O operations. The new design and optimizations are required to meet the demand of the continuously growing I/O rates. Network Express and EQDIO build the foundation for the introduction of advanced Ethernet and networking capabilities for the future of IBM Z Hybrid Cloud Enterprise users.
The Network Express feature also combines the functionality of both the OSA-Express and RoCE Express features into a single feature or adapter. A single Network Express port supports both IP protocols and RDMA protocols. This allows each Network Express port to function as both a standard NIC for Ethernet and as an RDMA capable NIC (RNIC) for RoCE protocols. Converging both protocols to a single adapter reduces Z customers’ cost for physical networking resources. With this change, IBM Z customers can now exploit Shared Memory Communications (SMC) leveraging RDMA (SMC-R) technology without incurring additional hardware costs.
In this session, the speakers will focus on how z/OS Communications Server has been updated to support the Network Express feature. An introduction to the new Enhanced QDIO Ethernet (EQENET) interface statement used to configure the new OSA is provided. EQDIO provides a variety of simplifications, such as no longer requiring VTAM user defined TRLEs, uses smarter defaults and removes outdated parameters. The speakers will also cover migration considerations for Network Express. In addition, the operational aspects of managing and monitoring the new OSA and RoCE interfaces will be covered. The speakers will also take you through the enhancements made to optimize both inbound and outbound network traffic. Come join us, step aboard and learn how z/OS Communications Server is bringing you the future in network communications with the IBM z17 Network Express feature.
Invited Talk at RAISE 2025: Requirements engineering for AI-powered SoftwarE Workshop co-located with ICSE, the IEEE/ACM International Conference on Software Engineering.
Abstract: Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM-powered multi-agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory-of-Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often-overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step-by-step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent-first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.