This document discusses how Apache Atlas and Apache Ranger can be used together to provide a metadata-driven and secure data lake. Apache Atlas provides metadata services and tagging capabilities. Apache Ranger uses the tags in Atlas to dynamically define and enforce access policies. The integration allows Ranger policies to automatically apply and change as Atlas metadata such as tags are updated. The document demonstrates how tags in Atlas for columns and tables can be used to create time-based and PII data access policies in Ranger.
This document provides an overview of Apache Atlas and how it addresses big data governance issues for enterprises. It discusses how Atlas provides a centralized metadata repository that allows users to understand data across Hadoop components. It also describes how Atlas integrates with Apache Ranger to enable dynamic security policies based on metadata tags. Finally, it outlines new capabilities in upcoming Atlas releases, including cross-component data lineage tracking and a business taxonomy/catalog.
The document outlines Renault's big data initiatives from 2014-2016 which progressed from an initial sandbox to a full industrialized big data platform. Key steps included implementing a new Hadoop infrastructure in 2015, industrializing the platform in 2016 to host production projects and POCs, and designing for scalability, isolation, simplified operations, and data protection. The document also discusses deploying quality projects to the data lake, ingestion scenarios, interactive SQL analytics, security measures including tokenization, and the next steps of federation and dynamic data change management.
This document summarizes improvements made to HDFS to optimize performance, stabilize operations, and improve supportability. Key areas discussed include logging enhancements, metrics and tools for troubleshooting, load management through RPC improvements, and changes to reduce garbage collection overhead and improve liveness detection. Specific optimizations covered range from code changes to reduce logging verbosity to adding batch processing of block reports.
The document discusses accelerating enterprise adoption of Apache Hadoop through a capability-driven approach. It outlines four core tenets for a Hadoop journey: having a capability-driven framework, using a heterogeneous set of technologies, choosing the right fit of open source and commercial solutions, and developing a flexible operating model. Case studies show how following these tenets can help reduce data processing times and give business users improved analytics capabilities.
This document discusses architecting Hadoop for adoption and data applications. It begins by explaining how traditional systems struggle as data volumes increase and how Hadoop can help address this issue. Potential Hadoop use cases are presented such as file archiving, data analytics, and ETL offloading. Total cost of ownership (TCO) is discussed for each use case. The document then covers important considerations for deploying Hadoop such as hardware selection, team structure, and impact across the organization. Lastly, it discusses lessons learned and the need for self-service tools going forward.
This document discusses strategies for filling a data lake by improving the process of data onboarding. It advocates using a template-based approach to streamline data ingestion from various sources and reduce dependence on hardcoded procedures. The key aspects are managing ELT templates and metadata through automated metadata extraction. This allows generating integration jobs dynamically based on metadata passed at runtime, providing flexibility to handle different source data with one template. It emphasizes reducing the risks associated with large data onboarding projects by maintaining a standardized and organized data lake.
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...DataWorks Summit
Data security is critical to the success of large enterprises such as Mayo Clinic (MC). There is no exception for healthcare data stored on the enterprise Big Data platforms. At MC, healthcare Big Data ingestion, storage, processing and analytics are all in the enterprise-secured environments including Sandbox, Dev, Int/Test and Prod Hadoop clusters. The primary data security in the enterprise-secured Hadoop clusters has been achieved at MC by the combination of Knox Gateway/F5 Balancer, Ranger authorization/auditing, Two Factor local authentication (TFA) and Kerberos authentication that are coupled to MC Active Directory and LDAP. In other words, any major HDFS, HBase and Hive healthcare data operations at MC have to go through the dedicated Knox Gateway or F5 balancer (for Knox HA) via Rest API, which interacts with Ranger and other primary security components involved. The data security on the Big Data platforms at MC is going to be strengthened by the on-going network segmentation and SSL enabling on the related Hadoop ecosystem components. The above approaches adopted on MC Big Data platforms have significantly improved the security of data for the success of MC Big Data program although the data need high-skilled clients or applications to use.
This document discusses navigating user data management and data discovery. It provides an overview of evaluating and selecting data management tools for a Hadoop data lake. Key criteria for evaluation include metadata curation, lineage and versioning, integration capabilities, and performance. Several vendors were evaluated, with Global ID, Attivio, and Waterline Data scoring highest based on the criteria. The presentation emphasizes selecting a limited number of tools based on business and user requirements.
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
Hadoop provides a powerful platform for data science and analytics, where data engineers and data scientists can leverage myriad data from external and internal data sources to uncover new insight. Such power is also presenting a few new challenges. On the one hand, the business wants more and more self-service, and on the other hand IT is trying to keep up with the demand for data, while maintaining architecture and data governance standards.
In this webinar, Andrew Ahn, Data Governance Initiative Product Manager at Hortonworks, will address the gaps and offer best practices in providing end-to-end data governance in HDP. Andrew Ahn will be followed by Oliver Claude of Waterline Data, who will share a case study of how Waterline Data Inventory works with HDP in the Modern Data Architecture to automate the discovery of business and compliance metadata, data lineage, as well as data quality metrics.
The document summarizes the Cask Data Application Platform (CDAP), which provides an integrated framework for building and running data applications on Hadoop and Spark. It consolidates the big data application lifecycle by providing dataset abstractions, self-service data, metrics and log collection, lineage, audit, and access control. CDAP has an application container architecture with reusable programming abstractions and global user and machine metadata. It aims to simplify deploying and operating big data applications in enterprises by integrating technologies like YARN, HBase, Kafka and Spark.
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
The document discusses creating a modern data architecture using a data lake. It describes Zaloni as a provider of data lake management solutions, including a data lake management and governance platform and self-service data platform. It outlines key features of a data lake such as storing different types of data, creating standardized datasets, and providing shorter time to insights. The document also discusses Zaloni's data lake maturity model and reference architecture.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
This document discusses using Hadoop and the Hortonworks Data Platform (HDP) for big data applications. It outlines how HDP can help organizations optimize their existing data warehouse, lower storage costs, unlock new applications from new data sources, and achieve an enterprise data lake architecture. The document also discusses how Talend's data integration platform can be used with HDP to easily develop batch, real-time, and interactive data integration jobs on Hadoop. Case studies show how companies have used Talend and HDP together to modernize their data architecture and product inventory and pricing forecasting.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
The convergence of reporting and interactive BI on HadoopDataWorks Summit
Since the early days of Hive, SQL on Hadoop has evolved from being a SQL wrapper on top of MapReduce to a viable replacement for the traditional EDW. In the meantime, while SQL-on-Hadoop vendors were busy adding enterprise capabilities and comparing their TPC-DS prowess against Hive, a niche industry emerged on the side for OLAP (a.k.a. “Interactive BI”) on Hadoop data. Unlike general-purpose SQL-on-Hadoop engines, which deal with the multiple aspects of warehousing, including reporting, OLAP-on-Hadoop engines focus almost exclusively on answering OLAP queries fast by using implementation techniques that had not been part of the SQL-on-Hadoop toolbox so far.
But SQL-on-Hadoop engines are not standing still. After having made huge progress in catching up to traditional EDWs for reporting workloads, SQL-on-Hadoop engines are now setting their sights on interactive BI. This is great news for enterprises. As the line between reporting and OLAP gets blurred, enterprises can now start considering using a single engine for both reporting and Interactive BI on their Hadoop data, as opposed to having to host, manage, and license two separate products.
Can a single engine satisfy both your reporting and Interactive BI needs? This may be a hard question to answer. Vendors use inconsistent terminology to describe their products and make ambitious and sometimes conflicting claims. This makes it very hard for enterprises to compare products, let alone decide which is the product that best matches their needs.
In this presentation, we’ll provide an overview of the different approaches to OLAP on Hadoop, and explain the key technologies behind each of them. We’ll use consistent terminology to describe what you get from multiple proprietary and open source products and outline advantages and disadvantages. You’ll come out equipped with the knowledge you need to read past marketing and sales pitches. You’ll be able to compare products and make an informed decision on whether a single engine for both reporting and Interactive BI on Hadoop is right for you.
Speaker
Gustavo Arocena, Big Data Architect, IBM
This webinar series covers Apache Kafka and Apache Storm for streaming data processing. Also, it discusses new streaming innovations for Kafka and Storm included in HDP 2.2
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
As containerization continues to gain momentum and become a de facto standard for application deployment, challenges around containerization of big data workloads are coming to light. Great strides have been made within the open source communities towards running big data workloads in containers, but much is left to be done.
Apache Hadoop YARN is the modern distributed operating system for big data applications. It has morphed the Hadoop compute layer into a common resource-management platform that can host a wide variety of applications. At its core, YARN has a very powerful scheduler which enforces global cluster level invariants and helps sites manage user and operator expectations of elastic sharing, resource usage limits, SLAs, and more. YARN recently increased its support for Docker containerization and added a YARN service framework supporting long-running services.
In this session we will explore the emerging patterns and challenges related to containers and big data workloads, including running applications such as Apache Spark, Apache HBase, and Kubernetes in containers on YARN. BILLIE RINALDI, Principal Software Engineer, Hortonworks and SHANE KUMPF, Software Engineer, Hortonworks
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Worldpay processes billions of transactions annually and stores vast amounts of transaction and customer data. In 2015, Worldpay committed to building a new enterprise data platform on Hadoop to provide analytics, reporting, and machine learning capabilities. The platform uses a multi-tenancy model with different "tenancy types" like data warehousing, decision services, APIs, and technical insights. Each tenancy type has its own components and services. Worldpay's platform currently has live implementations for data warehousing and is developing multiple decision services, with a goal of supporting tens of services within two years.
Hortonworks provides an open source Apache Hadoop data platform for managing large volumes of data. It was founded in 2011 and went public in 2014. Hortonworks has over 800 employees across 17 countries and partners with over 1,350 technology companies. Hortonworks' Data Platform is a collection of Apache projects that provides data management, access, governance, integration, operations and security capabilities. It supports batch, interactive and real-time processing on a shared infrastructure using the YARN resource management system.
The document summarizes research done at the Barcelona Supercomputing Center on evaluating Hadoop platforms as a service (PaaS) compared to infrastructure as a service (IaaS). Key findings include:
- Provider (Azure HDInsight, Rackspace CBD, etc.) did not significantly impact performance of wordcount and terasort benchmarks.
- Data size and number of datanodes were more important factors, with diminishing returns on performance from adding more nodes.
- PaaS can save on maintenance costs compared to IaaS but may be more expensive depending on workload and VM size needed. Tuning may still be required with PaaS.
Zurich Insurance is implementing a data lake to help address key trends in the insurance industry like digital transformation, emerging risks, and regulatory changes. The data lake will provide capabilities needed to store both structured and unstructured data at low cost, create business views on demand, support different workloads, enable rapid changes, and make data, analytics, and apps seamless. Zurich's conceptual architecture places all raw data into a single store with history and provides curation layers to build line of business and group level views for consumption.
Effective data governance is imperative to the success of Data Lake initiatives. Without governance policies and processes, information discovery and analysis is severely impaired. In this session we will provide an in-depth look into the Data Governance Initiative launched collaboratively between Hortonworks and partners from across industries. We will cover the objectives of Data Governance Initiatives and demonstrate key governance capabilities of the Hortonworks Data Platform.
This document discusses navigating user data management and data discovery. It provides an overview of evaluating and selecting data management tools for a Hadoop data lake. Key criteria for evaluation include metadata curation, lineage and versioning, integration capabilities, and performance. Several vendors were evaluated, with Global ID, Attivio, and Waterline Data scoring highest based on the criteria. The presentation emphasizes selecting a limited number of tools based on business and user requirements.
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
Hadoop provides a powerful platform for data science and analytics, where data engineers and data scientists can leverage myriad data from external and internal data sources to uncover new insight. Such power is also presenting a few new challenges. On the one hand, the business wants more and more self-service, and on the other hand IT is trying to keep up with the demand for data, while maintaining architecture and data governance standards.
In this webinar, Andrew Ahn, Data Governance Initiative Product Manager at Hortonworks, will address the gaps and offer best practices in providing end-to-end data governance in HDP. Andrew Ahn will be followed by Oliver Claude of Waterline Data, who will share a case study of how Waterline Data Inventory works with HDP in the Modern Data Architecture to automate the discovery of business and compliance metadata, data lineage, as well as data quality metrics.
The document summarizes the Cask Data Application Platform (CDAP), which provides an integrated framework for building and running data applications on Hadoop and Spark. It consolidates the big data application lifecycle by providing dataset abstractions, self-service data, metrics and log collection, lineage, audit, and access control. CDAP has an application container architecture with reusable programming abstractions and global user and machine metadata. It aims to simplify deploying and operating big data applications in enterprises by integrating technologies like YARN, HBase, Kafka and Spark.
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
The document discusses creating a modern data architecture using a data lake. It describes Zaloni as a provider of data lake management solutions, including a data lake management and governance platform and self-service data platform. It outlines key features of a data lake such as storing different types of data, creating standardized datasets, and providing shorter time to insights. The document also discusses Zaloni's data lake maturity model and reference architecture.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
This document discusses using Hadoop and the Hortonworks Data Platform (HDP) for big data applications. It outlines how HDP can help organizations optimize their existing data warehouse, lower storage costs, unlock new applications from new data sources, and achieve an enterprise data lake architecture. The document also discusses how Talend's data integration platform can be used with HDP to easily develop batch, real-time, and interactive data integration jobs on Hadoop. Case studies show how companies have used Talend and HDP together to modernize their data architecture and product inventory and pricing forecasting.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
The convergence of reporting and interactive BI on HadoopDataWorks Summit
Since the early days of Hive, SQL on Hadoop has evolved from being a SQL wrapper on top of MapReduce to a viable replacement for the traditional EDW. In the meantime, while SQL-on-Hadoop vendors were busy adding enterprise capabilities and comparing their TPC-DS prowess against Hive, a niche industry emerged on the side for OLAP (a.k.a. “Interactive BI”) on Hadoop data. Unlike general-purpose SQL-on-Hadoop engines, which deal with the multiple aspects of warehousing, including reporting, OLAP-on-Hadoop engines focus almost exclusively on answering OLAP queries fast by using implementation techniques that had not been part of the SQL-on-Hadoop toolbox so far.
But SQL-on-Hadoop engines are not standing still. After having made huge progress in catching up to traditional EDWs for reporting workloads, SQL-on-Hadoop engines are now setting their sights on interactive BI. This is great news for enterprises. As the line between reporting and OLAP gets blurred, enterprises can now start considering using a single engine for both reporting and Interactive BI on their Hadoop data, as opposed to having to host, manage, and license two separate products.
Can a single engine satisfy both your reporting and Interactive BI needs? This may be a hard question to answer. Vendors use inconsistent terminology to describe their products and make ambitious and sometimes conflicting claims. This makes it very hard for enterprises to compare products, let alone decide which is the product that best matches their needs.
In this presentation, we’ll provide an overview of the different approaches to OLAP on Hadoop, and explain the key technologies behind each of them. We’ll use consistent terminology to describe what you get from multiple proprietary and open source products and outline advantages and disadvantages. You’ll come out equipped with the knowledge you need to read past marketing and sales pitches. You’ll be able to compare products and make an informed decision on whether a single engine for both reporting and Interactive BI on Hadoop is right for you.
Speaker
Gustavo Arocena, Big Data Architect, IBM
This webinar series covers Apache Kafka and Apache Storm for streaming data processing. Also, it discusses new streaming innovations for Kafka and Storm included in HDP 2.2
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
As containerization continues to gain momentum and become a de facto standard for application deployment, challenges around containerization of big data workloads are coming to light. Great strides have been made within the open source communities towards running big data workloads in containers, but much is left to be done.
Apache Hadoop YARN is the modern distributed operating system for big data applications. It has morphed the Hadoop compute layer into a common resource-management platform that can host a wide variety of applications. At its core, YARN has a very powerful scheduler which enforces global cluster level invariants and helps sites manage user and operator expectations of elastic sharing, resource usage limits, SLAs, and more. YARN recently increased its support for Docker containerization and added a YARN service framework supporting long-running services.
In this session we will explore the emerging patterns and challenges related to containers and big data workloads, including running applications such as Apache Spark, Apache HBase, and Kubernetes in containers on YARN. BILLIE RINALDI, Principal Software Engineer, Hortonworks and SHANE KUMPF, Software Engineer, Hortonworks
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Worldpay processes billions of transactions annually and stores vast amounts of transaction and customer data. In 2015, Worldpay committed to building a new enterprise data platform on Hadoop to provide analytics, reporting, and machine learning capabilities. The platform uses a multi-tenancy model with different "tenancy types" like data warehousing, decision services, APIs, and technical insights. Each tenancy type has its own components and services. Worldpay's platform currently has live implementations for data warehousing and is developing multiple decision services, with a goal of supporting tens of services within two years.
Hortonworks provides an open source Apache Hadoop data platform for managing large volumes of data. It was founded in 2011 and went public in 2014. Hortonworks has over 800 employees across 17 countries and partners with over 1,350 technology companies. Hortonworks' Data Platform is a collection of Apache projects that provides data management, access, governance, integration, operations and security capabilities. It supports batch, interactive and real-time processing on a shared infrastructure using the YARN resource management system.
The document summarizes research done at the Barcelona Supercomputing Center on evaluating Hadoop platforms as a service (PaaS) compared to infrastructure as a service (IaaS). Key findings include:
- Provider (Azure HDInsight, Rackspace CBD, etc.) did not significantly impact performance of wordcount and terasort benchmarks.
- Data size and number of datanodes were more important factors, with diminishing returns on performance from adding more nodes.
- PaaS can save on maintenance costs compared to IaaS but may be more expensive depending on workload and VM size needed. Tuning may still be required with PaaS.
Zurich Insurance is implementing a data lake to help address key trends in the insurance industry like digital transformation, emerging risks, and regulatory changes. The data lake will provide capabilities needed to store both structured and unstructured data at low cost, create business views on demand, support different workloads, enable rapid changes, and make data, analytics, and apps seamless. Zurich's conceptual architecture places all raw data into a single store with history and provides curation layers to build line of business and group level views for consumption.
Effective data governance is imperative to the success of Data Lake initiatives. Without governance policies and processes, information discovery and analysis is severely impaired. In this session we will provide an in-depth look into the Data Governance Initiative launched collaboratively between Hortonworks and partners from across industries. We will cover the objectives of Data Governance Initiatives and demonstrate key governance capabilities of the Hortonworks Data Platform.
The document discusses using natural language processing (NLP) techniques like word2vec to analyze structured clinical data. Clinical encounters can be treated as "sentences" with vitals, labs, procedures, diagnoses, and prescriptions as "words". The author ingested clinical records into "sentences" and will use Spark's word2vec implementation on Hadoop to explore relationships between clinical concepts. The author is available for questions after demonstrating the approach on a dataset from a Kaggle diabetes prediction competition.
Apache Atlas provides centralized metadata services and cross-component dataset lineage tracking for Hadoop components. It aims to enable transparent, reproducible, auditable and consistent data governance across structured, unstructured, and traditional database systems. The near term roadmap includes dynamic access policy driven by metadata and enhanced Hive integration. Apache Atlas also pursues metadata exchange with non-Hadoop systems and third party vendors through REST APIs and custom reporters.
Apache Atlas provides metadata services and a centralized metadata repository for Hadoop platforms. It aims to enable data governance across structured and unstructured data through hierarchical taxonomies. Upcoming features include expanded dataset lineage tracking and integration with Apache Kafka and Ranger for dynamic access policy management. Challenges of big data management include scaling traditional tools to handle large volumes of entities and metadata, and Atlas addresses this through its decentralized and metadata-driven approach.
Hellmar Becker, a DevOps engineer, presented on securing Hadoop in an enterprise context at a summit in Dublin on April 14, 2016. The challenges of securing Hadoop include its default lack of security and risks of data loss, privacy breaches, and system intrusions. ING uses Hadoop for data storage, advanced analytics, real-time processing, and reporting. To secure Hadoop, ING implemented perimeter security, integrated Hadoop with its Active Directory for authentication and authorization using Ranger and Kerberos, and developed custom scripts to sync user groups efficiently with Ranger's limitations. Further improvements could include integrating OS and Hadoop security and using Identity and Policy Authentication for a centralized user database.
This document discusses Azure HDInsight and how it provides a managed Hadoop as a service on Microsoft's cloud platform. Key points include:
- Azure HDInsight runs Apache Hadoop and related projects like Hive and Pig in a cloud-based cluster that can be set up in minutes without hardware to deploy or maintain.
- It supports running queries and analytics jobs on data stored locally in HDFS or in Azure cloud storage like Blob storage and Data Lake Store.
- An IDC study found that Microsoft customers using cloud-based Hadoop through Azure HDInsight have 63% lower total cost of ownership than an on-premises Hadoop deployment.
ING Bank has developed a data lake architecture to centralize and govern all of its data. The data lake will serve as the "memory" of the bank, holding all data relevant for reporting, analytics, and data exchanges. ING formed an international data community to collaborate on Hadoop implementations and identify common patterns for file storage, deep data analytics, and real-time usage. Key challenges included the complexity of Hadoop, difficulty of large-scale collaboration, and ensuring analytic data received proper security protections. Future steps include standardizing building blocks, defining analytical model production, and embedding analytics in governance for privacy compliance.
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
HPE provides optimized server architectures for Hadoop including the Apollo 4200 server which offers high storage density. HPE also offers a reference architecture for Hadoop that separates compute and storage resources for better performance, using optimized servers like Moonshot for processing and Apollo for storage. Additionally, HPE contributes to Apache Spark through HP Labs to improve efficiency and scale of memory and performance.
Apache Atlas. Data Governance for Hadoop. Strata London 2015Sean Roberts
Apache Hadoop is being adopted across all industries for its ability
to store and process an abundance of new types of data in a modern data architecture. But this “Any Data” architecture presents a challenge when organizations must reconcile data management realities and as they bring existing and new data from disparate platforms under management.
Apache Atlas proposes to provide governance capabilities in Hadoop that use both a prescriptive and forensic models enriched by business taxonomical metadata. It is designed to exchange metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance.
Open Data Fueling Innovation - Kristen Honeyscoopnewsgroup
The document discusses the United States' leadership in open government and open data initiatives. It provides details on programs like the Open Government Initiative, Open Government Partnership, and open data policies. It then highlights the impact of open data across various federal agencies and programs, including examples in international development, finance, agriculture, education, health, precision medicine, and policing. Open data is fueling innovation and improved government services.
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
The IDEA Lab offers tools and programs to promote innovation across the Department of Health and Human Services. It provides accelerators and funding for new ideas, as well as training and support for entrepreneurs and innovators. The Health Data Initiative makes over 2,100 public health datasets available to spur innovative applications and improve health outcomes.
The document summarizes the past, present, and future of Hadoop at LinkedIn. It describes how LinkedIn initially implemented PYMK on Oracle in 2006, then moved to Hadoop in 2008 with 20 nodes, scaling up to over 10,000 nodes and 1000 users by 2016 running various big data frameworks. It discusses the challenges of scaling hardware and processes, and how LinkedIn developed tools like HDFS Dynamometer, Dr. Elephant, Byte-Ray and SoakCycle to help with scaling, performance tuning, dependency management and integration testing of Hadoop clusters. The future may include the Dali project to make data more accessible through different views.
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads. This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Speakers:
David Alves. Software engineer at Cloudera working on the Kudu team, and a PhD student at UT Austin. David is a committer at the Apache Software Foundation and has contributed to several open source projects, including Apache Cassandra and Apache Drill.
The document contains a series of advertisements using images and text to promote various products and services, including food items, reading materials, a farmer's market, and spy-themed toys and games. The ads employ persuasive language and imagery to encourage purchasing decisions without clearly stating the benefits or details of what is being sold.
The document discusses LLAP (Live Long and Process), a new execution engine in Apache Hive 2.0 that enables sub-second analytical queries. LLAP keeps a small subset of frequently accessed data in memory to enable faster query processing times compared to traditional Hive architectures that rely on disk access. It works by running Hive query fragments simultaneously in both YARN containers and long-running daemon processes that cache data in memory. This allows for highly concurrent query execution without specialized YARN configurations. The document provides details on how LLAP is implemented and evaluates its performance benefits based on benchmarks and customer case studies.
This document provides an agenda and overview of a presentation by Data Transformed on big data analytics using the KNIME platform. The presentation includes an introduction of Data Transformed and KNIME, a live demonstration of forecasting energy usage from customer data using KNIME and Hadoop tools, and a question and answer session. It promotes Data Transformed's services around data management, analytics, and consulting and highlights KNIME's capabilities for comprehensive data processing.
The document discusses Apache Atlas, which is a data governance solution for Hadoop. It provides a centralized business catalog to organize data assets along business terms. This helps improve data governance, compliance, and faster discovery of data. The business catalog provides a common taxonomy and supports features like tagging, lineage tracking, and dynamic access control integrated with Ranger. It aims to reduce the time analysts spend searching for data from 50-80% to less than 25%.
Apache Atlas provides data governance capabilities for Hadoop including data classification, metadata management, and data lineage/provenance. It models metadata using a flexible type system and stores metadata in a property graph database for relationships and lineage queries. Key features include cross-component lineage mapping, reusable tagging policies for access control, and a business catalog to organize assets by common business terms.
The Atlas/ Ranger integration represents a paradigm shift for big data governance and security. Enterprises can now implement dynamic classification-based security policies, in addition to role-based security. Ranger’s centralized platform empowers data administrators to define security policy based on Atlas metadata tags or attributes and apply this policy in real-time to the entire hierarchy of data assets including databases, tables and columns.
The document discusses extending data governance in Hadoop ecosystems using Apache Atlas and partner solutions including Waterline Data, Attivo, and Trifacta. It highlights how these vendors have adopted Apache's open source community commitment and are integrating their products with Atlas to provide a rich, innovative community with a common metadata store backed by Atlas. The session will showcase how these three vendors extend governance capabilities by integrating their products with Atlas.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
The document discusses new security and governance capabilities in Hortonworks Data Platform (HDP) provided by Apache Atlas and Ranger. Apache Atlas provides data governance by capturing metadata and enabling users to define tags, classifications, and policies. Ranger integrates with Atlas to enable dynamic, tag-based access policies. Together, Atlas and Ranger provide deep visibility into the security administration process, fine-grained security definition, and centralized management of security policies across HDP components.
The integration of Ranger and Atlas is a fundamental shift in how to provision access to assets within the Hadoop ecosystem. It allows for those who understand the content and classification of data to assign proper permissions based on data-specific attributes, rather than the current model of location- and user-based model. Furthermore, it provides a clear separation of duties and ensures the responsibility of maintaining data access security remains with the most appropriate teams: i.e. those who know the data best.
Moreover, data classification changes in Atlas trigger a change in Ranger policies to the appropriate authorization rules. It provides an agile approach to authorization. It further reduces the workload and stress on operational teams allowing for faster and accurate delivery.
With the ongoing evolution and maturation of the Hadoop ecosystem’s tools and services, data-driven authorization will scale in parallel. Essentially, it simplifies the number of policies defined across multiple services into a single policy per tag (data classification) that spans services. It takes careful planning and architecture to unlock these features in Atlas and Ranger.
The presentation will be a tutorial on how to:
• Structure user groups
• Add custom fields, entities, and tags to Atlas
• Inherit and chain Atlas entities and tags
• Configure Ranger to sync Atlas tags and assign permissions based on those tags
• Assign conditional permissions based on Atlas tags’ properties
• Integrate Atlas into your ingestion framework to auto assign metadata to your data
• A full run through of creating an entity, adding custom fields, adding tags, and configuring policies in Ranger to utilize the tag
The tutorial will also highlight key features in Atlas and different integration points within the Hadoop ecosystem. At the end of the tutorial, attendees should gain functional knowledge on how to authorize assets based on their metadata.
Speaker
Amer Issa, Senior Platform and Security Architect, Hortonworks
Hortonworks Oracle Big Data Integration Hortonworks
Slides from joint Hortonworks and Oracle webinar on November 11, 2014. Covers the Modern Data Architecture with Apache Hadoop and Oracle Data Integration products.
The document discusses Apache Atlas, an open source project aimed at solving data governance challenges in Hadoop. It proposes Atlas to provide capabilities like data classification, metadata exchange, centralized auditing, search and lineage tracking, and security policies. The architecture would involve a type system to define metadata, a graph database to store metadata, and search and lineage functionality. A governance certification program is also proposed to ensure partner solutions integrate well with Atlas and Hadoop.
The document provides an overview of Apache Atlas, a metadata management and governance solution for Hadoop data lakes. It discusses Atlas' architecture, which uses a graph database to store types and instances. Atlas also includes search capabilities and integration with Hadoop components like Hive to capture lineage metadata. The remainder of the document outlines Atlas' roadmap, with goals of adding additional component connectors, a governance certification program, and generally moving towards a production release.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
This document discusses Apache Ranger and Apache Atlas for security and governance in Hadoop. It provides an overview of Ranger's centralized authorization and auditing capabilities for Hadoop components using policies. It also describes Atlas' capabilities for metadata management, data lineage, classification using tags, and integrations with Ranger for classification-based security. The document concludes with a demo and Q&A section.
Hortonworks Hybrid Cloud - Putting you back in control of your dataScott Clinton
The document discusses Hortonworks' solutions for managing data across hybrid cloud environments. It proposes getting all data under management, combating growing cloud data silos, and consistently securing and governing data across locations. Hortonworks offers the Hortonworks Data Platform, Hortonworks Dataflow, and Hortonworks DataPlane to provide a modern hybrid data architecture with cloud-native capabilities, security and governance, and the ability to extend to edge locations. The document also highlights Hortonworks' professional services and open source community initiatives around hybrid cloud data.
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big DataMats Johansson
This document provides an overview of Hortonworks DataFlow, which is powered by Apache NiFi. It discusses how the growth of IoT data is outpacing our ability to consume it and how NiFi addresses the new requirements around collecting, securing and analyzing data in motion. Key features of NiFi are highlighted such as guaranteed delivery, data provenance, and its ability to securely manage bidirectional data flows in real-time. Common use cases like predictive analytics, compliance and IoT optimization are also summarized.
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...PwC
Hadoop Summit is an industry-leading Hadoop community event for business leaders and technology experts (such as architects, data scientists and Hadoop developers) to learn about the technologies and business drivers transforming data. PwC is helping organizations unlock their data possibilities to make data-driven decisions.
Learn more: http://hortonworks.com/hdf/
Log data can be complex to capture, typically collected in limited amounts and difficult to operationalize at scale. HDF expands the capabilities of log analytics integration options for easy and secure edge analytics of log files in the following ways:
More efficient collection and movement of log data by prioritizing, enriching and/or transforming data at the edge to dynamically separate critical data. The relevant data is then delivered into log analytics systems in a real-time, prioritized and secure manner.
Cost-effective expansion of existing log analytics infrastructure by improving error detection and troubleshooting through more comprehensive data sets.
Intelligent edge analytics to support real-time content-based routing, prioritization, and simultaneous delivery of data into Connected Data Platforms, log analytics and reporting systems for comprehensive coverage and retention of Internet of Anything data.
Learn more: http://hortonworks.com/hdf/
Log data can be complex to capture, typically collected in limited amounts and difficult to operationalize at scale. HDF expands the capabilities of log analytics integration options for easy and secure edge analytics of log files in the following ways:
More efficient collection and movement of log data by prioritizing, enriching and/or transforming data at the edge to dynamically separate critical data. The relevant data is then delivered into log analytics systems in a real-time, prioritized and secure manner.
Cost-effective expansion of existing log analytics infrastructure by improving error detection and troubleshooting through more comprehensive data sets.
Intelligent edge analytics to support real-time content-based routing, prioritization, and simultaneous delivery of data into Connected Data Platforms, log analytics and reporting systems for comprehensive coverage and retention of Internet of Anything data.
The document outlines a presentation about enterprise data science at scale. The agenda includes networking, announcements, a main presentation on introducing data science at scale, building and deploying models collaboratively, training models with all data, and putting models to work in streaming applications, followed by Q&A. The main presentation will discuss challenges of data science like data in multiple locations, too many tools, difficulty sharing insights and operationalizing models, and limitations of desktop. It will introduce Apache Spark as a distributed processing platform, Jupyter and Zeppelin notebooks, and deploying models as a virtual service. A demo will use customer churn data to train a random forest model to predict churn, and deploy it to production to deliver insights.
Balancing data democratization with comprehensive information governance: bui...DataWorks Summit
If information is the new oil, then governance is its “safety data sheet.” As demand for data as the raw material for competitive differentiation continues to rise in enterprises, enterprises are having bigger challenges identifying and valuing data and ensuring its appropriate use to extract the right information. In order for organizations to make effective business decisions, organizations need to have trust in their data so that they can impute the right value and use it for the right purposes while satisfying any organizational or regulatory mandates. A number of analytics and data science initiatives fail to reach their potential due to lack of an information governance framework in place. Robust information governance capabilities can help organizations develop trust in their data and empower them to make decisions confidently.
In this session Sanjeev Mohan, Research Analyst at Gartner, and Srikanth Venkat, Sr. Director of Product Management at Hortonworks, will walk you through an end-to-end architectural blueprint for information governance and best practices for helping organizations understand, secure, and govern diverse types of data in enterprise data lakes.
Speaker
Sanjeev Mohan, Gartner, Research Analyst
Srikanth Venkat, Hortonworks, Senior Director, Product Management
In this session Sanjeev Mohan, Research Analyst at Gartner, and Srikanth Venkat, Sr. Director of Product Management at Hortonworks, will walk you through an end-to-end architectural blueprint for information governance and best practices for helping organizations understand, secure, and govern diverse types of data in enterprise data lakes.
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
#8: TALK TRACK
Open Enterprise Hadoop enables trusted governance, with:
Data lifecycle management along the entire lifecycle
Modeling with metadata, and
Interoperable solutions that can access a common metadata store.
[NEXT SLIDE]
SUPPORTING DETAIL
Trusted Governance
Why this matters to our customers: As data accumulates in an HDP cluster, the enterprise needs governance policies to control how that data is ingested, transformed and eventually retired. This keeps those Big Data assets from turning into big liabilities that you can’t control.
Proof point: HDP includes 100% open source Apache Atlas and Apache Falcon for centralized data governance coordinated by YARN. These data governance engines provide those mature data management and metadata modeling capabilities, and they are constantly strengthened by members of the Data Governance Initiative. The Data Governance Initiative (DGI) is working to develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. The DGI coalition includes Hortonworks partner SAS and customers Merck, Target, Aetna and Schlumberger. Together, we assure that Hadoop:
Snaps into existing frameworks to openly exchange metadata
Addresses enterprise data governance requirements within its own stack of technologies
Citation: “As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.” | http://hortonworks.com/press-releases/hortonworks-establishes-data-governance-initiative/
#12: Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following:
Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources
Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop
Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
#13: Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together
#15: Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together
#18: - Learn about who are users are and what are their needs to validate if we are solving the right problem
Open ended half hour discussions about processes, challenges and current tools
We record the interviews so that we can focus on the conversation and analyis them afterward
#19: - Test our prototype in Invision - A click through prototyping tool
- Walk users through scenarios and watch how they respond
- Remind our participants that we aren’t testing them, we’re testing the design and encourage thinking aloud
#20: - Re-watch recordings and capture verbatim quotes on stickys
- Affinity mapping
- Group feedback into categories and look for trends and insights
- For this project we translated our sticky’s into Trello to share with the team remotely. We’ve starred the sticky’s that represented common themes and valuable insights.
#21: Is the product was well understood?
Is the product something they would use?
Where is the value?
#38: Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following:
Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources
Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop
Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy