Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
Apache Falcon is a data management platform that allows users to centrally manage data lifecycles across Hadoop clusters. It defines data entities like clusters, feeds, and processes to represent data pipelines. Falcon then automatically generates workflows to orchestrate the movement of data according to defined policies for replication, retention, and late data handling. It also provides data governance features like lineage tracing, auditing, and tagging. The latest version of Falcon includes new capabilities for disaster recovery mirroring and replication to cloud storage services.
This document provides an overview of Apache Atlas and how it addresses big data governance issues for enterprises. It discusses how Atlas provides a centralized metadata repository that allows users to understand data across Hadoop components. It also describes how Atlas integrates with Apache Ranger to enable dynamic security policies based on metadata tags. Finally, it outlines new capabilities in upcoming Atlas releases, including cross-component data lineage tracking and a business taxonomy/catalog.
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
Hadoop provides a powerful platform for data science and analytics, where data engineers and data scientists can leverage myriad data from external and internal data sources to uncover new insight. Such power is also presenting a few new challenges. On the one hand, the business wants more and more self-service, and on the other hand IT is trying to keep up with the demand for data, while maintaining architecture and data governance standards.
In this webinar, Andrew Ahn, Data Governance Initiative Product Manager at Hortonworks, will address the gaps and offer best practices in providing end-to-end data governance in HDP. Andrew Ahn will be followed by Oliver Claude of Waterline Data, who will share a case study of how Waterline Data Inventory works with HDP in the Modern Data Architecture to automate the discovery of business and compliance metadata, data lineage, as well as data quality metrics.
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
With the introduction of YARN, Hadoop has emerged as a first class citizen in the data center as a single Hadoop cluster can now be used to power multiple applications and hold more data. This advance has also put a spotlight on a need for more comprehensive approach to Hadoop security.
Hortonworks recently acquired Hadoop security company XA Secure to provide a common interface for central administration of security policy and coordinated enforcement across authentication, authorization, audit and data protection for the entire Hadoop stack.
In this presentation, Balaji Ganesan and Bosco Durai (previously with XA Secure, now with Hortonworks) introduce HDP Advanced Security, review a comprehensive set of Hadoop security requirements and demonstrate how HDP Advanced Security addresses them.
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
The document discusses using Hortonworks Data Platform (HDP) and Red Hat JBoss Data Virtualization to create a data lake solution and virtual data marts. It describes how a data lake enables storing all types of data in a single repository and accessing it through tools. Virtual data marts allow lines of business to access relevant data through self-service interfaces while maintaining governance and security over the central data lake. The presentation includes demonstrations of virtual data marts integrating data from Hadoop and other sources.
Apache Atlas provides data governance capabilities for Hadoop including data classification, metadata management, and data lineage/provenance. It models metadata using a flexible type system and stores metadata in a property graph database for relationships and lineage queries. Key features include cross-component lineage mapping, reusable tagging policies for access control, and a business catalog to organize assets by common business terms.
The document discusses extending data governance in Hadoop ecosystems using Apache Atlas and partner solutions including Waterline Data, Attivo, and Trifacta. It highlights how these vendors have adopted Apache's open source community commitment and are integrating their products with Atlas to provide a rich, innovative community with a common metadata store backed by Atlas. The session will showcase how these three vendors extend governance capabilities by integrating their products with Atlas.
Effective data governance is imperative to the success of Data Lake initiatives. Without governance policies and processes, information discovery and analysis is severely impaired. In this session we will provide an in-depth look into the Data Governance Initiative launched collaboratively between Hortonworks and partners from across industries. We will cover the objectives of Data Governance Initiatives and demonstrate key governance capabilities of the Hortonworks Data Platform.
This webinar series covers Apache Kafka and Apache Storm for streaming data processing. Also, it discusses new streaming innovations for Kafka and Storm included in HDP 2.2
The document provides an overview of Apache Atlas, a metadata management and governance solution for Hadoop data lakes. It discusses Atlas' architecture, which uses a graph database to store types and instances. Atlas also includes search capabilities and integration with Hadoop components like Hive to capture lineage metadata. The remainder of the document outlines Atlas' roadmap, with goals of adding additional component connectors, a governance certification program, and generally moving towards a production release.
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
Hortonworks Data Platform 2.2 includes Apache Falcon for Hadoop data governance. In this 30-minute webinar, we discussed why the enterprise needs Falcon for governance, and demonstrated data pipeline construction, policies for data retention and management with Ambari. We also discussed new innovations including: integration of user authentication, data lineage, an improved interface for pipeline management, and the new Falcon capability to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3.
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
Don't Let Security Be The 'Elephant in the Room'Hortonworks
Don't let security be the "elephant in the room" for enterprise big data. As big data now includes sensitive data from various sources, there are hidden risks to simply adopting big data technologies without also implementing proper data protection. While traditional IT security approaches provide some coverage, they also have gaps and do not fully address protecting data across its lifecycle and wherever it may travel. A data-centric security approach that encrypts data at capture can lock down data and keep it protected as it is stored, processed, and shared across systems.
Hortonworks and Clarity Solution Group Hortonworks
Many organizations are leveraging social media to understand consumer sentiment and opinions about brands and products. Analytics in this area, however, is in its infancy and does not always provide a compelling result for effective business impact. Learn how consumer organizations can benefit by integrating social data with enterprise data to drive more profitable consumer relationships. This webinar is presented by Hortonworks and Clarity Solution Group, and will focus on the evolution of Hadoop, the clear advantage of Hortonworks distribution, and business challenges solved by “Consumer720.”
This document discusses strategies for successfully utilizing a data lake. It notes that creating a data lake is just the beginning and that challenges include data governance, metadata management, access, and effective use of the data. The document advocates for data democratization through discovery, accessibility, and usability. It also discusses best practices like self-service BI and automated workload migration from data warehouses to reduce costs and risks. The key is to address the "data lake dilemma" of these challenges to avoid a "data swamp" and slow adoption.
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
As the enterprise's big data program matures and Apache Hadoop becomes more deeply embedded in critical operations, the ability to support and operate it efficiently and reliably becomes increasingly important. To aid enterprise in operating modern data architecture at scale, Red hat and Hortonworks have collaborated to integrate Hortonworks Data Platform with Red Hat's proven platform technologies. Join us in this interactive 3-part webinar series, as we'll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data.
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
This document appears to be a presentation about Apache Solr for Hadoop search using the Hortonworks Data Platform (HDP). The agenda includes an overview of Apache Solr and Hadoop search, a demo of Hadoop search, and a question and answer section. The presentation discusses how Solr provides scalable indexing of data stored in HDFS and powerful search capabilities. It also includes a reference architecture showing how Solr integrates with Hadoop for search and indexing.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
This document discusses optimizing a traditional enterprise data warehouse (EDW) architecture with Hortonworks Data Platform (HDP). It provides examples of how HDP can be used to archive cold data, offload expensive ETL processes, and enrich the EDW with new data sources. Specific customer case studies show cost savings ranging from $6-15 million by moving portions of the EDW workload to HDP. The presentation also outlines a solution model and roadmap for implementing an optimized modern data architecture.
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasDataWorks Summit
The community for Apache Atlas and Apache Ranger, which are foundational components for security and governance across the Hadoop stack, has spawned a robust partner ecosystem of tools and platforms. Such partner solutions build upon the extensibility offered in these platforms via open and robust APIs via integration patterns to provide innovative “better-together” capabilities. In this talk, we will showcase how the ecosystem of partners is building value-added capabilities to address GDPR based on Apache Ranger and Apache Atlas frameworks to complement the Hadoop ecosystem. The talk will showcase multiple ecosystem partner demonstrations that will include how to identify, map, and classify personal data, harvest and maintain metadata, track and map the movement of data through your enterprise, and enforce appropriate controls to monitor access and usage of personal data to help organizations address GDPR. We will also provide a short overview of Gov Ready and Sec Ready programs and how partners can benefit from the certification process as part of this program.
Speakers
Ali Bajwa, Principal Solutions Engineer, Hortonworks
Srikanth Venkat, Senior Director Product Management, Hortonworks
The document discusses Apache Atlas, an open source project aimed at solving data governance challenges in Hadoop. It proposes Atlas to provide capabilities like data classification, metadata exchange, centralized auditing, search and lineage tracking, and security policies. The architecture would involve a type system to define metadata, a graph database to store metadata, and search and lineage functionality. A governance certification program is also proposed to ensure partner solutions integrate well with Atlas and Hadoop.
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...DataWorks Summit
Emerging regulations such as GDPR and increasing incidence of data breaches such as those at Equifax are bringing a firm’s handling and processing of sensitive data such as personal data of its customers and employees into focus. Enterprises need to now be able to discover and manage sensitive data usage to answer compliance and regulatory reporting requirements and to prevent any reputational damage in the event of a data breach. In this talk, we will outline how using the foundation of open source technologies such as Apache Ranger, Apache Atlas and the recently announced Hortonworks DataPlane Service platform components data stewards, analysts, and data engineers can better understand their sensitive data assets across multiple data lakes at scale. We will demonstrate how enterprises can get a comprehensive 360-degree view of their sensitive data including where such data is located, who is accessing what data and how frequently, when was such data accessed, deleted, moved, how is the data protected, and where did this data come from. In addition we will show how such data can be discovered and profiled to understand their characteristics. We will also demonstrate organization and classification use cases for such sensitive data to facilitate their curation into collections for various business purposes and how such collections can be aggregated and summarized to provide a single view of sensitive data footprint in an enterprise from risk management and audit/compliance/forensics perspectives.
Speakers
Srikanth Venkat, Senior Director, Product Management, Hortonworks
Ashwin Rajeeva, Founder, Vidyash OU
This document discusses Apache Ranger and Apache Atlas for security and governance in Hadoop. It provides an overview of Ranger's centralized authorization and auditing capabilities for Hadoop components using policies. It also describes Atlas' capabilities for metadata management, data lineage, classification using tags, and integrations with Ranger for classification-based security. The document concludes with a demo and Q&A section.
Hortonworks and Voltage Security webinarHortonworks
Securing Hadoop data is a hot topic for good reason – no matter where you are in your Hadoop implementation plans, it’s best to define your data security approach now, not later. Hortonworks and Voltage Security are focused on deeply integrating Hadoop with your existing data center technologies and team capabilities. Attend this discussion to learn about a central policy administration framework across security requirements for authentication, authorization, auditing and data protection.
The Atlas/ Ranger integration represents a paradigm shift for big data governance and security. Enterprises can now implement dynamic classification-based security policies, in addition to role-based security. Ranger’s centralized platform empowers data administrators to define security policy based on Atlas metadata tags or attributes and apply this policy in real-time to the entire hierarchy of data assets including databases, tables and columns.
Smarter Analytics: Big Data and Predictive GovernanceIBM Danmark
The document discusses how big data, social media, systemic models, and governance can be used for predictive governance. It describes how analyzing large amounts of diverse data from various sources can help anticipate crises and their impacts. By monitoring data in real-time from sensors, images, reports, and meetings, predictive models can be generated to simulate potential outcomes and help decision makers plan accordingly. When combined with data governance practices to validate data quality, these tools allow issues to be addressed proactively before they become larger problems.
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...DATAVERSITY
Big Data is all the rage. Everybody is asking about Big Data, researching Big Data, considering Big Data, some are even doing Big Data. Certainly many people are asking questions about Big Data Governance. We have some answers for them.
This Real-World Data Governance webinar with Bob Seiner will focus on the strength of Big Data Governance as a concept and a practice and will highlight how the concepts of each, Big Data and Data Governance, both benefit and hurt each other.
This session will include:
Defining Big Data Governance
Ways to Govern Big Data
Making the Connection for IT and Business People
Determining the Vitality of Big Data Governance
Considerations for Big Data Governance
This webinar series covers Apache Kafka and Apache Storm for streaming data processing. Also, it discusses new streaming innovations for Kafka and Storm included in HDP 2.2
The document provides an overview of Apache Atlas, a metadata management and governance solution for Hadoop data lakes. It discusses Atlas' architecture, which uses a graph database to store types and instances. Atlas also includes search capabilities and integration with Hadoop components like Hive to capture lineage metadata. The remainder of the document outlines Atlas' roadmap, with goals of adding additional component connectors, a governance certification program, and generally moving towards a production release.
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
Hortonworks Data Platform 2.2 includes Apache Falcon for Hadoop data governance. In this 30-minute webinar, we discussed why the enterprise needs Falcon for governance, and demonstrated data pipeline construction, policies for data retention and management with Ambari. We also discussed new innovations including: integration of user authentication, data lineage, an improved interface for pipeline management, and the new Falcon capability to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3.
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
Don't Let Security Be The 'Elephant in the Room'Hortonworks
Don't let security be the "elephant in the room" for enterprise big data. As big data now includes sensitive data from various sources, there are hidden risks to simply adopting big data technologies without also implementing proper data protection. While traditional IT security approaches provide some coverage, they also have gaps and do not fully address protecting data across its lifecycle and wherever it may travel. A data-centric security approach that encrypts data at capture can lock down data and keep it protected as it is stored, processed, and shared across systems.
Hortonworks and Clarity Solution Group Hortonworks
Many organizations are leveraging social media to understand consumer sentiment and opinions about brands and products. Analytics in this area, however, is in its infancy and does not always provide a compelling result for effective business impact. Learn how consumer organizations can benefit by integrating social data with enterprise data to drive more profitable consumer relationships. This webinar is presented by Hortonworks and Clarity Solution Group, and will focus on the evolution of Hadoop, the clear advantage of Hortonworks distribution, and business challenges solved by “Consumer720.”
This document discusses strategies for successfully utilizing a data lake. It notes that creating a data lake is just the beginning and that challenges include data governance, metadata management, access, and effective use of the data. The document advocates for data democratization through discovery, accessibility, and usability. It also discusses best practices like self-service BI and automated workload migration from data warehouses to reduce costs and risks. The key is to address the "data lake dilemma" of these challenges to avoid a "data swamp" and slow adoption.
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
As the enterprise's big data program matures and Apache Hadoop becomes more deeply embedded in critical operations, the ability to support and operate it efficiently and reliably becomes increasingly important. To aid enterprise in operating modern data architecture at scale, Red hat and Hortonworks have collaborated to integrate Hortonworks Data Platform with Red Hat's proven platform technologies. Join us in this interactive 3-part webinar series, as we'll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data.
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
This document appears to be a presentation about Apache Solr for Hadoop search using the Hortonworks Data Platform (HDP). The agenda includes an overview of Apache Solr and Hadoop search, a demo of Hadoop search, and a question and answer section. The presentation discusses how Solr provides scalable indexing of data stored in HDFS and powerful search capabilities. It also includes a reference architecture showing how Solr integrates with Hadoop for search and indexing.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
This document discusses optimizing a traditional enterprise data warehouse (EDW) architecture with Hortonworks Data Platform (HDP). It provides examples of how HDP can be used to archive cold data, offload expensive ETL processes, and enrich the EDW with new data sources. Specific customer case studies show cost savings ranging from $6-15 million by moving portions of the EDW workload to HDP. The presentation also outlines a solution model and roadmap for implementing an optimized modern data architecture.
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasDataWorks Summit
The community for Apache Atlas and Apache Ranger, which are foundational components for security and governance across the Hadoop stack, has spawned a robust partner ecosystem of tools and platforms. Such partner solutions build upon the extensibility offered in these platforms via open and robust APIs via integration patterns to provide innovative “better-together” capabilities. In this talk, we will showcase how the ecosystem of partners is building value-added capabilities to address GDPR based on Apache Ranger and Apache Atlas frameworks to complement the Hadoop ecosystem. The talk will showcase multiple ecosystem partner demonstrations that will include how to identify, map, and classify personal data, harvest and maintain metadata, track and map the movement of data through your enterprise, and enforce appropriate controls to monitor access and usage of personal data to help organizations address GDPR. We will also provide a short overview of Gov Ready and Sec Ready programs and how partners can benefit from the certification process as part of this program.
Speakers
Ali Bajwa, Principal Solutions Engineer, Hortonworks
Srikanth Venkat, Senior Director Product Management, Hortonworks
The document discusses Apache Atlas, an open source project aimed at solving data governance challenges in Hadoop. It proposes Atlas to provide capabilities like data classification, metadata exchange, centralized auditing, search and lineage tracking, and security policies. The architecture would involve a type system to define metadata, a graph database to store metadata, and search and lineage functionality. A governance certification program is also proposed to ensure partner solutions integrate well with Atlas and Hadoop.
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...DataWorks Summit
Emerging regulations such as GDPR and increasing incidence of data breaches such as those at Equifax are bringing a firm’s handling and processing of sensitive data such as personal data of its customers and employees into focus. Enterprises need to now be able to discover and manage sensitive data usage to answer compliance and regulatory reporting requirements and to prevent any reputational damage in the event of a data breach. In this talk, we will outline how using the foundation of open source technologies such as Apache Ranger, Apache Atlas and the recently announced Hortonworks DataPlane Service platform components data stewards, analysts, and data engineers can better understand their sensitive data assets across multiple data lakes at scale. We will demonstrate how enterprises can get a comprehensive 360-degree view of their sensitive data including where such data is located, who is accessing what data and how frequently, when was such data accessed, deleted, moved, how is the data protected, and where did this data come from. In addition we will show how such data can be discovered and profiled to understand their characteristics. We will also demonstrate organization and classification use cases for such sensitive data to facilitate their curation into collections for various business purposes and how such collections can be aggregated and summarized to provide a single view of sensitive data footprint in an enterprise from risk management and audit/compliance/forensics perspectives.
Speakers
Srikanth Venkat, Senior Director, Product Management, Hortonworks
Ashwin Rajeeva, Founder, Vidyash OU
This document discusses Apache Ranger and Apache Atlas for security and governance in Hadoop. It provides an overview of Ranger's centralized authorization and auditing capabilities for Hadoop components using policies. It also describes Atlas' capabilities for metadata management, data lineage, classification using tags, and integrations with Ranger for classification-based security. The document concludes with a demo and Q&A section.
Hortonworks and Voltage Security webinarHortonworks
Securing Hadoop data is a hot topic for good reason – no matter where you are in your Hadoop implementation plans, it’s best to define your data security approach now, not later. Hortonworks and Voltage Security are focused on deeply integrating Hadoop with your existing data center technologies and team capabilities. Attend this discussion to learn about a central policy administration framework across security requirements for authentication, authorization, auditing and data protection.
The Atlas/ Ranger integration represents a paradigm shift for big data governance and security. Enterprises can now implement dynamic classification-based security policies, in addition to role-based security. Ranger’s centralized platform empowers data administrators to define security policy based on Atlas metadata tags or attributes and apply this policy in real-time to the entire hierarchy of data assets including databases, tables and columns.
Smarter Analytics: Big Data and Predictive GovernanceIBM Danmark
The document discusses how big data, social media, systemic models, and governance can be used for predictive governance. It describes how analyzing large amounts of diverse data from various sources can help anticipate crises and their impacts. By monitoring data in real-time from sensors, images, reports, and meetings, predictive models can be generated to simulate potential outcomes and help decision makers plan accordingly. When combined with data governance practices to validate data quality, these tools allow issues to be addressed proactively before they become larger problems.
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...DATAVERSITY
Big Data is all the rage. Everybody is asking about Big Data, researching Big Data, considering Big Data, some are even doing Big Data. Certainly many people are asking questions about Big Data Governance. We have some answers for them.
This Real-World Data Governance webinar with Bob Seiner will focus on the strength of Big Data Governance as a concept and a practice and will highlight how the concepts of each, Big Data and Data Governance, both benefit and hurt each other.
This session will include:
Defining Big Data Governance
Ways to Govern Big Data
Making the Connection for IT and Business People
Determining the Vitality of Big Data Governance
Considerations for Big Data Governance
Big data governance as a corporate governance imperativeGuy Pearce
Poor data governance impacts reputation risk by data breach, by privacy violations and by acting on poor quality data. Furthermore, there are some important differences in what data governance means for big data compared to data governance for operational data.
That poor data governance impacts reputation risk means it has considerable implications for the Board of Directors, for whom reputation risk is the number one risk according to Deloitte (2013).
This presentation targeting the Board of Directors and the C-Suite and presented at the National Data Governance and Privacy Congress in Calgary, Canada presented some reasons why data governance is critical, from the perspective of both the C-Suite and the Board of Directors.
(Also on YouTube at http://youtu.be/QR4KO3Yx0n4)
Originally Published: Jan 21, 2015
The size and complexity of data make it difficult for companies to unlock the true value of their data. IBM Information Integration Governance can improve data quality, protect sensitive data, and reduce cost and risk. Free up your resources and get more out of your data.
The document discusses challenges and opportunities for data governance in the era of big data. It argues that traditional hierarchical models of data governance are insufficient and that a hybrid approach is needed that combines hierarchical control with networked empowerment. Specifically, it recommends (1) focusing on digitalizing trust through social capital, (2) shifting from predictive analytics to lifetime customer value, and (3) establishing Chief Data Officer leadership to oversee a collaborative, hybrid approach.
This document provides an overview of MasterCard's approach to securing big data. It discusses security pillars like perimeter security, access security, visibility security and data security. It also covers infrastructure and data architecture vulnerabilities and recommends steps like implementing role-based access controls, encrypting data, regularly monitoring systems and updating software. The document emphasizes that security is an ongoing process requiring collaboration, training and maturity across people, processes and technologies.
Apache Atlas. Data Governance for Hadoop. Strata London 2015Sean Roberts
Apache Hadoop is being adopted across all industries for its ability
to store and process an abundance of new types of data in a modern data architecture. But this “Any Data” architecture presents a challenge when organizations must reconcile data management realities and as they bring existing and new data from disparate platforms under management.
Apache Atlas proposes to provide governance capabilities in Hadoop that use both a prescriptive and forensic models enriched by business taxonomical metadata. It is designed to exchange metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance.
Introduction to Data Governance
Seminar hosted by Embarcadero technologies, where Christopher Bradley presented a session on Data Governance.
Drivers for Data Governance & Benefits
Data Governance Framework
Organization & Structures
Roles & responsibilities
Policies & Processes
Programme & Implementation
Reporting & Assurance
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including:
+ Why you need Apache Falcon
+ Key new Falcon features
+ Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari
This document summarizes Hortonworks' Data Cloud, which allows users to launch and manage Hadoop clusters on cloud platforms like AWS for different workloads. It discusses the architecture, which uses services like Cloudbreak to deploy HDP clusters and stores data in scalable storage like S3 and metadata in databases. It also covers improving enterprise capabilities around storage, governance, reliability, and fault tolerance when running Hadoop on cloud infrastructure.
This document discusses how Apache Atlas and Apache Ranger can be used together to provide a metadata-driven and secure data lake. Apache Atlas provides metadata services and tagging capabilities. Apache Ranger uses the tags in Atlas to dynamically define and enforce access policies. The integration allows Ranger policies to automatically apply and change as Atlas metadata such as tags are updated. The document demonstrates how tags in Atlas for columns and tables can be used to create time-based and PII data access policies in Ranger.
This document provides an agenda and overview of topics for a Hortonworks data movement and management meetup. The agenda includes networking, introductions, discussions on Falcon use cases and releases, Hive disaster recovery, server-side extensions, ADF/instance search, Hive-based ingestion/export, Spark integration, and Sqoop 2 features. An overview of Falcon describes its high-level abstraction of Hadoop data processing services. Usage scenarios focus on dataset replication, lifecycle management, and lineage/traceability. The document also discusses Falcon examples for replication, retention, and late data handling.
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
Keynote slides from Big Data Spain Nov 2016. Has some thoughts on how Hadoop ecosystem is growing and changing to support the enterprise, including Hive, Spark, NiFi, security and governance, streaming, and the cloud.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
VIEW THE ON-DEMAND WEBINAR: http://hortonworks.com/webinar/introduction-hortonworks-dataflow/
Learn about Hortonworks DataFlow (HDFTM) and how you can easily augment your existing data systems – Hadoop and otherwise. Learn what Dataflow is all about and how Apache NiFi, MiNiFi, Kafka and Storm work together for streaming analytics.
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHaimo Liu
Introducing the new Hortonworks DataFlow (HDF) release, HDF 2.0. Also provides introduction to the flow management part of the platform, powered by Apache NIFI and MINIFI.
Learn about HDF and how you can easily augment your existing data systems - Hadoop and otherwise. Learn what Dataflow is all about and how Apache NiFi, MiNiFi, Kafka and Storm work together for streaming analytics.
Hadoop Present - Open Enterprise HadoopYifeng Jiang
The document is a presentation on enterprise Hadoop given by Yifeng Jiang, a Solutions Engineer at Hortonworks. The presentation covers updates to Hadoop Core including HDFS and YARN, data access technologies like Hive, Spark and stream processing, security features in Hadoop, and Hadoop management with Apache Ambari.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
This document provides an overview of Hadoop and its ecosystem. It discusses the evolution of Hadoop from version 1 which focused on batch processing using MapReduce, to version 2 which introduced YARN for distributed resource management and supported additional data processing engines beyond MapReduce. It also describes key Hadoop services like HDFS for distributed storage and the benefits of a Hadoop data platform for unlocking the value of large datasets.
This document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows Hadoop applications to work with both HDFS and cloud storage transparently. Recent enhancements to the S3A file system connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries with S3A compared to earlier versions. Upcoming work on output committers, object store abstraction, and consistency are outlined.
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
HDF Powered by Apache NiFi IntroductionMilind Pandit
The document discusses Apache NiFi and its role in managing enterprise data flows, providing an overview of NiFi's key features and capabilities for reliable data transfer, preparation, and routing. It also demonstrates how NiFi is used in common use cases and provides examples of building simple data flows in NiFi to ingest, filter, and deliver data.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
#10: Thanks Justin,
Here are Falcon’s primary features.
1 The first is to manage the data lifecycle in one common place.
2 The second is to facilitate quick deployment of replication for business continuity and disaster recovery use cases. This includes monitoring and a base set of policies for replication and retention
3 Lastly, Falcon provide foundation audit and compliance features – visuallization and tracking of entity lineage and collection of audit logs
#12: This is the high level Falcon Architecture
Falcon runs as a standalone server as part of your Hadoop cluster
A user creates entity specifications and submits to Falcon using the API
Falcon validates and saves entity specifications to HDFS
Falcon uses Oozie as its default scheduler
Dashboard for entity viewing in Falcon UI
Ambari integration for management
#13: Feeds have location, replication schedule and retention policies
Meta info including frequency, where data is coming from (source), where to replicate (target), how to long to retain
#14: Let take a look at the Data Pipeline or workflow.
** read high level **
#16: Once a pipeline is create you’ll want to run it.
This means you probably want to monitoring as well.
Falcon in conjunction with Ambari has centralized monitor
** bullets **
#17: Ok let chat about Replication with Falcon – which is very efficient.
In this example with a primary cluster with a typical workflow
There is business requirement to replicate this to a Failover cluster
** builett **
#18: Falcon has flexible data retention policies, it’s able to model the business compliance requirements.
Sophisticated retention policies expressed in one place
Simplify data retention for audit, compliance, or for data re-processing
In this example, different dataset in a workflow can have different retention policies.
#19: We realize at many type of workflow have inputs from different system with may be in different regions. Falcon has logic built-in to handle this potentially tricky situation.
#20: HCatalog – metadata shared across whole platform
File locations become abstract (not hard-coded)
Data types become shared (not redefined per tool)
Partitioning and HDFS-optimized