Implementing the Business Catalog in the Modern Enterprise: Bridging Traditional EDW and Hadoop with Apache Atlas

Implementing the Business Catalog
in the Modern Enterprise:
Bridging Traditional EDW and
Hadoop with Apache Atlas

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development,
may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback and
the overarching Apache Software Foundation community development process can all effect timing
and final delivery.
This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not
rely upon it when making purchasing decisions.

Speakers
Andrew Ahn
Governance Director
Product Management

Agenda
• Atlas Overview
• Near term roadmap
• Business Catalog
• Questions

Apache Atlas Overview

STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platfroms
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Project 1
Project 5
Project 4
Project 3
Metadata
Project 6
DATA
LAKE
GOAL: Provide a common approach to data
governance across all systems and data within the
enterprise
Transparent
Governance standards and protocols must be clearly
defined and available to all
Reproducible
Recreate the relevant data landscape at a point in time
Auditable
All relevant events and assets but be traceable with
appropriate historical lineage
Consistent
Compliance practices must be consistent

Ready for Trusted Governance
OPERATIONS SECURITY
GOVERNANCE
STORAGE
STORAGE
Machine
Learning
Batch
StreamingInteractive
Search
GOVERNANCE
YA R N
D A T A O P E R A T I N G S Y S T E M
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
Modeling with Metadata
enables comprehensive data lineage through
a hybrid approach with enhanced tagging
and attribute capabilities
Interoperable Solutions
across the Hadoop ecosystem, through a
common metadata store

DGI* Community becomes Apache Atlas
May
2015
Proto-type
Built
Apache
Atlas
Incubation
DGI group
Kickoff
Feb
2015
Dec
2014
July
2015
HDP 2.3
Foundation
GA Release
First kickoff to GA in 7 months
Global Financial
Company
* DGI: Data Governance Initiative
Faster & Safer
Co-Development driven
by customer use cases

Apache Atlas: Metadata Services
• Cross- component dataset
lineage. Centralized location for
all metadata inside HDP
• Single Interface point for
Metadata Exchange with
platforms outside of HDP
• Business Taxonomy based
classification. Conceptual,
Logical And Technical
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi

Big Data Management Through Metadata
Management Scalability
Many traditional tools and patterns do not scale when applied to multi-tenant data lakes.
Many enterprise have silo’d data and metadata stores that collide in the data lake. This is
compounded by the ability to have very large windows (years). Can traditional EDW tools
manage 100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata is the only via solution.
This allows quick integration with automation and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship, attribute based
security and self-service.

Apache Atlas High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon Others
REST API
Graph DB
Search
Kafka
Sqoop
Connectors
MessagingFramework

Technical and Logical Metadata Exchange
Knowledge
Store
Atlas
REST API
Structured
Unstructured
Files:
XML / JSON
3rd Party
Vendors
Custom
Reporter
Non-Hadoop

Near Term Roadmap:
Summer 2016

Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS

Dynamic Access Policy Driven by metadata

Business Taxonomy UX Prototype

We conduct open-ended user interviews so that we can learn more
about who are users are and what their needs are. This helps us
validate whether or not we’re solving the right problem.
User Interviews

We test our prototype in InVision - a click through prototyping tool
that allows users to interact with static mockups.
Usability Testing

After conducting interviews and usability testing we spend sometime
analyzing our findings and pulling out themes + insights.
Synthesis + Analysis

Usability Findings
• Understood the hierarchy and how to search for data
• Would generally search by file name or specific keyword
• Would use tags for the purpose of searching
• Would want to preview a subset of the data before analyzing the
whole data set
• Interested in the size of the data set
• Concerned with how current and updated the information is
• Would like the ability to contact a steward for more information
regarding the data set
• Would use an advanced boolean search if it were available
• Viewing the popularity and access frequency would provide
confidence
• Would like to provide and view fellow user’s input

Persona Findings
• Data Scientists typically have backgrounds in Mathematics, Computer
Science and Statistics
• Responsible for analyzing and transforming data into more useful
structures
• Responsible for correcting missing values, typos and parsing issues
• Typically fluent with SQL, Python and Hadoop tools
• Require time upfront to understand and discover new data sets
• Spend a significant amount of time reaching out to others with questions
about data sets
• Interact with Subject Matter Experts and Solution Architects
• Noted that compliance is a big interest for enterprises and government
• Felt Hadoop doesn’t support security and compliance very well
• Find it difficult to see who is doing what in Hadoop

Principle Roles
• Data Steward – Curator, responsible for catalog verasity
• Data Scientist – Analyst, primary consumer of Business Catalog
• Administrator – Role management only
• Data Engineer – Data ingress and egress, semantic data quality

UX proto-type: Taxonomy Navigation
Breadcrumbs for
taxonomy context path
Contents at
taxonomy context

Taxonomy Creation
In place taxonomy
management

Taxonomy Classification of Assets
Create new object
on the fly

Object Details
Annotation for
policies and rules

Object Lineage
Dataset Lineage
across components
Assign Tags
to assets

User Comments
User comments for
collaboration

Classify and Tag Assets
Keyword, DSL, and
Faceted search
Define authoritive tags
for the whole
taxonomy

• Hierarchical Taxonomy Creation
• Agile modeling: Model Conceptual, Logical, Physical assets
• Authorization: Steward / Analytic Roles
• Tag management: Definition and assignment
• DQ tab for profiling and sampling
• User Comments
Business Taxonomy UX Prototype
What other
information would you
like to see included?

Availability:
- Tech Preview VMs: May 2016
- GA Release: Summer 2016

Questions ?

Reference

Online Resources
VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP-
Atlas-Ranger-TP.ova —> Download Public Preview VM
Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger-
tp/tutorials/hortonworks/atlas-ranger-preview
Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of-
hadoop-based-security-data-governance/ (this is giving an error, right
now)
Learn More: http://hortonworks.com/solutions/atlas-ranger-
integration/

Implementing the Business Catalog in the Modern Enterprise: Bridging Traditional EDW and Hadoop with Apache Atlas

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Implementing the Business Catalog in the Modern Enterprise: Bridging Traditional EDW and Hadoop with Apache Atlas (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Implementing the Business Catalog in the Modern Enterprise: Bridging Traditional EDW and Hadoop with Apache Atlas

Editor's Notes