SlideShare a Scribd company logo
NAVIGATING THE WORLD OF
USER DATA MANAGEMENT AND DATA
DISCOVERY
SMITI SHARMA, VIRTUSTREAM - EMC
About Me
2
Principal Engineer & Lead,
Big Data Cloud, Virtustream
 Oracle – Principal Engineer & PM team
 EMC – Big Data Lead
 Pivotal – Global CTO
 Virtustream EMC
Areas of expertise
 Architecting, developing and managing Mission Critical
Transactional and Analytics Platforms
 RDBMS & NoSQL – Hadoop Platforms
 Product Management and Development
 ODPi RT member
Sessions at Hadoop Summit 2016
 Wed: 11:30am Room 211: User Data Management in HD
 Wed: 4:10pm Room 212: Building PAAS for Smart cities
Smiti.sharma@virtustream.com
@smiti_sharma
About Virtustream
 Enterprise-Class Cloud Solutions
− Cloud IAAS, PAAS
− Cloud Software Solutions
− Cloud Managed Services
− Cloud Professional Services
 Developer of xStream Cloud Management Platform SW
 Inventor of the µVMTM (MicroVM) Cloud Technology
 Industry leading Cloud offers in areas of
− SAP Landscape, HANA
− Storage
− Big Data
 Close partnerships with SAP, EMC, VMWare
 Service Provider to 2,000 Global Workloads
Global Footprint
 Data Management Overview
 Project Background & Context
 High level Architecture and data flow
 Solution criteria
 Evaluation criteria of Data Management Tools
 Differentiation factors
 Proposed Solution
 Conclusions
Agenda
4
5
5
User Data Landscape
Master Data Management
Metadata
- Business
- Technical
- Operational
Reference DataTransactional data
6
Reference DataTransactional data
Metadata
- Business
- Technical
- Operational
Driving Factors for Data Management
• IT custodian of business data
• Data Characteristics
– Business Value
• Analytical vs. transactional systems
– Volume and Volatility
– Complexity of Data type and formats
– Adaptive feedback from IT to Business
– Reusability factor – across different teams
– De-duplication factor
7
MDM is an organizational approach to
managing data as a corporate asset
8
What is Master Data Management
Application framework and collection of
tools implemented with the business
goal of providing, managing & using
data intelligently and productively
Multi-Domain & Organizational MDM
Metadata Data
Reference
Data
Transactional
Data
9
Domain 1: Product
Master Data Metadata Data
Reference
Data
Transactional
Data
Domain 3: Logistics
Master Data
Metadata Data
Reference
Data
Transactional
Data
Domain 2: Supply Chain
Master Data
Universal Ideology..somewhat..
10
11
Data Management for Hadoop: Why, At what stage
7%
4%
9%
Today
In three years
Growth
Indicator
Fig 2: Source TDWI
Fig 1: Source TDWI
Primary Strategy to
improve Quality of
Data managed in Hadoop
• Uni-directional movement of data
• Static and limited identification patterns
• Focused mainly on Transactional systems – data type/Hadoop OSS
integration limited
• Non-adaptive solutions to rapidly changing “schema”
• Limiting performance
12
Traditional MDM Challenges
Building “Data Management
Layer”
for Hadoop Data Lake
13
Project(s) and Context
• Project initiated at two Large Retailers
• Goal to extend the analytical Data Lake
– As of Late 2015 Data Lake built only for Analytics
– Pulls data from Transactional, ERP, POS systems
– Implemented using ODPi (Pivotal/Hortonworks) Distribution and Greenplum for MPPDB
• Next Generation Data Lake
– Current ETL system reaching performance and scale limits  Move ETL in Hadoop
– Move BIDW and Transactional reporting to Hadoop
– Increase users on this system – Security and Quality constraints
– In-store SKU count ~ 500 Million ; Online SKU count ~ 5 Million
• Complex Master Data Management around existing systems
– For Hadoop – the EIM integration didn’t exist and/or processes were not in place
– Little to no interest from EIM data integration team
14
Key Problem Statement (at least 1 of them!)
Evaluate and Prototype the
Data Management Strategy and Product(s)
to enhance and enrich the
“Next Generation Data Lake”
15
High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
Query/Accessand
VisualizationLayer
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
I
n
g
e
s
t
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
In-
Memory
Process-
ing
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
16
Metadata Management
Ingestion and Indexing
Data Management
Solution Requirements
• Inherent Data processing requirements
• Incoming data from sources e.g. Kafka, Storm, Sqoop, Spark
• Be able to manage complex data types e.g. Video files from POS
• Data placement based on priority and sensitivity – memory or disk
• Handling both Synchronous and Async (In-band and out-of-band)
• Integration with existing EIM tools
• Performance requirements
• Increasing ingest volume of data and expanding sources
• Varied Data Type support and considerations
17
File format type Embedded
Metadata
Compression&
Splitable
HQL/SQL interface
viability
Popularity in current
and new landscape
Support for
Schema
evolution
18
CSV/Text No No^ Hive/Hawq Most common Limited
Avro Yes Yes Hive Increasing footprint Yes
JSON Yes No^ Hive/ MongoDB Increasing footprint Yes
RC Files Limited Not as well Hive (RW) Yes No
ORC Files No Yes Hive (RW) Impala ( R) Yes No
Sequence Files (binary
format)
No Yes Hive (RW) Impala ( R) None today Limited
Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited
• Read/Write performance
• Source, Application and development effort and support
• Hierarchical model
File Format Considerations
File format type Embedded
Metadata
Compression&
Splitable
HQL/SQL interface
viability
Popularity in current
and new landscape
Support for
Schema
evolution
19
CSV/Text No No^ Hive/Hawq Most common Limited
Avro Yes Yes Hive Increasing footprint Yes
JSON Yes No^ Hive/ MongoDB Increasing footprint Yes
RC Files Limited Not as well Hive (RW) Yes No
ORC Files No Yes Hive (RW) Impala ( R) Yes No
Sequence Files (binary
format)
No Yes Hive (RW) Impala ( R) None today Limited
Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited
• Read/Write performance
• Source, Application and development effort and support
• Hierarchical model
File Format Considerations
Key Evaluation and Selection
Criteria
20
Initial Challenge
• Too many tools to choose from
• Each claimed to be Metadata management tool
• Each claimed security and integration features
• Resistance from the EIM team when initially involved
• Translating Data Management Ideology to tasks of
evaluation
21
Project Approach
• Build a list of KPI to evaluate tools
• Working with EIM team (best practices advise & SME engagement),
business and IT team support Data lake project
• Vendor Identification – List of 5
• Implementation
• Minimized scope of project
• Decided to tackle integration with legacy EIM at a later date
• After Evaluation, focused on implementing no-more than 2 Data
management tools for Next-Gen Data Fabric Platform
22
• Define Business Metadata (Is reference data available within tool or outside)
• Automation and flexibility in crawling the HDFS and understand the various format
– Range of File formats supported
– Reading each file to extract metadata
– Both for data persisted already and incoming new files in real-time
– Cross reference with lookup or repository for pre-existing classes and profiles
– Maturity of attaching context or facet to the atomic data
– Ability to retrieve descriptive and Structural Metadata even with no Metadata within the content
• Storing the profiled data – actual data and metadata in a repository
• Custom Tagging as well as recognizing Metadata information
• Translation and integration with industry certification and models
23
Metadata Curation and Management (1/2)
Data Profiling
• Ability to classify data – based on user defined categories
– Search/Crawl and identification "Facet Finder” and efficiency of internal repository
– Presence of Data Models if any
– Features around custom Metadata and Tagging
• Once classified - ability for Metadata information to be indexed, and searchable thru
API or Web Interfaces
– Efficiency of Search and indexing
– Richness of Integration with NLPTK
• Data Re-mediation
• Data Archiving and policy implementation
• Notification: Configurable triggers – based on user-defined criteria
24
Metadata Curation and Management (2/2)
Data Classification
Lineage and Versioning
• Be able to identify the origin of data – i.e. from
– Transactional systems, Dump files, Another HDFS file, Repository etc.
– Level of depth of data origination and lineage
• Ability of the solution to sense and preserve Metadata Versions around a given entity
during Capture process and post
• Ability to support Deduplication with the Entity’s metadata
– On the fly without impacting the performance
25
Integration
• Ability to integrate its meta store with enterprise MDM /EIM systems
– Maturity of Metadata Entity Readers (Input/Output) Artifacts from Metastore
– Bi-directional API for other tool integration to identify lineage
– Bi-directional API for other tool integration for SIEM threat assessment and detection
– While maintaining user and security context
• Integration with the various tools of Ingestion, Transformation & Consumption
– Spark, Storm, Kafka, Informatica, Data Stage etc.
• Integration with security tools – LDAP, ACLs, encryption
• Rules and Policy engine
26
Performance, Accuracy and Ease of Use
• Sample visualization of Metadata with Native Reporting tools & others
• Ability to process compressed and encrypted files
• Level of Error and exception handling built in during all processes
• Impact on performance from
– Crawling, scanning and profiling
– Classification & transformation
• Enable notifications of data availability - how customizable are they?
• Self-service discovery portal leveraging curated artifacts
27
Some of the notable Vendors evaluated
• Attivio
• Global ID
• Waterline Data
• Zaloni
• Adaptive Inc.
28
At the time of this study, Falcon and Ranger were new. Little analysis on these products was done
Vendor Evaluation Scoreboard (Template)
29
Vendor Evaluation Summary Results
31
Metadata
curation and
management
Lineage and
versioning
Integration Performance,
Accuracy and
Ease of use
Attivio
Global ID
Waterline Data
Zaloni
Global ID
Attivio
Waterline Data
Zaloni
Zaloni
Attivio
Global ID
Waterline Data
Attivio
Zaloni
Waterline Data
Global ID
• All tools had satisfactory features overall with emphasis in 1 or 2 areas.
• Your choice of tools needs to align with Business and User Requirements
• Waterline: Automated data discovery, self- service
• Attivio: Data Curation – Discovery, Search, flexibility of tagging, performant and scalable
• Global ID: Efficient in Mapping logical models, overlapping data identification and pattern matching
• Zaloni: Had notable interface for Data mapping and flow, integration with external tools
Evaluation Summary CAVEAT: Based on
criterion driven by
customer needs. You
eval and updates from
vendor will affect
results
High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
Query/Accessand
VisualizationLayer
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
I
n
g
e
s
t
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
In-
Memory
Process-
ing
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
32
Metadata Management
Ingestion and Indexing
Data Management
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
CustomPortal/other
evaluations(TBD)
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
Flume/Kafka/SpringXD
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
Apache
Spark/
GemFir
e
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
33
Metadata Management
Attivio
Global ID
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
Key Takeaways
34
 Market
− Metadata Mgmt tools in market are still evolving for Data Lake architectures
− Ever growing and Rich Partner ecosystem
− Hadoop does not offer a sufficient policy engine or action framework
 Customer
− Choice of tool is IT and business driven. Sponsorship important !
− To drive adoption – ease of use and intuitive product a must
− Balancing Multi-vendor and functionality: Limit number of tools to 3
− Recommendation to use Information management Professional Services with selected tool (s)
Key Takeaways
PROCESS
 Evaluation of the tools
− Reviews and demo of the tools versus a full-fledged POC
− Build an adaptive matrix of KPI measurements, customized to your organization - Unless quantified
evaluation would be very subjective
 Beware of the Trap- Analysis - Paralysis
− Multiple business units drive this decision
− Functionality scope - workflows, ETL processes and integration or pure-play data management
− Integration with existing EIM tools was delayed as a priority: Huge part of the success
 Investment/Cost: Existing tools, Level of Effort and implementation
Key Takeaways
References
• References to the following documents were made
– TDWI- Hadoop for enterprise
– MDM institute
• Acknowledgements from the following authors and additional work
– EMC IT Team
– Customer’s IT team for Prototyping along with EMC Field resources
37
Navigating the World of User Data Management and Data Discovery

More Related Content

What's hot (20)

PPTX
Hadoop Journey at Walgreens
DataWorks Summit
 
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
PDF
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
PPTX
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
Planing and optimizing data lake architecture
Milos Milovanovic
 
PPTX
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
PDF
Data-In-Motion Unleashed
DataWorks Summit
 
PPTX
Solving Performance Problems on Hadoop
Tyler Mitchell
 
PDF
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
PDF
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
PDF
Data Governance for Data Lakes
Kiran Kamreddy
 
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
PDF
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Rittman Analytics
 
Hadoop Journey at Walgreens
DataWorks Summit
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Planing and optimizing data lake architecture
Milos Milovanovic
 
Data-In-Motion Unleashed
DataWorks Summit
 
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Data Governance for Data Lakes
Kiran Kamreddy
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Rittman Analytics
 

Viewers also liked (20)

PPTX
Deploying a Governed Data Lake
WaterlineData
 
PDF
IDC Report - Unified Information Access on a Solid Search Base
Attivio
 
PPTX
Big Data at Tube: Events to Insights to Action
Murtaza Doctor
 
PDF
Big Data Applications
Joseph Zwicker
 
PDF
Elephant grooming: quality with Hadoop
Roman Nikitchenko
 
PDF
Hadoop do data warehousing rules apply
DataWorks Summit
 
PDF
Hadoop 2.0 - Solving the Data Quality Challenge
Inside Analysis
 
PPTX
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
PPTX
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
PPTX
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
PPTX
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
PPTX
Reinventing the Modern Information Pipeline: Paxata and MapR
Lilia Gutnik
 
PPTX
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
PPTX
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
PPTX
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Deploying a Governed Data Lake
WaterlineData
 
IDC Report - Unified Information Access on a Solid Search Base
Attivio
 
Big Data at Tube: Events to Insights to Action
Murtaza Doctor
 
Big Data Applications
Joseph Zwicker
 
Elephant grooming: quality with Hadoop
Roman Nikitchenko
 
Hadoop do data warehousing rules apply
DataWorks Summit
 
Hadoop 2.0 - Solving the Data Quality Challenge
Inside Analysis
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Reinventing the Modern Information Pipeline: Paxata and MapR
Lilia Gutnik
 
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Ad

Similar to Navigating the World of User Data Management and Data Discovery (20)

PPTX
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Con LA
 
PDF
Ibm based mdm poc
Bhawani N Prasad
 
PDF
ADV Slides: Data Pipelines in the Enterprise and Comparison
DATAVERSITY
 
PDF
Master data management and data warehousing
Zahra Mansoori
 
PDF
The Shifting Landscape of Data Integration
DATAVERSITY
 
PPTX
Smarter Management for Your Data Growth
RainStor
 
PDF
Master data management
Zahra Mansoori
 
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
PDF
Performance management capability
designer DATA
 
PPTX
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Matt Stubbs
 
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
PDF
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
PDF
Connecting Silos in Real Time with Data Virtualization
Denodo
 
PDF
Create your Big Data vision and Hadoop-ify your data warehouse
Jeff Kelly
 
PPTX
Эволюция Big Data и Information Management. Reference Architecture.
Andrey Akulov
 
PDF
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
PPTX
Harnessing the value of big data analytics
Sowmia Sathyan
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Con LA
 
Ibm based mdm poc
Bhawani N Prasad
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
DATAVERSITY
 
Master data management and data warehousing
Zahra Mansoori
 
The Shifting Landscape of Data Integration
DATAVERSITY
 
Smarter Management for Your Data Growth
RainStor
 
Master data management
Zahra Mansoori
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Performance management capability
designer DATA
 
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Matt Stubbs
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
Connecting Silos in Real Time with Data Virtualization
Denodo
 
Create your Big Data vision and Hadoop-ify your data warehouse
Jeff Kelly
 
Эволюция Big Data и Information Management. Reference Architecture.
Andrey Akulov
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
Harnessing the value of big data analytics
Sowmia Sathyan
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 

Navigating the World of User Data Management and Data Discovery

  • 1. NAVIGATING THE WORLD OF USER DATA MANAGEMENT AND DATA DISCOVERY SMITI SHARMA, VIRTUSTREAM - EMC
  • 2. About Me 2 Principal Engineer & Lead, Big Data Cloud, Virtustream  Oracle – Principal Engineer & PM team  EMC – Big Data Lead  Pivotal – Global CTO  Virtustream EMC Areas of expertise  Architecting, developing and managing Mission Critical Transactional and Analytics Platforms  RDBMS & NoSQL – Hadoop Platforms  Product Management and Development  ODPi RT member Sessions at Hadoop Summit 2016  Wed: 11:30am Room 211: User Data Management in HD  Wed: 4:10pm Room 212: Building PAAS for Smart cities [email protected] @smiti_sharma
  • 3. About Virtustream  Enterprise-Class Cloud Solutions − Cloud IAAS, PAAS − Cloud Software Solutions − Cloud Managed Services − Cloud Professional Services  Developer of xStream Cloud Management Platform SW  Inventor of the µVMTM (MicroVM) Cloud Technology  Industry leading Cloud offers in areas of − SAP Landscape, HANA − Storage − Big Data  Close partnerships with SAP, EMC, VMWare  Service Provider to 2,000 Global Workloads Global Footprint
  • 4.  Data Management Overview  Project Background & Context  High level Architecture and data flow  Solution criteria  Evaluation criteria of Data Management Tools  Differentiation factors  Proposed Solution  Conclusions Agenda 4
  • 5. 5 5
  • 6. User Data Landscape Master Data Management Metadata - Business - Technical - Operational Reference DataTransactional data 6 Reference DataTransactional data Metadata - Business - Technical - Operational
  • 7. Driving Factors for Data Management • IT custodian of business data • Data Characteristics – Business Value • Analytical vs. transactional systems – Volume and Volatility – Complexity of Data type and formats – Adaptive feedback from IT to Business – Reusability factor – across different teams – De-duplication factor 7
  • 8. MDM is an organizational approach to managing data as a corporate asset 8 What is Master Data Management Application framework and collection of tools implemented with the business goal of providing, managing & using data intelligently and productively
  • 9. Multi-Domain & Organizational MDM Metadata Data Reference Data Transactional Data 9 Domain 1: Product Master Data Metadata Data Reference Data Transactional Data Domain 3: Logistics Master Data Metadata Data Reference Data Transactional Data Domain 2: Supply Chain Master Data
  • 11. 11 Data Management for Hadoop: Why, At what stage 7% 4% 9% Today In three years Growth Indicator Fig 2: Source TDWI Fig 1: Source TDWI Primary Strategy to improve Quality of Data managed in Hadoop
  • 12. • Uni-directional movement of data • Static and limited identification patterns • Focused mainly on Transactional systems – data type/Hadoop OSS integration limited • Non-adaptive solutions to rapidly changing “schema” • Limiting performance 12 Traditional MDM Challenges
  • 14. Project(s) and Context • Project initiated at two Large Retailers • Goal to extend the analytical Data Lake – As of Late 2015 Data Lake built only for Analytics – Pulls data from Transactional, ERP, POS systems – Implemented using ODPi (Pivotal/Hortonworks) Distribution and Greenplum for MPPDB • Next Generation Data Lake – Current ETL system reaching performance and scale limits  Move ETL in Hadoop – Move BIDW and Transactional reporting to Hadoop – Increase users on this system – Security and Quality constraints – In-store SKU count ~ 500 Million ; Online SKU count ~ 5 Million • Complex Master Data Management around existing systems – For Hadoop – the EIM integration didn’t exist and/or processes were not in place – Little to no interest from EIM data integration team 14
  • 15. Key Problem Statement (at least 1 of them!) Evaluate and Prototype the Data Management Strategy and Product(s) to enhance and enrich the “Next Generation Data Lake” 15
  • 16. High Level Logical Data Architecture Metadata repository, Policy management and Business Rules Engine Enterprise Security Framework (AD/LDAP) Query/Accessand VisualizationLayer • API to access data sources • Interfaces with Metadata Repository to define data query path • Potentially Custom Portals for User queries as well as standard tools Access DataSources I n g e s t Data Sources (Raw)/Aggregated* Inventory data Logistics Product/ Vendor Data Data Fabric/“Landing Zone” Processing Framework In- Memory Process- ing Object store HDFS MPP DB RDBMS NoSQL/ NewSQL Data Ingest to Persistence or memory layer Federated query Ingest to Metadata management Layer Cross reference for Rules, policies and Metadata LEGEND 16 Metadata Management Ingestion and Indexing Data Management
  • 17. Solution Requirements • Inherent Data processing requirements • Incoming data from sources e.g. Kafka, Storm, Sqoop, Spark • Be able to manage complex data types e.g. Video files from POS • Data placement based on priority and sensitivity – memory or disk • Handling both Synchronous and Async (In-band and out-of-band) • Integration with existing EIM tools • Performance requirements • Increasing ingest volume of data and expanding sources • Varied Data Type support and considerations 17
  • 18. File format type Embedded Metadata Compression& Splitable HQL/SQL interface viability Popularity in current and new landscape Support for Schema evolution 18 CSV/Text No No^ Hive/Hawq Most common Limited Avro Yes Yes Hive Increasing footprint Yes JSON Yes No^ Hive/ MongoDB Increasing footprint Yes RC Files Limited Not as well Hive (RW) Yes No ORC Files No Yes Hive (RW) Impala ( R) Yes No Sequence Files (binary format) No Yes Hive (RW) Impala ( R) None today Limited Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited • Read/Write performance • Source, Application and development effort and support • Hierarchical model File Format Considerations
  • 19. File format type Embedded Metadata Compression& Splitable HQL/SQL interface viability Popularity in current and new landscape Support for Schema evolution 19 CSV/Text No No^ Hive/Hawq Most common Limited Avro Yes Yes Hive Increasing footprint Yes JSON Yes No^ Hive/ MongoDB Increasing footprint Yes RC Files Limited Not as well Hive (RW) Yes No ORC Files No Yes Hive (RW) Impala ( R) Yes No Sequence Files (binary format) No Yes Hive (RW) Impala ( R) None today Limited Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited • Read/Write performance • Source, Application and development effort and support • Hierarchical model File Format Considerations
  • 20. Key Evaluation and Selection Criteria 20
  • 21. Initial Challenge • Too many tools to choose from • Each claimed to be Metadata management tool • Each claimed security and integration features • Resistance from the EIM team when initially involved • Translating Data Management Ideology to tasks of evaluation 21
  • 22. Project Approach • Build a list of KPI to evaluate tools • Working with EIM team (best practices advise & SME engagement), business and IT team support Data lake project • Vendor Identification – List of 5 • Implementation • Minimized scope of project • Decided to tackle integration with legacy EIM at a later date • After Evaluation, focused on implementing no-more than 2 Data management tools for Next-Gen Data Fabric Platform 22
  • 23. • Define Business Metadata (Is reference data available within tool or outside) • Automation and flexibility in crawling the HDFS and understand the various format – Range of File formats supported – Reading each file to extract metadata – Both for data persisted already and incoming new files in real-time – Cross reference with lookup or repository for pre-existing classes and profiles – Maturity of attaching context or facet to the atomic data – Ability to retrieve descriptive and Structural Metadata even with no Metadata within the content • Storing the profiled data – actual data and metadata in a repository • Custom Tagging as well as recognizing Metadata information • Translation and integration with industry certification and models 23 Metadata Curation and Management (1/2) Data Profiling
  • 24. • Ability to classify data – based on user defined categories – Search/Crawl and identification "Facet Finder” and efficiency of internal repository – Presence of Data Models if any – Features around custom Metadata and Tagging • Once classified - ability for Metadata information to be indexed, and searchable thru API or Web Interfaces – Efficiency of Search and indexing – Richness of Integration with NLPTK • Data Re-mediation • Data Archiving and policy implementation • Notification: Configurable triggers – based on user-defined criteria 24 Metadata Curation and Management (2/2) Data Classification
  • 25. Lineage and Versioning • Be able to identify the origin of data – i.e. from – Transactional systems, Dump files, Another HDFS file, Repository etc. – Level of depth of data origination and lineage • Ability of the solution to sense and preserve Metadata Versions around a given entity during Capture process and post • Ability to support Deduplication with the Entity’s metadata – On the fly without impacting the performance 25
  • 26. Integration • Ability to integrate its meta store with enterprise MDM /EIM systems – Maturity of Metadata Entity Readers (Input/Output) Artifacts from Metastore – Bi-directional API for other tool integration to identify lineage – Bi-directional API for other tool integration for SIEM threat assessment and detection – While maintaining user and security context • Integration with the various tools of Ingestion, Transformation & Consumption – Spark, Storm, Kafka, Informatica, Data Stage etc. • Integration with security tools – LDAP, ACLs, encryption • Rules and Policy engine 26
  • 27. Performance, Accuracy and Ease of Use • Sample visualization of Metadata with Native Reporting tools & others • Ability to process compressed and encrypted files • Level of Error and exception handling built in during all processes • Impact on performance from – Crawling, scanning and profiling – Classification & transformation • Enable notifications of data availability - how customizable are they? • Self-service discovery portal leveraging curated artifacts 27
  • 28. Some of the notable Vendors evaluated • Attivio • Global ID • Waterline Data • Zaloni • Adaptive Inc. 28 At the time of this study, Falcon and Ranger were new. Little analysis on these products was done
  • 31. 31 Metadata curation and management Lineage and versioning Integration Performance, Accuracy and Ease of use Attivio Global ID Waterline Data Zaloni Global ID Attivio Waterline Data Zaloni Zaloni Attivio Global ID Waterline Data Attivio Zaloni Waterline Data Global ID • All tools had satisfactory features overall with emphasis in 1 or 2 areas. • Your choice of tools needs to align with Business and User Requirements • Waterline: Automated data discovery, self- service • Attivio: Data Curation – Discovery, Search, flexibility of tagging, performant and scalable • Global ID: Efficient in Mapping logical models, overlapping data identification and pattern matching • Zaloni: Had notable interface for Data mapping and flow, integration with external tools Evaluation Summary CAVEAT: Based on criterion driven by customer needs. You eval and updates from vendor will affect results
  • 32. High Level Logical Data Architecture Metadata repository, Policy management and Business Rules Engine Enterprise Security Framework (AD/LDAP) Query/Accessand VisualizationLayer • API to access data sources • Interfaces with Metadata Repository to define data query path • Potentially Custom Portals for User queries as well as standard tools Access DataSources I n g e s t Data Sources (Raw)/Aggregated* Inventory data Logistics Product/ Vendor Data Data Fabric/“Landing Zone” Processing Framework In- Memory Process- ing Object store HDFS MPP DB RDBMS NoSQL/ NewSQL 32 Metadata Management Ingestion and Indexing Data Management Data Ingest to Persistence or memory layer Federated query Ingest to Metadata management Layer Cross reference for Rules, policies and Metadata LEGEND
  • 33. High Level Logical Data Architecture Metadata repository, Policy management and Business Rules Engine Enterprise Security Framework (AD/LDAP) CustomPortal/other evaluations(TBD) • API to access data sources • Interfaces with Metadata Repository to define data query path • Potentially Custom Portals for User queries as well as standard tools Access DataSources Flume/Kafka/SpringXD Data Sources (Raw)/Aggregated* Inventory data Logistics Product/ Vendor Data Data Fabric/“Landing Zone” Processing Framework Apache Spark/ GemFir e Object store HDFS MPP DB RDBMS NoSQL/ NewSQL 33 Metadata Management Attivio Global ID Data Ingest to Persistence or memory layer Federated query Ingest to Metadata management Layer Cross reference for Rules, policies and Metadata LEGEND
  • 35.  Market − Metadata Mgmt tools in market are still evolving for Data Lake architectures − Ever growing and Rich Partner ecosystem − Hadoop does not offer a sufficient policy engine or action framework  Customer − Choice of tool is IT and business driven. Sponsorship important ! − To drive adoption – ease of use and intuitive product a must − Balancing Multi-vendor and functionality: Limit number of tools to 3 − Recommendation to use Information management Professional Services with selected tool (s) Key Takeaways
  • 36. PROCESS  Evaluation of the tools − Reviews and demo of the tools versus a full-fledged POC − Build an adaptive matrix of KPI measurements, customized to your organization - Unless quantified evaluation would be very subjective  Beware of the Trap- Analysis - Paralysis − Multiple business units drive this decision − Functionality scope - workflows, ETL processes and integration or pure-play data management − Integration with existing EIM tools was delayed as a priority: Huge part of the success  Investment/Cost: Existing tools, Level of Effort and implementation Key Takeaways
  • 37. References • References to the following documents were made – TDWI- Hadoop for enterprise – MDM institute • Acknowledgements from the following authors and additional work – EMC IT Team – Customer’s IT team for Prototyping along with EMC Field resources 37