SlideShare a Scribd company logo
The Power of Data
May 2019
About me
 BI, Data Warehousing and Big Data Evangelist since 1983.
 Before joining Ultimate, I was Chief Big Data Architect at Visa and before that I was
VP of Data Architecture at Fidelity Investments
 My first job was with Bob Earle “The father of OLAP”.
 I worked in the Finance group at Coors Brewing Company where we created some
of the first data warehouses.
 I have given many presentations at IOUW, RMOUG, TDWI, Collaborate, Gartner
Group, Oracle Open World and the BI Summit. Also, I have given Metadata and
Data Governance presentations for HIMSS.
 I have a degree in statistics, MBA in Finance and Masters of Computer Science.
 I have authored Oracle Essbase & Oracle OLAP: A Guide to Oracle’s
Multidimensional Solutions Published by Oracle Press and Oracle Data
Warehousing published by SAMS.
2
Agenda
• Challenge
• Analytics – What is it?
• The Power of Data
• Data Governance
• Solutions
• The Data Lake – Cloudera – HDP 3.1
• LLAP and Vectorization
• DataPlane – Ranger and Atlas – DSS and DLM
3
Good judgment comes from
experience, and a lot of that comes
from bad judgment." - Will Rogers
You keep using that word. I
do not think it means what
you think it means.
What do you mean by “analytics”?
Challenge – Analytics and Data Governance
There are two parts to
“analytics”
The mathy stuff The query & reporting stuff
With Analytics I can Predict Behavior
Benford’s Law
Tesla and LinkedIn Think Resumes Are Overrated.
They Use Neuroscience-Based Games Instead
www.inc.com/kevin-j-ryan/pymetrics-replacing-resumes-with-brain-games.html
That's the philosophy touted by Frida Polli, co-founder and CEO of hiring startup
Pymetrics. The company makes games meant to determine whether a candidate
would be a good fit in a specific role at your company. Polli says that so far, the
platform has been more effective at finding the right hires than traditional resumes.
The results have been promising. Polli says that some companies have more than
doubled the percentage of candidates they hire out of those they invite for in-person
interviews. One-year retention rates have increased by between 30 and 60 percent.
And companies are reporting that job performance has improved among newly hired
candidates.
How to tell if someone will repay a loan?
What’s the smartest way to predict loan Payback?
What is a Data Lake?
11
A single place to store every type of data in its native format with no fixed limits on account size or file
size, high throughput to increase analytic performance and native integration with the Hadoop
ecosystem.
An architectural shift in the BI World that uses Hadoop to deliver deep insight across a large,
broad, diverse set of data at efficient scale.
The primary view of BI, self service is publishing data
The Power of Data
The Power of Data
Find Any Business Data in Sub-second
Each CPU scans
local in-memory
columns
Scans use super
fast SIMD vector
instructions
Billions of
rows/sec scan rate
per CPU core
May 25,
2018
GDPR – What is it?
4%
Or
€20MPotential Penalty
Per Infraction
Global
Impact
5 Key General Data Protection Regulation Obligations
Rights of EU
Data
Subjects
Security of
Personal
Data
Consent Accountability of
Compliance
Data Protection by
Design and by
Default
www.eugdpr.org
Access
Defining what
users and
applications can
do with data
Technical concepts:
Data Policies
Authorization
Data Protection
Protecting data in
the cluster from
unauthorized
visibility
Technical concepts:
Encryption,
tokenization, data
masking
Visibility
Reporting on
where data came
from and how it’s
being used
Technical concepts:
Auditing
Lineage
Knox
Identity
Guarding
access to the
cluster itself
Technical concepts:
Authentication
Network
isolation
Pillars of our comprehensive Data
Governance Solution
Discovery
Finding Data
Assets and
Definitions
Technical concepts:
Business Glossary,
Technical Glossary
and Search.
Access
Defining what
users and
applications can
do with data
Technical concepts:
Data Policies
Authorization
Data Protection
Protecting data in
the cluster from
unauthorized
visibility
Technical concepts:
Encryption,
tokenization, data
masking
Visibility
Reporting on
where data came
from and how it’s
being used
Technical concepts:
Auditing
Lineage
Knox
Ranger DataPlane & Atlas
Hardware, File and
Column Encryption
Identity
Guarding
access to the
cluster itself
Technical concepts:
Authentication
Network
isolation
Pillars of our comprehensive Data
Governance Solution
Discovery
Finding Data
Assets and
Definitions
Technical concepts:
Business Glossary,
Technical Glossary
and Search.
DataPlane & AtlasKnox/Active
LDAP Kerwberos
ACCESS - Establish and Implement Data Policies
▪ Accomplish: Manage and automate the information lifecycle from ingestion to purge, cradle to
grave, based on the unified metadata catalog
- Role Based Authorization
- Allow an Analyst to see PII data but not Developer
- Allow for Masking of Data
- Allow for automate enforcement of Data Retention Policies such as 7 days in Kakfa
Dynamic Row Filtering & Column Masking: Apache Ranger with Apache Hive
User 2: Ivanna
Location : EU
Group: HRUser 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies: Filter rows by region & apply
relevant column masking
Users from US Analyst group see data for
US persons with CC and National ID (SSN)
as masked values and MRN is nullified
Country National ID Name MRN
Germany T22000129 Ernie
Schwarz
876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country, nationalid,
name, mrn FROM
ww_customers
Analysts
HR Marketing
Visiability - Apache Ranger Audits - Data Access
⬢ Comprehensive scalable audit logging
⬢ Audits for:
⬢ Resource Access Events with user context
⬢ Policy Edits/Creation/Deletion
⬢ User session information
⬢ Component plugin policy sync operations
Tag (Classification) Based Masking
Masking Policy
For any Hive columns tagged as containing PII:
• Allow HR to see data in the clear for any type of
PII
• Apply ‘Nullify’ mask to columns classified as
type ‘MRN’ for Analysts
• Apply ‘Hash’ as masking option to columns
classified as type ‘Password’
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for
distribution of column values
Data Steward Studio (DSS)
DataPlane DSS - Understanding
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – Where is my PII?
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – What Tables are Accessed?
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – PII Trends
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – Data Lineage
CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DLM as Backup and DR
Questions and Answers
A
Q&

More Related Content

What's hot (20)

PPTX
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
PDF
Summary introduction to data engineering
Novita Sari
 
PDF
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Journey at Walgreens
DataWorks Summit
 
PPTX
From Events to Networks: Time Series Analysis on Scale
Dr. Mirko Kämpf
 
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PPTX
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
PDF
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Databricks
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
Summary introduction to data engineering
Novita Sari
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Hadoop Journey at Walgreens
DataWorks Summit
 
From Events to Networks: Time Series Analysis on Scale
Dr. Mirko Kämpf
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Databricks
 
Modernizing to a Cloud Data Architecture
Databricks
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 

Similar to The Power of Data (20)

PDF
Big data and you
IBM
 
PPTX
Dealing with Dark Data
Simplex Consulting
 
PDF
Maturing Your Organization's Information Risk Management Strategy
Privacera
 
PDF
Implementing and running a secure datalake from the trenches
DataWorks Summit
 
PPTX
GDPR Community Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Index Engines Inc.
 
PDF
Transition to a modern data platform
Michael Ghen
 
PDF
Are Data Lakes for Business Users Webinar
Arcadia Data
 
PPTX
Big Data and the BI Wild West
DataWorks Summit
 
PPTX
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
DOCX
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Neil Raden
 
PPTX
2012 02-07 sql denali presentatie microsoft
Combell NV
 
PPTX
Supporting GDPR Compliance through Data Classification
Index Engines Inc.
 
PDF
What Data Do You Have and Where is It?
Caserta
 
PDF
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Informatica
 
PPTX
Balancing data democratization with comprehensive information governance: bui...
DataWorks Summit
 
PPTX
Overview of Business Intelligence
Parthiv Dixit
 
PDF
Big Data Meetup: Analytical Systems Evolution
Provectus
 
PPTX
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Dataconomy Media
 
Big data and you
IBM
 
Dealing with Dark Data
Simplex Consulting
 
Maturing Your Organization's Information Risk Management Strategy
Privacera
 
Implementing and running a secure datalake from the trenches
DataWorks Summit
 
GDPR Community Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...
Index Engines Inc.
 
Transition to a modern data platform
Michael Ghen
 
Are Data Lakes for Business Users Webinar
Arcadia Data
 
Big Data and the BI Wild West
DataWorks Summit
 
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Neil Raden
 
2012 02-07 sql denali presentatie microsoft
Combell NV
 
Supporting GDPR Compliance through Data Classification
Index Engines Inc.
 
What Data Do You Have and Where is It?
Caserta
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Informatica
 
Balancing data democratization with comprehensive information governance: bui...
DataWorks Summit
 
Overview of Business Intelligence
Parthiv Dixit
 
Big Data Meetup: Analytical Systems Evolution
Provectus
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Dataconomy Media
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
PPTX
Applying Noisy Knowledge Graphs to Real Problems
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Simple and concise overview about Quantum computing..pptx
mughal641
 

The Power of Data

  • 1. The Power of Data May 2019
  • 2. About me  BI, Data Warehousing and Big Data Evangelist since 1983.  Before joining Ultimate, I was Chief Big Data Architect at Visa and before that I was VP of Data Architecture at Fidelity Investments  My first job was with Bob Earle “The father of OLAP”.  I worked in the Finance group at Coors Brewing Company where we created some of the first data warehouses.  I have given many presentations at IOUW, RMOUG, TDWI, Collaborate, Gartner Group, Oracle Open World and the BI Summit. Also, I have given Metadata and Data Governance presentations for HIMSS.  I have a degree in statistics, MBA in Finance and Masters of Computer Science.  I have authored Oracle Essbase & Oracle OLAP: A Guide to Oracle’s Multidimensional Solutions Published by Oracle Press and Oracle Data Warehousing published by SAMS. 2
  • 3. Agenda • Challenge • Analytics – What is it? • The Power of Data • Data Governance • Solutions • The Data Lake – Cloudera – HDP 3.1 • LLAP and Vectorization • DataPlane – Ranger and Atlas – DSS and DLM 3 Good judgment comes from experience, and a lot of that comes from bad judgment." - Will Rogers
  • 4. You keep using that word. I do not think it means what you think it means. What do you mean by “analytics”? Challenge – Analytics and Data Governance
  • 5. There are two parts to “analytics” The mathy stuff The query & reporting stuff
  • 6. With Analytics I can Predict Behavior
  • 8. Tesla and LinkedIn Think Resumes Are Overrated. They Use Neuroscience-Based Games Instead www.inc.com/kevin-j-ryan/pymetrics-replacing-resumes-with-brain-games.html That's the philosophy touted by Frida Polli, co-founder and CEO of hiring startup Pymetrics. The company makes games meant to determine whether a candidate would be a good fit in a specific role at your company. Polli says that so far, the platform has been more effective at finding the right hires than traditional resumes. The results have been promising. Polli says that some companies have more than doubled the percentage of candidates they hire out of those they invite for in-person interviews. One-year retention rates have increased by between 30 and 60 percent. And companies are reporting that job performance has improved among newly hired candidates.
  • 9. How to tell if someone will repay a loan?
  • 10. What’s the smartest way to predict loan Payback?
  • 11. What is a Data Lake? 11 A single place to store every type of data in its native format with no fixed limits on account size or file size, high throughput to increase analytic performance and native integration with the Hadoop ecosystem. An architectural shift in the BI World that uses Hadoop to deliver deep insight across a large, broad, diverse set of data at efficient scale.
  • 12. The primary view of BI, self service is publishing data
  • 15. Find Any Business Data in Sub-second Each CPU scans local in-memory columns Scans use super fast SIMD vector instructions Billions of rows/sec scan rate per CPU core
  • 16. May 25, 2018 GDPR – What is it? 4% Or €20MPotential Penalty Per Infraction Global Impact 5 Key General Data Protection Regulation Obligations Rights of EU Data Subjects Security of Personal Data Consent Accountability of Compliance Data Protection by Design and by Default www.eugdpr.org
  • 17. Access Defining what users and applications can do with data Technical concepts: Data Policies Authorization Data Protection Protecting data in the cluster from unauthorized visibility Technical concepts: Encryption, tokenization, data masking Visibility Reporting on where data came from and how it’s being used Technical concepts: Auditing Lineage Knox Identity Guarding access to the cluster itself Technical concepts: Authentication Network isolation Pillars of our comprehensive Data Governance Solution Discovery Finding Data Assets and Definitions Technical concepts: Business Glossary, Technical Glossary and Search.
  • 18. Access Defining what users and applications can do with data Technical concepts: Data Policies Authorization Data Protection Protecting data in the cluster from unauthorized visibility Technical concepts: Encryption, tokenization, data masking Visibility Reporting on where data came from and how it’s being used Technical concepts: Auditing Lineage Knox Ranger DataPlane & Atlas Hardware, File and Column Encryption Identity Guarding access to the cluster itself Technical concepts: Authentication Network isolation Pillars of our comprehensive Data Governance Solution Discovery Finding Data Assets and Definitions Technical concepts: Business Glossary, Technical Glossary and Search. DataPlane & AtlasKnox/Active LDAP Kerwberos
  • 19. ACCESS - Establish and Implement Data Policies ▪ Accomplish: Manage and automate the information lifecycle from ingestion to purge, cradle to grave, based on the unified metadata catalog - Role Based Authorization - Allow an Analyst to see PII data but not Developer - Allow for Masking of Data - Allow for automate enforcement of Data Retention Policies such as 7 days in Kakfa
  • 20. Dynamic Row Filtering & Column Masking: Apache Ranger with Apache Hive User 2: Ivanna Location : EU Group: HRUser 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers Analysts HR Marketing
  • 21. Visiability - Apache Ranger Audits - Data Access ⬢ Comprehensive scalable audit logging ⬢ Audits for: ⬢ Resource Access Events with user context ⬢ Policy Edits/Creation/Deletion ⬢ User session information ⬢ Component plugin policy sync operations
  • 22. Tag (Classification) Based Masking Masking Policy For any Hive columns tagged as containing PII: • Allow HR to see data in the clear for any type of PII • Apply ‘Nullify’ mask to columns classified as type ‘MRN’ for Analysts • Apply ‘Hash’ as masking option to columns classified as type ‘Password’
  • 23. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values
  • 24. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values
  • 25. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for distribution of column values Data Steward Studio (DSS) DataPlane DSS - Understanding
  • 26. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – Where is my PII?
  • 27. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – What Tables are Accessed?
  • 28. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – PII Trends
  • 29. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DSS – Data Lineage
  • 30. CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for ditribution of column values DataPlane DLM as Backup and DR