SlideShare a Scribd company logo
Public Information
Lessons Learned Migrating from IBM BigInsights
to Hortonworks Data Platform
Data Works Summit 2018
Lisa Coleman – Data and Analytics Infrastructure
Robert Tucker – HDP Platform Administrator
Mica Glover – HDP Platform Administrator
May 2018
Public Information 2
OUR MISSION
The mission of the association is to
facilitate the financial security of its
members, associates and their families
through provision of a full range of highly
competitive financial products and services;
in so doing, USAA seeks to be the provider
of choice for the military community.
THE USAA STANDARD
• Keep our membership and mission first
• Live our core values: Service, Loyalty,
Honesty, Integrity
• Be authentic and build trust
• Create conditions for people to succeed
• Purposefully include diverse perspectives
for superior results
• Innovate and build for the future
Public Information 3
Challenges
& Drivers
Scope
of Work
Lessons
Learned
Agenda
Public Information 3
Public Information 4
Platform Challenges and Drivers
Interoperability
Support
Model
Velocity of
Change
Compatibility
IBM Strategy
& Guidance
 In-place upgrades not possible without GPFS
 Stack integration – Platform Symphony, WLM,
BigSQL
 Common BI tools in the industry not certified for
connectivity via GPFS
 Limited ability to leverage Streaming Data
Platforms as none were certified on our version of
IBM BigInsights
 Difficult resolution process
 Evolution of documentation to support version
upgrade
 Potential added expenses for professional
services
 Inconsistency in stability of enhancements and break-fix releases
 Difficulty securing the proper resources for assistance
 Limits our ability to innovate and take
advantage of new capabilities
 Slow adoption of new ODPi projects
Public Information 4
Public Information 5
Apache Project Support
Apache Project Usage Hortonworks 2.5 BigInsights 4.2 Required Capability
    
    
    
    
    
    
    
    
   
   
   
  
  
  
  
 
 
 

Hadoop
Hive
Sqoop
Ambari
Spark
Zeppelin
Atlas
Ranger
Knox
Tez
Nifi
Flume
Kafka
Pig
Storm
HBase
Phoenix
Solr
Accumulo
Core Components
SQL on Hadoop
Data Ingestion
Cluster Management
IoT, Streaming
Data Science
Governance
Security
Security
Improve Hive Performance
IoT
Data Ingestion
IoT, Streaming
ETL
IoT, Streaming
NoSQL
SQL Interface for Hbase
Social Media / NLP
NoSQL
2.7.3
1.2.1
1.4.6
2.4
1.6.2 AND 2.0*
0.6.0
0.7.0
0.6.0
0.9.0
0.7.0
1
1.5.2
0.10.0.1
1.2.0
1.0.1
1.1.2
4.7.0
5.2.2
1.7.0
2.7.2
1.2.1
1.4.6
2.2.0
1.6.1
N/A
N/A
0.5.2
0.7.0
N/A
N/A
1.6.0
0.9.0.1
0.15.0
N/A
1.2.0
4.6.1
5.5
N/A
Public Information 6
Hortonworks 2.5 BigInsights 4.2Capability
Key Innovations
Focus
Operations
Security /
Governance
Security
Rolling Upgrade – Zero cluster downtime /
Full cluster HA
Express Upgrade – controlled cluster downtime
Cluster preventative maintenance
Role Based Access Control (RBAC)
Basic Tag policy – Access and entitlements
can be based on attributes
Geo-based policy – Access policy based on
location
Time-based policy – Access policy based on
time windows
Prohibitions – Restrictions on combining two
data sets which might be in compliance
originally, but not when combined together
Yes. Proven. GA multiple releases
Yes. Proven. GA multiple releases
Yes, Hortonworks SmartSense
Yes, via Ranger
Yes, via Atlas and Ranger
Yes, via Atlas and Ranger
Yes, via Atlas and Ranger
Yes, via Atlas and Ranger
No, equivalent offering
New, Unproven
No, equivalent offering
Yes, via Ranger
No, Atlas not part of distribution
No, Atlas not part of distribution
No, Atlas not part of distribution
No, Atlas not part of distribution
Public Information 7
Scope of Work
Batch Jobs
& Scripts
 DDL replication
 Convert Posix to
HDFS commands
 Covert Hive CLI to
Beeline
 Convert existing
Ingest Utilities
Historic &
Incremental Loads
 Re-evaluated all data
assets
 Bulk Data loads
required specialized
configuration
 Data validation
Hand-Off to
Support Teams
Phased
Releases
 Planning with
data support teams
 Parallel runs
 Execute phased
turn-off
 Convert clients to
use Knox and Hive
Provision for
HDP Readiness
 AD Integration
 Enable Kerberos
 Standard
HDFS/Local
filesystem layout
 Establish DB / FTP
connections
Transition
Cut-Over &
Retirement
Data
Migration
Component
Migration
Environment
Set Up
 Documentation
 Access provisioning
 Knowledge transfer
 Sign-off
Public Information 8
In-Scope Component Summary
Hive
Tables
4,700
Data
Volume
500TB
Env
Files
233
Python
Scripts
499
Linux
Scripts
4,000Pig
Scripts
243
Prod
Jobs
7,356
Public Information 8
Public Information 9
Workload Transition
Public Information 9
Public Information 10
 Modified pathing/dir structure on HDFS
 Quality checks
 Networked additional set of nodes and
leveraged HDFS client copy from local to
move data (due to GPFS  HDFS)
 Enterprise monitoring
 Code repository
 Automated code deploy
 Managed asset provisioning
 HA Services
 Lack of dedicated resources across enterprise
required third party assistance
 Extensive knowledge transfer sessions to aid
in transition
 Developed training plan
 Re-write all ingest utilities for HDFS
 Standardize metadata delivery to Atlas
 Standard asset request
 Adoption of data stewardship
 LDAP/AD integration
 Kerberos
 Ranger
 Knox – non Kerberos connectivity to Hive
Lessons Learned
Security and
Access
Ingest
Framework
Stakeholder
Buy-In
Table Data
Migration
Operational
Maturity
 Transition from hive CLI to beeline
 Optimized file format (ORC/Parquet/Avro)
 Conversion of all scripts to acquire Kerberos ticket
 Transition from GPFS to HDFS
Code
Management
Public Information

More Related Content

What's hot (20)

PDF
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
PPTX
Containers and Big Data
DataWorks Summit
 
PPTX
How to deploy machine learning models into production
DataWorks Summit
 
PDF
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
PPTX
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
PPTX
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
PPTX
Saving the elephant—now, not later
DataWorks Summit
 
PPTX
Log I am your father
DataWorks Summit/Hadoop Summit
 
PDF
NoSQL and Spatial Database Capabilities using PostgreSQL
EDB
 
PPTX
Large-scaled telematics analytics
DataWorks Summit
 
PPTX
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
DataWorks Summit
 
PPTX
Insights into Real-world Data Management Challenges
DataWorks Summit
 
PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
PDF
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
PDF
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
VMware Tanzu
 
PPTX
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
DataWorks Summit
 
PDF
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
PPTX
Admiral Group
DataWorks Summit/Hadoop Summit
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
Containers and Big Data
DataWorks Summit
 
How to deploy machine learning models into production
DataWorks Summit
 
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
Saving the elephant—now, not later
DataWorks Summit
 
Log I am your father
DataWorks Summit/Hadoop Summit
 
NoSQL and Spatial Database Capabilities using PostgreSQL
EDB
 
Large-scaled telematics analytics
DataWorks Summit
 
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
DataWorks Summit
 
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
VMware Tanzu
 
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
DataWorks Summit
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

Similar to Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform (20)

PDF
Scalable ETL with Talend and Hadoop, CĂŠdric Carbone, Talend.
OW2
 
DOC
Robin_Hadoop
Robin David
 
PPTX
Still on IBM BigInsights? We have the right path for you
ModusOptimum
 
PDF
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld
 
PPTX
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
PPTX
Trafodion – an enterprise class sql based on hadoop
Krishna-Kumar
 
PPTX
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
PDF
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld
 
PPT
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
PDF
Big Data & SQL: The On-Ramp to Hadoop
Inside Analysis
 
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
PDF
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
Hortonworks
 
PPTX
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
 
PPTX
Big data and lynda_Subash_DSouza.com
Data Con LA
 
PDF
Level Up – How to Achieve Hadoop Acceleration
Inside Analysis
 
PDF
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 
PPTX
Open Marketing Meeting 03/27/2013
OpenStack
 
PPTX
Transform Your Business with Big Data and Hortonworks
Pactera_US
 
DOCX
Prashanth Kumar_Hadoop_NEW
Prashanth Shankar kumar
 
PPTX
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
DataWorks Summit
 
Scalable ETL with Talend and Hadoop, CĂŠdric Carbone, Talend.
OW2
 
Robin_Hadoop
Robin David
 
Still on IBM BigInsights? We have the right path for you
ModusOptimum
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
Trafodion – an enterprise class sql based on hadoop
Krishna-Kumar
 
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
Big Data & SQL: The On-Ramp to Hadoop
Inside Analysis
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
Hortonworks
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
 
Big data and lynda_Subash_DSouza.com
Data Con LA
 
Level Up – How to Achieve Hadoop Acceleration
Inside Analysis
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 
Open Marketing Meeting 03/27/2013
OpenStack
 
Transform Your Business with Big Data and Hortonworks
Pactera_US
 
Prashanth Kumar_Hadoop_NEW
Prashanth Shankar kumar
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
DataWorks Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform

  • 1. Public Information Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform Data Works Summit 2018 Lisa Coleman – Data and Analytics Infrastructure Robert Tucker – HDP Platform Administrator Mica Glover – HDP Platform Administrator May 2018
  • 2. Public Information 2 OUR MISSION The mission of the association is to facilitate the financial security of its members, associates and their families through provision of a full range of highly competitive financial products and services; in so doing, USAA seeks to be the provider of choice for the military community. THE USAA STANDARD • Keep our membership and mission first • Live our core values: Service, Loyalty, Honesty, Integrity • Be authentic and build trust • Create conditions for people to succeed • Purposefully include diverse perspectives for superior results • Innovate and build for the future
  • 3. Public Information 3 Challenges & Drivers Scope of Work Lessons Learned Agenda Public Information 3
  • 4. Public Information 4 Platform Challenges and Drivers Interoperability Support Model Velocity of Change Compatibility IBM Strategy & Guidance  In-place upgrades not possible without GPFS  Stack integration – Platform Symphony, WLM, BigSQL  Common BI tools in the industry not certified for connectivity via GPFS  Limited ability to leverage Streaming Data Platforms as none were certified on our version of IBM BigInsights  Difficult resolution process  Evolution of documentation to support version upgrade  Potential added expenses for professional services  Inconsistency in stability of enhancements and break-fix releases  Difficulty securing the proper resources for assistance  Limits our ability to innovate and take advantage of new capabilities  Slow adoption of new ODPi projects Public Information 4
  • 5. Public Information 5 Apache Project Support Apache Project Usage Hortonworks 2.5 BigInsights 4.2 Required Capability                                                                        Hadoop Hive Sqoop Ambari Spark Zeppelin Atlas Ranger Knox Tez Nifi Flume Kafka Pig Storm HBase Phoenix Solr Accumulo Core Components SQL on Hadoop Data Ingestion Cluster Management IoT, Streaming Data Science Governance Security Security Improve Hive Performance IoT Data Ingestion IoT, Streaming ETL IoT, Streaming NoSQL SQL Interface for Hbase Social Media / NLP NoSQL 2.7.3 1.2.1 1.4.6 2.4 1.6.2 AND 2.0* 0.6.0 0.7.0 0.6.0 0.9.0 0.7.0 1 1.5.2 0.10.0.1 1.2.0 1.0.1 1.1.2 4.7.0 5.2.2 1.7.0 2.7.2 1.2.1 1.4.6 2.2.0 1.6.1 N/A N/A 0.5.2 0.7.0 N/A N/A 1.6.0 0.9.0.1 0.15.0 N/A 1.2.0 4.6.1 5.5 N/A
  • 6. Public Information 6 Hortonworks 2.5 BigInsights 4.2Capability Key Innovations Focus Operations Security / Governance Security Rolling Upgrade – Zero cluster downtime / Full cluster HA Express Upgrade – controlled cluster downtime Cluster preventative maintenance Role Based Access Control (RBAC) Basic Tag policy – Access and entitlements can be based on attributes Geo-based policy – Access policy based on location Time-based policy – Access policy based on time windows Prohibitions – Restrictions on combining two data sets which might be in compliance originally, but not when combined together Yes. Proven. GA multiple releases Yes. Proven. GA multiple releases Yes, Hortonworks SmartSense Yes, via Ranger Yes, via Atlas and Ranger Yes, via Atlas and Ranger Yes, via Atlas and Ranger Yes, via Atlas and Ranger No, equivalent offering New, Unproven No, equivalent offering Yes, via Ranger No, Atlas not part of distribution No, Atlas not part of distribution No, Atlas not part of distribution No, Atlas not part of distribution
  • 7. Public Information 7 Scope of Work Batch Jobs & Scripts  DDL replication  Convert Posix to HDFS commands  Covert Hive CLI to Beeline  Convert existing Ingest Utilities Historic & Incremental Loads  Re-evaluated all data assets  Bulk Data loads required specialized configuration  Data validation Hand-Off to Support Teams Phased Releases  Planning with data support teams  Parallel runs  Execute phased turn-off  Convert clients to use Knox and Hive Provision for HDP Readiness  AD Integration  Enable Kerberos  Standard HDFS/Local filesystem layout  Establish DB / FTP connections Transition Cut-Over & Retirement Data Migration Component Migration Environment Set Up  Documentation  Access provisioning  Knowledge transfer  Sign-off
  • 8. Public Information 8 In-Scope Component Summary Hive Tables 4,700 Data Volume 500TB Env Files 233 Python Scripts 499 Linux Scripts 4,000Pig Scripts 243 Prod Jobs 7,356 Public Information 8
  • 9. Public Information 9 Workload Transition Public Information 9
  • 10. Public Information 10  Modified pathing/dir structure on HDFS  Quality checks  Networked additional set of nodes and leveraged HDFS client copy from local to move data (due to GPFS  HDFS)  Enterprise monitoring  Code repository  Automated code deploy  Managed asset provisioning  HA Services  Lack of dedicated resources across enterprise required third party assistance  Extensive knowledge transfer sessions to aid in transition  Developed training plan  Re-write all ingest utilities for HDFS  Standardize metadata delivery to Atlas  Standard asset request  Adoption of data stewardship  LDAP/AD integration  Kerberos  Ranger  Knox – non Kerberos connectivity to Hive Lessons Learned Security and Access Ingest Framework Stakeholder Buy-In Table Data Migration Operational Maturity  Transition from hive CLI to beeline  Optimized file format (ORC/Parquet/Avro)  Conversion of all scripts to acquire Kerberos ticket  Transition from GPFS to HDFS Code Management

Editor's Notes

  • #5: ODPi – open data platform initiative: goal was to contribute to open source collaboratively and provide interoperability across different distributions
  • #8: Prework to evaluate all data prior to migration – removed the need to migrate some data •Delete any data you no longer need. •Delete duplicate data •Compress all data with Bzip2 •Identify which Hive tables do not need to be migrated •Delete any temporary tables •Understand who is consuming your data and how they are consuming it •Understand whether your project conforms to the Ingest Framework methodology   HDFS client needed on BigInsights cluster Clients of SAS big consumers of BigSQL had to be converted Dedicated team to address a majority of the script conversions Post migration security model implementation required for some assets which did not have an assigned owner. KT – office hours, data consumption through Hive, search of available assets
  • #11: Add Code section, e.g. transition from hive cli to beeline Add Ranger in Security section I removed code management section GPFS shielded USAA form having to implement Kerberos Memory management – JVMs, Spark Case mismatch between Unix accounts Code Migration Conversion of all scripts to acquire Kerberos tickets Hive cli to beeline GPFS to HDFS Python libraries Table Data Pathing changed Quality checks Networked additional set of nodes and leveraged HDFS client copy from local to move data