Best Practices With Hadoop
Best Practices With Hadoop
LEAN
1.
2.
3.
4.
5.
6.
7.
Eliminate waste
Automate processes
Empower the team
Continuously improve
Build quality in
Plan for change
Optimize the whole
AGILE
1.
2.
3.
4.
5.
6.
7.
Profile data
sources for data
quality
Identify join
conditions and
data quality rules
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Define
Requirements
Analyze &
Design
Build
Test
Deploy
Support &
Maintain
Data Warehouse
Team
GMNA Applications
Team
(Irina)
Change Request
Confirmation Request
Telephone Tag
Automated
Workflow/Tracking
(Cust satisfaction)
Status Request
Status Update
Status Request
Status Update
CR Review Committee
Data Dictionary
to clarify rqmnts
(13 days)
Semi-Weekly Review
Notify Customer
(Cust Satisfaction)
(5 days)
Status Request
Clarify Requirements
Status Update
Requirements Clarification
Approved
Changes
Add CR
To List
Assign Resource
Design Approval
Bypass Council
for simple CRs
(26 days)
Production CR
Submission
Daily ETL Batch Run
Data Warehouse Team
Bypass
Committee for
simple changes
(8 days)
CR Approval
Test Results
Distribution
Approved
Designs
Forward
CR Request
To Developer
CRs
P1x12
P2x35
P3x124
Test Scheduling
Approved CR
& Design Docs
Requirements
Review
Design &
Development
Development Team
Development Team
Testing Handoff
Design Docs
& Schedule
Automated
Regression
Testing
Test Execution
Test Team
Test Team
CR Approval
& Schedule
Test Case
Development
Development Team
Charge
Request
Production
Deployment
Production
Execution
Infrastructure Team
8.8 Days
13.3 Days
30 Minutes
26 Days
180 Minutes
1 Day
15 Minutes
12.8 Days
180 Minutes
8.5 Days
90 Minutes
0.3 Days
15 Minutes
Days/Weeks
Mins/
Hrs
Days/Weeks
Mins/
Hrs
Days/Weeks
Mins/
Hrs
Mins/
Hrs
To-Be Process
Hours
Mins
Hours
Mins
Mins
1.
2.
Profile data
sources for data
quality
Assign Business
Request
Business submits
request for new
information in report
Analyst receives
request from
business
Mins/Hrs
App Developer
identifies table to meet
requirements
Analyst
requests
clarification
from business
Establish a data
governance framework
to ensure confidence,
integrity, transparency,
& security
Days/Weeks
Analyst defines
mapping
specification
Empower analyst
to find & preview
data on their own
Analyst describes
requirement to App
Developer
Define Requirements
Preview Data
Analyst creates
requirements definition
based on request
Days/Weeks
Mins/Hrs
Estimate project
scope based on
impact analysis
# Data Sources
Each data source in a
different application
requires another app
developer to get
involved
Days/Weeks
Mins/Hrs
Complexity
Find data
sources and
targets
As-Is Process
Days/Weeks
Mins/Hrs
Mins/Hrs
Find data
sources and
targets
Profile data
sources for data
quality
To-Be Process
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Browse and
navigate data
lineage to find data
sources and targets
10
INFORMATION
CONFIDENCE
MDM
Deliver Context
Rich Information
Through Metadata
Management
INFORMATION
TRANSPARENCY
Meta
Data
Mgt.
Data
Quality
Retention / Privacy
Deliver Consistent,
Correct, &
Complete
Information
Through Pervasive
Data Quality
INFORMATION
INTEGRITY
11
Find data
sources and
targets
Profile data
sources for data
quality
Identify join
conditions and
data quality rules
As-Is Process
Estimate project
scope based on
impact analysis
Browser-based tool
to profile data
without the help of
a developer
Profile Data
Analyst profiles in spreadsheet
or creates & runs SQL scripts to
profile data
Days/Weeks
Mins/Hrs
Days/Weeks
Mins/Hrs
Find data
sources and
targets
Days/Weeks
Mins/Hrs
Days/Weeks
Mins/Hrs
Mins/Hrs
12
13
Find data
sources and
targets
Profile data
sources for data
quality
Identify join
conditions and
data quality rules
mapping
specification
To-Be Process
Estimate project
scope based on
impact analysis
Identify duplicate
records and
uniqueness violations
Column Profiling
DEMO
14
quality
As-Is Process
Identify join
conditions and
data quality rules
Analyst defines
mapping
specification
Use cross-table
profiling and join
analysis in DI Build E-R Model
developer tool
Developer builds E-R
model based on table
relationships
Estimate project
scope based on
impact analysis
Search common
metadata repository
for data quality rules
already created
Fix DQ Issues
Developer writes script to resolve
data quality issue (which has
probably been written before)
Request DQ Rules
Analyst verifies data
quality rules fix issues
Request DQ Rules
Perform Join Analysis
Mins/Hrs
Days/Weeks
Mins/Hrs
Complex DQ
issues that are
not easy to
resolve may
need to go back
to the business
Profile data
sources for data
quality
Days/Weeks
Mins/Hrs
Days/Weeks
Mins/Hrs
Mins/Hrs
15
quality
To-Be Process
Identify join
conditions and
data quality rules
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
View PK-FK
relationships
Join Analysis
DEMO
16
Find data
sources and
targets
As-Is Process
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Developer requests
clarification from
Analyst
Add Sources & Targets
Identify Field Mappings
Analyst adds sources and
targets to mapping
document
Days/Weeks
Mins/Hrs
Developer(s) recommend
source-to-target field
mappings
Days/Weeks
Mins/Hrs
Mins/Hrs
Days/Weeks
Mins/Hrs
Mins/Hrs
17
18
Find data
sources and
targets
To-Be Process
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Specify transformation
logic using reusable
expressions
Include transformation
descriptions to instruct
developer
Define Specification
DEMO
19
Find data
sources and
targets
Profile data
sources for data
quality
As-Is Process
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Search SCCS
Estimate QA Time
Developer searches sourcecode control for reports or
other objects impacted
Code Review
Days/Weeks
Mins/Hrs
QA provides estimate on
how long to retest reports
and other affected objects
Leverage metadata
management with data
lineage to perform
impact analysis
Days/Weeks
Mins/Hrs
Days/Weeks
Mins/Hrs
Days/Weeks
Mins/Hrs
Mins/Hrs
20
Find data
sources and
targets
Profile data
sources for data
quality
To-Be Process
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
View upstream
lineage
View downstream
impact
Lineage object
View
downstream
lineage
View upstream
impact
21
Find data
sources and
targets
Profile data
sources for data
quality
mapping
specification
As-Is Process
Estimate project
scope based on
impact analysis
Specify Results
Clarify Specification
Analyst clarifies sourcetarget mapping
specification
Automatically
generate
mapping
logic
Developer
needs
from
specification
clarification
Translate Specification
Developer translates
mapping specification into
mapping logic
System Test
Deploy early and
Analyst verifies that target
test often to speed QA tests all affected
data meets business
up iterations
objects and reports
requirements
Use
Request analyst to
comparative preview target data in
profiling dev/test environment
Acceptance Test
Unit Test
Reuse DI & DQ
rules in unified
Change Objects developer
environment
Developer makes necessary
changes to affected objects
and reports
Days/Weeks
Mins/Hrs
Analyst specifies
expected results
Developer performs
unit testing
Automatically
compare actual
results with
expected
results
Test results are inconsistent
with
business requirements. Go back
to specification step
Days/Weeks
Mins/Hrs
Days/Weeks
Mins/Hrs
Analyst defines
mapping
specification
Days/Weeks
Mins/Hrs
Mins/Hrs
22
23
Data Integration
Master Data
Management
Data Quality
ODBC/JDBC
Access
Quality
Test Data
Management &
Archiving
Web
Services
Retention
B2B
PowerCenter
Privacy
Freshness
SOA/
Composite Apps
Business
Intelligence
Mainframe
Databases
Unstructured and
Semi-structured
Applications
Cloud
Social Media
NoSQL
24
Find data
sources and
targets
Profile data
sources for data
quality
mapping
specification
To-Be Process
Estimate project
scope based on
impact analysis
Mapping logic
automatically generated
from analyst defined
specification
Extend mapping to
dynamically mask
sensitive data
Consume WS
DEMO
Extend mapping to
consume web services
Dynamic Masking
DEMO
25
Find data
sources and
targets
Profile data
sources for data
quality
mapping
specification
To-Be Process
Estimate project
scope based on
impact analysis
Deploy to PowerCenter
Deploy PowerCenter
DEMO
26
Find data
sources and
targets
Profile data
sources for data
quality
Compare profiling
statistics during
development to ensure
data quality
mapping
specification
To-Be Process
Estimate project
scope based on
impact analysis
Expected
Results
B
Comparative
Profiling
Data Validation
B
Comparative Profiling
DEMO
Data Validation
DEMO
27
Profile data
sources for data
quality
Identify join
conditions and
data quality rules
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Define
Requirements
Analyze &
Design
Build
Test
Deploy
Support &
Maintain
28
Profile data
sources for data
quality
Identify join
conditions and
data quality rules
1.
2.
3.
4.
5.
6.
Project
Request
Define
Requirements
7.
Analyst defines
mapping
specification
Estimate project
scope based on
impact analysis
Analyze &
Design
Build
Test
Deploy
Support &
Maintain
29
30
31
32
Q&A
33