0% found this document useful (0 votes)
15 views

OC - Module 2 - DA Lifecycle 021312

Uploaded by

Lakshmi Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

OC - Module 2 - DA Lifecycle 021312

Uploaded by

Lakshmi Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Module 2 – Data Analytics

Lifecycle
1
Module 2: Data Analytics Lifecycle

Upon completion of this module, you should be able to:


• Apply the Data Analytics Lifecycle to a case study scenario
• Frame a business problem as an analytics problem
• Identify the four main deliverables in an analytics project

2
Module 2: Data Analytics Lifecycle

During this module the following topics are covered:


• Data Analytics Lifecycle
• Roles for a Successful Analytics Project
• Case Study to apply the data analytics lifecycle

3
How to Approach Your Analytics Problems
Your Thoughts?

• How do you currently approach


your analytics problems?

• Do you follow a methodology or


some kind of framework?

• How do you plan for an analytic


project?

4
Value of Using the Data Analytics Lifecycle

• Focus your time

• Ensure rigor and completeness

• Enable better transition to members of the cross-functional


analytic teams
 Repeatable
 Scale to additional analysts
 Support validity of findings

“A journey of a thousand miles begins with a single step“ (Lao Tzu)

5
Need For a Process to Guide Data Science
Projects
1. Well-defined processes can
help guide any analytic
project

2. Focus of Data Analytics


Lifecycle is on Data
Science projects, not
business intelligence
3. Data Science projects tend to require a more consultative
approach, and differ in a few ways
 More due diligence in Discovery phase
 More projects which lack shape or structure
 Less predictable data

6
Key Roles for a Successful Analytic Project
Role Description
Someone who benefits from the end results and can consult and advise
Business User project team on value of end results and how these will be
operationalized
Person responsible for the genesis of the project, providing the impetus
for the project and core business problem, generally provides the funding
Project Sponsor
and will gauge the degree of value from the final outputs of the working
team
Ensure key milestones and objectives are met on time and at expected
Project Manager
quality.
Business
Business domain expertise with deep understanding of the data, KPIs,
Intelligence
key metrics and business intelligence from a reporting perspective
Analyst
Deep technical skills to assist with tuning SQL queries for data
Data Engineer
management, extraction and support data ingest to analytic sandbox
Database
Database Administrator who provisions and configures database
Administrator
environment to support the analytical needs of the working team
(DBA)
Provide subject matter expertise for analytical techniques, data
Data Scientist modeling, applying valid analytical techniques to given business
problems and ensuring overall analytical objectives are met

7
Data Analytics Lifecycle Do I have enough
information to draft an
analytic plan and share
1 for peer review?
Discovery
Do I have
enough
good quality
6 2 data to start
building the
Operationalize Data Prep model?

5 3
Communicate Model
Results Planning

4
Do I have a good
Model
idea about the type
Is the model Building of model to try?
robust enough? Can I refine the
Have we failed for analytic plan?
sure?

8
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share
1 for peer review?
Discovery
Do I have
enough
good quality
• Learn the Business Domain data to start
building the
 Operationalize Data you
Determine amount of domain knowledge needed to orient Prep
to the data and
model?
interpret results downstream
 Determine the general analytic problem type (such as clustering, classification)
 If you don’t know, then conduct initial research to learn about the domain area
you’ll be analyzing
Communicate Model
• Learn from the past
Results Planning
 Have there been previous attempts in the organization to solve this problem?
 If so, why did they fail? Why are we trying again? How have things changed?
Do I have a good
Model
idea about the type
Is the model Building of model to try?
robust enough? Can I refine the
Have we failed for analytic plan?
sure?

11
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share
1 for peer review?
Discovery
Do I have
enough
good quality
data to start
building the
• Resources
Operationalize Data Prep model?
 Assess available technology
 Available data – sufficient to meet your needs
 People for the working team
Communicate Model
 Assess scope of time for the project in calendar time
Results and person-hours
Planning
 Do you have sufficient resources to attempt the project? If not, can you get
more?
Do I have a good
Model
idea about the type
Is the model Building of model to try?
robust enough? Can I refine the
Have we failed for analytic plan?
sure?

12
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share
1 for peer review?
Discovery
Do I have
enough
good quality
• Frame the problem…..Framing is the process of stating the analyticsdata to start
problem
building the
to beOperationalize
solved Data Prep model?
 State the analytics problem, why it is important, and to whom
 Identify key stakeholders and their interests in the project
 Clearly articulate the current situation and pain points
 Communicate
Objectives Model
– identify what needs to be achieved in business terms and what needs
to beResults
done to meet the needs Planning
 What is the goal? What are the criteria for success? What’s “good enough”?
 What is the failure criterion (when do we just stop trying or settle for what we
Do I have a good
have)? Model
idea about the type
Is thethe
 Identify model Building
success criteria, key risks, and stakeholders (such
of as RACI)
model to try?
robust enough? Can I refine the
Have we failed for analytic plan?
sure?

13
Tips for Interviewing the Analytics Sponsor
• Even if you are “given” an analytic problem you should work with clients to
clarify and frame the problem
 You’re typically handed solutions, you need to
identify the problem and their desired outcome
Sponsor Interview Tips
• Prepare for the interview – draft your questions, review with colleague, team
• Use open-ended questions, don’t ask leading questions
• Probe for details, follow-up
• Don’t fill every silence – give them time to think
• Let them express their ideas, don’t put words in their mouth, let them share their feelings
• Ask clarifying questions, ask why – is that correct? Am I on target? Is there anything else?
• Use active listening – repeat it back to make sure you heard it correctly
• Don’t express your opinions
• Be mindful of your body language and theirs – use eye contact, be attentive
• Minimize distractions
• Document what you heard and review it back with the sponsor

14
Tips for Interviewing the Analytics Sponsor

Interview Questions
• What is the business problem you’re trying to solve?
• What is your desired outcome?
• Will the focus and scope of the problem change if the following dimensions
change:
• Time – analyzing 1 year or 10 years worth of data?
• People – how would this project change this?
• Risk – conservative to aggressive
• Resources – none to unlimited (tools, tech, …..)
• Size and attributes of Data
• What data sources do you have?
• What industry issues may impact the analysis?
• What timelines are you up against?
• Who could provide insight into the project? Consulted?
• Who has final say on the project?

15
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share
1 for peer review?
Discovery
Do I have
enough
good quality
• Formulate Initial Hypotheses data to start
IH, H1 , H2, H3, … Hn
 Operationalize Data Prep
building the
model?
 Gather and assess hypotheses from stakeholders and
domain experts
 Preliminary data exploration to inform discussions with
Communicate
stakeholders during the hypothesis forming stageModel
• IdentifyResults
Data Sources – Begin Learning the Data
Planning
 Aggregate sources for previewing the data and provide
high-level understanding Model Do I have a good
idea about the type
 Review the raw data
Is the model Building of model to try?
robust enough?
 Have
Determine the
we failed forstructures and tools needed
Can I refine the
analytic plan?
 Scope the kind of data needed for this kind of problem
sure?

16
Using a Sample Case Study to Track the
Phases in the Data Analytics Lifecycle
Mini Case Study: Churn Prediction for
Yoyodyne Bank
Situation Synopsis
• Retail Bank, Yoyodyne Bank wants to improve the Net Present Value
(NPV) and retention rate of customers
• They want to establish an effective marketing campaign targeting
customers to reduce the churn rate by at least five percent
• The bank wants to determine whether those customers are worth
retaining. In addition, the bank also wants to analyze reasons for
customer attrition and what they can do to keep them
• The bank wants to build a data warehouse to support Marketing
and other related customer care groups

18
How to Frame an Analytics Problem MiniStudy
Case

Sample Business Analytical


Qualifiers
Problems Approach

• How can we improve on x?


Will the focus and scope of the problem change if Define an analytical
• What’s happening real- the following dimensions change: approach, including
time? Trends? key terms, metrics, and
• How can we use analytics • Time
data needed.
differentiate ourselves • People – how would x change this?
• How can we use analytics • Risk – conservative/aggressive
to innovate? • Resources – none/unlimited
• How can we stay ahead of • Size of Data?
our biggest competitor?

• Time: Trailing 5 months


Mini Case Study:
Churn Prediction for • People: Working team and business
users from the Bank How do we
Yoyodyne Bank
identify churn/no
• Risk: the project will fail if we cannot
churn for a
determine valid predictors of churn
Yoyodyne Bank customer?
How can we improve • Resources: EDW, analytic sandbox,
Net Present Value (NPV) and OLTP system Pilot study
retention rate of the customers? • Data: Use 24 months for the training followed full scale
set, then analyze 5 months of analytical model
historical data for those customers
who churned

19
Data Analytics Lifecycle Do I have enough
Phase 2: Data Preparation information to draft an
analytic plan and share
for peer review?

• Prepare Analytic Sandbox Discovery


Do I have
 Work space for the analytic team enough
good quality
 10x+ vs. EDW 2 data to start
building the
• Perform ELT
Operationalize Data Prep model?
 Determine needed transformations
 Assess data quality and structuring
 Derive statistically useful measures
Communicate
 Extract data and determine data Model
Results Planning
connections for raw data, OLTP
transactions, OLAP cubes or data feeds
 Big ELT and Big ETL Do I have a good
Model
idea about the type
Is the model Building of model to try?
robust enough? Can I refine the
• Useful Tools for this phase:
Have we failed for analyticAlpine
plan? Miner
• For Data Transformation & Cleansing: SQL, Hadoop, MapReduce,
sure?

20
Data Analytics Lifecycle Do I have enough
Phase 2: Data Preparation information to draft an
analytic plan and share
for peer review?
• Familiarize yourself with the dataDiscovery
thoroughly
 List your data sources Do I have
enough
 What’s needed vs. what’s available good quality
2 data to start
• Data Conditioning building the
Operationalize
 Clean Data Prep model?
and normalize data
 Discern what you keep vs. what you discard
• Survey & Visualize
 Overview, zoom & filter, details-on-demand
Communicate Model
ResultsStatistics
 Descriptive Planning
 Data Quality
Do I have a good
Model
• Useful Tools for this phase: idea about the type
Is the model Building
• Descriptive Statistics on candidate variables for diagnostics & quality of model to try?
robust enough? Can I refine the
• Visualization: R (base package, ggplot and lattice), GnuPlot, Ggobi/Rggobi, Spotfire,
Have we failed for analytic plan?
Tableau
sure?

22
Data Analytics Lifecycle Do I have enough
Phase 3: Model Planning information to draft an
analytic plan and share
for peer review?
• Determine Methods
Discovery
 Select methods based on hypotheses, data Do I have
enough
structure and volume good quality
 Ensure techniques and approach will meet data to start
building the
business objectives
Operationalize Data Prep model?

• Techniques & Workflow


 Candidate tests and sequence 3
Communicate
 Identify and document modeling Model
Results Planning
assumptions

Do I have a good
• Model
Useful Tools for this phase: R/PostgresSQL, SQL idea about the type
Analytics, Alpine
Is the model Building
Miner, SAS/ACCESS, SPSS/OBDC of model to try?
robust enough? Can I refine the
Have we failed for analytic plan?
sure?

24
Data Analytics Lifecycle Do I have enough
Phase 3: Model Planning information to draft an
analytic plan and share
for peer review?
• Data Exploration Discovery
Do I have
• Variable Selection enough
 Inputs from stakeholders and domain good quality
data to start
experts building the
Operationalize Data Prep model?
 Capture essence of the predictors, leverage
a technique for dimensionality reduction
 Iterative testing to confirm the most
3
significant variables
Communicate Model
Results Planning
• Model Selection
 Conversion to SQL or database language for
Do I have a good
best performance Model
idea about the type
Is the model Building of model to try?
 Choose technique
robust enough? based on the end goal Can I refine the
Have we failed for analytic plan?
sure?

26
Sample Research: Churn Prediction in OtherMiniVerticals
Case Study:
Churn
Prediction for
Yoyodyne Bank
• After conducting research on churn prediction, you have identified many methods for
analyzing customer churn across multiple verticals (those in bold are taught in this course)
• At this point, a Data Scientist would assess the methods and select the best model for the
situation
Market Sector Analytic Techniques/Methods Used
Wireless Telecom DMEL method (data mining by evolutionary
learning)
Retail Business Logistic regression, ARD (automatic
relevance determination), decision tree
Daily Grocery MLR (multiple linear regression), ARD, and
decision tree
Wireless Telecom Neural network, decision tree, hierarchical
neurofuzzy systems, rule evolver
Retail Banking Multiple regression
Wireless Telecom Logistic regression, neural network, decision
tree

28
Data Analytics Lifecycle Do I have enough
Phase 4: Model Building information to draft an
analytic plan and share
for peer review?
Discovery
• Develop data sets for testing, training, and production purposes Do I have
enough
 Need to ensure that the model data is sufficiently robust for the model
good quality
and analytical techniques data to start
building the
Operationalize Data Prep
Smaller, test sets for validating approach, training set for initial model?
experiments
• Get the best environment you can for building models and workflows…
fast hardware, parallel processing
Communicate Model
Results Planning

4
Is the model Do I have a good
Model
robust enough? idea about the type
Have we failed for Building of model to try?
sure? Can I refine the
analytic plan?
• Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner

29
Data Analytics Lifecycle Do I have enough
Phase 5: Communicate Resultsanalytic
information to draft an
plan and share
for peer review?
Discovery
Do I have
enough
good quality
Did we succeed? Did we fail? data to start
building the
Operationalize Data Prep model?

• Interpret the results


• Compare to IH’s from Phase 1
5
• Identify key findings Model
Communicate
Results • Quantify business value
Planning
• Summarizing findings, depending on
audience
Do I have a good
Model
idea about the type
Is the model Building of model to try?
robust
Mini Case Study: enough?
Have we for
failed for
For the YoyoDyne Case Study,
Can I refine the
Churn Prediction analytic plan?
Yoyodyne Banksure? what would be some possible results and key findings?

31
Data Analytics Lifecycle Do I have enough
Phase 6: Operationalize information to draft an
analytic plan and share
for peer review?
Discovery
Do I have
enough

6
• Run a pilot good quality
data to start
Operationalize • Assess the benefits
Data Prep
building the
model?
• Deliver final deliverables
• Model Execution in Production
Communicate
Environment Model
Results • Define process toPlanning
update and retrain
the model, as needed
Do I have a good
Model
idea about the type
Is the model Building of model to try?
robust enough? Can I refine the
Have we failed for analytic plan?
sure?

33
Mini Case Study:
Analytic Plan Churn Prediction for
Retail Banking

Components Retail Banking: Yoyodyne Bank


of Analytic
Plan
Phase 1: Discovery How do we identify churn/no churn for a customer?
Business Problem
Framed
Initial Hypotheses Transaction volume and type are key predictors of churn
rates.
Data 5 months of customer account history.
Phase 3: Model Logistic regression to identify most influential factors
Planning - Analytic predicting churn.
Technique
Phase 5: Once customers stop using their accounts for gas and
Result & groceries, they will soon erode their accounts and churn.
Key Findings If customers use their debit card fewer than 5 times per
month, they will leave the bank within 60 days.
Business Impact If we can target customers who are high-risk for churn,
we can reduce customer attrition by 25%. This would
save $3 million in lost of customer revenue and avoid
$1.5 million in new customer acquisition costs each year.

35
Key Outputs from a Successful Analytic Project,
by Role
What the Role Needs in the Final
Role Description
Deliverables
Someone who benefits from the end results and • Sponsor Presentation addressing:
can consult and advise project team on value of • Are the results good for me?
Business end results and how these will be operationalized • What are the benefits of the
User findings?
• What are the implications of this for
me?
Person responsible for the genesis of the project, • Sponsor Presentation addressing:
providing the impetus for the project and core • What’s the business impact of doing
Project business problem, generally provides the funding this?
Sponsor and will gauge the degree of value from the final • What are the risks? ROI?
outputs of the working team • How can this be evangelized within
the organization (and beyond)?
Project Ensure key milestones and objectives are met on
Manager time and at expected quality.
Business Business domain expertise with deep • Show the analyst presentation
Intelligenc understanding of the data, KPIs, key metrics and • Determine if the reports will change
e Analyst business intelligence from a reporting perspective
Deep technical skills to assist with tuning SQL • Share the code from the analytical
Data queries for data management, extraction and project
Engineer support data ingest to analytic sandbox • Create technical document on how
to implement it.
Database Administrator who provisions and • Share the code from the analytical
Database
configures database environment to support the project
Administra
analytical needs of the working team • Create technical document on how
tor (DBA)
to implement it.
Provide subject matter expertise for analytical • Show the analyst presentation
Data techniques, data modeling, applying valid • Share the code
Scientist analytical techniques to given business problems 36
4 Core Deliverables to Meet Most Stakeholder
Needs
1. Presentation for Project Sponsors
• “Big picture" takeaways for executive level stakeholders
• Determine key messages to aid their decision-making process
• Focus on clean, easy visuals for the presenter to explain and for the
viewer to grasp
2. Presentation for Analysts
• Business process changes
• Reporting changes
• Fellow Data Scientists will want the details and are comfortable with
technical graphs (such as ROC curves, density plots, histograms)
3. Code for technical people
4. Technical specs of implementing the code

37
Analyst Wish List for a Successful Analytics Project

Data & Workspaces


• Access to all the data, including aggregated OLAP data, BI tools, raw data, structured and
various states of unstructured data as needed
• Up-to-date data dictionary to describe the data
• Area for staging and production data sets
• Ability to move data back and forth between workspaces and staging areas
• Analytic sandbox with strong compute power to experiment and play with the data

Tools
• Statistical/mathematical/visual software of choice for a given situation and problem set,
such as SAS, Matlab, R, java tools, Tableau, Spotfire
• Collaboration: an online platform or environment for collaboration and communicating
with team members
• Tool or place to log errors with systems, environments or data sets

39
Concepts in Practice
Greenplum’s Approach to Analytics

Magnetic Agile Deep


Attract all kinds of data Flexible and elastic data structures Rich data repository and
algorithmic engine

Future
Analyze and
How can
Model in the What will
we do
Ou
cloud happen?
rts l i ng de tlier better?
A l e s am p
- t
Se ectio Push
Re gm n
en
tat results What
es da raw

A / s t in g M o n

ion
Te orin
B

back into
s in t a

happened How and


pr t a i v e

Sc
g
da as s

where why did it


oc &

the cloud

ze
M

and happen?

Si
Mode n

Analytics
selec

Get data
desig

ta
when?
del

Da
l
tio
E TL /

into the
Fast

Facts Interpretation

Past
Data
ELT

EDC PLATFORM
cloud

Source: MAD Skills: New Analysis Practices for Big Data, March 2009

40
“The pessimist –
complains about the wind
The optimist –
expects it to change
The leader –
adjusts the sails

John Maxwell
(Leadership Author)

41
Check Your Knowledge
• In which phase would you expect to invest most of your project time and
why? Where would expect to spend the least time? Your Thoughts?

• What are the benefits of doing a pilot program before a full scale rollout of a
new analytical methodology? Discuss this in the context of the mini case
study.

• What kinds of tools would be used in the following phases, and for which
kinds of use scenarios?
 Phase 2: Data Preparation
 Phase 4: Model Execution
• Now that you have completed the analytical project at Yoyodyne, you have an
opportunity to repurpose this approach for an online eCommerce company.
What phases of the lifecycle do you need to focus on to identify ways to do
this?

42
Module 2: Summary

Key points covered in this module:


• The Data Analytics Lifecycle was applied to a case study
scenario
• A business problem was framed as an analytics problem
• The four main deliverables in an analytics project were
identified

43
Lab Exercise 1: Introduction to Data
Environment
This first lab introduces the Analytics Lab Environment you
will be working on throughout the course.

After completing the tasks in this lab you should be able to:
• Authenticate and access the Virtual Machine (VM)
assigned to you for all of your lab exercises
• Locate data sets you will be working with for the
course’s labs
• Use meta commands and PSQL to navigate through
the data sets
• Create sub-sets of the big data, using table joins and
filters to analyze subsequent lab exercises

44

You might also like