0% found this document useful (0 votes)
14 views

Chapter 02 DataAnalyticsLifecycle

Chapter_02_DataAnalyticsLifecycle

Uploaded by

datnthe171250
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Chapter 02 DataAnalyticsLifecycle

Chapter_02_DataAnalyticsLifecycle

Uploaded by

datnthe171250
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA ANALYTICS

LIFECYCLE
Author : FU
Date : Mar-2022
Objectives

After studying this chapter, the student should


be able to:
 Understand data analytics lifecycle
 Understand key roles for a successful analytics project
 Understand what analytics team should lean and what
needs for data discovery?
Content

1. Data Analytics Lifecycle Overview


2. Phase 1: Data Discovery
3. Phase 2: Data Preparation
4. Phase 3: Model Planning
5. Phase 4: Model Building
6. Phase 5: Communicate Results
7. Phase 6: Operationalize
8. Case Study: Global Innovation Network and
Analysis (GINA
1.1 Key Roles for a Successful Analytics Project

 Business User
 Project Sponsor
 Project Manager
 Business Intelligence Analyst
 Database Administrator
 Data Engineer
 Data Scientist
FIGURE 2-1 Key roles for a successful analytics project
1.2 Process overview (1)

 Data Analytics Lifecycle


designed for Big Data
problems and data
science projects has six
phases
 Project work can occur in
several phases at once.
For most phases in the
lifecycle, the movement
can be either forward or
backward
FIGURE 2-2 Overview of Data Analytics Lifecycle
1.2 Process overview (2)

 Phase 1—Discovery
o Learns business domain, including relevant history such as
whether the organization or business unit has attempted similar
projects in the past
o Assesses resources available to support the project: people,
technology, time, and data
o Important activities : framing the business problem as an
analytics challenge and formulating initial hypotheses
 Phase 2—Data preparation
o Presence of an analytic sandbox
o Execute extract, load, and transform (ELT) or extract, transform
and load (ETL) to get data into the sandbox. Data transformed in
the ETLT process so the team can work with it and analyze it.
1.2 Process overview (3)

 Phase 3—Model planning


o Determines methods, techniques, and workflow it intends to
follow for the subsequent model building phase.
o Explores data to learn about the relationships between variables
and subsequently selects key variables and the most suitable
models
 Phase 4—Model building
o Develops datasets for testing, training, and production purposes.
o Builds and executes models based on the work done in the
model planning phase
o Considers whether its existing tools will suffice for running
models
1.2 Process overview (4)

 Phase 5—Communicate results


o Collaboration with major stakeholders, determines if the
results of the project are a success or a failure based on the
criteria developed in Phase 1
o Identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to
stakeholders.
 Phase 6—Operationalize
o Delivers final reports, briefings, code, and technical
documents
o Run a pilot project to implement the models in a production
environment
2. Phase 1: Discovery

 Learning the Business Domain


 Resources (technology, tools, systems, data, and
people)
 Framing the Problem
 Identifying Key Stakeholders
 Interviewing the Analytics Sponsor
 Developing Initial Hypotheses
 Identifying Potential Data Sources
2.1 Learning the Business Domain

 Learn and investigate the problem, develop context and


understanding, and learn about the data sources needed
and available for the project
 Formulates initial hypotheses that can later be tested with
data
 Have deep computational and quantitative knowledge
broadly applied across many disciplines
o Deep knowledge of the methods, techniques, and ways for
applying heuristics to a variety of business and conceptual
problems
o Have deep knowledge of a domain area, coupled with
quantitative expertise
2.2 Resources

 Assess resources available to support the project: technology,


tools, systems, data, and people
 Available tools and technology the team will be using and the
types of systems needed for later phases to operationalize the
models
 What types of skills and roles will be needed for the recipients of
the model being developed? influence techniques the team
selects and the kind of implementation the team chooses to
pursue in subsequent phases of the Data Analytics Lifecycle
 Computing resources: types of data available, collect additional
data, purchase it from outside sources, or transform existing
data
2.3 Framing the Problem

 Framing is the process of stating the analytics problem to be solved.


 A best practice is to write down the problem statement and share it with
the key stakeholders
 Identify main objectives of the project, identify what needs to be achieved
in business terms, and identify what needs to be done to meet the needs.
Need to consider the objectives and the success criteria for the project
 What is the team attempting to achieve by doing the project, and what will
be considered “good enough” as an outcome of the project?
 Need to document and share with the project team and key stakeholders
 The best practice is to share the statement of goals and success criteria
with the team and confirm alignment with the project sponsor’s
expectations
2.4 Identifying Key Stakeholders

 Important step is to identify the key stakeholders and


their interests in the project
 Identify the success criteria, key risks, and stakeholders,
which should include anyone who will benefit from the
project or will be significantly impacted by the project
 When interviewing stakeholders, learn about the domain
area and any relevant history from similar analytics
projects.
 Critical to articulate the pain points as clearly as possible
to address them and be aware of areas to pursue or avoid
as the team gets further into the analytical process
2.5 Interviewing the Analytics Sponsor (1)

 When interviewing the main stakeholders, the team


needs to take time to thoroughly interview the project
sponsor, funding the project or providing the high-
level requirements.
 It is critical to thoroughly understand the sponsor’s
perspective to guide the team in getting started on
the project.
2.5 Interviewing the Analytics Sponsor (2)

 Some tips for interviewing project sponsors:


o Prepare for the interview; draft questions, and review with colleagues.
o Use open-ended questions; avoid asking leading questions.
o Probe for details and pose follow-up questions.
o Avoid filling every silence in the conversation; give the other person time to
think.
o Let the sponsors express their ideas and ask clarifying questions, such as “Why?
Is that correct? Is this idea on target? Is there anything else?”
o Use active listening techniques; repeat back what was heard to make sure the
team heard it correctly, or reframe what was said.
o Try to avoid expressing the team’s opinions, which can introduce bias; instead,
focus on listening.
o Be mindful of the body language of the interviewers and stakeholders; use eye
contact where appropriate, and be attentive.
o Minimize distractions.
o Document what the team heard, and review it with the sponsors.
2.5 Interviewing the Analytics Sponsor (3)

 Common questions that are helpful to ask during the


discovery phase when interviewing the project sponsor
o What business problem is the team trying to solve?
o What is the desired outcome of the project?
o What data sources are available?
o What industry issues may impact the analysis?
o What timelines need to be considered?
o Who could provide insight into the project?
o Who has final decision-making authority on the project?
o How will the focus and scope of the problem change if the
following dimensions change (Time, People, Risk, Resources,
Size and Attribute of data:
2.6 Developing Initial Hypotheses

 Developing a set of Initial Hypotheses (IHs) is a key facet of the


discovery phase, involves forming ideas that the team can test with data
 it is best to come up with a few primary hypotheses to test and then be
creative about developing several more
 These IHs form the basis of the analytical tests the team will use in later
phases and serve as the foundation for the findings in Phase 5
 Can compare its answers with the outcome of an experiment or test to
generate additional possible solutions to problems. As a result, the team
will have a much richer set of observations to choose from and more
choices for agreeing upon the most impactful conclusions from a project
 Another part of this process involves gathering and assessing hypotheses
from stakeholders and domain experts who may have their own
perspective on what the problem is, what the solution should be, and how
to arrive at a solution
2.7 Identifying Potential Data Sources

 Five main activities during this step of the discovery


phase:
o Identify data sources
o Capture aggregate data sources
o Review the raw data
o Evaluate the data structures and tools needed
o Scope the sort of data infrastructure needed for this type of
problem
3. Phase 2: Data Preparation

 Preparing the Analytic Sandbox


 Performing ETLT
 Learning About the Data
 Data Conditioning
 Survey and Visualize
 Common Tools for the Data Preparation Phase
4. Phase 3: Model Planning

 Data Exploration and Variable Selection


 Model Selection
 Common Tools for the Model Planning Phase
4.1 Data Exploration and Variable Selection

 To understand the relationships among the variables to inform selection of


the variables and methods and
 To understand the problem domain.
 Common ways by using tools to perform data visualizations
 Stakeholders and subject matter experts have instincts and hunches about
what the data science team should be considering and analyzing, good grasp
of the problem and domain, but may not be aware of the subtleties within the
data or the model needed to accept or reject a hypothesis
 To approach problems with an unbiased mind-set and be ready to question all
assumptions
 To question the incoming assumptions and test initial ideas of the project
sponsors and stakeholders
 Depending on objectives, need to consider an alternate method, reduce the
number of data inputs, or transform the inputs to allow the team to use the
best method for a given business problem
4.2 Model Selection

 Main goal is to choose an analytical technique, or a short list


of candidate techniques, based on the end goal of the
project
 A model simply refers to an abstraction from reality
 Observes events happening in real data to construct models
that emulate this behavior with a set of rules and conditions.
 In data mining & machine learning, rules and conditions are
grouped into several general sets of techniques, such as
classification, association rules, and clustering
 To identify and document the modeling assumptions it is
making as it chooses and constructs preliminary models.
4.3 Common Tools for the Model Planning Phase

 R has a complete set of modeling capabilities and


provides a good environment for building interpretive
models with high-quality code.
 SQL Analysis services can perform in-database
analytics of common data mining functions, involved
aggregations, and basic predictive models
 SAS/ACCESS provides integration between SAS and
the analytics sandbox via multiple data connectors
such as OBDC, JDBC, and OLE DB
5. Phase 4: Model Building (1)

 Develop datasets for training, testing,


and production purposes
 Analytical model is developed and fit
on the training data and evaluated
(scored) against the test data
 Model planning and model building
can overlap quite a bit, and in practice
one can iterate back and forth
between the two phases for a while
before settling on a final model.
 Execute the models defined in Phase
3.
FIGURE 2-6 Model building phase
5. Phase 4: Model Building (2)

 Questions to consider include these:


o Does the model appear valid and accurate on the test data?
o Does the model output/behavior make sense to the domain
experts? That is, does it appear as if the model is giving answers
that make sense in this context?
o Do the parameter values of the fitted model make sense in the
context of the domain?
o Is the model sufficiently accurate to meet the goal?
o Does the model avoid intolerable mistakes?
o Are more data or more inputs needed? Do any of the inputs need
to be transformed or eliminated?
o Will the kind of model chosen support the runtime requirements?
o Is a different form of the model required to address the business
problem? If so, go back to the model planning phase and revise
the modeling approach.
5.1 Common Tools for the Model Building Phase

 Commercial Tools
o SAS Enterprise Miner
o SPSS Modeler
o Matlab
o Alpine Miner
o STATISTICA and Mathematica
 Free or Open Source tools
o R and PL/R
o Octave
o WEKA
o Python
o SQL in-database implementations, such as MADlib
6. Phase 5: Communicate Results

 Considers how best to articulate the


findings and outcomes to the various
team members and stakeholders, taking
into account caveats, assumptions, and
any limitations of the results?
 Determine if it succeeded or failed in its
objectives
 Perform very robust analysis and are
searching for ways to show results, even
when results may not be there
 Have determined which model or models
address the analytical challenge in the
most appropriate way
 Have ideas of some of the findings as a
result of the project
7. Phase 6: Operationalize (1)

 Communicates benefits of the project more broadly and


sets up a pilot project to deploy the work in a controlled
way before broadening the work to a full enterprise or
ecosystem of users
 Approach deploying the new analytical methods or
models in a production environment
 Learn by undertaking a small scope, pilot deployment
before a wide-scale rollout to learn about the
performance and related constraints of the model in a
production environment on a small scale and make
adjustments before a full deployment.
7. Phase 6: Operationalize (2)

FIGURE 2-9 Key


outputs from a
successful analytics
project
8. Case Study: Global Innovation Network
and Analysis (GINA)

 EMC’s Global Innovation Network and Analytics (GINA) team is a group of


senior technologists located in centers of excellence (COEs) around the world.
 Team’s charter
o To engage employees across global COEs to drive innovation, research, and university
partnerships. In 2012, a newly hired director wanted to improve these activities and
provide a mechanism to track and analyze the related information.
o To create more robust mechanisms for capturing the results of its informal
conversations with other thought leaders within EMC, in academia, or in other
organizations, which could later be mined for insights
 Provide a means to share ideas globally and increase knowledge sharing
among GINA members who may be separated geographically. It planned to
create a data repository containing both structured and unstructured data to
accomplish three main goals.
o Store formal and informal data.
o Track research from global technologists.
o Mine the data for patterns and insights to improve the team’s operations and strategy.
8.1. Phase 1: Discovery (1)

 Identifying data sources


o Although GINA was a group of technologists skilled in many different aspects
of engineering, it had some data and ideas about what it wanted to explore
but lacked a formal team that could perform these analytics.
o After consulting with various experts including Tom Davenport, a noted expert
in analytics at Babson College, and Peter Gloor, an expert in collective
intelligence and creator of CoIN (Collaborative Innovation Networks) at MIT,
the team decided to crowdsource the work by seeking volunteers within EMC.
 Various roles on the working team were fulfilled.
o Business User, Project Sponsor, Project Manager: Vice President from Office of
the CTO
o Business Intelligence Analyst: Representatives from IT
o Business Intelligence Analyst: Representatives from IT
o Data Scientist: Distinguished Engineer, who also developed the social graphs
shown in the GINA case study
8.1. Phase 1: Discovery (2)

 Two main categories of data


o First category represented five years of idea submissions from EMC’s
internal innovation contests, known as the Innovation Roadmap
(formerly called the Innovation Showcase). Data is a mix of structured
data, such as idea counts, submission dates, inventor names, and
unstructured content, such as the textual descriptions of the ideas
themselves.
o Second category of data encompassed minutes and notes representing
innovation and research activity from around the world. This also
represented a mix of structured and unstructured data.
 Structured data included attributes such as dates, names, and geographic
locations.
 Unstructured documents contained the “who, what, when, and where”
information that represents rich data about knowledge growth and transfer within
the company. This type of information is often stored in business silos that have
little to no visibility across disparate research teams.
8.1. Phase 1: Discovery (3)

 10 main IHs that the GINA team developed were as follows:


o IH1: Innovation activity in different geographic regions can be
mapped to corporate strategic directions
o IH2: The length of time it takes to deliver ideas decreases when
global knowledge transfer occurs as part of the idea delivery
process.
o IH3: Innovators who participate in global knowledge transfer
deliver ideas more quickly than those who do not.
o IH4: An idea submission can be analyzed and evaluated for the
likelihood of receiving funding.
o IH5: Knowledge discovery and growth for a particular topic can be
measured and compared across geographic regions.
o IH6: Knowledge transfer activity can identify research-specific
boundary spanners in disparate regions.
8.1. Phase 1: Discovery (4)

 10 main IHs that the GINA team developed were as follows:


o IH7: Strategic corporate themes can be mapped to geographic regions.
o IH8: Frequent knowledge expansion and transfer events reduce the
time it takes to generate a corporate asset from an idea.
o IH9: Lineage maps can reveal when knowledge expansion and transfer
did not (or has not) resulted in a corporate asset.
o IH10: Emerging research topics can be classified and mapped to
specific ideators, innovators, boundary spanners, and assets.
 The GINA (IHs) can be grouped into two categories:
o Descriptive analytics of what is currently happening to spark further
creativity, collaboration, and asset generation
o Predictive analytics to advise executive management of where it
should be investing in the future
8.2 Phase 2: Data Preparation

 Team partnered with its IT department to set up a new analytics sandbox to


store and experiment on the data. During the data exploration exercise, the
data scientists and data engineers began to notice that certain data
needed conditioning and normalization. In addition, the team realized that
several missing datasets were critical to testing some of the analytic
hypotheses.
 As the team explored the data, it quickly realized that if it did not have data
of sufficient quality or could not get good quality data, it would not be able
to perform the subsequent steps in the lifecycle process. As a result, it was
important to determine what level of data quality and cleanliness was
sufficient for the project being undertaken. In the case of the GINA, Team
discovered that many of the names of the researchers and people
interacting with the universities were misspelled or had leading and trailing
spaces in the datastore. Seemingly small problems such as these in the
data had to be addressed in this phase to enable better analysis and data
aggregation in subsequent phases.
8.3 Phase 3: Model Planning (1)

 For much of the dataset seemed feasible to use social network


analysis techniques to look at the networks of innovators within EMC.
 In other cases, it was difficult to come up with appropriate ways to
test hypotheses due to the lack of data. In one case (IH9), the team
made a decision to initiate a longitudinal study to begin tracking data
points over time regarding people developing new intellectual
property. This data collection would enable the team to test the
following two ideas in the future:
o IH8: Frequent knowledge expansion and transfer events reduce the amount of
time it takes to generate a corporate asset from an idea.
o IH9: Lineage maps can reveal when knowledge expansion and transfer did
not (or has not) result(ed) in a corporate asset.
 Team needed to establish goal criteria for the study, e.g. the end goal
of a successful idea that had traversed the entire journey
8.3 Phase 3: Model Planning (2)

 Parameters related to scope of the study included the


following considerations:
o Identify the right milestones to achieve this goal.
o Trace how people move ideas from each milestone toward
the goal.
o Once this is done, trace ideas that die, and trace others that
reach the goal. Compare the journeys of ideas that make it
and those that do not.
o Compare the times and the outcomes using a few different
methods (depending on how the data is collected and
assembled). These could be as simple as t-tests or perhaps
involve different types of classification algorithms.
8.4 Phase 4: Model Building (1)

 In Phase 4, the GINA team employed several


analytical methods
o Natural Language Processing (NLP) techniques on the
textual descriptions of the Innovation Roadmap ideas
o Conducted social network analysis using R and RStudio, and
then he developed social graphs and visualizations of the
network of communications related to innovation using R’s
ggplot2 package. Examples of this work are shown in
Figures 2-10 and 2-11.
8.4 Phase 4: Model Building (2)

FIGURE 2-10 Social graph


[27] visualization of idea
submitters and finalists

FIGURE 2-11 Social graph visualization of top innovation influenc


8.5 Phase 5: Communicate Results (1)

 Team found several ways to cull results of the analysis and identify
the most impactful and relevant findings. This project was considered
successful in identifying boundary spanners and hidden innovators.
 As a result, CTO office launched longitudinal studies to begin data
collection efforts and track innovation results over longer periods of
time. The GINA project promoted knowledge sharing related to
innovation and researchers spanning multiple areas within the
company and outside of it.
 GINA also enabled EMC to cultivate additional intellectual property
that led to additional research topics and provided opportunities to
forge relationships with universities for joint academic research in the
fields of Data Science and Big Data. In addition, the project was
accomplished with a limited budget, leveraging a volunteer force of
highly skilled and distinguished engineers and data scientists.
8.5 Phase 5: Communicate Results (2)

 One of the key findings from the project is that there was a disproportionately
high density of innovators in Cork, Ireland. Each year, EMC hosts an innovation
contest, open to employees to submit innovation ideas that would drive new
value for the company. When looking at the data in 2011, 15% of the finalists and
15% of the winners were from Ireland.
 These are unusually high numbers, given the relative size of the Cork COE
compared to other larger centers in other parts of the world. After further
research, it was learned that the COE in Cork, Ireland had received focused
training in innovation from an external consultant, which was proving effective.
The Cork COE came up with more innovation ideas, and better ones, than it had
in the past, and it was making larger contributions to innovation at EMC. It would
have been difficult, if not impossible, to identify this cluster of innovators through
traditional methods or even anecdotal, word-of-mouth feedback.
 Applying social network analysis enabled the team to find a pocket of people
within EMC who were making disproportionately strong contributions. These
findings were shared internally through presentations and conferences and
promoted through social media and blogs.
8.6 Phase 6: Operationalize (1)

 Key findings from the project include these:


o The CTO office and GINA need more data in the future, including a
marketing initiative to convince people to inform the global
community on their innovation/research activities.
o Some of the data is sensitive, and the team needs to consider
security and privacy related to the data, such as who can run the
models and see the results.
o In addition to running models, a parallel initiative needs to be
created to improve basic Business Intelligence activities, such as
dashboards, reporting, and queries on research activities
worldwide.
o A mechanism is needed to continually reevaluate the model after
deployment. Assessing the benefits is one of the main goals of this
stage, as is defining a process to retrain the model as needed.
8.6 Phase 6: Operationalize (2)

 Table 2-3 outlines


an analytics plan
for the GINA case
study example.
Summary

 Data Analytics Lifecycle, which is an approach to


managing and executing analytical projects. This approach
describes the process in six phases.
o Discovery
o Data preparation
o Model planning
o Model building
o Communicate results
o Operationalize
 Through these steps, data science teams can Identify
problems and perform rigorous investigation of the
datasets needed for in-depth analysis

You might also like