0% found this document useful (0 votes)
65 views7 pages

CRISP DM For Data Science

CRISP-DM is a popular framework for executing data science projects that provides a natural description of the data science life cycle. It consists of six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. While providing a task-focused approach, it fails to address team and communication issues. The framework should be combined with other approaches to address its weaknesses.

Uploaded by

mohityy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views7 pages

CRISP DM For Data Science

CRISP-DM is a popular framework for executing data science projects that provides a natural description of the data science life cycle. It consists of six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. While providing a task-focused approach, it fails to address team and communication issues. The framework should be combined with other approaches to address its weaknesses.

Uploaded by

mohityy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EVALUATING

CRISP-DM FOR
DATA SCIENCE

How can you use the classic data


science life cycle on your next
project?

Data Science Process Alliance


Integrating data science process effectiveness research with industry leading agile training expertise
Data Science Process Alliance
Integrating data science process effectiveness research with industry leading agile training expertise

Executive Summary
What is CRISP-DM?
Published in 1999, CRISP-DM (CRoss Industry
Standard Process for Data Mining (CRISP-DM)
is the most popular framework for executing
data science projects. It provides a natural
description of a data science life cycle (the
workflow in data-focused projects).

However, this task-focused approach for


executing projects fails to address team and
communication issues. Thus, CRISP-DM should
be combined with other team coordination
frameworks. Results of a 2020 DSPA poll on the use of
data science process frameworks.

Six Phases
1. Business understanding
What does the business need?
2. Data understanding
What data do we have / need? Is it clean?
3. Data preparation
How do we organize the data for modeling?
4. Modeling
What modeling techniques should we apply?
5. Evaluation
What best meets the business objectives?
6. Deployment
How do stakeholders access the results?

How can you use CRISP-DM on your next Project?


Every project, team, and organization is
Evaluating CRISP-DM
unique. So to evaluate CRISP-DM for your next
project, first review its key concepts. Then, 1. Review the CRISP-DM framework
assess its strengths and weaknesses. Finally, 2. Explore Strengths & Weaknesses
consider some keys tips for its use. 3. Actions to consider

www.DataScience-PM.com © Data Science Process Alliance 2021


EVAUATING CRISP - DM FOR DATA SCIENCE

Reviewing CRISP-DM
Diving into the CRISP-DM Phases
I. Business Understanding
The Business Understanding phase focuses on understanding the objectives and requirements of the project.
While many teams hurry through this phase, establishing a strong business understanding is like building the
foundation of a house – absolutely essential. Aside from the third task, the three other tasks in this phase are
foundational project management activities that are universal to most projects:
1. Determine business objectives: understand what the customer / client is trying to
achieve, including the business success criteria.
2. Assess situation: Determine resources availability, project requirements, assess
risks and contingencies, and conduct a cost-benefit analysis.
3. Determine project goals: In addition to defining the business objectives, you should
also define what success looks like from a technical data mining perspective.
4. Produce project plan: Select technologies and tools and define detailed plans for
each project phase.

II. Data Understanding


Adding to the foundation of Business Understanding, the Data Understanding phase focuses on identifying,
collecting, and analyzing data sets that can help the project. This phase also has four tasks:
1. Collect initial data: Acquire the necessary data and (if necessary)
load it into your analysis tool.
2. Describe data: Examine the data and document its surface
properties like data format, number of records, or field identities.
3. Explore data: Dig deeper into the data. Query it, visualize it, and
identify relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document any
quality issues.

III. Data Preparation


This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. A common
rule of thumb is that 50% to 80% of the project effort is in the data preparation phase. This phase has five tasks:

1. Select data: Determine which data sets will be used and document reasons
for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to
garbage-in, garbage-out. A common practice during this task is to correct,
impute, or remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example, derive
someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple
sources.
5. Format data: Re-format data as necessary. For example, you might convert
string values that store numbers to numeric values so that you can perform
mathematical operations.

www.DataScience-PM.com © Data Science Process Alliance 2021


EVAUATING CRISP - DM FOR DATA SCIENCE

Reviewing CRISP-DM
Diving into the CRISP-DM Phases
IV. Modeling
Modeling is often regarded as data science’s most exciting work. In this phase, the team builds and assesses
various models based, often using several different modeling techniques. Although the CRISP-DM guide suggests
to “iterate model building and assessment until you strongly believe that you have found the best model(s)”, in
practice teams might iterating until they have a “good enough” model. This phase has four tasks:
1. Select modeling techniques: Determine which algorithms to try (e.g. regression,
neural net).
2. Generate test design: Pending your modeling approach, you might need to split the
data into training, test, and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing a few
lines of code like “reg = LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other, and
the data scientist needs to interpret the model results based on domain knowledge,
the pre-defined success criteria, and the test design.

V. Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation
phase looks more broadly at which model best meets the business and what to do next. This phase has three
tasks:
1. Evaluate results: Do the models meet the business success criteria?
Which one(s) should we approve for the business?
2. Review process: Review the work accomplished. Was anything
overlooked? Were all steps properly executed? Summarize findings
and correct anything if needed.
3. Determine next steps: Based on the previous three tasks, determine
whether to proceed to deployment, iterate further, or initiate new
projects.

VI. Deployment
A model is not particularly useful unless the customer can access its results. So, deployment should be thought of
in terms of what does it take to actually use the results of the project. Depending on the project, this can be as
simple as sharing a report or as complex as implementing a live real-time predictive model. This final phase has
four tasks:

1. Plan deployment: Develop and document a plan for deploying the model.
2. Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model.
3. Produce final report: The project team documents a summary of the project
which might include a final presentation of data mining results.
4. Review project: Conduct a project retrospective about what went well, what
could have been better, and how to improve in the future.

www.DataScience-PM.com © Data Science Process Alliance 2021


EVAUATING CRISP - DM FOR DATA SCIENCE

Analyzing CRISP-DM
Strengths and Weaknesses
Strengths & Benefits
Common Sense: Data scientists naturally follow a CRISP-DM-like Key Strengths:
process. When people are asked to do a data science project without
project management direction, they tend toward a CRISP-like
Common sense steps
methodology and can easily identify with the CRISP-DM phases and
doing iterations.
Cyclical: CRISP-DM can support the iterative nature of data science Easy to understand
(but how to actually do iterations is not defined)
Adopt-able: CRISP-DM can be implemented without much training, Defines a shared
organizational role changes, or controversy.
vocabulary for the
Right Start: The initial focus on Business Understanding, an often-
overlooked step, is helpful to align technical work with business steps in a project
needs and to steer data scientists away from jumping into a problem
without properly understanding business objectives.
Flexible: A loose CRISP-DM implementation can be flexible to
provide many of the benefits of agile principles and practices. By
accepting that a project starts with significant unknowns, the user can
cycle through steps, each time gaining a deeper understanding of the
data and the problem. The empirical knowledge learned from
previous cycles can then feed into the following cycles.

Weaknesses & Challenges


Not a Team Coordination Framework: Perhaps most significantly, Key Weaknesses:
CRISP-DM is not a true project management methodology because it
implicitly assumes that its user is a single person or small, tight-knit
team and ignores the teamwork coordination necessary for larger
Not clear when to "loop
projects. back" to a previous
Can ignore stakeholders: CRISP-DM phases and tasks can be phase
done with minimal input from stakeholders.
Outdated: CRISP-DM has not been updated since 1999 and is
Missing phases
criticized for not meeting the considerations of modern big data
science projects (e.g., operational support). (operational support)
Documentation Heavy: The full-fledged CRISP-DM approach
requires a lot of time-consuming documentation (although most No structured
teams seem to skip much of it). In fact, nearly every task has a
communication with
documentation step. While documenting one’s work is key in a
mature process, CRISP-DM’s documentation requirements might
stakeholders
unnecessarily slow the team from actually delivering increments.
Slow starts: The process matches closely with building a waterfall-
like approach, which could delay business value delivery by spending
too much time on the early phases, without incremental learning.

www.DataScience-PM.com © Data Science Process Alliance 2021


EVAUATING CRISP - DM FOR DATA SCIENCE

Going Forward
Key Actions to Consider
1. Combine with a team coordination process
There needs to be a mechanism for the team to communicate and prioritize work.
The team process should define how the team communicates, prioritizes tasks and
“loops back” to previous project phases.
Teams can leverage the CRISP-DM phases, and then use a framework such as
Scrum, Kanban or Data Driven Scrum to prioritize potential tasks.

2. Ensure multiple experiments / iterations


Iterate quickly and do not fall get pulled into a waterfall of sequential work.
Rather, try to deliver thin vertical slices of end-to-end value. Your first deliverable
might not be too useful. That’s okay. Iterate.
While it’s important to do multiple iterations, each team needs to think through how
iterations are defined and then evaluated.

3. Define team roles


CRISP-DM does not include roles (nor a team).
Data science efforts are increasingly a team sport.
Roles can include stakeholders / product owners (to ensure the insight is
actionable), as well as a process expert.

4. Ensure actionable insight


CRISP-DM lacks a communication structure with stakeholders.
How does the team ensure actionable insight?
Be sure to communicate and set expectations with stakeholders frequently.

5. Add phases (if needed) and define the subitems within each phase
Add steps or phases for practices like git version control and ML ops.
Be clear how tasks (within a phase) are defined.
Some tasks that should be explicitly discussed include: bias checks, accuracy
assessments, business validation, and dev dicussions.

6. Document enough…but not too much


CRISP-DM can be documentation heavy; for example, CRISP-DM calls for 12
reports prior to data collection.
So, do what’s reasonable and appropriate but don’t go overboard.

www.DataScience-PM.com © Data Science Process Alliance 2021


Data Science Process Alliance
Training & Consulting

About the Data Science Process Alliance


Combining data science process research with industry leading agile training, the Data Science Process Alliance
(www.DataScience-PM.com) is the only organization dedicated specifically to improving data science project management.

Training and Certification


Better Process
Delivers Improved The Data Science Team Lead™

Results (DSTL) course provides in-depth,


comprehensive, and actionable

EFFICIENCY training in data science project


Repeatable processes drive process management. DSTL members will
efficiency and help ensure the be ready to lead data science
highest value analyses are explored projects.

Course Components
Get individualized consulting w/ four 30-min one-on-one sessions
Access 6 hours of on-demand video content
VALIDITY Examine a real-world case study
Help ensure accurate, fair, non-
Discuss trends, blogs and other emerging items
biased, and where needed,
transparent results Get exclusive Training on Data Driven Scrum
Access a library of curated resources (white papers, videos, etc)
Differentiate yourself with the DSTL certification

Register
ACTIONABLE INSIGHTS
Ensure insights that are actionable
and understood by stakeholders via
better communication and Private Group Training
coordination Is your team looking to more effectively manage data science projects?
Have the team go through a cohort of the DSTL class. Or we can
customize a training program, specific to your organizational needs.
Contact us to learn more.

Consulting
Unlike other consultants, all we focus on is data science project
management. Contact us to learn what better project management can
mean for your organization.

www.DataScience-PM.com © Data Science Process Alliance 2021

You might also like