0% found this document useful (0 votes)
4 views15 pages

TKDE_Data_Science_Trajectories_PF

The paper discusses the evolution of the CRISP-DM methodology, which has been a standard in data mining for two decades, and its relevance in the context of modern data science projects. It argues that while CRISP-DM remains applicable for goal-directed projects, a more flexible Data Science Trajectories (DST) model is needed for exploratory projects. The authors propose a categorization of data science projects based on their trajectories, which can aid in project planning and management.

Uploaded by

Dewi Freeya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

TKDE_Data_Science_Trajectories_PF

The paper discusses the evolution of the CRISP-DM methodology, which has been a standard in data mining for two decades, and its relevance in the context of modern data science projects. It argues that while CRISP-DM remains applicable for goal-directed projects, a more flexible Data Science Trajectories (DST) model is needed for exploratory projects. The authors propose a categorization of data science projects based on their trajectories, which can aid in project planning and management.

Uploaded by

Dewi Freeya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C.

, Hernandez-
Orallo, J., Kull, M., Lachiche, N. J. A. H., Ramírez-Quintana, M. J., &
Flach, P. A. (2019). CRISP-DM Twenty Years Later: From Data
Mining Processes to Data Science Trajectories. IEEE Transactions on
Knowledge and Data Engineering. Advance online publication.
https://doi.org/10.1109/TKDE.2019.2962680

Peer reviewed version

Link to published version (if available):


10.1109/TKDE.2019.2962680

Link to publication record on the Bristol Research Portal


PDF-document

This is the author accepted manuscript (AAM). The final published version (version of record) is available online
via IEEE at https://ieeexplore.ieee.org/document/8943998. Please refer to any applicable terms of use of the
publisher.

University of Bristol – Bristol Research Portal


General rights

This document is made available in accordance with publisher policies. Please cite only the
published version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/red/research-policy/pure/user-guides/brp-terms/
1

CRISP-DM Twenty Years Later:


From Data Mining Processes
to Data Science Trajectories
Fernando Martı́nez-Plumed, Lidia Contreras-Ochando, Cèsar Ferri, José Hernández-Orallo, Meelis Kull,
Nicolas Lachiche, Marı́a José Ramı́rez-Quintana and Peter Flach

Abstract—CRISP-DM (CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus
about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and
knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the
leading term being favoured over data mining. In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for
purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely
holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more
varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it
can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life
exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big
Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.

Index Terms—Data Science Trajectories, Data Mining, Knowledge Discovery Process, Data-driven Methodologies.

F
1 I NTRODUCTION

T OWARDS the end of the previous century, when the


systematic application of data mining techniques to
extract knowledge from data was becoming more and more
of techniques coming from machine learning, data manage-
ment, visualisation, causal inference and other areas. But,
more importantly, compared to twenty years ago there are
common in industry, some companies and institutions saw many more ways in which data can be monetised, through
the need of joining forces to identifying good practices as new kinds of applications, interfaces and business models.
well as common mistakes in their past experiences. With While the area of deriving value from data has grown ex-
funding from the European Union, a team of experienced ponentially in size and complexity, it has also become much
data mining engineers developed a generally applicable more exploratory under the umbrella of data science. In the
data mining methodology which over time would become latter, data-driven and knowledge-driven stages interact, in
widely accepted. In 1999 the first version of the CRoss- contrast to the traditional data mining process, starting from
Industry Standard Process for Data Mining, better known precise business goals that translate into a clear data mining
as CRISP-DM, was introduced [1]. This straightforward task, which ultimately converts “data to knowledge”. In
methodology was conceived to catalogue and guide the other words, not only has the nature of the data changed
most common steps in data mining projects. It soon be- but also the processes for extracting value from it.
came “de facto standard for developing data mining and Clearly these changes did not happen overnight, and
knowledge discovery projects” [2], and it is still today the new methodologies have been proposed in the meantime
most widely-used analytic methodology according to many to accommodate some of the changes. For instance, IBM
opinion polls. introduced ASUM-DM [3], and SAS introduced SEMMA [4],
In the last two decades the ubiquity of electronic devices and many others, as we will review in more detail in the
and sensors, the use of social networks and the capacity of following section. However, the original CRISP-DM model
storing and exchanging these data all have dramatically in- can still be recognised in these more recent proposals, which
creased the opportunities for extracting knowledge through remain focused on the traditional paradigm of a sequential
data mining projects. The diversity of the data has increased list of stages from data to knowledge. We would argue that
– in origin, format and modalities – and so has the variety they are still, in essence, data mining methodologies that do
not fully embrace the diversity of data science projects.
F. Martı́nez-Plumed, L. Contreras-Ochando, C. Ferri, J. Hernández-Orallo In this paper we investigate the extent to which, after
and MJ. Ramı́rez-Quintana are with Universitat Politècnica de València, twenty years, the original CRISP-DM and the underlying
Spain e-mail: {fmartinez,liconoc,cferri,jorallo,mramirez}@dsic.upv.es.
data mining paradigm remain applicable for the much
M. Kull is with University of Tartu, Estonia e-mail: [email protected]. wider range of data science projects we see today. We
N. Lachiche is with Université de Strasbourg, France e-mail: nico- identify new activities in data science, from data simulation
[email protected]. to narrative exploration. We propose a general diagram
P. Flach is with the University of Bristol and the Alan Turing Institute, containing the possible activities that can be included in
U.K. e-mail: [email protected].
a data science project. Based on examples, we distinguish
2

particular trajectories through this space that distinguish


different kinds of data science projects. We propose that
these trajectories can be used as templates for data scientists
Business Data
when planning their data science projects and, in this way, Understanding Understanding

explore new activities that could be added to or removed


from their workflows. Together, they represent a new Data Data
Preparation
Science Trajectories model (DST).
Deployment
Data
On one hand, this DST model represents an important
overhaul of the original CRISP-DM initiative. However, we Modelling

have been careful not to discard CRISP-DM completely, as it


still represents one of the most common trajectories in data
science, those that go from data to knowledge when there Evaluation
is a clear business goal that translates into a data mining
goal. One could say that DST is “backwards compatible”
with CRISP-DM, while allowing the considerable additional
flexibility that twenty-first century data science demands. Fig. 1. The CRISP-DM process model of data mining.
In this paper we identify some other trajectories that cap-
ture the common routes of data science projects, but the
flexibility of the DST map makes it possible to incorporate reader to [5], [6]. Fayyad, Piatetsky-Shapiro and Smyth de-
current and new methodologies in the development and fine Knowledge Discovery in Databases (KDD) as “the overall
deployment of data science projects. process of knowledge discovery from data, including how
The contributions of the paper are the following: the data is stored and accessed, how algorithms can be
• Recognition of the limitations of the original CRISP-DM scaled to massive datasets and still run efficiently, how
and other related methodologies considering the diver- results can be interpreted and visualised, and how the
sity of data science projects today. overall human-machine interaction can be modeled and
• Identification of more exploratory activities that are supported” and data mining as a single step in this pro-
common in data science but not covered by CRISP-DM, cess, turning suitably pre-processed data into patterns that
leading to a more flexible and comprehensive DST map. can subsequently be turned into valuable and actionable
• Recognition of popular trajectories in this space describ- knowledge [7]. However, data mining is often used as a
ing well-known practices in data science, which could synonym for KDD, and we will not distinguish between the
be used as templates, making the DST model exemplary two meanings in this paper.
rather than prescriptive. As already mentioned in the introduction, CRISP-DM
• Some general suggestions on how the DST model can [1] can be viewed as the canonical approach from which
be coupled with actual project management methodolo- most of the subsequent proposals have evolved (both for
gies in order to be customised to different organisations data mining and data science process models). It elaborates
and contexts. and extends the steps in the original KDD proposal into six
steps: Business understanding, Data understanding, Data
The rest of the paper is organised as follows. Section 2 re-
preparation, Modelling, Evaluation, and Deployment. Fig-
visits CRISP-DM and other related variations that have been
ure 1 depicts the six steps of CRISP-DM and the way they
introduced in the last two decades. The identification of new
are sequenced in a typical data mining application.
activities and the formulation of the DST map is included
Several process models and methodologies were devel-
in Section 3. Section 4 illustrates these trajectories on real
oped around the turn of the century using CRISP-DM as a
cases of data science projects, using a precise notation on
basis, but with varying objectives. Some examples include:
trajectory charts. In Section 5 we discuss data science project
• Human-Centered Approach to Data Mining [8], [9], which
management by considering the three kinds of activities,
looking at these seven real cases plus 51 use cases from the involves a holistic understanding of the entire Knowl-
NIST Big Data Public Working Group. Section 6 compares edge Discovery Process, considering people’s involve-
the model with software methodologies and the scientific ment and interpretation in each phase and putting
method, suggesting how organisation can couple this with emphasis on that the target user is the data engineer.
• SEMMA [4], which stands for Sample, Explore, Modify,
existing and new methodologies, as well as particular eth-
ical issues and the challenge of data science automation. Model and Assess, is the proprietary methodology de-
The appendix includes more detail about the experimental veloped by SAS1 to develop Data Mining products and
analysis over the 7 + 51 use cases covered in the paper. is mainly focused on the technical aspects.
• Cabena’s [10] model, used in the marketing and sales
domain, this being one of the first process models which
2 CRISP-DM AND R ELATED P ROCESS M ODELS took into account the business objectives;
• Buchner’s [11] model, adapted to the development of
In this section we give a succinct description of the most web mining projects and focused on an online customer
used and cited data mining and knowledge discovery (incorporating the available operational and materi-
methodologies, providing for each an overview of its evo- alised data as well as marketing knowledge).
lution, basis and primary characteristics. For a more com-
prehensive description of these methodologies we refer the 1. www.sas.com
3

• Two Crows [12], which takes advantage of some insights •Cios et al.’s Six-step discovery process [38], [39], which
from (first versions of) CRISP-DM (before release), and adapts the CRISP-DM model to the needs of the aca-
proposes a non-linear list of steps (very close to those demic research community (research-oriented descrip-
from KDD), so it is possible to go back and forth. tions, explicit feedback mechanisms, extension of dis-
• Dˆ3M [13], a domain-driven data mining approach covered knowledge to other domains, etc.).
proposed to promote the paradigm shift from “data- • RAMSYS (RApid collaborative data Mining SYStem)
centered knowledge discovery” to “domain-driven, ac- [40], a methodology for developing collaborative DM
tionable knowledge delivery”. and KD projects with geographically diverse groups.
• ASUM-DM (Analytics Solutions Unified Method for
There are also some other relevant approaches not directly
Data Mining/Predictive Analytics) [3], a methodology
related to the KDD task. The 5 A’s Process [14], originally
which refines and extends CRISP-DM, adding infras-
developed by SPSS2 , already included an “Automate” step
tructure, operations, deployment and project manage-
which helps non-expert users to automate the whole process
ment sections as well as templates and guidelines,
of DM applying already defined methods to new data,
personalised for IBM’s practices.
but it does not contain steps to understand the business
• CASP-DM [41], which addresses specific challenges of
objectives and to test data quality. Another approach that
machine learning and data mining for context change
tries to assist the users in the DM process is [15]. All these
and model reuse handling.
were influential for CRISP-DM. In 1996 Motorola developed
• HACE [42], a Big Data processing framework based on a
the 6σ approach [16], which emphasises measurement and
three tier structure: a “Big Data mining platform” (Tier
statistical control techniques for quality and excellence in
I), challenges on information sharing and privacy, and
management. Another approach is the KDD Roadmap [17],
Big Data application domains (Tier II), and Big Data
an iterative data mining methodology that as a main con-
mining algorithms (Tier III).
tribution introduces the “resourcing” task, consisting in the
integration of databases from multiple sources to form the The aforementioned methodologies have in common that
operational database. they are designed to spend a great deal of time in the busi-
ness understanding phase aiming at gathering as much in-
The evolution of these data mining process models and
formation as possible before starting a data mining project.
methodologies is graphically depicted in Figure 2. The
However, the current data deluge as well as the experimen-
arrows in the figure indicate that CRISP-DM incorporates
tal and exploratory nature of data science projects require
principles and ideas from most of the aforementioned
less rigid and more lightweight and flexible methodologies.
methodologies, while also forming the basis for many later
In response, big IT companies have introduced similar
proposals. CRISP-DM is still considered the most complete
lifecycles and methodologies for data science projects. For
data mining methodology in terms of meeting the needs
example, in 2015 IBM released the Foundational Methodol-
of industrial projects, and has become the most widely
ogy for Data Science (FMDS) [43], a 10-stage data science
used process for DM projects according to the KDnuggets
methodology that – although bearing some similarities to
polls (https://www.kdnuggets.com/) held in 2002, 2004,
CRISP-DM– emphasises a number of the new practices such
2007 and 2014. In short, CRISP-DM is considered the de
as the use of very large data volumes, the incorporation of
facto standard for analytics, data mining, and data science
text analytics into predictive modelling and the automation
projects.
of some of the processes. In 2017 Microsoft released the Team
To corroborate this view from data science experts,
Data Science Process (TDSP) [44], an “agile, iterative, data sci-
we also checked that CRISP-DM is still a very common
ence methodology to deliver predictive analytics solutions
methodology for data mining applications. For instance, just
and intelligent applications efficiently” and to improve team
focussing on the past four years, we can find a large number
collaboration and learning.
of conventional studies applying or slightly adapting the
At a high level, both FMDS and TDSP have much in
CRISP-DM methodology to many different domains: health-
common with CRISP-DM. This demonstrates the latter’s
care [18], [19], [20], [21], signal processing [22], engineering
flexibility, which allows to include new specific steps (such
[23], [24], education [25], [26], [27], [28], [29], logistics [30]
as analytic and feedback phases/tasks) that are missing in
production [31], [32], sensors and wearable applications
the original proposal. On the other hand, methodologies
[33], tourism [34], warfare [35], sports [36] and law [37].
such as FMDS and TDSP are in essence still data mining
However, things have evolved in the business applica- methodologies that assume a clearly identifiable goal from
tion of data mining since CRISP-DM was published. Sev- the outset. In the next section we argue that data science
eral new methodologies have appeared as extensions of calls for a much more exploratory mindset.
CRISP-DM, showing how it can be modernised without
changing it fundamentally. For instance, the CRISP-DM 2.0
Special Interest Group (SIG) was established with the aim of 3 F ROM G OAL -D IRECTED DATA M INING P RO -
meeting the changing needs of DM with an improved ver- CESSES TO E XPLORATORY DATA S CIENCE T RAJEC -
sion of the CRISP-DM process. This version was scheduled TORIES
to appear in the late 2000s, but the group was discontinued As is evident from the previous section, the perspective of
before the new version could be delivered. Other examples CRISP-DM and related methodologies is that data mining is
include: a process starting from a relatively clear business goal and
data that have already been collected and are available for
2. http://www.spss.com/ further computational processing. This kind of process is
4

Legend
KDD Related
6-sigma
(Harry & Schroeder, 1999; Cios et al.
SEMMA
Pyzdek, 2003) (Cios et al., 2000; Cios & CRISP-DM
(SAS Institute, 2015) Related
Kurgan, 2005)
RAMSYS
Human-Centered (Moyle & Jorge, 2001)
Other approach.
(Brachman and D^3M
Anand,1996, Gertosio and (Cao, 2010)
KDD Roadmap
Dussauchoy, 2014)
(Debuse et al., 2001)

HACE
(X. Wu, et al, 2014)

KDD
Cabena et al. CRISP-DM
(Piatetsky-Shapiro, 1991;
(Cabena et al., 1997) (Chapman et al., 2000)
Fayyad et al., 1996b)
ASUM-DM
(IBM, 2015)

CRISP-DM2.0 CASP-DM
Anand & Buchner (Martinez-Plumed et al.,

(Buchner et al., 1999) 2017)

FMDS
5A's TDSP (IBM, 2015)
Two Crows (SPSS, 1999, de Pisón (Microsoft, 2016)
(Two Crows Corporation, 1999)
Ascacibar, 2003;
SPSS, 2007)

Fig. 2. Evolution of most relevant Data Mining and Data Science models and methodologies (in white and light blue, respectively). KDD and
CRISP-DM are the ‘canonical’ methodologies, depicted in grey. Adapted from [6]. The years are those of the most representative papers, not the
years in which the model was introduced.

akin to mining for valuable minerals or metals at a given the data mining perspective, the process takes centre stage.
geographic location where the existence of the minerals In contrast, in contemporary data science the data take centre
or metals has been established: data are the ore, in which stage: we know or suspect there is value in these data, how
valuable knowledge can be found. Whenever this kind of do we unlock it? What are the possible operations we can
metaphor is applicable, we suggest that CRISP-DM is a good apply to the data to unlock and utilise their value? While
methodology to follow and still holds its own after twenty moving away from the process, the methodology becomes
years. less prescriptive and more inquisitive: things you can do to
However, data science is now a much more commonly data rather than things you should do to data.
used term than data mining in the context of knowledge To continue with the ‘mining’ metaphor: if data mining
discovery. A quick query on Google Trends shows that the is like mining for precious metals, data science is like
former became a more frequent search term than the latter prospecting: searching for deposits of precious metals where
in early 2016 and now is more than twice as common. So profitable mines can be located. Such a prospecting process
what is data science? There seem to be two broad senses is fundamentally exploratory and can include some of the
in which the term is used: (a) the science OF data; and following activities:
(b) applying scientific methods TO data. From the first
Goal exploration: finding business goals which can be
perspective, data science is seen as an academic subject
achieved in a data-driven way;
that studies data in all its manifestations, together with
Data source exploration: discovering new and valuable
methods and algorithms to manipulate, analyse, visualise
sources of data;
and enrich data. It is methodologically close to computer sci-
Data value exploration: finding out what value might be
ence and statistics, combining theoretical, algorithmic and
extracted from the data;
empirical work. From the second perspective, data science
Result exploration: relating data science results to the busi-
spans both academia and industry, extracting value from
ness goals;
data using scientific methods, such as statistical hypothesis
Narrative exploration: extracting valuable stories (e.g., vi-
testing or machine learning. Here the emphasis is on solving
sual or textual) from the data;
the domain-specific problems in a data-driven way. Data
Product exploration: finding ways to turn the value ex-
are used to build models, design artefacts, and generally
tracted from the data into a service or app that delivers
increase understanding of the subject. If we wanted to
something new and valuable to users and customers.
distinguish these two senses then we could call the first
theoretical data science; and the second, applied data science. While it is possible to see (weak) links between these ex-
In this paper, we are really concerned with the latter and ploratory activities and CRISP-DM phases (e.g. goal explo-
henceforth we use the term ‘data science’ in this applied ration relates to business understanding and result explo-
sense. ration relates to modelling and evaluation), the former are
The key difference we perceive between data mining typically more open-ended than the CRISP-DM phases. In
twenty years ago and data science today is that the former data science, the order of activities depends on the domain
is goal-oriented and concentrates on the process, while the as well as on the decisions and discoveries of the data scien-
latter is data-oriented and exploratory. Developed from the tist. For example, after getting unsatisfactory results in data
goal-oriented perspective, CRISP-DM is all about processes value exploration performed on given data it might be nec-
and different tasks and roles within those processes. It views essary to do further data source exploration. Alternatively, if
the data as an ingredient towards achieving the goal – an no data are given then data source exploration would come
important ingredient, but not more. In other words, from before data value exploration. Sometimes neither of these
5

Business Data
Data Source Data Value
Understanding Understanding
Exploration Exploration

Data Acquisition

Goal Data Simulation Data Result


Deployment
Exploration Preparation Exploration
Data Architecting

Data Release

Product Narrative
Evaluation Modelling
Exploration Exploration

Fig. 3. The DST map, containing the outer circle of exploratory activities, inner circle of CRISP-DM(or goal-directed) activities, and at the core the
data management activities.

activities is required, and sometimes these activities would knowledge (such as the Structural Causal Models [46], the
be run several times. Potential Outcomes Framework [47] or the Linear non-
Data science projects are certainly not only about ex- Gaussian acyclic models [48]). Hernan et al. [49] discuss
ploration, and contain more goal-driven parts as well. The how data science can tackle causal inference from data by
standard six phases of the CRISP-DM model from business considering it as a new kind of data science task known
understanding to deployment are all still valid and relevant. as counterfactual prediction. Basically, counterfactual predic-
However, in data science projects it is common to see only tion requires to incorporate domain expert knowledge not
partial traces through CRISP-DM. For example, sometimes only to formulate the goals or questions to be answered
there is no need for activities beyond data preparation, and to identify or generate the data sources, but also to
as the prepared data are the final product of the project. formally describe the causal structure of the system. This
Data that is scraped from different sources, integrated and task and others performing causal inference go well within
cleansed can be published or sold for various purposes, or CRISP-DM (under the modelling step) but expert knowl-
can be loaded into a data warehouse for OLAP querying. edge becomes crucial (and, as a result, the inner stages
The CRISP-DM phases are also often interrupted by further of the CRISP-DM process are harder to automate). For its
exploratory activities, whenever the data scientist decides to part, the business understanding phase reinforces its first-
seek more information and new ideas. stage position in these circumstances as this must be the
place where the expert understanding of the domain has
We hence see a successful data science project as fol-
to be converted into models and queries which are needed
lowing a trajectory through a space like the one depicted in
for the subsequent steps (data understanding, preparation,
Figure 3. In contrast to the CRISP-DM model there are no
modelling and evaluation).
arrows here, because the activities are not to be taken in any
pre-determined order. It is the responsibility of the project’s However, under the causal inference framework, data
leader(s) to decide which step to take next, based on the science must play a more active role with the data. Data is
available information including the results of previous ac- not just an input of the system: “a causal understanding
tivities. Even though the space contains all the CRISP-DM of the data is essential to be able to predict the conse-
phases, these are not necessarily run in the standard order, quences of interventions, such as setting a given variable to
as the goal-driven activities are interleaved with exploratory some specified value” [48]. This suggests a more iterative
activities, and these can sometimes set new goals or provide process where we could need to generate new data, for
new data. instance through randomised experiments or performing
Data take centre-stage in data science, and the terms simulations on the observed or generated data, using the
‘data preparation’ and ‘modelling’ do not fully capture expert’s causal knowledge in the form of graphical models
anymore the variety of practical work that might be car- together with other kinds of domain knowledge or extracted
ried out on the data. Two decades ago, many applications, patterns. All these operations are difficult to integrate in the
especially those falling under the term business intelligence, CRISP-DM model and may require new generative activities
were based on analysing their own data (e.g., customer for data acquisition and simulation.
behaviour) and extracting patterns from it that would meet Another relevant area where CRISP-DM seems to fall
the business goals. But today, many more options are con- short is when thinking about “data-driven products”, such
sidered. as a mobile app that takes information from the location of
For instance, causal inference [45] has recently been their users and recommends routes to other users, according
pointed out as a new evolution of data analysis aimed to their patterns. The product is the data and the knowledge
to understand the cause-effect connections in data. Causal extracted from it. This perspective was unusual two decades
inference from data focuses on answering questions of the ago, but it is now widespread. Also, nowadays the data
type “what if” and relies on methods that incorporate causal might have multiple uses, even far away from the context or
6

Business Data
Data Source Data Value
Understanding Understanding
Exploration Exploration

Data Acquisition
2
0 Goal 1 Data Simulation Data
Deployment Result
Exploration Preparation
Data Architecting Exploration

Data Release 3

5 Product
Evaluation Modelling Narrative
Exploration Exploration

Fig. 4. Example trajectory through a data science project.

domain where they were collected (e.g., the data collected As we will do in the next section, we can represent tra-
by an electronic payment system can be bought and used by jectories more compactly, by removing those activities that
a multinational company to know where a new store will be are not used. Still, if an activity happens more than once in
best located, or can be used by an environmental agency a trajectory, we only show the same activity once. For these
to obtain petrol consumption patterns). The huge size and DST charts, we use numbered arrows to show the process
complexity of the data in some applications nowadays also (possibly visiting the same activity more than once)4 . More
suggest that handling the data requires important techni- precisely, a trajectory chart is defined as follows:
cal work on curation and infrastructure. In other words,
the CRISP-DM model included the ‘data’ as a static disk • A DST chart is a directed graph that only includes
cylinder in the middle of the process (see Figure 1), but activities (once) and connections (transitions) between
we want to highlight the activities around this disk, going them (as directed solid arrows).
beyond data preparation and integration3 . Given the variety • All arrows are numbered from 0 to N , showing the se-
of scenarios for using the data from others or from yourself, quence of transitions between activities. Consequently,
for your own or others’ benefit, we consider the following we cannot have unlimited loops.
data management activities. • We use three different types of boxes for activities
(circles for exploration activities, rounded squares for
Data acquisition: obtaining or creating relevant data, for
CRISP-DM activities, and cylinders for data manage-
example by installing sensors or apps;
ment activities).
Data simulation: simulating complex systems in order to
• If two or more arrows have the same number, it means
produce useful data, ask causal (e.g., what-if) questions;
that they take place in parallel (or their sequential order
Data architecting: designing the logical and physical lay-
is unattested or unimportant).
out of the data and integrating different data sources;
• A trajectory can go through the same activity more than
Data release: making the data available through databases,
once. If the trajectory moves from A to B more than
interfaces and visualisations.
once, we will annotate this as a single arrow with a
Once the set of activities has been introduced, a tra- single label, showing as many transition numbers as
jectory is simply an acyclic directed graph over activities, needed, separated by commas.
usually representing a sequence, but occasionally forking • Every trajectory has an entrance transition (with num-
to represent when things are done in parallel (by different ber 0 and not starting from any activity) and an exit
individuals or groups in a data science team). An example transition (with number N and not ending in any
of a trajectory through the DST map is given in Figure 4, activity).
where the goal is established as a first step in a data-driven
way (goal exploration), and relevant data is then explored to By following the transitions from 0 to N , we derive one
extract valuable knowledge (data value exploration). Classical single trajectory from the chart (remember that repeated
CRISP-DM activities are performed to clean and transform numbers are not alternatives, but things going in parallel).
the data (data transformation) which will be used to train a Once introduced the graphical notation for the charts that
particular machine learning model (modelling). Finally, the completes our DST model, in the following section we
most appropriate end-user product and/or presentation is present some real-life scenarios and discuss the order of
explored (product exploration) in order to turn the value exploratory, goal-directed and data management activities
extracted from the data into a valuable product for users in these scenarios.
and customers. This example will be visited in full detail in
section 4.1.
4. Note that a trajectory chart represents one single trajectory, and it is
3. Despite disk cylinders not being cognitively associated with activ- not a pattern for a set of trajectories. CRISP-DM is actually a pattern and
ities as a representation, we have decided to use them to emphasise the not a single trajectory chart, as CRISP-DM admits several trajectories,
correspondence with the original CRISP-DM model especially through the use of the backwards arrows.
7

4 E XAMPLES OF DATA S CIENCE T RAJECTORIES


0 Goal 1 Data Value 2 Data 3 4 Product 5
Modelling
The set of cases we include in this section is not meant Exploration Exploration Preparation Exploration

to be exhaustive, but aims to show a diverse range of


common data science trajectories that illustrate alignments
and especially misalignments with parts (and most of them Fig. 5. Tourism recommender: A possible trajectory for the development
of a location and activity recommendation system (Section 4.1) may im-
the whole) of the CRISP-DM model, by showing exploratory ply that, once the goal is established as a first step (goal exploration), the
and data management activities. The exemplar trajectories company would decide to use the users’ location and activity histories
are also useful to illustrate the graphical notation that we as relevant data (data value exploration) from the data which has been
use for the trajectory charts. For each case, we explain retrieved from third party location based services and networks. Then,
the data preparation activity starts to create a user-location-activity
the domain and context in a separate subsection while the rating tensor which could be used to implement and train a recommen-
sequence of activities is explained in the captions of the dation system (modelling stage). Once the best model is selected and
corresponding figures. evaluated (note that the evaluation against business goals in CRISP-DM
is not necessary here), the company may explore the most appropriate
end-user product and presentation (product exploration), either through
4.1 Tourism recommender simple visualisations or through the development of mobile/web apps.

With the increasing popularity of location-based services,


there is a large amount of this sort of data being accumu- generally linked to fuel combustion as by-products of these
lated. For instance, real-time data is being collected from processes. The generated system can be used not only to
drivers who use the Waze5 navigation app as well as from predict the level of the pollutants, but also for simulating
pedestrians who use the public-transportation app Moovit6 , the effect on pollution of, for instance, restricting the cir-
or the popular social network for athletes Strava7 , which culation of cars in certain parts of the city since temporal
monitors how cyclists and runners are moving around and spatial resolution of emissions is essential to predict the
the city, giving it an unprecedented view on thousands of concentration of pollutants near roadways. A trajectory of
moving points across the cities. All this information can be working with a simulated system for predicting traffic and
collected from thousands of smartphones being walked or pollution is shown in Figure 6.
driven around a city, and can be used by many different Transportation agencies and researchers in the past have
companies that could be interested in this information with estimated emissions using one average speed and volume
very different purposes. For instance, a tour operator would on a long stretch of roadway. With MOVES, there is an
be interested in answering questions related to location opportunity for higher precision and accuracy. Integrating a
recommendation (if we want to do something, where shall microscopic traffic simulation model (such as VISSIM) with
we go?) or activity recommendation (if we visit some place, MOVES allows one to obtain precise and accurate emissions
what can we do there?). By exploiting the information estimates. The proposed emission rate estimation process
retrieved from the aforementioned networks, the company also can be extended to gridded emissions for ozone mod-
then decided to create a collaborative smart tourism recom- eling, or to localised air quality dispersion modeling, where
mendation system to provide personalised plan trips as well temporal and spatial resolution of emissions is essential to
as suitable and adequate offers and activities (accommoda- predict the concentration of pollutants near roadways.
tion, restaurants, museums, transports, shopping and other
attractions) appropriate to the users’ profile. We find real-
1
word examples such as Google Travel8 , a service developed 0 Data Source
Data Acquisition 2
Data 3
Modelling
4 Product 5
Exploration Preparation Exploration
to plan for upcoming trips with summarising info about the 1 2
Data Simulation
users destination in several categories such as day plans,
reservations, best routes, etc. In this example, a possible
trajectory is shown in Figure 5. Fig. 6. Environmental simulator: Possible trajectory of an application
for predicting pollution in cities (Section 4.2). The first activity must
select the data sources for traffic parameters and topology of a city,
4.2 Environmental simulator as well as real meteorological data, all done by means of a data
source exploration. The real data about weather conditions can then be
Simulation processes are an effective resource that may be collected by the sensors distributed along the city (data acquisition) and
used to create a whole system in order to generate data simulated data about traffic can be generated(data simulation). In order
that is usually difficult (or expensive) to collect. Moreover, to make predictions, all the collected data have to be converted (data
preparation) to a format or structure suitable for being processed by
the simulation of complex systems also provides additional the machine learning techniques (modelling). The generated models are
advantages such as the possibility of analysing different then evaluated according to a certain quality criterion (again not against
scenarios and, in this way, estimating the costs and con- any business goal), and the best model is further used to make the
predictions. Finally, the municipalities can explore the most appropriate
sequences of the alternatives. For instance, agencies and end-user presentation (product exploration), e.g, web or mobile app, and
researchers can integrate traffic simulation models with real the most effective way to communicate the alerts (e.g, text messages,
data about meteorological conditions (e.g., obtained from email alerts or Pop-Up Mobile Ads).
weather stations located around the city) for building mod-
els about pollution spread for different pollutants which are
4.3 Insurance refining
5. https://www.waze.com/
6. https://moovit.com/ Insurance companies can use driving history records, lo-
7. https://www.strava.com/ cations and real-time data based on ubiquitous Internet of
8. https://www.google.com/travel/ Things (IoT) sensors to offer context-based insurance plans
8

an behavioural policy pricing to their clients. This data 4, 6

can be used to create much more complete user profiles


0 Business 1 Data 2 Data 3 Data Source Result 8
including, for instance, how much time the vehicle is in Understanding Understanding Preparation Exploration Exploration
use, frequent destinations, whether drivers change lane 5, 7
excessively, their driving speeds, to what extent they respect
traffic rules, or if they use their smartphone while driving, Fig. 8. Sales OLAP: Trajectory for the analysis of sales in retailing
among many other things. All this information may be used (Section 4.4). The first developments imply the preparation of the data
to allow safer drivers to pay less for auto insurance. This mart, led by a data scientist that goes through the first activities of
a data mining project: business understanding, choosing a process
may be considered as a special data science project where of interest; data understanding, identifying the needed data: what are
the insurance company has already deployed a data mining- the facts, the dimensions, their hierarchies?; and data preparation thus
based product (customer profiling) which could be poten- building the datamart. These activities are usually performed using the
tially enriched by means of different new data explorations. so-called ETL tools (Extract, Transform and Load) in data warehous-
ing, helping in the progress of migrating and integrating data from the
This would make a shift from the insurer companies being original data sources to the data warehouse. The second part of the
reactive claim payers to a proactive risk managers. Some trajectory involves possibly several analysts/managers extracting value
major auto insurance companies are already using this sort from the datamart by getting the right data (data source exploration),
and analysing the results (result exploration), and iterating loops until
of data9 . Fig. 7 shows the trajectory followed which, apart they come to decisions.
from the classical CRISP-DM cycle used to develop their
current customer profiling product, involves new activities.
tools, can iteratively explore data and results, through typ-
ical drill-down and roll-up operations along the hierarchies
0 Business 1 Data
Understanding Understanding in order to visualise key business issues. The trajectory
consists of two main developments, the creation of a data
2
warehouse, which can be assimilated to the first stages of
Data 3,10 4,11 5,12 13 CRISP-DM and a more explorative period at the end, as
Modelling Evaluation Deployment
Preparation illustrated in Figure 8.
9 6
8 7 4.5 Repository publishing
Data Source Data Value
Data Acquisition Exploration Exploration
Data publishing means curating data and making it avail-
able in a form that makes it easy for others to extract value.
So data is both the starting point and the product. Some
Fig. 7. Insurance refining: For insurance companies (Section 4.3) aim-
ing at improving already deployed products (e.g., customer profiling amount of data value exploration has happened as part of
using data mining), a possible trajectory may imply, after a complete this process, but there is not a very concrete business goal
CRISP-DM trajectory, the exploration of the value of the data, where (yet) for which the data is being made available. Some data
the insurance company realises that combining analytical applications
mining has happened to support the data value exploration
(e.g., behavioural models based on customer profile data) with streams
of real-time data (e.g., driver’s behaviour, vehicle sensors, satellite data, and data understanding process, but data publishing takes
weather reports, etc.) could be an important source for refining the the place of deployment. In this way a data repository can
products and services offered; exploring new data sources, where the be created serving as a data library for storing data sets
company decides what should be acquired and/or sensorised to cre-
ate detailed and personalised assessments of risks; and, finally, data that can be used for data analysis, sharing and reporting.
acquisition, where it is to be decided which kind of sensor technology, Many examples of data repositories can be found through
smart or wearable devices should be used and where/how they should platforms such as re3data10 , which allows users to search
be installed/used to obtain relevant data. among a vast number of different data repositories along the
world by a simple or advance search using different char-
acteristics. Another similar example is paperswithcode.com, a
4.4 Sales OLAP free resource for researchers and practitioners to find and
In a supermarket, managers regularly analyse information follow the latest state-of-the-art machine learning-related
regarding the results of merchandise sales since, as a critical papers and code. The company behind this (Atlas ML)
resource, it influences directly the operational efficiency of has explored the way to present data regarding trending
commercial enterprises. For this purpose, managers usually machine learning research, state-of-the-art leaderboards and
look at the results of various predefined queries, reports the code to implement it. This way users could have access
and indicators, and can also refine their queries to get a in a unified and genuinely comprehensive manner to papers
better understanding of the sales. Such managers either (fetched from several venues, repositories and open source
write their own queries or use reporting tools. But they and free license related projects) and to its code on different
need an appropriate representation, a star (or multidimen- repositories, which can help with reviewing content from
sional) schema, and data organised into datamarts. These different perspectives to discover and compare research. A
datamarts usually come with supporting software (OLAP possible trajectory for both examples is shown in Figure 9.
tools) to make human analysis easier both by lowering the
cognitive load of the user to understand/manipulate the 4.6 Parking App
data and by speeding up the database system itself. The Smart cities are an emergent concept that refers to an urban
data analyst, who is here a manager using user-oriented area that uses types of electronic data collection sensors
9. http://fortune.com/2016/01/11/car-insurance-companies-track/ 10. https://www.re3data.org/
9

0 Data Source 1 2 Data 3 5 4 5


Exploration Data Acquisition Preparation Data Architecting Data Release 0 Data Value 1 Goal 2 Data 3 Result 4 Narrative 5
Exploration Exploration Preparation Exploration Exploration

Fig. 9. Repository publishing: A possible trajectory for generating a data


repository (Section 4.5) that might have been taken includes the activites Fig. 11. Payment geovisualisation: A possible trajectory for the tourism
of data source exploration, when data comes from external sources, and spending example (Section 4.7). This includes the steps of data value
data acquisition, where the required data is downloaded, scraped and exploration where the bank systematically looked through the data it
explored; data preparation where data is parsed, curated and structured; held; goal exploration where the bank considered the potential goals and
data architecting, where data is annotated, stored and managed in order chose to do an interactive website; data preparation where the data were
to provide an easy access to the users; and data release, where both integrated and prepared to be queried for visualisation; result exploration
the data and the automatic data extraction pipelines are shared under where the visualisations were analysed to decide which companies
different licenses for public use. to offer particularised applications for; and narrative exploration where
example stories were compiled in order to attract the audience to the
visualisation tool.
to supply information which is used to manage assets
and resources efficiently. Smart cities technology allows the spending behaviour of tourists in Spain by having access
to monitor what is happening in the city and to make across several variables to this information, with a general
decisions to improve the city evolution. Local governments free demo application and particularised (or more detailed)
and city councils usually realise that these real-time raw applications for companies. The application, which reveals
data collected (e.g., from citizens, sensors, devices, etc.) the data simply and clearly, was made attractive with stories
could be an important source for enhancing the quality of such as: “Ever wondered when the French buy their food?”,
their living environment by improving the performance of “Which places the Germans flock to on their holidays?”,
urban services such as energy, transportation and utilities in or “Sit back and discover the dynamics of spending in
order to reduce resource consumption, wastage and overall Spain”12 . In this example, a possible trajectory that might
costs. For example, the open CityOS platform11 is an Open have taken is shown in Figure 11.
source software that supports the visualisation of real time
data and mobile applications of smart cities. This platform
has been adopted by several smart cities projects. One of 5 ACTIVITY T YPES FOR P ROJECT M ANAGEMENT
the developed applications is a smart parking app for the In the previous section we have seen a rich variety of data
city of Dubrovnik (Smart Parking Dubrovnik) that allows science trajectories. Some include the data mining process
drivers to find vacant parking spots, visualising them in an (in part or entirely) as a key component of the trajectory,
interactive map. In this example, a possible trajectory for but others mostly exclude it. We have even seen some cases
this application is shown in Figure 10. where the conversion of data into knowledge by modelling
or learning is not part of the process, but they are still con-
sidered genuine data science trajectories, as data is used to
0 Data Source 1 2 Data 3 Product 4 generate value. This ranges from projects only featuring the
Exploration Data Acquisition Preparation Exploration
non-inferential part of “business intelligence” (e.g., building
a data warehouse and obtaining aggregated tables, cubes
and other graphical representations from the data) [50],
Fig. 10. Parking App: A possible trajectory for the development of the
Smart Parking Dubrovnik app (Section 4.6). The first step is to determine but also those that follow more exploratory or interactive
what data should be acquired (data source exploration) and how to scenarios, such as those common in visual analytics [51].
collect them (data acquisition), which may imply the development of Such variability across data science projects poses chal-
specific sensors. Then, the following actions are performed in real time:
the data gathered by the sensors are transformed to a format (data
lenges for project managers, who need to hire suitable peo-
preparation) that allows to determine which parking spots are free and ple and make time and cost estimates. Exploratory activities
which ones are occupied. Finally, an app is developed for visualising the require expert data scientists and increase time and cost un-
vacant parking spots in a map on the screen of users’ mobiles (product certainty, whereas data management activities require more
exploration).
data engineers and are more easily contained within a fixed
time interval and budget. The DST model (see Figure 3)
can help project planning by clearly separating exploratory,
4.7 Payment geovisualisation CRISP-DM (goal-directed) and data management activities,
Credit card transactions are a rich source of data that which each have different time and cost characteristics.
banks and other payment platforms can exploit in many In order to better understand the nature of our seven
ways. BBVA, one major Spanish bank, through their Data illustrative examples from the previous section, Figure 12
& Analytics division, has been exploring several ways of shows a Venn diagram of the three kinds of activities. We
making this data valuable. They realised that the histor- can see which of the seven use cases in section 4 fall in each
ical information of what is bought by different people of the possible regions according to how relevant (in number
(nationalities) at different times and dates, and different or importance) the three kinds of activities are (details of the
locations could be an important source for monetisation, methodology to estimate this are given in the Appendix A).
as many other companies (retailers, restaurants, etc.) could For example, for the Tourism recommender case (location-
be interested in this information. They decided to create an based services, with DST in Figure 5) both the exploratory
interactive representation, so that users could learn about and the CRISP-DM activities play an important role, and

11. http://cityos.io 12. The result can be found here: http://bbvatourism.vizzuality.com


10

Payment
geovisualisation
Sightseeing
Smart
Parking
advisor 1 5
Drivers
profiling Sales
4
Exploratory OLAP CRISP-DM Exploratory CRISP-DM

Pollution
simulator 3
5 9

Data Management Data Management


Repository
publishing 23

Fig. 12. Venn diagram of the three kinds of activities (exploratory, Fig. 13. Venn diagram of the three kinds of activities (exploratory,
CRISP-DM and data management) and the seven use cases introduced CRISP-DM and data management) and the number of use cases from
in section 4. NIST Big Data Public Working Group Use Cases [52] which fall into each
region.

this is shown by their location in the Venn diagram. Overall,


we see that most of the use cases are located in regions CRISP-DM activities. Interestingly, even if this collection is
where exploration is important, as expected. about Big Data projects, we have at least one exemplar in
However, this picture should not be mistaken as repre- each region.
sentative of the whole range of data science applications, This focus on the three kinds of activities and possible
many of which may follow a more traditional CRISP-DM regions of overlap provides a useful characterisation of data
workflow or may give more relevance to data management. science projects. Data science teams and their organisations
In section 2 we not only referred to polls that recognised can do a similar analysis of their projects and compare
CRISP-DM as the methodology that is still prevalent for a new project specification against them. We recommend
data scientists (despite its limitations) but included a bib- the following procedure: (1) Even at very early stages of a
liographic survey covering the last past four years, with project, it is already possible to identify the activities that
an important number of domains where CRISP-DM is still will be required. By analysing how many and how signifi-
used extensively. All the applications reviewed there fit cant they are for each kind (exploratory, CRISP-DM or data
CRISP-DM well, with no or very little adaptation over management) it is possible to identify to which region of the
the original formulation and including mostly CRISP-DM Venn diagram they belong. If the project has one or more
activities. This shows that CRISP-DM is still fit for purpose strong exploration components, it will be more open-ended.
for one of the areas in Figure 12. Consequently, more expert data scientists will be needed,
Apart from projects that fit in the CRISP-DM category, with good knowledge about the domain and its casual
and those that are more explorative, it may be worth looking models. Furthermore, planning will be more involved. If
at some other projects that can have a stronger component the project has a strong data management component, more
in the data management part. In order to do this, we have data engineers will be needed, as well as more hardware and
examined the NIST Big Data Public Working Group Use software resources. (2) By comparing to other projects of that
Cases [52], as per their version 3.0. This is a very com- region, one can estimate the project costs more accurately
prehensive set of 51 real use cases and their requirements than by comparing against the whole collection of projects,
gathered by the NBD-PWG Use Cases and Requirements and use some of the trajectories in that region as patterns
Subgroup at the US National Institute for Standards and or prototypes for the appropriate DST for the project. As
Technology (NIST). Following the approach used for our a result, the types of activities in Figure 3 (exploratory,
seven illustrative cases, we went through the 51 NIST cases. CRISP-DM and data management) are a practical, yet pow-
The first significant insight is that we did not find any erful, way of describing a data science project, prior to going
activity that is not represented in Figure 3. This shows that into the more detailed flow of its trajectory, which can be
our model is comprehensive, and captures a wide range of useful for predictive and explanatory questions about the
activities associated with any kind of data science project, project.
including those that are more data-heavy. Also, when we
look at the distribution of activities we also see clear pat-
terns, which confirm what we already knew about the types 6 D ISCUSSION
of applications that are included in this NIST collection. In Standardised processes are not the same as methodologies
particular, Figure 13 shows a Venn diagram of the NIST [53], and many methodologies do not necessarily include
cases and how many of them fall into each of the possible guided processes, where one can follow a series of steps
regions that emanate from the three kinds of activities. linearly. Two cases that are close to data science are quite
Some further insights can be extracted from this di- illustrative. The first case is software engineering, which
agram. Unsurprisingly, since this is a collection of Big has many methodologies [54], and none of them seems to
Data projects, we find nearly half of them located in the be the best methodology for all situations, depending on
Data Management (only) region. But there are also some many internal and external factors. Software development,
other cases that are combined with the exploratory and/or like many other engineering problems, has a structure that
11

resembles CRISP-DM in many ways (starting with business While the trajectory perspective may allow for a more
needs and ending up in deployment and maintenance of systematic (and even automated) analysis at the process
the outcome of the process), but it would be likewise inap- level, it is no surprise that the more flexible, less system-
propriate to use the same linear flow for all problems and atic, character of the new activities (exploration and data
circumstances. The similarities have suggested the applica- management) highlights the challenges for the automation
tion or adaptation of software development methodologies of data science. For instance, while the automation of the
for data science (or big data) projects [55], but it is per- modelling stage of CRISP-DM has been achieved to a large
haps the general project management methodologies that extent under the AutoML paradigm [63], [64], many other
may be more appropriate, or some specific ideas such as parts of CRISP-DM are still escaping automation, such as
design patterns [56]. Also, we can learn from some novel data wrangling or model deployment. Beyond data mining,
lightweight methodologies, such as Extreme Programming many new competences have been identified as necessary
(XP) [57], which attempted to add flexibility to the process, for a data scientist, including both technical and non-
allowing teams to develop software, from requirements to technical skills, such as communicating results, leading a
deployment, in a a more efficient way. team, being creative, etc. [65], [66], [67], [68], and they
The second case is methodology in science. The whole are usually associated with the exploration activities. Data
process of scientific discovery is usually question-driven, scientists are expected to cover a wide range of soft skills,
rather than data-driven or goal-driven, but is generally such as being proactive, curious and inquisitive, being able
much more flexible in the initial trajectories (surprising to tell a story about the data and visualise the insights
observations, serendipity, etc.) – while more strict when it appropriately, and focus on traceability and trust. Most of
comes to hypothesis testing, replicable experimental design, the new explorative steps beyond CRISP-DM identified in
etc. Despite the analogies between some trajectories in data this paper imply these soft skills and the use of business
science and the methodologies in science, there is an on- knowledge and vision that is far from the capabilities that
going controversy whether the traditional scientific method AI provides today, and will be harder to automate in the
is obsolete under the irruption of data science [58], [59], or years to come.
whether data science methodologies should learn more from The trajectory model does not yet explicitly address all
the general scientific method [60], [61]. the ethical and legal issues around data science [69], an
In the absence of more rigid schemes, this diversity of area that is becoming more relevant in data science over
methodologies and trajectories may create uncertainty for the previous data mining paradigm, even if problems such
project management. This is mitigated by three important as fairness and privacy already existed for data mining.
aspects of our DST model. First, we define trajectories over The increased relevance comes especially from the incen-
a well-defined collection of activities, which can be encap- tives behind many data science projects, which focus on
sulated and documented, similar to the original substages the monetisation of the data, through the exploration of
in CRISP-DM. DST thus allow data scientists to design their new data products. This usually implies the use of data
data science projects as well as explore new activities that for purposes that are different from those that created the
could be added to or removed from their workflows. This data in the first place, such as social networks, digital
is especially useful for teams, as they can agree and locate assistants or wearable devices. The most relevant ethical
themselves (and subteams) in some of the subactivities of issues will appear in the new activities: goal exploration,
the trajectory. Secondly, existing trajectories can be used as data source exploration, data value exploration, result ex-
templates so that new projects can use them as references. ploration, product exploration, and data acquisition. These
A new project may find the best match in the catalogue of are also the parts of the trajectories where more senior data
trajectories rather than forcing it to fit a process model such scientists will be involved, assuming higher awareness and
as CRISP-DM that may not suit the project well and may training on ethical issues [70] than other more technical, less
cause planning difficulties and a bad estimation of effort senior data scientists or team members.
(e.g., resources, costs, project expertise, completition plans, The DST is also motivated by the causal approach to data
etc.). Actually, if the estimations of resources and costs using science. In this case, it is not that much that new exploratory
DST are more accurate than using CRISP-DM, this would activites are needed, but new data management activities,
be evidence for validity and usefulness in an organisa- required to generate data for the discovery of the causal
tion. Thirdly, trajectories can be mapped with project plans structure: data acquisition and simulation. These are a series
directly, assigning deadlines to transitions, and assigning of activities that are becoming more and more relevant, as
personnel and budget to activities. Iterations on activities we have also seen in the large Big Data NIST repository and
are explicit in the trajectories, which also allows for spiral the associated trajectories that we explored in section 5.
models where subparts of the trajectory are iterated from In conclusion, CRISP-DM still plays an important role
small to big or until a given criterion is met (or a resource is as a common framework for setting up and managing
exhausted). data mining projects. However, the world today is a very
All this paves the way to the introduction of proper different place from the world in which CRISP-DM was
data science project management methodologies, and the conceived over two decades ago. In this paper we have
reuse of statistics and experiences from activities used in argued that the shift from data mining to data science is
previous projects. Techniques from the area of workflow not just terminological, but signifies an evolution towards
inference and management could also be applied to analyse a much wider range of approaches, in which the main
trajectories [62], estimate costs and success rates, and extract value-adding component may be undetermined at the out-
patterns that fit a domain or organisation. set and needs to be discovered as part of the project. For
12

such exploratory projects the CRISP-DM framework will [12] H. A. Edelstein, Introduction to data mining and knowledge discovery.
be too restrictive. We have proposed a new Data Science Two Crows, 1998.
[13] L. Cao, “Domain-driven data mining: Challenges and prospects,”
Trajectories (DST) framework which expands CRISP-DM by IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 6, pp.
including exploratory activities such as goal exploration, data 755–769, 2010.
source exploration and data value exploration. Entry points into, [14] C. Brunk, J. Kelly, and R. Kohavi, “Mineset: An integrated system
trajectories through and exit points out of this richer set for data mining.” in KDD, 1997, pp. 135–138.
[15] A. Bernstein, F. Provost, and S. Hill, “Toward intelligent assistance
of data science steps can vary greatly among data science for a data mining process: An ontology-based approach for cost-
projects. We have illustrated this by means of a broad range sensitive classification,” IEEE Trans. on knowledge and data engineer-
of exemplar projects and the trajectories they embody. ing, vol. 17, no. 4, pp. 503–518, 2005.
Data science is still a young subject, with many open [16] M. J. Harry, “Six sigma: a breakthrough strategy for profitability,”
Quality progress, vol. 31, no. 5, p. 60, 1998.
questions regarding its nature and methodology. While [17] J. Debuse, B. de la Iglesia, C. Howard, and V. Rayward-Smith,
other authors approach these questions from a top-down “Building the kdd roadmap,” in Industrial Knowledge Management.
perspective [71], what we have attempted here is more Springer, 2001, pp. 179–196.
bottom-up, starting from something that is generally ac- [18] O. Niaksu, “CRISP data mining methodology extension for med-
ical domain,” Baltic J. of Modern Computing, vol. 3, no. 2, p. 92,
cepted to be productive in the data mining context, and 2015.
investigating how it can be generalised to account for the [19] D. Asamoah and R. Sharda, “Adapting CRISP-DM process for
much richer data science context. We hence see this as part social network analytics: Application to healthcare,” 21th Americas
Conf. on Information Systems, Puerto Rico, 2015, 2015.
of a larger, ongoing conversation and hope that the perspec-
[20] N. Njiru and E. Opiyo, “Clustering and visualizing the status
tive offered here will be received as a positive contribution. of child health in kenya: A data mining approach.” International
Journal of Social Science and Technology I, 2018.
[21] N. Azadeh-Fard, F. M. Megahed, and F. Pakdil, “Variations of
ACKNOWLEDGMENTS length of stay: a case study using control charts in the CRISP-
We thank the anonymous reviewers for their comments, DM framework,” International Journal of Six Sigma and Competitive
Advantage, vol. 11, no. 2-3, pp. 204–225, 2019.
which motivated the analysis in Section 5. This mate- [22] A. Dåderman and S. Rosander, “Evaluating frameworks for im-
rial is based upon work supported by the EU (FEDER), plementing machine learning in signal processing: A comparative
and the Spanish MINECO under grant RTI2018-094403-B- study of CRISP-DM, semma and kdd,” 2018.
C3, the Generalitat Valenciana PROMETEO/2019/098. F. [23] M. Rogalewicz and R. Sika, “Methodologies of knowledge discov-
ery from data and data mining methods in mechanical engineer-
Martı́nez-Plumed was also supported by INCIBE (Ayudas ing,” Management and Production Engineering Review, vol. 7, no. 4,
para la excelencia de los equipos de investigación avanzada pp. 97–108, 2016.
en ciberseguridad), the European Commission (JRC) HU- [24] S. Huber, H. Wiemer, D. Schneider, and S. Ihlenfeldt, “DMME:
Data mining methodology for engineering applications–a holistic
MAINT project (CT-EX2018D335821-101), and UPV (PAID- extension to the CRISP-DM model,” Procedia CIRP, vol. 79, pp.
06-18). J. H-Orallo is also funded by an FLI grant RFP2-152. 403–408, 2019.
[25] C. Barclay, A. Dennis, and J. Shepherd, “Application of the
CRISP-DM model in predicting high school students’ examination
R EFERENCES (csec/cxc) performance,” Knowledge Discovery Process and Methods
[1] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, to Enhance Organizational Performance, p. 279, 2015.
C. Shearer, and R. Wirth, “CRISP-DM 1.0 step-by-step data mining [26] D. B. Fernández and S. Luján-Mora, “Uso de la metodologı́a
guide,” 2000. CRISP-DM para guiar el proceso de minerı́a de datos en lms,”
[2] O. Marbán, J. Segovia, E. Menasalvas, and C. Fernández-Baizán, in Tecnologı́a, innovación e investigación en los procesos de enseñanza-
“Toward data mining engineering: A software engineering ap- aprendizaje. Octaedro, 2016, pp. 2385–2393.
proach,” Information systems, vol. 34, no. 1, pp. 87–107, 2009. [27] L. Almahadeen, M. Akkaya, and A. Sari, “Mining student data
[3] IBM, “Analytics solutions unified method,” ftp://ftp.software. using CRISP-DM model,” International Journal of Computer Science
ibm.com/software/data/sw-library/services/ASUM.pdf, 2005. and Information Security, vol. 15, no. 2, p. 305, 2017.
[4] SAS, “Semma data mining methodology,” http://www.sas.com/ [28] D. Oreski, I. Pihir, and M. Konecki, “CRISP-DM process model
technologies/analytics/datamining/miner/semma.html, 2005. in educational setting,” Economic and Social Development: Book of
[5] L. A. Kurgan and P. Musilek, “A survey of knowledge discovery Proceedings, pp. 19–28, 2017.
and data mining process models,” The Knowledge Engineering Re- [29] E. Espitia, A. F. Montilla et al., “Applying CRISP-DM in a kdd pro-
view, vol. 21, no. 1, pp. 1–24, 2006. cess for the analysis of student attrition,” in Colombian Conference
[6] G. Mariscal, O. Marban, and C. Fernandez, “A survey of data on Computing. Springer, 2018, pp. 386–401.
mining and knowledge discovery process models and methodolo- [30] V. Tumelaire, E. Topan, and A. Wilbik, “Development of a re-
gies,” The Knowledge Engineering Review, vol. 25, no. 02, pp. 137– pair cost calculation model for daf trucks nv using the CRISP-
166, 2010. DM framework,” Ph.D. dissertation, Master’s thesis, Eindhoven
[7] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process University of Technology, 2015.
for extracting useful knowledge from volumes of data,” Commun. [31] F. Schäfer, C. Zeiselmair, J. Becker, and H. Otten, “Synthesizing
ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996. CRISP-DM and quality management: A data mining approach
[8] R. J. Brachman and T. Anand, “Advances in knowledge discovery for production processes,” in 2018 IEEE International Conference on
and data mining,” U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, Technology Management, Operations and Decisions (ICTMOD). IEEE,
and R. Uthurusamy, Eds. Menlo Park, CA, USA: American 2018, pp. 190–195.
Association for Artificial Intelligence, 1996, ch. The Process of [32] E. G. Nabati and K.-D. Thoben, “On applicability of big data an-
Knowledge Discovery in Databases, pp. 37–57. alytics in the closed-loop product lifecycle: Integration of CRISP-
[9] C. Gertosio and A. Dussauchoy, “Knowledge discovery from DM standard,” in IFIP International Conference on Product Lifecycle
industrial databases,” Journal of Intelligent Manufacturing, vol. 15, Management. Springer, 2016, pp. 457–467.
no. 1, pp. 29–37, 2004. [33] H. Nagashima and Y. Kato, “Aprep-dm: a framework for automat-
[10] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi, ing the pre-processing of a sensor data analysis based on CRISP-
Discovering data mining: from concept to implementation. Prentice- DM,” in 2019 IEEE International Conference on Pervasive Computing
Hall, Inc., 1998. and Communications Workshops (PerCom Workshops). IEEE, 2019,
[11] A. G. Buchner, M. D. Mulvenna, S. S. Anand, and J. G. Hughes, pp. 555–560.
“An internet-enabled knowledge discovery process,” in Proc. of the [34] S. B. Gómez, M. C. Gómez, and J. B. Quintero, “Inteligencia de
9th Int. Database Conf., Hong Kong, vol. 1999, 1999, pp. 13–27. negocios aplicada al ecoturismo en colombia: Un caso de estudio
13

aplicando la metodologı́a CRISP-DM,” in 14th Iberian Conference on [56] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design patterns:
Information Systems and Technologies, CISTI 2019. IEEE Computer elements of reusable object-oriented software. Pearson Education,
Society, 2019, p. 8760802. 1995.
[35] R. Ganger, J. Coles, J. Ekstrum, T. Hanratty, E. Heilman, [57] K. Auer and R. Miller, Extreme programming applied: playing to win.
J. Boslaugh, and Z. Kendrick, “Application of data science within Addison-Wesley Longman Publishing Co., Inc., 2001.
the army intelligence warfighting function: problem summary and [58] C. Anderson, “The end of theory: The data deluge makes the
key findings,” in Artificial Intelligence and Machine Learning for scientific method obsolete,” Wired magazine, vol. 16, no. 7, pp. 16–
Multi-Domain Operations Applications, vol. 11006. International 07, 2008.
Society for Optics and Photonics, 2019, p. 110060N. [59] R. Kitchin, “Big data, new epistemologies and paradigm shifts,”
[36] R. P. Bunker and F. Thabtah, “A machine learning framework for Big Data & Society, vol. 1, no. 1, p. 2053951714528481, 2014.
sport result prediction,” Applied computing and informatics, 2017. [Online]. Available: https://doi.org/10.1177/2053951714528481
[37] R. Barros, A. Peres, F. Lorenzi, L. K. Wives, and E. H. da Silva Jac- [60] S. Carrol and D. Goodstein, “Defining the scientific method,” Nat
cottet, “Case law analysis with machine learning in brazilian Methods, vol. 6, p. 237, 2009.
court,” in International Conference on Industrial, Engineering and [61] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee,
Other Applications of Applied Intelligent Systems. Springer, 2018, A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, “Theory-
pp. 857–868. guided data science: A new paradigm for scientific discovery from
data,” IEEE Transactions on Knowledge and Data Engineering, vol. 29,
[38] K. J. Cios, A. Teresinska, S. Konieczna, J. Potocka, and S. Sharma,
no. 10, pp. 2318–2331, 2017.
“A knowledge discovery approach to diagnosing myocardial per-
[62] W. Van Der Aalst, K. M. Van Hee, and K. van Hee, Workflow
fusion,” Engineering in Medicine and Biology Magazine, IEEE, vol. 19,
management: models, methods, and systems. MIT press, 2004.
no. 4, pp. 17–25, 2000.
[63] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-
[39] K. J. Cios and L. A. Kurgan, “Trends in data mining and knowl- weka: Combined selection and hyperparameter optimization of
edge discovery,” in Advanced techniques in knowledge discovery and classification algorithms,” in Proceedings of the 19th ACM SIGKDD
data mining. Springer, 2005, pp. 1–26. international conference on Knowledge discovery and data mining.
[40] S. Moyle and A. Jorge, “Ramsys-a methodology for supporting ACM, 2013, pp. 847–855.
rapid remote collaborative data mining projects,” in ECML/PKDD [64] I. Guyon, L. Sun-Hosoya, M. Boullé, H. Escalante, S. Escalera,
2001 Workshop on Integrating Aspects of Data Mining, Decision Sup- Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag et al., “Analysis of
port and Meta-Learning: Internal SolEuNet Session, 2001, pp. 20–31. the automl challenge series 2015-2018,” 2017.
[41] F. Martı́nez-Plumed, L. C. Ochando, C. Ferri, P. A. Flach, [65] “8 Skills You Need to Be a Data Scientist,” https://blog.udacity.
J. Hernández-Orallo, M. Kull, N. Lachiche, and M. J. Ramı́rez- com/2014/11/data-science-job-skills.html, Nov. 2014.
Quintana, “CASP-DM: context aware standard process for data [66] V. Dhar, “Data science and prediction,” Communications of the
mining,” CoRR, vol. abs/1709.09003, 2017. [Online]. Available: ACM, vol. 56, no. 12, pp. 64–73, 2013.
http://arxiv.org/abs/1709.09003 [67] M. Loukides, What Is Data Science? ”O’Reilly Media, Inc.”, Apr.
[42] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big 2011.
data,” IEEE transactions on knowledge and data engineering, vol. 26, [68] E. Commission, “European e-Competence Framework,” 2016.
no. 1, pp. 97–107, 2014. [Online]. Available: http://www.ecompetences.eu/
[43] J. Rollins, “Why we need a methodology for data science,” [69] M. Taddeo and L. Floridi, “Theme issue ‘the ethical impact of data
2015. [Online]. Available: https://www-01.ibm.com/common/ science’,” 2016.
ssi/cgi-bin/ssialias?htmlfid=IMW14824USEN [70] S. Russell, S. Hauert, R. Altman, and M. Veloso, “Ethics of artificial
[44] R. B. Severtson, “What is the team data science process?” intelligence,” Nature, vol. 521, no. 7553, pp. 415–416, 2015.
2017. [Online]. Available: https://docs.microsoft.com/en-us/ [71] L. Cao, “Data science: a comprehensive overview,” ACM Comput-
azure/machine-learning/team-data-science-process/overview ing Surveys (CSUR), vol. 50, no. 3, p. 43, 2017.
[72] M. Ponsen, K. Tuyls, M. Kaisers, and J. Ramon, “An evolutionary
[45] J. Pearl and D. Mackenzie, The book of why: the new science of cause
game-theoretic analysis of poker strategies,” Entertainment Com-
and effect. Basic Books, 2018.
puting, vol. 1, no. 1, pp. 39–45, 2009.
[46] J. Pearl, “The seven tools of causal inference, with reflections on
machine learning.” Commun. ACM, vol. 62, no. 3, pp. 54–60, 2019.
[47] G. W. Imbens and D. B. Rubin, “Rubin causal model,” The new
palgrave dictionary of economics, pp. 1–10, 2017.
A PPENDIX
[48] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen, “A In section 5 we portray summarised information about 51
linear non-gaussian acyclic model for causal discovery,” J. Mach. use cases extracted from the NIST Big Data Public Working
Learn. Res., vol. 7, pp. 2003–2030, Dec. 2006. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1248547.1248619 Group [52]. In this appendix we give more information
[49] M. A. Hernán, J. Hsu, and B. Healy, “A second chance to get about this source of cases and the methodology we used to
causal inference right: A classification of data science tasks,” process them. The National Institute of Standards and Tech-
CHANCE, vol. 32, no. 1, p. 42–49, Jan 2019. [Online]. Available: nology (NIST) sought to establish relations among industry
http://dx.doi.org/10.1080/09332480.2019.1579578
professionals to further the secure and effective adoption of
[50] S. Chaudhuri, U. Dayal, and V. Narasayya, “An overview of
business intelligence technology,” Communications of the ACM, Big Data and develop consensus on definitions, taxonomies,
vol. 54, no. 8, pp. 88–98, 2011. secure reference architectures, security and privacy, and,
[51] D. Keim, G. Andrienko, J.-D. Fekete, C. Görg, J. Kohlhammer, from these, a standards roadmap. With this aim, the NIST
and G. Melançon, “Visual analytics: Definition, process, and chal-
Big Data Public Working Group (NBD-PWG) was launched
lenges,” in Information visualization. Springer, 2008, pp. 154–175.
[52] D. N. B. D. I. Framework, “NIST big data interoperability frame- with extensive participation by industry, academia, and
work: Volume 3, use cases and general requirements,” NIST Special government. The results from this group are reported in
Publication, vol. 1500, p. 344, 2019. the NIST Big Data Interoperability Framework series of vol-
[53] J. Saltz, K. Crowston et al., “Comparing data science project man- umes which, among definitions, taxonomies, requeriments,
agement methodologies via a controlled experiment,” in Proceed-
ings of the 50th Hawaii International Conference on System Sciences, etc., contains a set of 51 original use cases gathered by
2017. the NBD-PWG Use Cases and Requirements Subgroup13 .
[54] L. R. Vijayasarathy and C. W. Butler, “Choice of software de- The report includes examples in the following broad areas:
velopment methodologies: Do organizational, project, and team government operations (4 cases), commercial (8), defense
characteristics matter?” IEEE software, vol. 33, no. 5, pp. 86–94,
2016. (3), healthcare and life sciences (10), deep learning and social
[55] V. D. Kumar and P. Alencar, “Software engineering for big data media (6), research (4), astronomy and physics (5), earth,
projects: Domains, methodologies and gaps,” in 2016 IEEE Interna-
tional Conference on Big Data (Big Data). IEEE, 2016, pp. 2886–2895. 13. https://bigdatawg.nist.gov/show InputDoc.php
14

environmental and polar sciences (10) and energy (1). For


each use case, the report presents their requirements and
challenges.

Environmental Simulation CRISP−DM

1 100
Insurance Refinement

Parking App

Payment geovisualisation 2

20
80
Repository Publishing
1

Sales OLAP

Tourism recommender 1
40

1 60

CR
ry
rato

NIST Use Cases

ISP
plo

−D
Ex

M
1 1 1 2
60

2 1 3 40

1 1 3
1
1 1 5
80

1 2 1 20
3
10
0

1 1 12
20

40

60

80

0
10
Exploratory Data
Management
Data Management

Fig. 14. Ternary plot depicting the proportions of the three activity types
(exploratory, CRISP-DM and data management) for the seven use
cases in section 4 and the 51 use cases from NIST Big Data Public
Working Group Use Cases [52] (numbers show how many NIST use
cases fall in the same point).

Aiming at better understanding the nature of these 51


uses cases, we classify them according to how relevant
the three kinds of activities (exploratory, CRISP-DM and
data management) are. In this regard, each use case is
modelled as a DST following their definition from [52]. We
then determine whether a case has a significant number
of activities for each of the three groups of activities. We
have three possible variables (i.e., type of activity) and 23
potential combinations (“application types”) depending on
how many activities of each type an use case involves. In
this regard, we set a threshold to determine whether there is
a significant use or not of a specific type of activity in terms
of the number of activities used. Particularly, for the present
study we set this threshold on minimum 2 activities. The
results are those shown in Figures 12 and 13 in section 5
On the other hand, and in order to support the analysis
performed in section 5, we have also analysed the percent-
age of the three types of activities as positions in a ternary
plot (or simplex plot in game theory [72]) for all the illustra-
tive examples from section 4 as well as the NIST use cases.
This way, Figure 14 visualises the relative importance of the
three activity types for each point (use case), where their
positions in the plot represent their different compositions.
Using percentages or ratios (instead of absolute numbers)
here makes sense as there are no big differences in the
number of activities involving each use case (i.e., they range
from 3 to 7 activities, with 4.2 ± 1.3 activities on average).
The previous classifications show two things: (1) there is
no case which has an activity that is not captured by our
set of activities; (2) while our selection of illustrative exam-
ples in section 4 was made to emphasise the exploratory
activities, which are more distinctive in the new conception
of data science, the use cases in the NIST dataset are more
related to data management.

You might also like