TKDE_Data_Science_Trajectories_PF
TKDE_Data_Science_Trajectories_PF
, Hernandez-
Orallo, J., Kull, M., Lachiche, N. J. A. H., Ramírez-Quintana, M. J., &
Flach, P. A. (2019). CRISP-DM Twenty Years Later: From Data
Mining Processes to Data Science Trajectories. IEEE Transactions on
Knowledge and Data Engineering. Advance online publication.
https://doi.org/10.1109/TKDE.2019.2962680
This is the author accepted manuscript (AAM). The final published version (version of record) is available online
via IEEE at https://ieeexplore.ieee.org/document/8943998. Please refer to any applicable terms of use of the
publisher.
This document is made available in accordance with publisher policies. Please cite only the
published version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/red/research-policy/pure/user-guides/brp-terms/
1
Abstract—CRISP-DM (CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus
about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and
knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the
leading term being favoured over data mining. In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for
purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely
holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more
varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it
can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life
exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big
Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.
Index Terms—Data Science Trajectories, Data Mining, Knowledge Discovery Process, Data-driven Methodologies.
F
1 I NTRODUCTION
• Two Crows [12], which takes advantage of some insights •Cios et al.’s Six-step discovery process [38], [39], which
from (first versions of) CRISP-DM (before release), and adapts the CRISP-DM model to the needs of the aca-
proposes a non-linear list of steps (very close to those demic research community (research-oriented descrip-
from KDD), so it is possible to go back and forth. tions, explicit feedback mechanisms, extension of dis-
• Dˆ3M [13], a domain-driven data mining approach covered knowledge to other domains, etc.).
proposed to promote the paradigm shift from “data- • RAMSYS (RApid collaborative data Mining SYStem)
centered knowledge discovery” to “domain-driven, ac- [40], a methodology for developing collaborative DM
tionable knowledge delivery”. and KD projects with geographically diverse groups.
• ASUM-DM (Analytics Solutions Unified Method for
There are also some other relevant approaches not directly
Data Mining/Predictive Analytics) [3], a methodology
related to the KDD task. The 5 A’s Process [14], originally
which refines and extends CRISP-DM, adding infras-
developed by SPSS2 , already included an “Automate” step
tructure, operations, deployment and project manage-
which helps non-expert users to automate the whole process
ment sections as well as templates and guidelines,
of DM applying already defined methods to new data,
personalised for IBM’s practices.
but it does not contain steps to understand the business
• CASP-DM [41], which addresses specific challenges of
objectives and to test data quality. Another approach that
machine learning and data mining for context change
tries to assist the users in the DM process is [15]. All these
and model reuse handling.
were influential for CRISP-DM. In 1996 Motorola developed
• HACE [42], a Big Data processing framework based on a
the 6σ approach [16], which emphasises measurement and
three tier structure: a “Big Data mining platform” (Tier
statistical control techniques for quality and excellence in
I), challenges on information sharing and privacy, and
management. Another approach is the KDD Roadmap [17],
Big Data application domains (Tier II), and Big Data
an iterative data mining methodology that as a main con-
mining algorithms (Tier III).
tribution introduces the “resourcing” task, consisting in the
integration of databases from multiple sources to form the The aforementioned methodologies have in common that
operational database. they are designed to spend a great deal of time in the busi-
ness understanding phase aiming at gathering as much in-
The evolution of these data mining process models and
formation as possible before starting a data mining project.
methodologies is graphically depicted in Figure 2. The
However, the current data deluge as well as the experimen-
arrows in the figure indicate that CRISP-DM incorporates
tal and exploratory nature of data science projects require
principles and ideas from most of the aforementioned
less rigid and more lightweight and flexible methodologies.
methodologies, while also forming the basis for many later
In response, big IT companies have introduced similar
proposals. CRISP-DM is still considered the most complete
lifecycles and methodologies for data science projects. For
data mining methodology in terms of meeting the needs
example, in 2015 IBM released the Foundational Methodol-
of industrial projects, and has become the most widely
ogy for Data Science (FMDS) [43], a 10-stage data science
used process for DM projects according to the KDnuggets
methodology that – although bearing some similarities to
polls (https://www.kdnuggets.com/) held in 2002, 2004,
CRISP-DM– emphasises a number of the new practices such
2007 and 2014. In short, CRISP-DM is considered the de
as the use of very large data volumes, the incorporation of
facto standard for analytics, data mining, and data science
text analytics into predictive modelling and the automation
projects.
of some of the processes. In 2017 Microsoft released the Team
To corroborate this view from data science experts,
Data Science Process (TDSP) [44], an “agile, iterative, data sci-
we also checked that CRISP-DM is still a very common
ence methodology to deliver predictive analytics solutions
methodology for data mining applications. For instance, just
and intelligent applications efficiently” and to improve team
focussing on the past four years, we can find a large number
collaboration and learning.
of conventional studies applying or slightly adapting the
At a high level, both FMDS and TDSP have much in
CRISP-DM methodology to many different domains: health-
common with CRISP-DM. This demonstrates the latter’s
care [18], [19], [20], [21], signal processing [22], engineering
flexibility, which allows to include new specific steps (such
[23], [24], education [25], [26], [27], [28], [29], logistics [30]
as analytic and feedback phases/tasks) that are missing in
production [31], [32], sensors and wearable applications
the original proposal. On the other hand, methodologies
[33], tourism [34], warfare [35], sports [36] and law [37].
such as FMDS and TDSP are in essence still data mining
However, things have evolved in the business applica- methodologies that assume a clearly identifiable goal from
tion of data mining since CRISP-DM was published. Sev- the outset. In the next section we argue that data science
eral new methodologies have appeared as extensions of calls for a much more exploratory mindset.
CRISP-DM, showing how it can be modernised without
changing it fundamentally. For instance, the CRISP-DM 2.0
Special Interest Group (SIG) was established with the aim of 3 F ROM G OAL -D IRECTED DATA M INING P RO -
meeting the changing needs of DM with an improved ver- CESSES TO E XPLORATORY DATA S CIENCE T RAJEC -
sion of the CRISP-DM process. This version was scheduled TORIES
to appear in the late 2000s, but the group was discontinued As is evident from the previous section, the perspective of
before the new version could be delivered. Other examples CRISP-DM and related methodologies is that data mining is
include: a process starting from a relatively clear business goal and
data that have already been collected and are available for
2. http://www.spss.com/ further computational processing. This kind of process is
4
Legend
KDD Related
6-sigma
(Harry & Schroeder, 1999; Cios et al.
SEMMA
Pyzdek, 2003) (Cios et al., 2000; Cios & CRISP-DM
(SAS Institute, 2015) Related
Kurgan, 2005)
RAMSYS
Human-Centered (Moyle & Jorge, 2001)
Other approach.
(Brachman and D^3M
Anand,1996, Gertosio and (Cao, 2010)
KDD Roadmap
Dussauchoy, 2014)
(Debuse et al., 2001)
HACE
(X. Wu, et al, 2014)
KDD
Cabena et al. CRISP-DM
(Piatetsky-Shapiro, 1991;
(Cabena et al., 1997) (Chapman et al., 2000)
Fayyad et al., 1996b)
ASUM-DM
(IBM, 2015)
CRISP-DM2.0 CASP-DM
Anand & Buchner (Martinez-Plumed et al.,
FMDS
5A's TDSP (IBM, 2015)
Two Crows (SPSS, 1999, de Pisón (Microsoft, 2016)
(Two Crows Corporation, 1999)
Ascacibar, 2003;
SPSS, 2007)
Fig. 2. Evolution of most relevant Data Mining and Data Science models and methodologies (in white and light blue, respectively). KDD and
CRISP-DM are the ‘canonical’ methodologies, depicted in grey. Adapted from [6]. The years are those of the most representative papers, not the
years in which the model was introduced.
akin to mining for valuable minerals or metals at a given the data mining perspective, the process takes centre stage.
geographic location where the existence of the minerals In contrast, in contemporary data science the data take centre
or metals has been established: data are the ore, in which stage: we know or suspect there is value in these data, how
valuable knowledge can be found. Whenever this kind of do we unlock it? What are the possible operations we can
metaphor is applicable, we suggest that CRISP-DM is a good apply to the data to unlock and utilise their value? While
methodology to follow and still holds its own after twenty moving away from the process, the methodology becomes
years. less prescriptive and more inquisitive: things you can do to
However, data science is now a much more commonly data rather than things you should do to data.
used term than data mining in the context of knowledge To continue with the ‘mining’ metaphor: if data mining
discovery. A quick query on Google Trends shows that the is like mining for precious metals, data science is like
former became a more frequent search term than the latter prospecting: searching for deposits of precious metals where
in early 2016 and now is more than twice as common. So profitable mines can be located. Such a prospecting process
what is data science? There seem to be two broad senses is fundamentally exploratory and can include some of the
in which the term is used: (a) the science OF data; and following activities:
(b) applying scientific methods TO data. From the first
Goal exploration: finding business goals which can be
perspective, data science is seen as an academic subject
achieved in a data-driven way;
that studies data in all its manifestations, together with
Data source exploration: discovering new and valuable
methods and algorithms to manipulate, analyse, visualise
sources of data;
and enrich data. It is methodologically close to computer sci-
Data value exploration: finding out what value might be
ence and statistics, combining theoretical, algorithmic and
extracted from the data;
empirical work. From the second perspective, data science
Result exploration: relating data science results to the busi-
spans both academia and industry, extracting value from
ness goals;
data using scientific methods, such as statistical hypothesis
Narrative exploration: extracting valuable stories (e.g., vi-
testing or machine learning. Here the emphasis is on solving
sual or textual) from the data;
the domain-specific problems in a data-driven way. Data
Product exploration: finding ways to turn the value ex-
are used to build models, design artefacts, and generally
tracted from the data into a service or app that delivers
increase understanding of the subject. If we wanted to
something new and valuable to users and customers.
distinguish these two senses then we could call the first
theoretical data science; and the second, applied data science. While it is possible to see (weak) links between these ex-
In this paper, we are really concerned with the latter and ploratory activities and CRISP-DM phases (e.g. goal explo-
henceforth we use the term ‘data science’ in this applied ration relates to business understanding and result explo-
sense. ration relates to modelling and evaluation), the former are
The key difference we perceive between data mining typically more open-ended than the CRISP-DM phases. In
twenty years ago and data science today is that the former data science, the order of activities depends on the domain
is goal-oriented and concentrates on the process, while the as well as on the decisions and discoveries of the data scien-
latter is data-oriented and exploratory. Developed from the tist. For example, after getting unsatisfactory results in data
goal-oriented perspective, CRISP-DM is all about processes value exploration performed on given data it might be nec-
and different tasks and roles within those processes. It views essary to do further data source exploration. Alternatively, if
the data as an ingredient towards achieving the goal – an no data are given then data source exploration would come
important ingredient, but not more. In other words, from before data value exploration. Sometimes neither of these
5
Business Data
Data Source Data Value
Understanding Understanding
Exploration Exploration
Data Acquisition
Data Release
Product Narrative
Evaluation Modelling
Exploration Exploration
Fig. 3. The DST map, containing the outer circle of exploratory activities, inner circle of CRISP-DM(or goal-directed) activities, and at the core the
data management activities.
activities is required, and sometimes these activities would knowledge (such as the Structural Causal Models [46], the
be run several times. Potential Outcomes Framework [47] or the Linear non-
Data science projects are certainly not only about ex- Gaussian acyclic models [48]). Hernan et al. [49] discuss
ploration, and contain more goal-driven parts as well. The how data science can tackle causal inference from data by
standard six phases of the CRISP-DM model from business considering it as a new kind of data science task known
understanding to deployment are all still valid and relevant. as counterfactual prediction. Basically, counterfactual predic-
However, in data science projects it is common to see only tion requires to incorporate domain expert knowledge not
partial traces through CRISP-DM. For example, sometimes only to formulate the goals or questions to be answered
there is no need for activities beyond data preparation, and to identify or generate the data sources, but also to
as the prepared data are the final product of the project. formally describe the causal structure of the system. This
Data that is scraped from different sources, integrated and task and others performing causal inference go well within
cleansed can be published or sold for various purposes, or CRISP-DM (under the modelling step) but expert knowl-
can be loaded into a data warehouse for OLAP querying. edge becomes crucial (and, as a result, the inner stages
The CRISP-DM phases are also often interrupted by further of the CRISP-DM process are harder to automate). For its
exploratory activities, whenever the data scientist decides to part, the business understanding phase reinforces its first-
seek more information and new ideas. stage position in these circumstances as this must be the
place where the expert understanding of the domain has
We hence see a successful data science project as fol-
to be converted into models and queries which are needed
lowing a trajectory through a space like the one depicted in
for the subsequent steps (data understanding, preparation,
Figure 3. In contrast to the CRISP-DM model there are no
modelling and evaluation).
arrows here, because the activities are not to be taken in any
pre-determined order. It is the responsibility of the project’s However, under the causal inference framework, data
leader(s) to decide which step to take next, based on the science must play a more active role with the data. Data is
available information including the results of previous ac- not just an input of the system: “a causal understanding
tivities. Even though the space contains all the CRISP-DM of the data is essential to be able to predict the conse-
phases, these are not necessarily run in the standard order, quences of interventions, such as setting a given variable to
as the goal-driven activities are interleaved with exploratory some specified value” [48]. This suggests a more iterative
activities, and these can sometimes set new goals or provide process where we could need to generate new data, for
new data. instance through randomised experiments or performing
Data take centre-stage in data science, and the terms simulations on the observed or generated data, using the
‘data preparation’ and ‘modelling’ do not fully capture expert’s causal knowledge in the form of graphical models
anymore the variety of practical work that might be car- together with other kinds of domain knowledge or extracted
ried out on the data. Two decades ago, many applications, patterns. All these operations are difficult to integrate in the
especially those falling under the term business intelligence, CRISP-DM model and may require new generative activities
were based on analysing their own data (e.g., customer for data acquisition and simulation.
behaviour) and extracting patterns from it that would meet Another relevant area where CRISP-DM seems to fall
the business goals. But today, many more options are con- short is when thinking about “data-driven products”, such
sidered. as a mobile app that takes information from the location of
For instance, causal inference [45] has recently been their users and recommends routes to other users, according
pointed out as a new evolution of data analysis aimed to their patterns. The product is the data and the knowledge
to understand the cause-effect connections in data. Causal extracted from it. This perspective was unusual two decades
inference from data focuses on answering questions of the ago, but it is now widespread. Also, nowadays the data
type “what if” and relies on methods that incorporate causal might have multiple uses, even far away from the context or
6
Business Data
Data Source Data Value
Understanding Understanding
Exploration Exploration
Data Acquisition
2
0 Goal 1 Data Simulation Data
Deployment Result
Exploration Preparation
Data Architecting Exploration
Data Release 3
5 Product
Evaluation Modelling Narrative
Exploration Exploration
domain where they were collected (e.g., the data collected As we will do in the next section, we can represent tra-
by an electronic payment system can be bought and used by jectories more compactly, by removing those activities that
a multinational company to know where a new store will be are not used. Still, if an activity happens more than once in
best located, or can be used by an environmental agency a trajectory, we only show the same activity once. For these
to obtain petrol consumption patterns). The huge size and DST charts, we use numbered arrows to show the process
complexity of the data in some applications nowadays also (possibly visiting the same activity more than once)4 . More
suggest that handling the data requires important techni- precisely, a trajectory chart is defined as follows:
cal work on curation and infrastructure. In other words,
the CRISP-DM model included the ‘data’ as a static disk • A DST chart is a directed graph that only includes
cylinder in the middle of the process (see Figure 1), but activities (once) and connections (transitions) between
we want to highlight the activities around this disk, going them (as directed solid arrows).
beyond data preparation and integration3 . Given the variety • All arrows are numbered from 0 to N , showing the se-
of scenarios for using the data from others or from yourself, quence of transitions between activities. Consequently,
for your own or others’ benefit, we consider the following we cannot have unlimited loops.
data management activities. • We use three different types of boxes for activities
(circles for exploration activities, rounded squares for
Data acquisition: obtaining or creating relevant data, for
CRISP-DM activities, and cylinders for data manage-
example by installing sensors or apps;
ment activities).
Data simulation: simulating complex systems in order to
• If two or more arrows have the same number, it means
produce useful data, ask causal (e.g., what-if) questions;
that they take place in parallel (or their sequential order
Data architecting: designing the logical and physical lay-
is unattested or unimportant).
out of the data and integrating different data sources;
• A trajectory can go through the same activity more than
Data release: making the data available through databases,
once. If the trajectory moves from A to B more than
interfaces and visualisations.
once, we will annotate this as a single arrow with a
Once the set of activities has been introduced, a tra- single label, showing as many transition numbers as
jectory is simply an acyclic directed graph over activities, needed, separated by commas.
usually representing a sequence, but occasionally forking • Every trajectory has an entrance transition (with num-
to represent when things are done in parallel (by different ber 0 and not starting from any activity) and an exit
individuals or groups in a data science team). An example transition (with number N and not ending in any
of a trajectory through the DST map is given in Figure 4, activity).
where the goal is established as a first step in a data-driven
way (goal exploration), and relevant data is then explored to By following the transitions from 0 to N , we derive one
extract valuable knowledge (data value exploration). Classical single trajectory from the chart (remember that repeated
CRISP-DM activities are performed to clean and transform numbers are not alternatives, but things going in parallel).
the data (data transformation) which will be used to train a Once introduced the graphical notation for the charts that
particular machine learning model (modelling). Finally, the completes our DST model, in the following section we
most appropriate end-user product and/or presentation is present some real-life scenarios and discuss the order of
explored (product exploration) in order to turn the value exploratory, goal-directed and data management activities
extracted from the data into a valuable product for users in these scenarios.
and customers. This example will be visited in full detail in
section 4.1.
4. Note that a trajectory chart represents one single trajectory, and it is
3. Despite disk cylinders not being cognitively associated with activ- not a pattern for a set of trajectories. CRISP-DM is actually a pattern and
ities as a representation, we have decided to use them to emphasise the not a single trajectory chart, as CRISP-DM admits several trajectories,
correspondence with the original CRISP-DM model especially through the use of the backwards arrows.
7
Payment
geovisualisation
Sightseeing
Smart
Parking
advisor 1 5
Drivers
profiling Sales
4
Exploratory OLAP CRISP-DM Exploratory CRISP-DM
Pollution
simulator 3
5 9
Fig. 12. Venn diagram of the three kinds of activities (exploratory, Fig. 13. Venn diagram of the three kinds of activities (exploratory,
CRISP-DM and data management) and the seven use cases introduced CRISP-DM and data management) and the number of use cases from
in section 4. NIST Big Data Public Working Group Use Cases [52] which fall into each
region.
resembles CRISP-DM in many ways (starting with business While the trajectory perspective may allow for a more
needs and ending up in deployment and maintenance of systematic (and even automated) analysis at the process
the outcome of the process), but it would be likewise inap- level, it is no surprise that the more flexible, less system-
propriate to use the same linear flow for all problems and atic, character of the new activities (exploration and data
circumstances. The similarities have suggested the applica- management) highlights the challenges for the automation
tion or adaptation of software development methodologies of data science. For instance, while the automation of the
for data science (or big data) projects [55], but it is per- modelling stage of CRISP-DM has been achieved to a large
haps the general project management methodologies that extent under the AutoML paradigm [63], [64], many other
may be more appropriate, or some specific ideas such as parts of CRISP-DM are still escaping automation, such as
design patterns [56]. Also, we can learn from some novel data wrangling or model deployment. Beyond data mining,
lightweight methodologies, such as Extreme Programming many new competences have been identified as necessary
(XP) [57], which attempted to add flexibility to the process, for a data scientist, including both technical and non-
allowing teams to develop software, from requirements to technical skills, such as communicating results, leading a
deployment, in a a more efficient way. team, being creative, etc. [65], [66], [67], [68], and they
The second case is methodology in science. The whole are usually associated with the exploration activities. Data
process of scientific discovery is usually question-driven, scientists are expected to cover a wide range of soft skills,
rather than data-driven or goal-driven, but is generally such as being proactive, curious and inquisitive, being able
much more flexible in the initial trajectories (surprising to tell a story about the data and visualise the insights
observations, serendipity, etc.) – while more strict when it appropriately, and focus on traceability and trust. Most of
comes to hypothesis testing, replicable experimental design, the new explorative steps beyond CRISP-DM identified in
etc. Despite the analogies between some trajectories in data this paper imply these soft skills and the use of business
science and the methodologies in science, there is an on- knowledge and vision that is far from the capabilities that
going controversy whether the traditional scientific method AI provides today, and will be harder to automate in the
is obsolete under the irruption of data science [58], [59], or years to come.
whether data science methodologies should learn more from The trajectory model does not yet explicitly address all
the general scientific method [60], [61]. the ethical and legal issues around data science [69], an
In the absence of more rigid schemes, this diversity of area that is becoming more relevant in data science over
methodologies and trajectories may create uncertainty for the previous data mining paradigm, even if problems such
project management. This is mitigated by three important as fairness and privacy already existed for data mining.
aspects of our DST model. First, we define trajectories over The increased relevance comes especially from the incen-
a well-defined collection of activities, which can be encap- tives behind many data science projects, which focus on
sulated and documented, similar to the original substages the monetisation of the data, through the exploration of
in CRISP-DM. DST thus allow data scientists to design their new data products. This usually implies the use of data
data science projects as well as explore new activities that for purposes that are different from those that created the
could be added to or removed from their workflows. This data in the first place, such as social networks, digital
is especially useful for teams, as they can agree and locate assistants or wearable devices. The most relevant ethical
themselves (and subteams) in some of the subactivities of issues will appear in the new activities: goal exploration,
the trajectory. Secondly, existing trajectories can be used as data source exploration, data value exploration, result ex-
templates so that new projects can use them as references. ploration, product exploration, and data acquisition. These
A new project may find the best match in the catalogue of are also the parts of the trajectories where more senior data
trajectories rather than forcing it to fit a process model such scientists will be involved, assuming higher awareness and
as CRISP-DM that may not suit the project well and may training on ethical issues [70] than other more technical, less
cause planning difficulties and a bad estimation of effort senior data scientists or team members.
(e.g., resources, costs, project expertise, completition plans, The DST is also motivated by the causal approach to data
etc.). Actually, if the estimations of resources and costs using science. In this case, it is not that much that new exploratory
DST are more accurate than using CRISP-DM, this would activites are needed, but new data management activities,
be evidence for validity and usefulness in an organisa- required to generate data for the discovery of the causal
tion. Thirdly, trajectories can be mapped with project plans structure: data acquisition and simulation. These are a series
directly, assigning deadlines to transitions, and assigning of activities that are becoming more and more relevant, as
personnel and budget to activities. Iterations on activities we have also seen in the large Big Data NIST repository and
are explicit in the trajectories, which also allows for spiral the associated trajectories that we explored in section 5.
models where subparts of the trajectory are iterated from In conclusion, CRISP-DM still plays an important role
small to big or until a given criterion is met (or a resource is as a common framework for setting up and managing
exhausted). data mining projects. However, the world today is a very
All this paves the way to the introduction of proper different place from the world in which CRISP-DM was
data science project management methodologies, and the conceived over two decades ago. In this paper we have
reuse of statistics and experiences from activities used in argued that the shift from data mining to data science is
previous projects. Techniques from the area of workflow not just terminological, but signifies an evolution towards
inference and management could also be applied to analyse a much wider range of approaches, in which the main
trajectories [62], estimate costs and success rates, and extract value-adding component may be undetermined at the out-
patterns that fit a domain or organisation. set and needs to be discovered as part of the project. For
12
such exploratory projects the CRISP-DM framework will [12] H. A. Edelstein, Introduction to data mining and knowledge discovery.
be too restrictive. We have proposed a new Data Science Two Crows, 1998.
[13] L. Cao, “Domain-driven data mining: Challenges and prospects,”
Trajectories (DST) framework which expands CRISP-DM by IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 6, pp.
including exploratory activities such as goal exploration, data 755–769, 2010.
source exploration and data value exploration. Entry points into, [14] C. Brunk, J. Kelly, and R. Kohavi, “Mineset: An integrated system
trajectories through and exit points out of this richer set for data mining.” in KDD, 1997, pp. 135–138.
[15] A. Bernstein, F. Provost, and S. Hill, “Toward intelligent assistance
of data science steps can vary greatly among data science for a data mining process: An ontology-based approach for cost-
projects. We have illustrated this by means of a broad range sensitive classification,” IEEE Trans. on knowledge and data engineer-
of exemplar projects and the trajectories they embody. ing, vol. 17, no. 4, pp. 503–518, 2005.
Data science is still a young subject, with many open [16] M. J. Harry, “Six sigma: a breakthrough strategy for profitability,”
Quality progress, vol. 31, no. 5, p. 60, 1998.
questions regarding its nature and methodology. While [17] J. Debuse, B. de la Iglesia, C. Howard, and V. Rayward-Smith,
other authors approach these questions from a top-down “Building the kdd roadmap,” in Industrial Knowledge Management.
perspective [71], what we have attempted here is more Springer, 2001, pp. 179–196.
bottom-up, starting from something that is generally ac- [18] O. Niaksu, “CRISP data mining methodology extension for med-
ical domain,” Baltic J. of Modern Computing, vol. 3, no. 2, p. 92,
cepted to be productive in the data mining context, and 2015.
investigating how it can be generalised to account for the [19] D. Asamoah and R. Sharda, “Adapting CRISP-DM process for
much richer data science context. We hence see this as part social network analytics: Application to healthcare,” 21th Americas
Conf. on Information Systems, Puerto Rico, 2015, 2015.
of a larger, ongoing conversation and hope that the perspec-
[20] N. Njiru and E. Opiyo, “Clustering and visualizing the status
tive offered here will be received as a positive contribution. of child health in kenya: A data mining approach.” International
Journal of Social Science and Technology I, 2018.
[21] N. Azadeh-Fard, F. M. Megahed, and F. Pakdil, “Variations of
ACKNOWLEDGMENTS length of stay: a case study using control charts in the CRISP-
We thank the anonymous reviewers for their comments, DM framework,” International Journal of Six Sigma and Competitive
Advantage, vol. 11, no. 2-3, pp. 204–225, 2019.
which motivated the analysis in Section 5. This mate- [22] A. Dåderman and S. Rosander, “Evaluating frameworks for im-
rial is based upon work supported by the EU (FEDER), plementing machine learning in signal processing: A comparative
and the Spanish MINECO under grant RTI2018-094403-B- study of CRISP-DM, semma and kdd,” 2018.
C3, the Generalitat Valenciana PROMETEO/2019/098. F. [23] M. Rogalewicz and R. Sika, “Methodologies of knowledge discov-
ery from data and data mining methods in mechanical engineer-
Martı́nez-Plumed was also supported by INCIBE (Ayudas ing,” Management and Production Engineering Review, vol. 7, no. 4,
para la excelencia de los equipos de investigación avanzada pp. 97–108, 2016.
en ciberseguridad), the European Commission (JRC) HU- [24] S. Huber, H. Wiemer, D. Schneider, and S. Ihlenfeldt, “DMME:
Data mining methodology for engineering applications–a holistic
MAINT project (CT-EX2018D335821-101), and UPV (PAID- extension to the CRISP-DM model,” Procedia CIRP, vol. 79, pp.
06-18). J. H-Orallo is also funded by an FLI grant RFP2-152. 403–408, 2019.
[25] C. Barclay, A. Dennis, and J. Shepherd, “Application of the
CRISP-DM model in predicting high school students’ examination
R EFERENCES (csec/cxc) performance,” Knowledge Discovery Process and Methods
[1] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, to Enhance Organizational Performance, p. 279, 2015.
C. Shearer, and R. Wirth, “CRISP-DM 1.0 step-by-step data mining [26] D. B. Fernández and S. Luján-Mora, “Uso de la metodologı́a
guide,” 2000. CRISP-DM para guiar el proceso de minerı́a de datos en lms,”
[2] O. Marbán, J. Segovia, E. Menasalvas, and C. Fernández-Baizán, in Tecnologı́a, innovación e investigación en los procesos de enseñanza-
“Toward data mining engineering: A software engineering ap- aprendizaje. Octaedro, 2016, pp. 2385–2393.
proach,” Information systems, vol. 34, no. 1, pp. 87–107, 2009. [27] L. Almahadeen, M. Akkaya, and A. Sari, “Mining student data
[3] IBM, “Analytics solutions unified method,” ftp://ftp.software. using CRISP-DM model,” International Journal of Computer Science
ibm.com/software/data/sw-library/services/ASUM.pdf, 2005. and Information Security, vol. 15, no. 2, p. 305, 2017.
[4] SAS, “Semma data mining methodology,” http://www.sas.com/ [28] D. Oreski, I. Pihir, and M. Konecki, “CRISP-DM process model
technologies/analytics/datamining/miner/semma.html, 2005. in educational setting,” Economic and Social Development: Book of
[5] L. A. Kurgan and P. Musilek, “A survey of knowledge discovery Proceedings, pp. 19–28, 2017.
and data mining process models,” The Knowledge Engineering Re- [29] E. Espitia, A. F. Montilla et al., “Applying CRISP-DM in a kdd pro-
view, vol. 21, no. 1, pp. 1–24, 2006. cess for the analysis of student attrition,” in Colombian Conference
[6] G. Mariscal, O. Marban, and C. Fernandez, “A survey of data on Computing. Springer, 2018, pp. 386–401.
mining and knowledge discovery process models and methodolo- [30] V. Tumelaire, E. Topan, and A. Wilbik, “Development of a re-
gies,” The Knowledge Engineering Review, vol. 25, no. 02, pp. 137– pair cost calculation model for daf trucks nv using the CRISP-
166, 2010. DM framework,” Ph.D. dissertation, Master’s thesis, Eindhoven
[7] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process University of Technology, 2015.
for extracting useful knowledge from volumes of data,” Commun. [31] F. Schäfer, C. Zeiselmair, J. Becker, and H. Otten, “Synthesizing
ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996. CRISP-DM and quality management: A data mining approach
[8] R. J. Brachman and T. Anand, “Advances in knowledge discovery for production processes,” in 2018 IEEE International Conference on
and data mining,” U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, Technology Management, Operations and Decisions (ICTMOD). IEEE,
and R. Uthurusamy, Eds. Menlo Park, CA, USA: American 2018, pp. 190–195.
Association for Artificial Intelligence, 1996, ch. The Process of [32] E. G. Nabati and K.-D. Thoben, “On applicability of big data an-
Knowledge Discovery in Databases, pp. 37–57. alytics in the closed-loop product lifecycle: Integration of CRISP-
[9] C. Gertosio and A. Dussauchoy, “Knowledge discovery from DM standard,” in IFIP International Conference on Product Lifecycle
industrial databases,” Journal of Intelligent Manufacturing, vol. 15, Management. Springer, 2016, pp. 457–467.
no. 1, pp. 29–37, 2004. [33] H. Nagashima and Y. Kato, “Aprep-dm: a framework for automat-
[10] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi, ing the pre-processing of a sensor data analysis based on CRISP-
Discovering data mining: from concept to implementation. Prentice- DM,” in 2019 IEEE International Conference on Pervasive Computing
Hall, Inc., 1998. and Communications Workshops (PerCom Workshops). IEEE, 2019,
[11] A. G. Buchner, M. D. Mulvenna, S. S. Anand, and J. G. Hughes, pp. 555–560.
“An internet-enabled knowledge discovery process,” in Proc. of the [34] S. B. Gómez, M. C. Gómez, and J. B. Quintero, “Inteligencia de
9th Int. Database Conf., Hong Kong, vol. 1999, 1999, pp. 13–27. negocios aplicada al ecoturismo en colombia: Un caso de estudio
13
aplicando la metodologı́a CRISP-DM,” in 14th Iberian Conference on [56] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design patterns:
Information Systems and Technologies, CISTI 2019. IEEE Computer elements of reusable object-oriented software. Pearson Education,
Society, 2019, p. 8760802. 1995.
[35] R. Ganger, J. Coles, J. Ekstrum, T. Hanratty, E. Heilman, [57] K. Auer and R. Miller, Extreme programming applied: playing to win.
J. Boslaugh, and Z. Kendrick, “Application of data science within Addison-Wesley Longman Publishing Co., Inc., 2001.
the army intelligence warfighting function: problem summary and [58] C. Anderson, “The end of theory: The data deluge makes the
key findings,” in Artificial Intelligence and Machine Learning for scientific method obsolete,” Wired magazine, vol. 16, no. 7, pp. 16–
Multi-Domain Operations Applications, vol. 11006. International 07, 2008.
Society for Optics and Photonics, 2019, p. 110060N. [59] R. Kitchin, “Big data, new epistemologies and paradigm shifts,”
[36] R. P. Bunker and F. Thabtah, “A machine learning framework for Big Data & Society, vol. 1, no. 1, p. 2053951714528481, 2014.
sport result prediction,” Applied computing and informatics, 2017. [Online]. Available: https://doi.org/10.1177/2053951714528481
[37] R. Barros, A. Peres, F. Lorenzi, L. K. Wives, and E. H. da Silva Jac- [60] S. Carrol and D. Goodstein, “Defining the scientific method,” Nat
cottet, “Case law analysis with machine learning in brazilian Methods, vol. 6, p. 237, 2009.
court,” in International Conference on Industrial, Engineering and [61] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee,
Other Applications of Applied Intelligent Systems. Springer, 2018, A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, “Theory-
pp. 857–868. guided data science: A new paradigm for scientific discovery from
data,” IEEE Transactions on Knowledge and Data Engineering, vol. 29,
[38] K. J. Cios, A. Teresinska, S. Konieczna, J. Potocka, and S. Sharma,
no. 10, pp. 2318–2331, 2017.
“A knowledge discovery approach to diagnosing myocardial per-
[62] W. Van Der Aalst, K. M. Van Hee, and K. van Hee, Workflow
fusion,” Engineering in Medicine and Biology Magazine, IEEE, vol. 19,
management: models, methods, and systems. MIT press, 2004.
no. 4, pp. 17–25, 2000.
[63] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-
[39] K. J. Cios and L. A. Kurgan, “Trends in data mining and knowl- weka: Combined selection and hyperparameter optimization of
edge discovery,” in Advanced techniques in knowledge discovery and classification algorithms,” in Proceedings of the 19th ACM SIGKDD
data mining. Springer, 2005, pp. 1–26. international conference on Knowledge discovery and data mining.
[40] S. Moyle and A. Jorge, “Ramsys-a methodology for supporting ACM, 2013, pp. 847–855.
rapid remote collaborative data mining projects,” in ECML/PKDD [64] I. Guyon, L. Sun-Hosoya, M. Boullé, H. Escalante, S. Escalera,
2001 Workshop on Integrating Aspects of Data Mining, Decision Sup- Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag et al., “Analysis of
port and Meta-Learning: Internal SolEuNet Session, 2001, pp. 20–31. the automl challenge series 2015-2018,” 2017.
[41] F. Martı́nez-Plumed, L. C. Ochando, C. Ferri, P. A. Flach, [65] “8 Skills You Need to Be a Data Scientist,” https://blog.udacity.
J. Hernández-Orallo, M. Kull, N. Lachiche, and M. J. Ramı́rez- com/2014/11/data-science-job-skills.html, Nov. 2014.
Quintana, “CASP-DM: context aware standard process for data [66] V. Dhar, “Data science and prediction,” Communications of the
mining,” CoRR, vol. abs/1709.09003, 2017. [Online]. Available: ACM, vol. 56, no. 12, pp. 64–73, 2013.
http://arxiv.org/abs/1709.09003 [67] M. Loukides, What Is Data Science? ”O’Reilly Media, Inc.”, Apr.
[42] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big 2011.
data,” IEEE transactions on knowledge and data engineering, vol. 26, [68] E. Commission, “European e-Competence Framework,” 2016.
no. 1, pp. 97–107, 2014. [Online]. Available: http://www.ecompetences.eu/
[43] J. Rollins, “Why we need a methodology for data science,” [69] M. Taddeo and L. Floridi, “Theme issue ‘the ethical impact of data
2015. [Online]. Available: https://www-01.ibm.com/common/ science’,” 2016.
ssi/cgi-bin/ssialias?htmlfid=IMW14824USEN [70] S. Russell, S. Hauert, R. Altman, and M. Veloso, “Ethics of artificial
[44] R. B. Severtson, “What is the team data science process?” intelligence,” Nature, vol. 521, no. 7553, pp. 415–416, 2015.
2017. [Online]. Available: https://docs.microsoft.com/en-us/ [71] L. Cao, “Data science: a comprehensive overview,” ACM Comput-
azure/machine-learning/team-data-science-process/overview ing Surveys (CSUR), vol. 50, no. 3, p. 43, 2017.
[72] M. Ponsen, K. Tuyls, M. Kaisers, and J. Ramon, “An evolutionary
[45] J. Pearl and D. Mackenzie, The book of why: the new science of cause
game-theoretic analysis of poker strategies,” Entertainment Com-
and effect. Basic Books, 2018.
puting, vol. 1, no. 1, pp. 39–45, 2009.
[46] J. Pearl, “The seven tools of causal inference, with reflections on
machine learning.” Commun. ACM, vol. 62, no. 3, pp. 54–60, 2019.
[47] G. W. Imbens and D. B. Rubin, “Rubin causal model,” The new
palgrave dictionary of economics, pp. 1–10, 2017.
A PPENDIX
[48] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen, “A In section 5 we portray summarised information about 51
linear non-gaussian acyclic model for causal discovery,” J. Mach. use cases extracted from the NIST Big Data Public Working
Learn. Res., vol. 7, pp. 2003–2030, Dec. 2006. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1248547.1248619 Group [52]. In this appendix we give more information
[49] M. A. Hernán, J. Hsu, and B. Healy, “A second chance to get about this source of cases and the methodology we used to
causal inference right: A classification of data science tasks,” process them. The National Institute of Standards and Tech-
CHANCE, vol. 32, no. 1, p. 42–49, Jan 2019. [Online]. Available: nology (NIST) sought to establish relations among industry
http://dx.doi.org/10.1080/09332480.2019.1579578
professionals to further the secure and effective adoption of
[50] S. Chaudhuri, U. Dayal, and V. Narasayya, “An overview of
business intelligence technology,” Communications of the ACM, Big Data and develop consensus on definitions, taxonomies,
vol. 54, no. 8, pp. 88–98, 2011. secure reference architectures, security and privacy, and,
[51] D. Keim, G. Andrienko, J.-D. Fekete, C. Görg, J. Kohlhammer, from these, a standards roadmap. With this aim, the NIST
and G. Melançon, “Visual analytics: Definition, process, and chal-
Big Data Public Working Group (NBD-PWG) was launched
lenges,” in Information visualization. Springer, 2008, pp. 154–175.
[52] D. N. B. D. I. Framework, “NIST big data interoperability frame- with extensive participation by industry, academia, and
work: Volume 3, use cases and general requirements,” NIST Special government. The results from this group are reported in
Publication, vol. 1500, p. 344, 2019. the NIST Big Data Interoperability Framework series of vol-
[53] J. Saltz, K. Crowston et al., “Comparing data science project man- umes which, among definitions, taxonomies, requeriments,
agement methodologies via a controlled experiment,” in Proceed-
ings of the 50th Hawaii International Conference on System Sciences, etc., contains a set of 51 original use cases gathered by
2017. the NBD-PWG Use Cases and Requirements Subgroup13 .
[54] L. R. Vijayasarathy and C. W. Butler, “Choice of software de- The report includes examples in the following broad areas:
velopment methodologies: Do organizational, project, and team government operations (4 cases), commercial (8), defense
characteristics matter?” IEEE software, vol. 33, no. 5, pp. 86–94,
2016. (3), healthcare and life sciences (10), deep learning and social
[55] V. D. Kumar and P. Alencar, “Software engineering for big data media (6), research (4), astronomy and physics (5), earth,
projects: Domains, methodologies and gaps,” in 2016 IEEE Interna-
tional Conference on Big Data (Big Data). IEEE, 2016, pp. 2886–2895. 13. https://bigdatawg.nist.gov/show InputDoc.php
14
1 100
Insurance Refinement
Parking App
Payment geovisualisation 2
20
80
Repository Publishing
1
Sales OLAP
Tourism recommender 1
40
1 60
CR
ry
rato
ISP
plo
−D
Ex
M
1 1 1 2
60
2 1 3 40
1 1 3
1
1 1 5
80
1 2 1 20
3
10
0
1 1 12
20
40
60
80
0
10
Exploratory Data
Management
Data Management
Fig. 14. Ternary plot depicting the proportions of the three activity types
(exploratory, CRISP-DM and data management) for the seven use
cases in section 4 and the 51 use cases from NIST Big Data Public
Working Group Use Cases [52] (numbers show how many NIST use
cases fall in the same point).