0% found this document useful (0 votes)
17 views

Modularity HICSS Final Afterreview

Uploaded by

yeti22257
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Modularity HICSS Final Afterreview

Uploaded by

yeti22257
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Helping Data Science Students Develop Task Modularity

Jeffrey S. Saltz Robert Heckman Kevin Crowston Sangseok You Yatish Hedge
Syracuse University Syracuse University Syracuse University HEC Paris Syracuse University
[email protected] [email protected] [email protected] [email protected] [email protected]

Abstract to be skilled data scientists [2], especially on how


This paper explores the skills needed to be a data these emerging data scientists can work together on a
scientist. Specifically, we report on a mixed method data science project.
study of a project-based data science class, where we One aspect of enabling a team to work well
evaluated student effectiveness with respect to together is by having the team be able to break the
dividing a project into appropriately sized modular project into modular components [3]. A modular
tasks, which we termed task modularity. Our results approach enables the team to proceed more quickly
suggest that while data science students can and effectively [4]. Furthermore, it has also been
appreciate the value of task modularity, they struggle noted that modularity brings increased flexibility, a
to achieve effective task modularity. As a first step, better ability to deal with complexity and the
based our study, we identified six task decomposition accommodation of uncertainty [5]. More generally, it
best practices. However, these best practices do not has been shown that leveraging modularity delivers
fully address this gap of how to enable data science significant benefits within many contexts, such as
students to effectively use task modularity. We note manufacturing [6] and, perhaps most commonly,
that while computer science/information system software development [7].
programs typically teach modularity (e.g., the The modularity of a solution can be considered as
decomposition process and abstraction), and there a continuum describing the degree to which the
remains a need identify a corresponding model to components of a solution can be separated, worked
that used for computer science / information system on independently, and recombined [8]. With respect
students, to teach modularity to data science to data science, the use of R [9] is an example of one
students. aspect of leveraging modularity. The Comprehensive
R Archive Network (CRAN) contains thousands of
“packages”, which can be installed and loaded as
1. Introduction needed. These packages enable a team to easily
leverage modules developed by others, such as using
Data Science is an emerging discipline that an advanced machine learning module via a function
combines expertise across a range of domains, call, and is a key aspect of modularity (and the
including software development, data management growth of R).
and statistics. Data science projects typically have a However, another aspect of modularity, task
goal to identify correlations and causal relationships, modularity, is concerned with how a data science
classify and predict events, identify patterns and team breaks down its activities into modular “chunks
anomalies, and infer probabilities, interest and of work” that can be worked on in parallel, in a
sentiment [1]. Big Data is a related field, often coordinated manner. One important benefit of task
thought of as a subset of data science, in that data modularity is that it helps reduce the need to
science applies to large and small data sets and coordinate details of a team member’s work with
covers the end-to-end process of collecting, analyzing other team members.
and communicating the results of the analysis. With Since data scientists need to work on complex
the increasing ability to collect, store and analyze an problems, providing a framework for data science
ever-growing diversity of data that is being generated students to effectively use task modularity should be
with increasing frequency, the field of data science is a key aspect of data science education. Unfortunately,
growing rapidly. there are few studies exploring team process
As a new field, much has been written about the effectiveness within a data science context. There has
use of data science and algorithms that can generate also been minimal research on how to best develop
useful results. Unfortunately, less has been written modular thinking in the students who will become the
about the project skills students need to learn in order next generation of data science practitioners.
However, there has been a recent study that have parallels to other domains, there are also
demonstrated that the Kanban process methodology differences as compared to these other types of
was a promising approach for collaboration and projects. For example, Chen, Kazman & Kaziyev
coordination in data science teams [10]. [18] argue that the use of agile techniques for data
Unfortunately, that study, or any other study, did not science is new and necessitates careful adaptation, as
explore how to help data science teams increase their it is dramatically different from smaller, more
task modularity. With this in mind, we choose to traditional, data analytics efforts. Furthermore,
explore modularity within a Kanban process compared to software development, data science
methodology context. projects have an increased focus on data, what data is
Specifically, our research explores if using a needed and the availability, quality and timeliness of
Kanban data science project methodology improves a the data [1, 19, 20]. Thus, while there are some
student’s ability to use modular concepts, as parallels to other domains, one cannot assume
compared to the baseline situation that is how most findings in those other domains will be applicable
data science student teams currently work, that is, within a data science context.
without a well defined process methodology. Thus,
we focused on the following research questions: 2.1. The Need for Improved Coordination

RQ1: Do data science students naturally apply Vanauer, Bohle and Hellingrath [11] noted the
task modularity while working on a data science lack of an empirically grounded big data science
project? methodology. Hence, not surprisingly, it has been
observed that most data projects are managed in an
RQ2: Does using a Kanban process methodology ad hoc fashion, that is, at a low level of process
improve modular thinking in data science maturity [12]. Indeed, it has been argued that data
students and lead to improved task modularity? science projects need to focus on people, process and
technology [13,14] and not just on algorithms used
To address our research questions, we report on a by data scientists.
mixed method study that explores students using the Thus, the need for more guidance is recognized
Kanban process methodology and evaluates if the with respect to how data scientists can best work
methodology impacts a student’s ability to think in a together; for example, a Gartner Consulting report
more modular manner. advocates for more careful management of the
The rest of the paper is structured as follows. In analysis processes, though a specific methodology is
Section 2, we review modularity in project not identified [15]. Chen, Kazman & Matthes [16]
management, software engineering, as well as in a studied 23 large enterprises and confirmed this gap.
data science context. Section 3 describes the This gap is re-enforced by Espinosa and Armour
methodology for our study. Section 4 discusses the [17], who noted that the main challenge in a data
findings from our study. Finally, section 5 presents a science project is task coordination.
synthesis of our research results and section 6
discusses limitations and possible next steps. 2.2 Task Modularity

2. Background We note that the benefit of task modularity is that


it supports complex problem-solving by enabling a
To provide context for our exploration of task team member to focus on smaller challenges, rather
modularity and the Kanban process methodology, than needing to focus on the entire problem [21, 22].
this section provides background on the need for When using a modular approach, one leverages a
improved data science team coordination, project general set of design principles that involves breaking
modularity, the growing use of Kanban in the up a problem into discrete chunks [23] and “building
classroom, as well as the current research that focuses a complex product or process from smaller
on data science education. subsystems that can be designed and worked on
As previously noted, one key enabler of team independently yet function together as a whole” [24].
coordination is to be able to decompose a project into Hence, task modularity is likely to be import to
modular tasks. Because there has been minimal data science students due to the benefits of
research reported on data science team collaboration decomposing tasks and allowing different team
or a data scientist’s use of task modularity, in this members to work on different aspects of the project,
section we also explore related domains. However, it similar to what is done in software development
must be noted that while data science projects do projects. In other words, enabling or improving
modularity can provide data science students with a fact, the notion of modularity is central in the design
mechanism to improve team effectiveness. and production of software artifacts, especially for
To better understand the potential importance of large and complex projects [5]. In other words, for
task modularity within a data science project, we first information system / computer science development
explore modularity in the fields of project efforts, the widespread adoption of object oriented
management and software development. languages and the diffusion of component based
development as well other popular trends in software
2.2.1. Modularity and Project Management. Task engineering means that software developers are
decomposition is specifically addressed in the Project exposed to modular thinking throughout their post-
Management Body of Knowledge, sometimes known secondary education as well as after graduation,
as PMBOK [25]. The Project Management Body of when they then use those concepts within a software
Knowledge defines a work breakdown structure, development context. This is very different that data
which is “a hierarchical decomposition of the total scientists, who are often not exposed to this task
scope of work to be carried out by the project team to modularity during their data science education.
accomplish the project objectives and create the
required deliverables” [26]. In other words, in the 2.3. Data Science Education
project management literature, task decomposition is
typically viewed as a linear, structured, top-down Perhaps because it is a new domain, we note that
process for creating a hierarchy of sub-tasks. Task there has been little focus on what skills data science
decomposition results in a hierarchical view of students should gain that could improve their ability
project work that includes simple operations, tasks, to execute data science team projects. For example,
and sub-tasks. The process typically has a detailed there has not been significant discussion of the
series of steps focused on strategic goals, priorities, challenges that students might encounter when they
required resources, logical sequence, milestones, and are doing a data science project.
eventually produces a task flowchart that shows all However, there has been some research on
levels of project breakdown. students working on data science projects within a
While the PMBOK guidelines are useful within course context. For example, one study explored how
general project management for their emphasis on student teams worked on data science projects using
high-level goals, priorities, resources required, different methodologies [30] and a different study
milestones, and completion criteria, this top-down, discussed a project-focused data science course [31],
linear, hierarchical process may not be optimal for but the focus of that study was on the viability of
data science projects in which future tasks and goals using real world projects not on how the team
are often dependent on results of previous analyses, actually worked together. In fact, neither of these
and cannot be precisely specified in advance. efforts focused on the task modularity within a
project, and no research has been identified that
2.2.2. Modularity and Software Engineering. In focuses on this topic within a data science context.
the earliest days of software development, There has also been some research published on
programmers intuitively decomposed a programming the slightly more general topic of data science
task into modules, and programming was essentially education. For example, some have focused on
a craft discipline. However, in the early 70’s, Niklaus designing a data science curriculum [32,33] and
Wirth described a decomposition procedure that others have focused on the overall design of an
aimed to identify modules of a solution whose work introductory data science course [34, 35] and yet
could proceed independently of other work [27] and another focused on data science pair-programming
was used to describe the design of systems [28]. [36]. In the end, it is not surprising that it has been
Later, modularity was noted as an approach in the noted that there has been little research reported on
design of software products [29]. Over time, more how to educate data science students [37].
formal rules evolved (e.g. structured programming,
object-oriented analysis and design, etc.) and 2.4 Kanban in the Classroom
guidelines for decomposition became formalized and
incorporated into programming tools, such as Kanban was created for lean manufacturing, but
interactive development environments. has been adopted across a number of domains,
This focus on work that can be done including software development [38]. A key aspect of
independently of other work is a central principle in this methodology is the Kanban board, where the
the team-based task decomposition that has now been work in progress can easily be seen and tracked [39].
integrated into software development coding tools. In Specifically, the phases of a project are shown as
columns on a Kanban board. Within each phase, In terms of research on Kanban practices within a
there is typically a defined maximum number of data science classroom context, in an experiment
work-in-progress tasks. Using this framework, the comparing several different process methodologies
team defines a prioritized list of what to do. Then, for use within a group data science project, it was
based on the number of allowed simultaneous tasks at noted that a Scrum methodology performed the worst
each phase (column on the board), a task flows due to the challenge of students being able to
through the defined process. Limiting the number of estimate task duration and Kanban performed the
tasks within any one phase, known as limiting work- best, in part due to not requiring explicit task
in-progress (WIP), helps to ensure that the team estimation [10].
minimizes bottlenecks [40] and wasted effort and Hence, with this background in mind, our
also enables agility, in that the team can quickly research focused on the task modularity skills and
reprioritize tasks that have been proposed but not capabilities of data science students when using the
started. Kanban methodology.
More specifically, the following are the key
Kanban principles, based on Anderson’s description 3. Methodology
of the Kanban methodology [39]. First, visualize the
workflow refers to splitting the work into pieces; Our study examines the ability of data science
writing each task on a card, putting that card on the students to use task modularity when using a Kanban
board and using named columns to illustrate where process on a group project. In this section, we first
each item is in the workflow. Making the work describe the context and class environment and then
visible—along with blockers, bottlenecks and review the Kanban process used within the course.
queues—is believed to lead to increased
communication and collaboration. 3.1. Study Context / Environment
The other key aspect of Kanban is to limit the
work-in-progress (WIP). By limiting how much
152 students in a graduate level introduction to
unfinished work is in progress, the team can
data science class were put into teams to work on a
hopefully reduce the time it takes an item to travel
semester long project. All students received the same
through the Kanban system (i.e., for the task to be
large lecture instruction as well as weekly time in a
completed). This limiting of WIP can also help avoid
smaller lab section. There were 7 lab sections in the
problems caused by task switching. The idea is that
course. There were four section instructors, with
by using work-in-process limits the team can smooth
each instructor teaching one, two or three sections.
the flow of work and make sure the team is focused
Students were randomly assigned to teams, which
on getting work completed as well as collect metrics
were comprised of four to six students per team and
to analyze flow.
all team members were from the same section. In
There is growing research demonstrating the
total, there were 31 teams in our study.
benefits when student teams use a Kanban approach.
While most of the students were graduate
For example, it has been empirically shown that
information system students, 13 percent were in other
Kanban provides increased motivation and project
graduate programs, mainly business administration or
activity control [41]. In addition, a case study of
public policy. The class had a broad spectrum of
students using the Kanban methodology found that
student undergraduate majors, including fields such
the students who applied the Kanban principles in
as information technology, engineering, and business.
their project work perceived an increase in outcome
The mixed method study explored how the
success [40]. It was also found that the majority of
Kanban process methodology affected modular
the students expressed positive views about Kanban
thinking and task modularity across the teams via
in their project work and appreciated its value as part
three complementary approaches. First, we explored
of their university education. Others have also
the task modularity of the actual student projects.
reported on the benefits of using a Kanban based
Second, we also surveyed the students on their
methodology for capstone projects [42].
perceptions of how the methodology encouraged task
More generally, a recent study statistically
modularity. Finally, we augmented this data with
compared the effectiveness of the Scrum and Kanban
semi-structured interviews of the section instructors.
methods for software development projects [43] and
found that both Scrum and Kanban lead to the
development of successful projects, but that the
Kanban method was better than the Scrum method.
3.2. Kanban Process Description progress and focusing on the flow of the Kanban
board.
At the start of the project, the students received an
explanation of the Kanban process to be used during 3.3. Measuring Task Modularity
their team project. This explanation typically took
about an hour of class time (including student Q&A). One approach to measure the task modularity for
Throughout the semester, the teams received a project that uses Kanban is to evaluate the Kanban
feedback on their use of the methodology from their board used that team. Specifically, we explored each
instructor. board to see if each of the tasks were defined in a
The Kanban project methodology was based on modular fashion.
Kanban pipeline process management described by Specifically, each team’s Kanban board was
Anderson [39] and described in the previous section. evaluated twice. The first board evaluation was three
The teams were given the freedom to define the weeks after the project started, and the second
columns (the different phases of the project) on their evaluation was one month later, after instructor
board that they thought were most useful, however a coaching on task modularity.
default configuration was suggested that had four Two independent coders evaluated each of the 31
columns: “to do”, “doing”, “validating” and “done”. Kanban boards to determine each board’s task
To help define and track work, the teams used trello modularity. The coders were experienced data
(www.trello.com), which is an online web-based tool scientists and were provided high level guidelines to
for visualizing a board. evaluate the tasks (e.g., evaluate the required time to
From a task perspective, each team was asked to complete the task), as well as some specific criteria to
define what they wanted to investigate (i.e., tasks help evaluate the tasks (e.g., did the task have a
such as “link weather data to our previously collected clearly defined goal). However, it was also
data”). These ideas were all listed (in a prioritized recognized that determining what was a “good
order) in their “to do” column. Then, as space modular task” required some human judgment.
permitted (based on the number of allowed Perhaps due to this required judgment, after training,
simultaneous tasks at each step), a task was permitted the coders agreed on 85% of the coding decisions.
to flow to the next column on the board. In other Disagreements were discussed and agreed upon to
words, when a task was completed within a column, create a final coded data set.
that task got moved to the next column and so on In general, for Poor/low modularity, the tasks
across the board until the task is completed. Each were either too big or too small (in terms of taking a
team set their own WIP limits, and the WIP limit per reasonable amount of time to complete), and/or the
column was typically the number of people in the input/outputs were not clearly defined (so, when
team. reading the task, the goals of the task were difficult to
As the board allowed (based on the work-in- know). An example of tasks with low modularity
progress limits), new tasks could be started. Each include “identify drivers for customer satisfaction”,
team also decided on the size of the “chunks of which was the overall goal of the project and so was
work” (tasks to be done). However, it was explained too vague of a task, and “compute average customer
to the teams that the smaller / more detailed the task, satisfaction”, which based on the team’s current
the easier it would be for the team to understand status, would have taken just a couple of minutes to
potential bottlenecks. complete.
Hence, the process to do a data science project For good/high modularity, the tasks were
could be thought of as a pipeline with requests appropriate in terms of the expected scope of the task
entering one end and improved data insight coming (ex. it would take a reasonable amount of time to
out the other end. The team worked through the complete). In addition, the input and outputs of that
project pipeline throughout the project with no task were clear. An example of a task with good
specific deadlines for interim deliverables. The goal modularity was “generate a linear model for customer
was to make sure, at the end of the semester, that satisfaction based on our 10 identified target
there was not a lot of time spent on an effort that did variables”.
not complete (better to get a fewer number of tasks
all the way through the pipeline). In summary, the 4. Findings
key Kanban-based principles, based on Anderson’s
description of the Kanban methodology, were 4.1. Task Modularity for the Project
explained to the students, and included concepts such
as visualizing the workflow, limiting work-in-
As shown in Table 1, for the initial analysis, of 4.2.1 Kanban Helping with Task Modularity. An
the 31 teams analyzed, only two teams (6%) had analysis of the comments showed that most students
good task modularity and 19 (61%) were rated as (76%) understood at least some of the benefits of task
poor. While there was some improvement after modularity when executing their projects. Some of
instructor coaching, there were still only six teams the student comments were quite insightful, and
(19%) having good task modularity. Taken together, clearly articulated benefits associated with breaking a
we can note that many of the teams had significant complex project down into smaller chunks.
challenges creating modular tasks. Specifically, we identified four key themes noted by
the students, each of which is described below.
Table 1: Task and Project level modularity
Initial Follow-up Smaller tasks improve understanding of the
overall project - Students were able to articulate one
Poor 19 (61%) 11 (35%) the key benefits of modularity, that is, task
Fair 10 (32%) 14 (45%) decomposition via comments such as:
Good 2 (6%) 6 (19%)
“I think that the usage of trello boards really helped
to divide the entire project into smaller tasks”
Furthermore, the instructor’s perceptions matched
the findings of the analysis of the Kanban boards – “We used Trello to detail each and every task we
that while there was an improvement from previous worked on. For example, a task like visualization was
semesters, the students still struggled with task split into two people where one would handle bar
modularity. One interesting finding when getting charts and maps and the other would handle scatter
feedback from the instructors was that the boards plots and heat maps. Similarly, models were also
provided a vehicle in which the instructors were able split.”
to easily provide structured feedback on a student’s
(or the student team’s) task modularity. Hence, “Everyone uploaded tasks individually which they
within this context, instructors thought that the use of thought were important and later we could discuss
this process was helpful, in that students were able to them and find out which one is actually needed”
learn via the feedback (coaching) that the instructors
were able to provide throughout the semester. Task modularity improves overall project
tracking - A different theme noted by students was
4.2. Student Perceptions that having task modularity enabled them to more
easily track progress of their project. This was
At the end of the project, each student was given exemplified by comments such as:
a survey to complete. Out of the 152 students, 134
responded to the end of semester survey (88% “It helped us in having a look at our progress with
response rate). the tasks and the pending tasks”
One question asked a 5-point Likert-type question
(the extent to which they agreed or disagreed with the
“We could focus more efficiently on the tasks in the
following statement: “I think using a Kanban board
to do section”
improved the task modularity of our project”) and
84% of the students agreed or strongly agreed with
“Moving completed tasks to completed section and
the statement. Hence, the students thought that the proposed tasks to the to do list helped us to focus on
methodology improved their task modularity. a few tasks at a time in an efficient manner”
In addition, to better understand the students’
thinking with respect to modularity, they were also “This helped us understand how well we were paced
asked an open-ended question. Specifically, the with the project”
survey asked, “Please provide some context /
information on your answer to the previous question Modular tasks facilitate distributing the work -
relating to if you thought using a Kanban board
Students also realized that by decomposing the
improved the modularity of the project”. The answers
project into modular components, they could more
collected were analyzed through an iterative process
easily divide the work across the team as noted via
of item surfacing, refinement and regrouping.
the following student comments:
“Having individual tasks also helped us in the “While trying to understand the data science
distribution of work among the team members” concepts at a basic level, I don't think we internalized
how to break them down into modules or small tasks”
“This division of tasks also helped in distribution of
the tasks among individual team members” “It was sometimes difficult to divide your task into
smaller chunks due to lack of proper
Task modularity improves individual task communication”
tracking – The final theme noted by students was
that using a modular approach also enabled the teams Similarly, a related challenge in decomposing the
to more easily track the progress of individual tasks, project was to have a good grasp on how large each
exemplified by comments such as: module should be – being too large or too small was
noted as being problematic:
“Moving the finished tasks to the completed section
also helped us keep track of the status of the pending “It can help us to divide big tasks but we still face
work” some problems like how to divide tasks equally”

“Each team member could monitor the other persons “There were times when we tried to divide up a task
work right to the very detail” into pieces just so that people could have a task to
move but in the end, we would either complete the
“We could break up the project into multiple tasks task together or individually before comparing
and tackle them one at a time” results”

4.2.2 Kanban Not Helping with Task Modularity. “Division of work looked fine on trello board, but
However, the analysis of student comments also implementation of the tasks on separate machines
showed that a number of students found it difficult to and then combining them was a bigger task”
effectively modularize their project tasks, which was
consistent with what was noted by the analysis of the Process was confusing – In addition, some
actual Kanban boards (that were created within students were confused with respect how to use the
Trello). In reviewing student comments, 22% of the process. This could have been due to the combination
students articulated challenges in modularizing their of working on a difficult project, working with a
work. These comments suggest that at least some process methodology that was new to them and
students were willing to articulate the problems they trying to create modular components, which was also
experienced when trying to break down a complex a new concept. Additional education and training
project into workable chunks. Our analysis of the might address these issues, which were typically
Kanban boards, described in the previous section, noted in very general terms, such as:
indicates that these problems were probably more
widespread than even these comments indicate. “[the process] actually made this process more
Specifically, in reviewing the student comments, confusing”
we noted two key themes relating to the Kanban
process not being helpful with respect to modularity. “[the process] was just a burden as we did not need
that as opposed to meeting regularly as a team”
Hard to divide complex tasks into chunks of
appropriate size/scope – Knowing that it is useful to “Our group was also very small, the task assignment
create subtasks is not the same as being able to create process is usually only helpful when the group is
subtasks that are useful. This was a key challenge 10+”
noted when evaluating the Kanban boards, and was
also noted by several students, for example: 5. Discussion
“Our team was unsure on how to break down the From our analysis of the project boards, it is clear
more complex tasks, which led us to having a team that decomposing large and complex data science
member focus on one modeling solution at a time in a tasks into discrete, re-combinable subtasks was a
silo, rather than having it broken out more challenge for most of the data science students. While
incrementally” some teams did improve from their initial efforts,
even after gaining comfort in the use of the Kanban
process, task modularity in most teams was not as
good as one would hope to see, since only 19% of the visualizations or model output. By having clearly
teams achieved good task modularity. Hence, the defined inputs and outputs, teams can achieve high
evidence from our study makes it clear that cohesion and low coupling.
decomposing large and complex tasks into discrete, Ensure reasonable task duration - Care should
re-combinable subtasks was still a challenge for be given to the duration of the task. A task that will
many data science teams. take one month to complete will involve significant
However, it is also clear from our study that work, and will lessen the impact of a modular
students did recognize the rationale for, and benefits approach. Similarly, it is possible to define tasks with
of, effective decomposition of complex project tasks. durations that are too short, leading to a focus on the
This suggests that the challenge is not student trivial and excessive task management overhead. The
motivation, but rather, that creating good modular study suggests that one of the hardest challenges for
sub-tasks is difficult and that the students need more students was to understand how to best decide on the
than the training and coaching provided within the granularity of the task. In essence, the challenge is to
Kanban context. In other words, the fact that there ensure the task is “not too big, but also, not too
were still many teams who were unable to small”. For example, one could try to use a time-
decompose effectively suggests that opportunities based approach, where one tries to determine the
remain in terms of how to best enable data science granularity of the task by suggesting task duration.
teams to effectively achieve task modularity. However, this can be difficult to implement, since
As a first step towards the goal of helping estimation of the time it takes to do a task is one of
students improve task modularity, we provide some the difficulties of students when executing data
potential best practices for task decomposition. science projects [10].
Specifically, based on our observations, we Ensure a logical start and end - Tasks should
developed six guidelines that might help students have a natural and logical start and finish, which
improve their ability to effectively decompose data could be analogous to single entry and exit in
science projects into workable sub-tasks. These software modules. Based on this approach, a logical
approaches are often complimentary in nature, in that task can be broken into subtasks. If there are more
they could be used in conjunction with each other to than 7 (+/- 2) sub-tasks per phase, as suggested by
help in the process of task decomposition. PMBOK, then that task should be broken into smaller
Note that these six practices could be explored, tasks. One keeps breaking down tasks until one
refined and elaborated upon in future research defines a small set of subtasks. One risk in this
investigating data science task modularity. approach is that one might create too many small
Have a specific and concise task title - Since the tasks.
task description, or title, is often how people refer to Define accountability and responsibility -
the task, it is important to have a well-defined task Every task should have a clear-cut person (or team)
name. Titles should also be short and focused. It is working on the task, and there should be a clearly
helpful to start the title with a verb. The title is the defined owner of that task. In addition, every task
first step to ensure everyone understands what will be should have clearly defined completion criteria. This
done within the task. approach helps to ensure clearly defined tasks that
Have a well-defined goal - Ensure that the task others can easily understand.
has a clearly defined goal. In other words, what is the
purpose of the task/module, and why is it important 6. Conclusion
to complete the task? How does this module help
create actionable insight? It is also important that Task modularity within a data science context is a
others can easily understand the goal of the task, new area that has not previously been studied. To
which should be suggested by the task title, but address our first research question, we note that an
elaborated as needed to ensure a consistent view analysis of the student’s initial attempt at task
across the team of the goal of the task. modularity demonstrates that data science students do
Define task inputs and outputs - In addition to not naturally apply task modularity to their projects.
having a clearly defined goal, it is also important to To address our second research question, we note that
clearly articulate what inputs the module needs (ex. in general, the students still had difficulty achieving
data attributes columns within a data file which might task decomposition and task modularity, even after
have been generated as an output from a previous exposure to the Kanban methodology and task
task, such as data cleaning). It is also important to modularity coaching support by the instructors.
clearly define the outputs that will be generated from Some of the challenges in achieving task
the module, which might be cleaned data sets, modularity were that it was difficult to divide
complex tasks, and many of the team’s tasks were programming experience, might have a better
perceived to be complex. In addition, those complex approach to data science task modularity.
tasks were difficult to size/scope, so they often were Furthermore, the type of data science project might
either very large or very small. have impacted our results, and so, additional case
At a higher level, we note that while computer studies could be done to identify if the type of project
science/information system programs typically teach impacts a student’s ability to achieve task modularity.
modularity (e.g., the decomposition process and
abstraction along with topics such as patterns and 7. References
components), to date, there does not seem to be a
corresponding approach of how to teach modularity [1] M. Das, R. Cui, D. R. Campbell, G. Agrawal, and R.
within a data science context. Our rules of thumb do Ramnath, Towards methods for systematic research on
not fully address this gap and there remains a need to big data, in Big Data (Big Data), IEEE International
improve these potential best practices and more Conference on, pp. 2072-2081, 2015.
importantly, identify a corresponding model to that [2] J. Saltz and I. Shamshurin, Big Data Team Process
used for computer science / information system Methodology: A Literature Review and the
students, to teach modularity to data science students. Identification of Critical Factors for a Project’s
Success, in Big Data (Big Data), IEEE International
Conference on, 2016.
6.1. Potential Next Steps [3] C. Baldwin and K. Clark, Modularity in the Design of
Complex Engineering Systems, Working Paper,
First, we note that the evolution within the Harvard Business School, Boston, MA. 2004.
software development domain is perhaps analogous [4] C. Mattmann, (2013). Computing: A vision for data
to the task decomposition challenges facing data science. Nature, 493(7433), 473-475, 2013.
science students and practitioners today, where a [5] Y. Yeo, J., Hahn, The Role of Project Modularity in
tool-focused approach might be applicable within a Information Systems Development, 35th International
data science context. However, data science is Conference on Information Systems, 2014.
[6] A. Salonen, R. Rajala and A. Virtanen, Leveraging the
typically viewed via a data flow construct, not an
benefits of modularity in the provision of integrated
object-oriented approach. Hence, future work needs solutions: A strategic learning perspective. Industrial
to investigate the applicability of using a tool-based Marketing Management, 2017.
approach for modular data science efforts, but with [7] D. Sturtevant, D. Modular Architectures Make You
the acknowledgement of how data scientists typically Agile in the Long Run. IEEE Software, (1), 104-108,
work. 2017.
Specifically, related to exploring tools to support [8] M. Schilling. Toward a general modular systems
modularity, one could explore group coordination theory and its application to interfirm product
and decomposition tools that could be integrated with modularity. Academy of Management Review, 25(2),
312–334, 2000.
code modules. One such example of a group
[9] R Core Team, R: A Language and Environment for
coordination tool is Trello (www.trello.com), which Statistical Computing, R Foundation for Statistical
was used in this study and provides boards to make Computing, available at https://www.R-project.org/,
task decomposition visible. Future work could 2016.
explore how such a team-based tool could be [10] J. Saltz, I. Shamshurin and K. Crowston, Comparing
integrated within a code-based modular development Data Science Project Management Methodologies via
environment, which could make task decomposition a Controlled Experiment. in Hawaii International
more focused and hence easier for data scientists and Conference on System Sciences (HICSS), 2017.
data science students. [11] M. Vanauer, C. Bohle and B. Hellingrath, Guiding the
Introduction of Big Data in Organizations: A
Methodology with Business-and Data-Driven Ideation
6.2. Limitations and Enterprise Architecture Management-Based
Implementation, in Hawaii International Conference
This mixed method study had several limitations, on System Sciences (HICSS), 2015.
which additional research could address. For [12] A. Bhardwaj, S. Bhattacherjee, A. Chavan, A.
example, this effort leveraged graduate students. Deshpande, A. Elmore, S. Madden and A.
Junior data science professionals or undergraduate Parameswaran, DataHub: Collaborative Data Science
students might yield different results. Related to this, & Dataset Version Management at Scale, Biennial
Conference on Innovative Data Systems Research
most of the students were information system
(CIDR), 2015.
students and it is possible participants with a different [13] J. Gao A. Koronios and S. Selle, Towards A Process
background, especially more computer science View on Critical Success Factors in Big Data
focused students, via their significant object-oriented
Analytics Projects, 21st Americas' Conference on [31] J. Saltz and R. Heckman, Big Data Science Education:
Information Systems, 2015. A case study of a Project-Focused Introductory
[14] N. Grady, M. Underwood, A. Roy and W. Chang, Big Course, Themes in Science and Technology Education,
Data: Challenges, practices and technologies: NIST Special issue on Big Data in Education, 8(2), 2016.
Big Data Public Working Group workshop at IEEE [32] B. Ramamurthy, A Practical and Sustainable Model
Big Data, in Big Data (Big Data), IEEE International for Learning and Teaching Data Science'. Proceedings
Conference on, pp. 11-15: IEEE, 2014. of the 47th ACM Technical Symposium on Computing
[15] N. Chandler and T. Oestreich, Use analytic business Science Education: ACM, 169-174, 2016.
processes to drive business performance, ed: Gartner, [33] P. Anderson, J. Bowring, R. McCauley, G. Pothering,
2015. and C. Starr, An undergraduate degree in data science:
[16] H. Chen, R. Kazman and F. Matthes, Demystifying curriculum and a decade of implementation
big data adoption: Beyond IT fashion and relative experience. In Proceedings of the 45th ACM technical
advantage. In Proc. DIGIT, 2015. symposium on Computer science education (pp. 145-
[17] J. Espinosa and F. Armour, The Big Data Analytics 150). ACM, 2014.
Gold Rush: A Research Framework for Coordination [34] Y. Gil, Teaching parallelism without programming: a
and Governance, in 49th Hawaii International data science curriculum for non-CS students.
Conference on System Sciences (HICSS), pp. 1112- Proceedings of the Workshop on Education for High-
1121: IEEE, 2016. Performance Computing: IEEE Press, 42-48, 2014.
[18] H. Chen, R. Kazman and S. Haziyev, Agile Big Data [35] R. Brunner and E. Kim, Teaching Data Science,
Analytics for Web-based Systems: An Procedia Computer Science, 80, pp. 1947-1956, 2016.
Architecturecentric Approach, IEEE Transactions on [36] J. Saltz and I. Shamshurin, Does Pair Programming
Big Data, 2016. work in a Data Science Context: An Initial Case
[19] V. Dhar, Data science and prediction, Study, in Big Data (Big Data), 2017 IEEE
Communications of the ACM, vol. 56, no. 12, pp. 64- International Conference on, 2017.
73, 2013. [37] M. Mellody, Training Students to Extract Value From
[20] J. Saltz, The need for new processes, methodologies Big Data, Summary of a Workshop, The National
and tools to support big data teams and improve big Academies Press, Washington DC, 2014.
data project effectiveness, in Big Data (Big Data), [38] M. Ahmad, J. Markkula and M. Oivo, Kanban in
IEEE International Conference on, 2015. software development: A systematic literature review,
[21] C.Y. Baldwin and K.B. Clark, The Value and Costs of Software Engineering and Advanced Applications
Modularity, Working Paper, Harvard Business School, (SEAA), 39th EUROMICRO Conference on, pp. 9-16:
Boston, MA, 2001. IEEE, 2013.
[22] S. Brusoni, L. Marengo, A. Prencipe and M. Valente, [39] D.J. Anderson, Kanban: Successful Evolutionary
The Value and Costs of Modularity: A Problem- Change for Your Technology Business. Blue Hole
Solving Perspective, European Management Review Press, 2010.
(4:2), pp. 121-132, 2007. [40] M. O. Ahmad, J. Markkula, and M. Oivo, "Kanban in
[23] R.N. Langlois, Modularity in Technology and software development: A systematic literature
Organization, Journal of Economic Behavior & review," in Software Engineering and Advanced
Organization (49:1), pp. 19-37, 2002. Applications (SEAA), 39th EUROMICRO Conference
[24] C.Y. Baldwin and K.B. Clark, Managing in an age of on, pp. 9-16: IEEE, 2013.
modularity. Harvard Business Review 84–93 [41] M. Ikonen, E. Pirinen, F. Fagerholm, P. Kettunen and
September-October, 1997. P. Abrahamsson, On the impact of Kanban on
[25] PMBOK Guide: A Guide to the Project Management software project work: An empirical case study
Body of Knowledge, Project Management Institute investigation. In Engineering of Complex Computer
(PMI), Pennsylvania, 2004. Systems (ICECCS), 16th IEEE International
[26] Work Breakdown Structure, in Wikipedia, 2018, DOI: Conference on (pp. 305-314). IEEE, 2011.
en.wikipedia.org/wiki/Work_breakdown_structure [42] A. Neyem, J. Diaz-Mosquera, J. Munoz-Gama and J.
[27] F. P. Brooks, Mythical Man-Month. Datamation, Navon, Understanding Student Interactions in
20(12), 44-52, 1974. Capstone Courses to Improve Learning Experiences.
[28] D. Parnas, On the Criteria to Be Used in Decomposing In Proceedings of the 2017 ACM SIGCSE Technical
Systems Into Modules, Communications of the ACM, Symposium on Computer Science Education, 2017.
15, 1053–1058, 1972. [43] H. Lei, F. Ganjeizadeh, P.K. Jayachandran and P.
[29] C. Szyperski, Independently Extensible Systems— Ozcan, A statistical analysis of the effects of Scrum
Software Engineering Potential and Challenges, and Kanban on software development projects.
Australian Computer Science Communications, 18, Robotics and Computer-Integrated Manufacturing, 43,
203–212, 1996. 59-67, 2017.
[30] J. Saltz, R. Heckman and I. Shamshurin, Exploring
How Different Project Management Methodologies
Impact Data Science Students, In Proceedings of the
25th European Conference on Information Systems
(ECIS), 2017.

You might also like