Modularity HICSS Final Afterreview
Modularity HICSS Final Afterreview
Jeffrey S. Saltz Robert Heckman Kevin Crowston Sangseok You Yatish Hedge
Syracuse University Syracuse University Syracuse University HEC Paris Syracuse University
[email protected] [email protected] [email protected] [email protected] [email protected]
RQ1: Do data science students naturally apply Vanauer, Bohle and Hellingrath [11] noted the
task modularity while working on a data science lack of an empirically grounded big data science
project? methodology. Hence, not surprisingly, it has been
observed that most data projects are managed in an
RQ2: Does using a Kanban process methodology ad hoc fashion, that is, at a low level of process
improve modular thinking in data science maturity [12]. Indeed, it has been argued that data
students and lead to improved task modularity? science projects need to focus on people, process and
technology [13,14] and not just on algorithms used
To address our research questions, we report on a by data scientists.
mixed method study that explores students using the Thus, the need for more guidance is recognized
Kanban process methodology and evaluates if the with respect to how data scientists can best work
methodology impacts a student’s ability to think in a together; for example, a Gartner Consulting report
more modular manner. advocates for more careful management of the
The rest of the paper is structured as follows. In analysis processes, though a specific methodology is
Section 2, we review modularity in project not identified [15]. Chen, Kazman & Matthes [16]
management, software engineering, as well as in a studied 23 large enterprises and confirmed this gap.
data science context. Section 3 describes the This gap is re-enforced by Espinosa and Armour
methodology for our study. Section 4 discusses the [17], who noted that the main challenge in a data
findings from our study. Finally, section 5 presents a science project is task coordination.
synthesis of our research results and section 6
discusses limitations and possible next steps. 2.2 Task Modularity
“Each team member could monitor the other persons “There were times when we tried to divide up a task
work right to the very detail” into pieces just so that people could have a task to
move but in the end, we would either complete the
“We could break up the project into multiple tasks task together or individually before comparing
and tackle them one at a time” results”
4.2.2 Kanban Not Helping with Task Modularity. “Division of work looked fine on trello board, but
However, the analysis of student comments also implementation of the tasks on separate machines
showed that a number of students found it difficult to and then combining them was a bigger task”
effectively modularize their project tasks, which was
consistent with what was noted by the analysis of the Process was confusing – In addition, some
actual Kanban boards (that were created within students were confused with respect how to use the
Trello). In reviewing student comments, 22% of the process. This could have been due to the combination
students articulated challenges in modularizing their of working on a difficult project, working with a
work. These comments suggest that at least some process methodology that was new to them and
students were willing to articulate the problems they trying to create modular components, which was also
experienced when trying to break down a complex a new concept. Additional education and training
project into workable chunks. Our analysis of the might address these issues, which were typically
Kanban boards, described in the previous section, noted in very general terms, such as:
indicates that these problems were probably more
widespread than even these comments indicate. “[the process] actually made this process more
Specifically, in reviewing the student comments, confusing”
we noted two key themes relating to the Kanban
process not being helpful with respect to modularity. “[the process] was just a burden as we did not need
that as opposed to meeting regularly as a team”
Hard to divide complex tasks into chunks of
appropriate size/scope – Knowing that it is useful to “Our group was also very small, the task assignment
create subtasks is not the same as being able to create process is usually only helpful when the group is
subtasks that are useful. This was a key challenge 10+”
noted when evaluating the Kanban boards, and was
also noted by several students, for example: 5. Discussion
“Our team was unsure on how to break down the From our analysis of the project boards, it is clear
more complex tasks, which led us to having a team that decomposing large and complex data science
member focus on one modeling solution at a time in a tasks into discrete, re-combinable subtasks was a
silo, rather than having it broken out more challenge for most of the data science students. While
incrementally” some teams did improve from their initial efforts,
even after gaining comfort in the use of the Kanban
process, task modularity in most teams was not as
good as one would hope to see, since only 19% of the visualizations or model output. By having clearly
teams achieved good task modularity. Hence, the defined inputs and outputs, teams can achieve high
evidence from our study makes it clear that cohesion and low coupling.
decomposing large and complex tasks into discrete, Ensure reasonable task duration - Care should
re-combinable subtasks was still a challenge for be given to the duration of the task. A task that will
many data science teams. take one month to complete will involve significant
However, it is also clear from our study that work, and will lessen the impact of a modular
students did recognize the rationale for, and benefits approach. Similarly, it is possible to define tasks with
of, effective decomposition of complex project tasks. durations that are too short, leading to a focus on the
This suggests that the challenge is not student trivial and excessive task management overhead. The
motivation, but rather, that creating good modular study suggests that one of the hardest challenges for
sub-tasks is difficult and that the students need more students was to understand how to best decide on the
than the training and coaching provided within the granularity of the task. In essence, the challenge is to
Kanban context. In other words, the fact that there ensure the task is “not too big, but also, not too
were still many teams who were unable to small”. For example, one could try to use a time-
decompose effectively suggests that opportunities based approach, where one tries to determine the
remain in terms of how to best enable data science granularity of the task by suggesting task duration.
teams to effectively achieve task modularity. However, this can be difficult to implement, since
As a first step towards the goal of helping estimation of the time it takes to do a task is one of
students improve task modularity, we provide some the difficulties of students when executing data
potential best practices for task decomposition. science projects [10].
Specifically, based on our observations, we Ensure a logical start and end - Tasks should
developed six guidelines that might help students have a natural and logical start and finish, which
improve their ability to effectively decompose data could be analogous to single entry and exit in
science projects into workable sub-tasks. These software modules. Based on this approach, a logical
approaches are often complimentary in nature, in that task can be broken into subtasks. If there are more
they could be used in conjunction with each other to than 7 (+/- 2) sub-tasks per phase, as suggested by
help in the process of task decomposition. PMBOK, then that task should be broken into smaller
Note that these six practices could be explored, tasks. One keeps breaking down tasks until one
refined and elaborated upon in future research defines a small set of subtasks. One risk in this
investigating data science task modularity. approach is that one might create too many small
Have a specific and concise task title - Since the tasks.
task description, or title, is often how people refer to Define accountability and responsibility -
the task, it is important to have a well-defined task Every task should have a clear-cut person (or team)
name. Titles should also be short and focused. It is working on the task, and there should be a clearly
helpful to start the title with a verb. The title is the defined owner of that task. In addition, every task
first step to ensure everyone understands what will be should have clearly defined completion criteria. This
done within the task. approach helps to ensure clearly defined tasks that
Have a well-defined goal - Ensure that the task others can easily understand.
has a clearly defined goal. In other words, what is the
purpose of the task/module, and why is it important 6. Conclusion
to complete the task? How does this module help
create actionable insight? It is also important that Task modularity within a data science context is a
others can easily understand the goal of the task, new area that has not previously been studied. To
which should be suggested by the task title, but address our first research question, we note that an
elaborated as needed to ensure a consistent view analysis of the student’s initial attempt at task
across the team of the goal of the task. modularity demonstrates that data science students do
Define task inputs and outputs - In addition to not naturally apply task modularity to their projects.
having a clearly defined goal, it is also important to To address our second research question, we note that
clearly articulate what inputs the module needs (ex. in general, the students still had difficulty achieving
data attributes columns within a data file which might task decomposition and task modularity, even after
have been generated as an output from a previous exposure to the Kanban methodology and task
task, such as data cleaning). It is also important to modularity coaching support by the instructors.
clearly define the outputs that will be generated from Some of the challenges in achieving task
the module, which might be cleaned data sets, modularity were that it was difficult to divide
complex tasks, and many of the team’s tasks were programming experience, might have a better
perceived to be complex. In addition, those complex approach to data science task modularity.
tasks were difficult to size/scope, so they often were Furthermore, the type of data science project might
either very large or very small. have impacted our results, and so, additional case
At a higher level, we note that while computer studies could be done to identify if the type of project
science/information system programs typically teach impacts a student’s ability to achieve task modularity.
modularity (e.g., the decomposition process and
abstraction along with topics such as patterns and 7. References
components), to date, there does not seem to be a
corresponding approach of how to teach modularity [1] M. Das, R. Cui, D. R. Campbell, G. Agrawal, and R.
within a data science context. Our rules of thumb do Ramnath, Towards methods for systematic research on
not fully address this gap and there remains a need to big data, in Big Data (Big Data), IEEE International
improve these potential best practices and more Conference on, pp. 2072-2081, 2015.
importantly, identify a corresponding model to that [2] J. Saltz and I. Shamshurin, Big Data Team Process
used for computer science / information system Methodology: A Literature Review and the
students, to teach modularity to data science students. Identification of Critical Factors for a Project’s
Success, in Big Data (Big Data), IEEE International
Conference on, 2016.
6.1. Potential Next Steps [3] C. Baldwin and K. Clark, Modularity in the Design of
Complex Engineering Systems, Working Paper,
First, we note that the evolution within the Harvard Business School, Boston, MA. 2004.
software development domain is perhaps analogous [4] C. Mattmann, (2013). Computing: A vision for data
to the task decomposition challenges facing data science. Nature, 493(7433), 473-475, 2013.
science students and practitioners today, where a [5] Y. Yeo, J., Hahn, The Role of Project Modularity in
tool-focused approach might be applicable within a Information Systems Development, 35th International
data science context. However, data science is Conference on Information Systems, 2014.
[6] A. Salonen, R. Rajala and A. Virtanen, Leveraging the
typically viewed via a data flow construct, not an
benefits of modularity in the provision of integrated
object-oriented approach. Hence, future work needs solutions: A strategic learning perspective. Industrial
to investigate the applicability of using a tool-based Marketing Management, 2017.
approach for modular data science efforts, but with [7] D. Sturtevant, D. Modular Architectures Make You
the acknowledgement of how data scientists typically Agile in the Long Run. IEEE Software, (1), 104-108,
work. 2017.
Specifically, related to exploring tools to support [8] M. Schilling. Toward a general modular systems
modularity, one could explore group coordination theory and its application to interfirm product
and decomposition tools that could be integrated with modularity. Academy of Management Review, 25(2),
312–334, 2000.
code modules. One such example of a group
[9] R Core Team, R: A Language and Environment for
coordination tool is Trello (www.trello.com), which Statistical Computing, R Foundation for Statistical
was used in this study and provides boards to make Computing, available at https://www.R-project.org/,
task decomposition visible. Future work could 2016.
explore how such a team-based tool could be [10] J. Saltz, I. Shamshurin and K. Crowston, Comparing
integrated within a code-based modular development Data Science Project Management Methodologies via
environment, which could make task decomposition a Controlled Experiment. in Hawaii International
more focused and hence easier for data scientists and Conference on System Sciences (HICSS), 2017.
data science students. [11] M. Vanauer, C. Bohle and B. Hellingrath, Guiding the
Introduction of Big Data in Organizations: A
Methodology with Business-and Data-Driven Ideation
6.2. Limitations and Enterprise Architecture Management-Based
Implementation, in Hawaii International Conference
This mixed method study had several limitations, on System Sciences (HICSS), 2015.
which additional research could address. For [12] A. Bhardwaj, S. Bhattacherjee, A. Chavan, A.
example, this effort leveraged graduate students. Deshpande, A. Elmore, S. Madden and A.
Junior data science professionals or undergraduate Parameswaran, DataHub: Collaborative Data Science
students might yield different results. Related to this, & Dataset Version Management at Scale, Biennial
Conference on Innovative Data Systems Research
most of the students were information system
(CIDR), 2015.
students and it is possible participants with a different [13] J. Gao A. Koronios and S. Selle, Towards A Process
background, especially more computer science View on Critical Success Factors in Big Data
focused students, via their significant object-oriented
Analytics Projects, 21st Americas' Conference on [31] J. Saltz and R. Heckman, Big Data Science Education:
Information Systems, 2015. A case study of a Project-Focused Introductory
[14] N. Grady, M. Underwood, A. Roy and W. Chang, Big Course, Themes in Science and Technology Education,
Data: Challenges, practices and technologies: NIST Special issue on Big Data in Education, 8(2), 2016.
Big Data Public Working Group workshop at IEEE [32] B. Ramamurthy, A Practical and Sustainable Model
Big Data, in Big Data (Big Data), IEEE International for Learning and Teaching Data Science'. Proceedings
Conference on, pp. 11-15: IEEE, 2014. of the 47th ACM Technical Symposium on Computing
[15] N. Chandler and T. Oestreich, Use analytic business Science Education: ACM, 169-174, 2016.
processes to drive business performance, ed: Gartner, [33] P. Anderson, J. Bowring, R. McCauley, G. Pothering,
2015. and C. Starr, An undergraduate degree in data science:
[16] H. Chen, R. Kazman and F. Matthes, Demystifying curriculum and a decade of implementation
big data adoption: Beyond IT fashion and relative experience. In Proceedings of the 45th ACM technical
advantage. In Proc. DIGIT, 2015. symposium on Computer science education (pp. 145-
[17] J. Espinosa and F. Armour, The Big Data Analytics 150). ACM, 2014.
Gold Rush: A Research Framework for Coordination [34] Y. Gil, Teaching parallelism without programming: a
and Governance, in 49th Hawaii International data science curriculum for non-CS students.
Conference on System Sciences (HICSS), pp. 1112- Proceedings of the Workshop on Education for High-
1121: IEEE, 2016. Performance Computing: IEEE Press, 42-48, 2014.
[18] H. Chen, R. Kazman and S. Haziyev, Agile Big Data [35] R. Brunner and E. Kim, Teaching Data Science,
Analytics for Web-based Systems: An Procedia Computer Science, 80, pp. 1947-1956, 2016.
Architecturecentric Approach, IEEE Transactions on [36] J. Saltz and I. Shamshurin, Does Pair Programming
Big Data, 2016. work in a Data Science Context: An Initial Case
[19] V. Dhar, Data science and prediction, Study, in Big Data (Big Data), 2017 IEEE
Communications of the ACM, vol. 56, no. 12, pp. 64- International Conference on, 2017.
73, 2013. [37] M. Mellody, Training Students to Extract Value From
[20] J. Saltz, The need for new processes, methodologies Big Data, Summary of a Workshop, The National
and tools to support big data teams and improve big Academies Press, Washington DC, 2014.
data project effectiveness, in Big Data (Big Data), [38] M. Ahmad, J. Markkula and M. Oivo, Kanban in
IEEE International Conference on, 2015. software development: A systematic literature review,
[21] C.Y. Baldwin and K.B. Clark, The Value and Costs of Software Engineering and Advanced Applications
Modularity, Working Paper, Harvard Business School, (SEAA), 39th EUROMICRO Conference on, pp. 9-16:
Boston, MA, 2001. IEEE, 2013.
[22] S. Brusoni, L. Marengo, A. Prencipe and M. Valente, [39] D.J. Anderson, Kanban: Successful Evolutionary
The Value and Costs of Modularity: A Problem- Change for Your Technology Business. Blue Hole
Solving Perspective, European Management Review Press, 2010.
(4:2), pp. 121-132, 2007. [40] M. O. Ahmad, J. Markkula, and M. Oivo, "Kanban in
[23] R.N. Langlois, Modularity in Technology and software development: A systematic literature
Organization, Journal of Economic Behavior & review," in Software Engineering and Advanced
Organization (49:1), pp. 19-37, 2002. Applications (SEAA), 39th EUROMICRO Conference
[24] C.Y. Baldwin and K.B. Clark, Managing in an age of on, pp. 9-16: IEEE, 2013.
modularity. Harvard Business Review 84–93 [41] M. Ikonen, E. Pirinen, F. Fagerholm, P. Kettunen and
September-October, 1997. P. Abrahamsson, On the impact of Kanban on
[25] PMBOK Guide: A Guide to the Project Management software project work: An empirical case study
Body of Knowledge, Project Management Institute investigation. In Engineering of Complex Computer
(PMI), Pennsylvania, 2004. Systems (ICECCS), 16th IEEE International
[26] Work Breakdown Structure, in Wikipedia, 2018, DOI: Conference on (pp. 305-314). IEEE, 2011.
en.wikipedia.org/wiki/Work_breakdown_structure [42] A. Neyem, J. Diaz-Mosquera, J. Munoz-Gama and J.
[27] F. P. Brooks, Mythical Man-Month. Datamation, Navon, Understanding Student Interactions in
20(12), 44-52, 1974. Capstone Courses to Improve Learning Experiences.
[28] D. Parnas, On the Criteria to Be Used in Decomposing In Proceedings of the 2017 ACM SIGCSE Technical
Systems Into Modules, Communications of the ACM, Symposium on Computer Science Education, 2017.
15, 1053–1058, 1972. [43] H. Lei, F. Ganjeizadeh, P.K. Jayachandran and P.
[29] C. Szyperski, Independently Extensible Systems— Ozcan, A statistical analysis of the effects of Scrum
Software Engineering Potential and Challenges, and Kanban on software development projects.
Australian Computer Science Communications, 18, Robotics and Computer-Integrated Manufacturing, 43,
203–212, 1996. 59-67, 2017.
[30] J. Saltz, R. Heckman and I. Shamshurin, Exploring
How Different Project Management Methodologies
Impact Data Science Students, In Proceedings of the
25th European Conference on Information Systems
(ECIS), 2017.