Software Development Process Mining Discovery Conformance Checking and Enhancement
Software Development Process Mining Discovery Conformance Checking and Enhancement
João Caldeira
Fernando Brito e Abreu
Instituto Universitário de Lisboa (ISCTE-IUL)
ISTAR-IUL
Lisboa, Portugal
{jcppc, fba}@iscte-iul.pt
Abstract—Software development has become a fundamental Regarding the quality of the software development process,
process on any business or organization. As a consequence, several approaches have been proposed to improve and assess
together with other emergent technologies, new development it, either at the organizational level (e.g. CMMI - Capability
platforms (IDEs) are being created, mainly in the cloud (e.g., Maturity Model Integration [4]), project level (e.g. Team
Eclipse Orion, Cloud9, Codio), requiring different approaches on Software Process [5]) or individual level (e.g. Personal
the way software development can be studied. Empirical studies Software Process [6]). Those approaches were conceived for
on software development most often are based on data taken non-agile software development processes in mind and rely
from software configuration management repositories, source heavily on collecting evidence of current practice. This
code management systems and issue tracking tools, but not from
frequently implies a considerable overhead of manual data
the IDEs themselves, because they do not record data publically
regarding developers’ activities. We aim to bring forward new
collection regarding software process activities. Along this
insights on the software development process by analyzing how intrusiveness, collected data is regularly adulterated due to the
developers use their IDE. Based upon process mining techniques well-known Hawthorne Effect [7], therefore not allowing to
such as process discovery and conformance checking, this reach valid conclusions. It is therefore not surprising that little
missing perspective will hopefully allow the discovery of coding progress has been made on researching the quality of agile
patterns, the search for programmer behaviors and the detection software development [8].
of deviations from prescribed processes. Finally, we expect to Meanwhile, the increasing role of the open-source
provide advice for individual software process enhancement.
movement has allowed a considerable increase in the
Keywords—software development, software process, process availability of free software engineering tools, namely in
mining, pattern discovery, software engineering Eclipse, the dominant IDE for its most widespread programing
language, Java. That availability has led to a progressive use of
a multitude of tools during a software development project, far
I. INTRODUCTION beyond the traditional ones that dominated the first decades of
As a result of the direct or indirect dependence on software computer programming, namely the code editor, compiler,
in our daily lives, software development has become a linker and configuration management system (CMS).
fundamental process on any business or organization. Nowadays, developers use modeling tools, code generators,
Consequently, it is vital to carefully study, understand and code recommendation tools, code smell detection tools,
improve such a process [1]. However, the development process refactoring tools, metrics collection tools, software structure
is often not formalized (i.e. a model is not available), especially visualization tools, test generators and test coverage tools, to
in agile methodologies such as Scrum, Extreme Programming name just a few. Those tools are becoming increasingly
(XP), Feature-driven development (FDD), Agile Unified intertwined within the IDE and the latter is progressively
Process (AUP) or Open Unified Process (OpenUP). integrated with cloud-based services that allow cooperative
work (e.g. GitHub, Sourceforge) that provide services such as
While agile approaches give a wide berth for a CMS, issue tracking system, project documentation and wiki.
development team to define its own development process, a set
of stereotypical activities should be in place. For instance, in A recent survey mentions that around 2/3 of the data
XP, developers are expected to record user stories, write tests sources used for data analysis on software development
before coding, program in pairs, use refactoring techniques, do projects come from CMS (e.g. CVS, SVN, Git or Mercurial)
check-ins frequently (continuous integration), adopt a coding [9]. These systems maintain a history of changes in all files,
standard, own the code collectively (i.e. anyone can improve and control the authors responsible for the modifications. CMS
any part of the code at any time) and do not work beyond 40 data include information such as commit messages, commit
hours a week [2]. The efficiency and effectiveness of agile time, commit author, and commit type (added, modified, or
approaches relies on the adoption of those activities, but due to deleted). However, in agile software development initiatives,
the invisibility of software development activities [3], it is not using a multitude of techniques and tools as aforementioned, it
straightforward to check if members of a development team are is not possible to characterize and understand the process by
actually performing them or not.
255
Authorized licensed use limited to: Brno University of Technology. Downloaded on August 17,2024 at 19:15:08 UTC from IEEE Xplore. Restrictions apply.
A. Process Discovery in Software Development The process conformance checking problem raises the
Software development is usually performed by small following hypotheses:
teams, in short iterations. According to the famous Manifesto [H02a] It is not possible to identify divergences
for Agile Software Development, agile teams are supposed to between what agile developers do in practice and what
“value individuals and interactions over processes and tools”
they were supposed to do (as prescribed in a process
[10]. We believe this kind of strong claim was stated to
contrast with highly structured/organized software model) by mining event logs generated by IDE usage.
development methodologies. By then, the most representative [H02b] There is no significant variability in the roles
surrogate of the latter was the Rational Unified Process (RUP) performed within a development team in agile
that prescribed a proprietary set of practices organized in approaches.
processes, and tools to support the same practices [11]. Agile
approaches, like the aforementioned ones, also prescribe a set C. Process Enhancement in Software Development
of agile practices. However, the way they are applied is left for Software development is a socio-technical activity [1].
each team to decide. Many of those practices (e.g. refactoring, Every project has its own needs in terms of requirements,
regression testing) are unlikely to be applied without tools technologies and human resources that should be allocated to
nowadays, so the cited manifesto claim became, at least each task. Successful software development projects not only
partially, an anachronism. Nevertheless, the process still require people with the right programming skills, but also with
remains mostly hidden from outsiders. Personal turnover in the right behavior. Productivity is derived from both: skills
agile teams becomes a hindrance and things get worst if teams and behavior. A major difficulty in identifying the best human
are geographically distributed, due to the tacit knowledge resources for a specific project is caused by the fact that we
problem [12]. often have no clue on how programmers behave individually
or in groups, while developing software. That behavioral
We claim that it is possible to discover the process of an information provides a new perspective that may contribute to
agile team using a given IDE where a set of development tools improve software engineering project management and/or the
are integrated, by mining the plethora of events generated by individual adoption of best practices. Moreover, getting
those tools. Our objective is making that process explicit, information on the installed plugins, their co-occurrence and
using an appropriate modeling language (e.g. Petri Nets or their usage patterns may provide interesting insights that can
BPMN), either for a single developer or for a development be used to generate IDE configuration advices.
team as a whole, evidencing the role of each team member. As
an example, we may identify the activities a developer Difficulties in researching on software development
executes when using the IDE, such as: open projects, add processes and tools improvement have been reported by
contents, remove contents, refactoring, and save a project, industry [14] [15] and academia [16]. These difficulties have
among others. in common a lack of methods to mine the processes and
artifacts usage. Indeed, a limited awareness of actual processes
The process discovery problem raises the following and tools usage, hampers our ability to improve them.
hypotheses: Following, the main questions we can put forward are: Can we
[H01a] Software development processes discovered extract relevant information from the IDE, in order to make it
from events recorded from development tools used are adaptable to the developer's profile? Can we really use it to
insufficiently detailed and accurate to be useful for improve the overall process?
software engineering purposes. The process enhancement problem raises the following
[H01b] It is not possible to provide feedback to agile hypotheses:
developers in real time regarding the process being [H03a] It is not possible to provide feedback to the
executed, namely by being able to differentiate the role agile developer, regarding the quality of the process
of each team member. he/she is executing.
B. Conformance Checking in Software Development [H03b] It is not possible to automatically adapt the IDE
In the last few years, agile development methodologies to improve the developer’s performance, based on the
became mainstream in software development organizations. analysis of the IDE event logs.
Meanwhile, the development process has evolved from an
D. Methodology
individual task to a more collaborative one. This is supported
by new tools delivered by public or private IDEs and has This research endeavor will apply a combination of two
introduced new challenges in validating the adherence of methodologies: the Scientific Method (SM) and the Design
individuals to the agile methodologies supposedly adopted in Science Research (DSR). The latter will be used in the
place by each organization or department. Conformance conception and operationalization of the research instrument
validation can be used to check process rules and improve that will be described in section IV. This instrument will allow
processes within any organization [13]. If each developer in a collecting data that will be used for assessing our research
team is left alone, without a perception of his/her alignment hypotheses, as prescribed by the SM.
with the expected process, ultimately it may lead to a lack of While delving into the research problems that were
quality in the delivered products. previously identified, we formulated several research
256
Authorized licensed use limited to: Brno University of Technology. Downloaded on August 17,2024 at 19:15:08 UTC from IEEE Xplore. Restrictions apply.
hypotheses. The SM is a fundamental technique used by activities will be gauged against best practices. The latter can
scientists to raise hypothesis and produce theories. A theory is be based on values taken from the best results obtained with
a conceptual framework that explains existing facts or predicts the team or the organization.
new facts. It assumes that the scientific knowledge is
predictive and that cause and effect relationships exist. Finally, process enhancement aims at improving an existing
Knowledge in an area is expressed as a set of theories and software process model with information extracted from actual
theories are raised upon non refuted hypothesis. The SM software process instances, once again captured as logs of
progresses through a series of steps: (i) observe facts, (ii) events raised during the various activities of the software
formulate hypotheses, (ii) design and (iii) execute the development process. We expect to devise and highlight the
experiment (implies the availability of collection instruments most frequent activity paths in development, highlight
and subjects from where data can be collected), (iv) analyze resources, such as people, systems, roles, and how they are
data and interpret the results, (v) raise a theory and, (vi) related and potentially predict process time, discover
disseminate results for peer validation. bottlenecks, monitor resource utilization and, measure service
levels. From a socio-technical perspective, we expect to
It has been reported in the literature a shortage of empirical identify bad practices and good practices amongst the
studies in the information technology domain and more developers and profile them using clustering analysis or other
specifically in software development [1][3][9]. This is classification techniques. Another issue that we expect to
repeatedly related with the lack of consistent methods in data address with our software process mining based approach is
gathering or lack of real life use cases. As a result, the data identifying the friction factors that have a negative impact on
analysis is not trustworthy, and the findings of those studies the software development pace.
cannot easily be validated within the research community. We
expect that our automated data collection approach, to be In this research work we expect to adopt a holistic approach
described in section IV, will mitigate this problem. As such, where events generated by all tools used will be considered. A
we will be able to test the aforementioned set of hypothesis standardized format represented by XES [10] events will be
that emerged from our research problems. used for the sake of interoperability, namely with process
mining tools.
DSR has its roots in engineering and is appropriate when
developing new technologies for solving problems, such as the The number of events to be generated by a development
ones described herein. DSR helps gaining problem team within a project iteration (e.g. a Scrum sprint) can be very
understanding, identifying systemically appropriate solutions, large. We will therefore assess if a big data open-source
and in effectively evaluating new and innovative solutions. platform (e.g., Cassandra5 or HBase6) will be required for
The DSR methodology prescribes several activities that are storing and processing the collected data in our research.
being adapted to our context: (i) problem identification and The aforementioned approach is expected to scaffold
motivation section, (ii) definition of the objectives for a exploratory activities on top of the collected data, allowing the
solution, (iii) design and development, (iv) demonstration, (v) community to do benchmarking, evaluate software engineering
evaluation and, (vi) communication / dissemination. best practices and assess software engineering research topics
Both SM and DSR approaches encompass the publication like the ones we have previously identified, by means of
of results for peer scrutiny. We will privilege narrow-scope structured empirical studies [17]. Since we will be dealing with
conferences and journals with a high rank, to better focus on large amounts of data with different formats, coming from
our research concerns, and also because these are the forums different sources, the techniques and tools to perform software
where the best researchers of the relevant community are analysis must be aligned with the challenges imposed by those
expected to present their works. Submitting to those forums scenarios. As pointed earlier by [21] big data technologies not
will enable us to maximize the quality of the received only deal with the data challenges mentioned above, but also in
feedback from our work, even in rejection situations. leveraging visualization capabilities to foster qualitative
perception and reasoning. We firmly believe that a
combination of process mining techniques using machine
IV. CURRENT WORK learning algorithms supported by big data technologies would
To foster a shared understanding of what the current be the best approach to tackle the identified research problems
process really is, we will use model discovery techniques. It [18] [19] [20].
allows to reverse engineer the software process model by
mining event logs taken from real software development A. Eclipse Plugin Development and Initial Process Mining
activities. Those activities are expected to characterize the A standard and automated process to collect IDE data can
underlying process model, since a given execution flow mitigate the efforts to validate results and sustain conclusions.
(sequence of consecutive or parallelized) executed activities, We plan to leverage Eclipse IDE events logged and potentially
from start to end, corresponds to what we call a “process take one step further and try to port the same principles to a
instance”. cloud IDE. Apart from the plugin to collect the events from the
We will use conformance checking techniques for IDE functions, the fundamental activity is to mine the data by
diagnosing if actual software development activities (again using one type of process mining – process discovery.
captured as event logs) are following a given process model.
Our objective here will be to provide each developer with an 5
- cassandra.apache.org
“agile dashboard” where the current adoption of the agile 6
- hbase.apache.org
257
Authorized licensed use limited to: Brno University of Technology. Downloaded on August 17,2024 at 19:15:08 UTC from IEEE Xplore. Restrictions apply.
We have built a preliminary version of an Eclipse events specific project requirements. Having the right people doing
capture plugin that sends those events, wrapped as JSON the right tasks contributes to improve productivity and overall
objects, across a microservices architecture to a cloud server. software quality.
The latter will convert the events to the XES format [10] and
will apply process mining and machine learning algorithms. V. CONCLUSION
We are currently performing some beta tests to find out which
is the adequate granularity level of the relevant events and filter Most empirical studies on software-related topics cover
out the remaining ones. Some preliminary validation product issues. As for the ones targeting the process dimension,
experiments are being prepared within the context of many open research problems have been identified [1]. We
programming classes in the context of several software observed in the literature that there is a lack of understanding
development courses on two public universities in the Lisbon on how developers behave during the development process
area (ISCTE-IUL7 and UNL8). itself, while using their main workbench – the IDE. Current
IDEs are indeed toolboxes that offer a large plethora of
Once our experimental setup is validated, we plan to make facilities beyond the traditional edit-compile-run ones, such as
the event-capture plugin available in the Eclipse Marketplace9 debuggers, target runtime emulators, code generators, testbed
for public download. Users of this plugin will be offered back a tools, auditing, code visualization or lifecycle management
private dashboard on our cloud server, where they will be able tools. Those tools, offered as extensions to the IDE (aka
to observe their profile and historical data on their own plugins) generate events that can be trapped by the IDE. We
development process. Furthermore, they will get IDE are developing a cloud-based architecture that will analyze
configuration advices (e.g. suggestions on plugins used by those events using process mining and classification
peers with a similar profile). This approach will hopefully techniques. By applying machine learning techniques on data
foster widespread participation and will allow us to conduct collected from many developers, we expect to derive a set of
several wide scale experiments on software development profiles that will help characterizing developer roles along
process mining. several perspectives.
B. Process Adherence Conformance Checking Our approach is aligned with recent European
recommendations for future software engineering related
To understand developers adherence to some agile studies [15]. We expect to corroborate those recommendations
methodologies, such as Scrum and others, we intend to use - by falsifying the null hypotheses stated in section III. If so, this
conformance checking techniques [23]. The latter cover new research thread will bring a new analytics dimension in the
different perspectives such as: i) control-flow perspective, software development process engineering domain.
responsible for analyzing the order of activities, ii)
organizational perspective, which focuses on highlighting
resources, such as people, systems, roles and how they are REFERENCES
related, iii) case perspective, characterized by the actors [1] A. Fuggetta, E. Di Nitto, and P. Milano, “Software Process,” Proc.
working on a process or its own path, and iv) the time Futur. Softw. Eng., pp. 1–12, 2014.
perspective concerned with frequency of events and their [2] K. Beck, Extreme Programming Explained: Embrace Change. 2004.
timing, allowing us to predict remaining process time, discover [3] F. P. J. Brooks, “No silver bullet-essence and accidents of software
engineering,” Proc. IFIP Tenth World Comput. Conf., pp. 1069–1076,
bottlenecks, monitor resource utilization and measure service 1986.
levels.
[4] M. B. Chrissis, M. Konrad, and S. Shrum, CMMI for Development:
Guidelines for Process Integration and Product Improvement. Pearson
C. Development Process and Tools Improvement Education, 2011.
Process enhancement and mainly software development [5] W. S. Humphrey, Introduction to the Team Software Process(sm).
Addison-Wesley Professional, 2000.
improvement are some of the most active topics in the research
community and software industry [10]. Using mainly the [6] W. S. Humphrey, Introduction to the Personal Software Process.
Addison-Wesley Professional, 1997.
organizational and case perspectives we will improve our
[7] J. G. Adair, “The Hawthorne effect: A reconsideration of the
understanding on developer’s behavior and expect to identify methodological artifact,” Journal of Applied Psychology, vol. 69, nº 2,
development process patterns, detect trends and perform pp. 334-345, May, 1984.
predictions. We consider using clustering and other [8] I. Dubielewicz, B. Hnatkowska, Z. Huzar, and L. Tuzinkiewicz,
classification techniques to build developers profiles upon the “Quality Assurance in Agile Software Development,” Adv. Appl. Model.
two aforementioned dimensions: skills and behavior. Some Eng., pp. 155–176, 2014.
profiles will be surrogates of good practices and others of bad [9] R. L. Novais, A. Torres, T. S. Mendes, M. Mendonça, and N. Zazworka,
practices. Albeit that profiling will mainly be for private “Software evolution visualization: A systematic mapping study,” Inf.
Softw. Technol., vol. 55, no. 11, pp. 1860–1883, 2013.
gauging and improvement, like in the Personal Software
[10] W. Van Der Aalst, et al, “Process mining manifesto,” Lect. Notes Bus.
Process approach [6], matching those profiles to the required Inf. Process., vol. 99 LNBIP, pp. 169–194, 2012.
roles in a development team may turn out to be a very useful
[11] D. Teams, “Rational Unified Process Best Practices for Software,”
tool in allocating the most adequate developers to satisfy Development, pp. 1–21, 2004.
[12] A. CHUA and S. PAN, “Knowledge transfer and organizational learning
7
www.iscte-iul.pt/en in IS offshore sourcing,” Omega, vol. 36, no. 2, pp. 267–281, 2008.
8
www.unl.pt/en [13] V. Rubin, I. Lomazova, and W. M. P. van der Aalst, “Agile development
9 with software process mining,” Proc. 2014 Int. Conf. Softw. Syst.
marketplace.eclipse.org
258
Authorized licensed use limited to: Brno University of Technology. Downloaded on August 17,2024 at 19:15:08 UTC from IEEE Xplore. Restrictions apply.
Process - ICSSP 2014, pp. 70–74, 2014. [22] G. Lee and W. Xia, “Toward agile: An integrated analysis of
[14] N. W. Paper, “SOFTWARE Key Enabler for Innovation,” no. July, quantitative and qualitative field data on software development agility,”
2014. MIS Q., vol. 34, no. 1, pp. 87–114, 2010.
[15] N. E. Software and S. Initiative, “Networked European Software and [23] R. Nayak and T. Qiu, “A Data Mining Application: Analysis of
Services Initiative Complementary Recommendations for WP 2016 / Problems Occurring During a Software Project Development Process,”
2017 on SOFTWARE ENGINEERING,” no. October, pp. 2014–2017, Int. J. Softw. Eng. Knowl. Eng. IJSEKE, vol. 15, no. 4, pp. 647–663,
2014. 2005.
[16] W. Poncin, A. Serebrenik, and M. Van Den Brand, “Process Mining [24] Y. Simmhan, S. Aman, A. Kumbhare, R. Liu, S. Stevens, and Q. Zhou,
Software Repositories,” 2011 15th Eur. Conf. Softw. Maint. “Cloud-based software platform for data-driven smart grid
Reengineering, pp. 5–14, 2011. management,” Comput. Sci. Eng., vol. 15, no. 4, pp. 1–11, 2013.
[17] D. Zhang and T. Xie, “Software analytics in practice: mini tutorial,” p. [25] R. Bryant, R. Katz, and E. Lazowska, “Big-Data Computing: Creating
997, Jun. 2012. Revolutionary Breakthroughs in Commerce, Science and Society,”
Comput. Res. Assoc., pp. 1–15, 2008.
[18] W. Van Der Aalst, and S. Member, “Service Mining : Using Process
Mining to Discover, Check, and Improve Service Behavior” vol. 6, no. [26] G. Liu, M. Zhang, and F. Yan, “Large-Scale Social Network Analysis
November, pp. 525–535, 2013. Based on MapReduce,” in 2010 International Conference on
Computational Aspects of Social Networks, 2010, pp. 487–490, 2010.
[19] D. Zhang, Y. Dang, S. Han, and T. Xie, “Teaching and Training for
Software Analytics,” in 2012 IEEE 25th Conference on Software [27] N. Chen, S. C. H. Hoi, and X. Xiao, “Software process evaluation: a
Engineering Education and Training, 2012, pp. 92–92, 2012. machine learning framework with application to defect
management process,” Empir. Softw. Eng., vol. 19, no. 6, pp. 1531–
[20] M. Brhel, H. Meth, A. Maedche, and K. Werder, “Exploring principles 1564, 2014.
of user-centered agile software development: A literature review,” Inf.
Softw. Technol., vol. 61, pp. 163–181, 2015. [28] A. Rajaraman and J. D. Ullman, “Mining of Massive Datasets,” Lect.
Notes Stanford CS345A Web Min., vol. 67, p. 328, 2011.
[21] R. M. Fontana, V. Meyer, S. Reinehr, and A. Malucelli, “Progressive
Outcomes: A framework for maturing in agile software development,” J.
Syst. Softw., vol. 102, pp. 88–108, 2015.
259
Authorized licensed use limited to: Brno University of Technology. Downloaded on August 17,2024 at 19:15:08 UTC from IEEE Xplore. Restrictions apply.