0% found this document useful (0 votes)
20 views

11 Requirements Engineering in Machine Learning Projects

Uploaded by

yukizyx230
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

11 Requirements Engineering in Machine Learning Projects

Uploaded by

yukizyx230
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Received 24 May 2023, accepted 14 June 2023, date of publication 12 July 2023, date of current version 19 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3294840

Requirements Engineering in Machine


Learning Projects
ANA GJORGJEVIKJ , KOSTADIN MISHEV , LJUPCHO ANTOVSKI ,
AND DIMITAR TRAJANOV , (Member, IEEE)
Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, 1000 Skopje, North Macedonia
Corresponding author: Ana Gjorgjevikj ([email protected])

ABSTRACT Over the last decade, machine learning methods have revolutionized a large number of domains
and provided solutions to many problems that people could hardly solve in the past. The availability
of large amounts of data, powerful processing architectures, and easy-to-use software frameworks have
made machine learning a popular, readily available, and affordable option in many different domains and
contexts. However, the development and maintenance of production-level machine learning systems have
proven to be quite challenging, as these activities require an engineering approach and solid best practices.
Software engineering offers a mature development process and best practices for conventional software
systems, but some of them are not directly applicable to the new programming paradigm imposed by
machine learning. The same applies to the requirements engineering best practices. Therefore, this article
provides an overview of the requirements engineering challenges in the development of machine learning
systems that have been reported in the research literature, along with their proposed solutions. Furthermore,
it presents our approach to overcoming those challenges in the form of a case study. Through this mixed-
method study, the article tries to identify the necessary adjustments to (1) the best practices for conventional
requirements engineering and (2) the conventional understanding of certain types of requirements to better
fit the specifics of machine learning. Moreover, the article tries to emphasize the relevance of properly
conducted requirements engineering activities in addressing the complexity of machine learning systems,
as well as to motivate further discussion on the requirements engineering best practices in developing such
systems.

INDEX TERMS Machine learning, requirements engineering, software engineering, software requirements.

I. INTRODUCTION (e.g., [2], [3]) have been mainly a result of the improvements
Artificial intelligence (AI) and its sub-field machine learning in the techniques used to train deep neural networks, the
(ML) have had significant research activity and commercial availability of larger datasets and more powerful computers,
use for decades, but over the last decade, they have become as well as the significantly reduced training time [4]. This
significantly more popular and accessible to the wider com- progress has gradually made ML algorithms ubiquitous in
munity. To a large extent, that has happened as a result of the many areas of our society and everyday activities.
significant progress made in the ML sub-field known as deep ML methods introduce a different approach to software
learning (DL) [1], which relies on deep neural networks to programming in which, instead of writing problem-solving
learn meaningful representations from raw data and bypasses instructions in software code, learning algorithms learn
the need for manual feature engineering. The significant solutions to problems through data. This new approach
achievements that DL has made possible in many fields generally consists of specifying a goal of the program
behavior, e.g., by collecting relevant data, limiting the solu-
The associate editor coordinating the review of this manuscript and tion search space through a rough skeleton of code, and
approving it for publication was Vicente Alarcon-Aquino . letting the learning algorithm find the best solution [5].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
72186 VOLUME 11, 2023
A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

Although ML systems typically require a significant amount ethical or legal requirements they are subjected to. Never-
of conventional software code to support the ML models theless, not much is available on RE for ML systems, nor
which are at their core [6], the new approach to software have the RE activities from the related domain processes,
programming introduced by ML challenges the established such as Cross-Industry Standard Process for Data Mining
software development process and best practices. The ML (CRISP-DM) and Knowledge Discovery in Databases
software development process is characterized by its data- (KDD), been detailed sufficiently by the RE and ML commu-
centricity, non-linearity, and multiple feedback loops between nities [10]. All of the above was a motivation for this article
stages, which can become even more complex in systems with which tries to answer the following research questions:
multiple ML components that interact in complex ways [7]. 1) Are conventional RE activities relevant to the ML
Previous experience has shown that while developing and development process, what challenges does this pro-
deploying ML systems can sometimes be a relatively fast and cess bring them, and what are their necessary adjust-
inexpensive process, maintaining such systems over time can ments to better fit into this process?
be challenging and costly, mainly because ML systems are 2) What types of requirements are particularly impor-
prone to accumulating hidden technical debt [6]. In addition tant in addressing the ML systems specifics, do these
to engineering challenges, AI systems introduce a new set specifics affect the conventional understanding of the
of challenges related to predicting their exact behavior in requirements, and how should this understanding be
different situations, predicting their effect on individuals or adjusted?
society, and ensuring their trustworthiness. Sometimes it can The article answers the research questions through a
be challenging to predict the behavior and outcomes of AI mixed-method study, i.e., (1) a review of previously published
systems precisely because of their complexity, susceptibility literature in the fields of ML and RE, and (2) a case study
to imperfections of the data they learn from, the difficulty involving a research project of the authors of this article [11].
in interpreting the functional processes that generate their The mixed-method study was primarily motivated by the lack
output, as well as any new behavior arising from their inter- of practical examples of (1) RE activities in ML projects and
actions with the world or changes in their environment [8]. (2) requirements specifications for ML systems reported in
In that context, the research literature has reported an example the literature. The case study gave us the opportunity to share
of bias in a commercial ML system that had been discov- the challenges we faced during the RE activities in a research
ered only after the system had been released for use, and project involving ML, our approach to dealing with those
negative user experiences had been reported [9]. Develop- challenges, and excerpts from the requirements specification
ing an appropriate solution to a real problem through ML for the developed ML system.
is a complex process that requires meticulous analysis of The objectives of this article are the following:
the system capabilities, behavior, risks, limitations, qualities, 1) Emphasizing the importance of RE activities in dealing
and intended/unintended use cases. It also requires analysis with the complexity of ML systems.
of the potential trade-offs between the stakeholders’ (some- 2) Analyzing the aspects of conventional RE that need to
times too high) expectations and their feasibility constrained be adjusted to the ML specifics.
by the available data and resources, between the aspiration 3) Giving an overview and sharing our experiences on
for higher model accuracy (often leading to higher model this relatively unexplored to date, but, in our opinion,
complexity) and the compliance with quality, ethical, and important topic.
legal constraints, between the time spent on experimenting The rest of the article is organized as follows. First,
and the expected time to delivery of an initial value to the an overview of the related work on RE for AI/ML systems
stakeholders, to name just a few. is presented. Next, a description of the methodology used to
The analysis of ML systems’ feasibility, the formulation identify relevant articles for the research questions is given.
of their important quality, ethical, and legal attributes, their Two sections dedicated to answering the research questions
limitations, constraints, and risks, the decisions on the accept- follow. The article concludes with a discussion of the most
able trade-offs, and the choice of system validation strategies important findings and a conclusion.
in agreement with their stakeholders are all activities that
belong to the requirements engineering (RE) stage in the II. RELATED WORK
conventional software development process. This leads to The number of research articles dedicated to RE for AI
the conclusion that RE activities are as crucial to the ML and ML systems is relatively small to date, as noted
development process as they are to the conventional one. in [10] and [12] also. Furthermore, in their review Martínez-
However, when ML components replace conventionally pro- Fernández et al. [13] have identified only one article,
grammed ones, the software requirements should correspond i.e., [10], that covers the whole RE process. For these reasons,
to the different development process and the ML specifics. this section includes articles that are not entirely dedicated to
Otherwise, the consequences of incorrectly engineered or RE for AI or ML systems but which mention this process as
missing requirements may be even greater in the case of part of the broader software engineering process they analyze.
ML systems, given the effects these systems may have on Belani et al. [14] discuss the RE challenges in the devel-
individuals, the control mechanisms they require, and the opment of systems which the authors call AI-based complex

VOLUME 11, 2023 72187


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

systems. The article presents the RE4AI taxonomy of chal- ML model characteristics, they conclude that the lack of
lenges related to recognized elements of AI (data, model, requirements specification and robustness have the great-
and system), which are aligned with the typical RE activities. est impact on those models. Rahman et al. [20] present a
Vogelsang and Borg [10] present the findings from interviews project which uses ML in detection and correction of trans-
with four data scientists on their experience with RE activ- action errors. In terms of RE, the authors emphasize the
ities in the development of ML systems. Among the rest, need for iterative refinement of the requirements since they
the authors conclude that requirements engineers should be evolve frequently. They further emphasize the importance
aware of the new quality requirements and integrate the ML of properly conducted feasibility analysis of ML systems
specifics in the established RE process. Chuprina et al. [15] in relation to the available data, as well as the importance
describe their ongoing work on an artifact-based RE method of identifying the data requirements before any large data
for data-centric systems, i.e., systems that include both AI acquisition. Wan et al. [21] present analyses of information
and ML systems. The authors state that these systems require obtained from 14 interviewees and 342 survey respondents
a new approach to RE and define a conceptual model from 26 countries. The authors’ analyses reveal significant
of artifacts, contents, and relations that should guide the differences between the ML and non-ML software develop-
RE process. Heyn et al. [16] identify four challenging RE ment process at different stages (e.g., requirements, design,
areas in the development of systems which the authors call and testing). Some of the differences related to RE include
AI-intense systems, i.e., systems that fundamentally depend the need for preliminary experiments while collecting the
on AI functionalities. The four areas include (1) defining requirements for ML systems, the greater uncertainty of
requirements for the context in which the system would the requirements, and the need to anticipate any potential
operate, (2) defining quality attributes and data requirements, performance degradation. Giray [22] presents an overview
(3) defining performance metrics and monitoring if the of research articles on software engineering for ML sys-
system has the guaranteed behavior, and (4) gaining an tems. In terms of RE, the author points to the challenges
understanding of the human factors that influence the user with proper management of customer expectations, with
acceptance and trust. the requirements elicitation, analysis, and specification, with
Studer et al. [17] extend the CRISP-DM data mining pro- the new quality attributes, and the new types of require-
cess model to address the specifics of the ML development ments such as data requirements. The author suggests that
process. This new process model consists of six phases, i.e., future research should focus on improving the alignment of
(1) business and data understanding, (2) data preparation, performance metrics with business objectives, proper integra-
(3) modeling, (4) evaluation, (5) deployment, and (6) mon- tion of the requirements for ML and non-ML components,
itoring and maintenance. The authors provide a description risk assessment frameworks, and data privacy regulations.
of the RE activities throughout the phases. While Kästner Martínez-Fernández et al. [13] provide a review of 248 arti-
and Kang [12] describe a course on software engineering for cles on software engineering for AI systems, of which 17 are
systems which the authors call AI-enabled systems, they also dedicated to software requirements. The authors conclude
mention software requirements as one of the stages in the that many of the latter focus on quality attributes, several
software engineering life cycle. In that context, the authors deal with specification approaches, and only one offers a
emphasize the lack of specification for AI components, the holistic view of the RE process. They point to the soft-
importance of identifying and measuring quality require- ware requirements as one of the underrepresented areas in
ments beyond model accuracy, the importance of defining the entire set of articles, with great potential for further
safety and security requirements, as well as the importance research. In terms of quality, they emphasize that stan-
of properly planned error handling. Zhang et al. [18] have dards developed for conventional software systems should
surveyed 195 DL practitioners to identify software engineer- be updated. Serban and Visser [23] analyze software archi-
ing challenges in DL application development. The authors tectures that enable robust integration of ML components
present 13 findings that reveal the challenges in different through a systematic literature review, interviews, and a
development phases, and 7 improvement recommendations. survey. They identify RE challenges such as (1) the dif-
Requirement analysis, integration testing, acceptance test- ficulty in understanding the project and estimating the
ing, and problem definition are identified as the most effort in advance, (2) the difficulty in defining functional
labor-consuming tasks throughout the process. Require- requirements for ML components, and (3) the potential reg-
ment analysis is recognized as a more difficult task in DL ulatory restrictions. Pereira and Thomas [24] analyze the
applications than in conventional ones. Kuwajima et al. [19] safety challenges in the development of ML-based cyber-
study the open problems in engineering safety-critical ML physical systems. In terms of RE, the authors indicate that
systems, particularly in terms of ML model/system require- while high-level requirements can be defined explicitly, the
ments, design, and verification. The authors conclude that low-level requirements are defined implicitly through the
ML models are characterized by a lack of requirements spec- dataset, making the requirements traceability inapplicable.
ification, design specification, interpretability, and robust- They suggest specifying requirements for data management,
ness. Through gap analysis of standard quality models and model development, model testing/verification, and model

72188 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

deployment. The potential risks include incomplete data def- Several research articles focus on non-functional require-
inition, incorrect loss function, wrong performance metrics, ments (NFRs) and quality-related aspects of AI/ML systems.
incompleteness of the testing process, and inadequacy of the Pons and Ozkaya [35] summarize the unique characteristics
safe operation values. Lorenzoni et al. [25] summarize the of several quality attributes of AI systems that are used by
software engineering practices and challenges in developing the public sector, i.e., security, privacy, data-centricity, sus-
ML models. Through their review of research articles, the tainability, and explainability. Horkoff [36] outlines a set of
authors have found an evident lack of techniques related to challenges associated with NFRs for ML systems, as well
RE for ML models. Lwakatare et al. [26] present a taxon- as research directions to solve them. The author states that
omy of engineering challenges related to commercial systems the current knowledge of NFRs should be at least partially
containing ML components. Through several case studies, rethought in the context of ML, because although many
the authors have identified five stages in the evolution of techniques related to NFRs for non-ML systems are still
ML components, from an experimental stage to autonomous valid, some need adjustment or complete renewal. Kuwajima
functioning. Some of the presented challenges, which, in our and Ishikawa [37] analyze the quality attributes relevant to
opinion, are related to RE, are those associated with the AI systems. The authors try to identify what needs to be
problem formulation and the desired outcome specification modified or added to quality standards for them to be adapted
in the experimental stage, as well as with the failure to eval- to the ML specifics and the Ethics Guidelines for Trustworthy
uate models with business-centric metrics in the noncritical AI from the European Commission [38]. Siebert et al. [39]
deployment stage. Maass and Storey [27] analyze if ML could present a process for constructing quality models for ML
benefit from conceptual modeling. Additionally, the authors systems, describe the elements of the process, and present a
outline specification languages useful in specifying various use case from the industry. The authors conclude that some
types of requirements for ML systems. Villamizar et al. [28] of the existing quality attributes relevant to conventional
present a catalog of 45 concerns related to ML systems software systems should be redefined, and new ones rele-
that should help requirements engineers in defining require- vant to ML systems should be added. Nakamichi et al. [40]
ments for such systems. The concerns cover five perspectives, propose a requirements-driven method for deriving qual-
i.e., objectives, user experience, infrastructure, model, and ity attributes for ML systems. They extend conventional
data. In a second research article, Villamizar et al. [29] pro- quality attributes with those relevant to ML systems and
pose an approach for analysis and specification of the five describe a method that allows deriving quality attributes and
perspectives of ML systems outlined in [28]. The authors measurements dependent on ML systems’ goals. Habibul-
provide a diagram of ML tasks and concerns, as well as a lah and Horkoff [41] present findings from interviews with
specification template. Pei et al. [30] review research articles ML industry practitioners regarding ML-relevant NFRs, their
published from 2016 to 2022 on RE-related collaboration measurement, and challenges. The authors conclude that
challenges occurring between the different roles involved the NFRs for ML systems are neither well structured nor
in ML development. The authors summarize the solutions well documented, their measurement is challenging, and
proposed in the reviewed literature and give an example although important, their consideration in ML systems is still
from the industry. Ahmad et al. [31] present a systematic at an early stage. In a journal article, Habibullah et al. [42]
mapping study of 43 primary studies on RE for AI. The extend these findings by analyzing the importance of different
authors analyze (1) the methodologies used in specifying NFRs, their associated challenges, and the different percep-
requirements for AI-based software, (2) their limitations, tion of NFRs that exists between practitioners from industry
(3) the evaluation method which the primary studies use, and academia. Habibullah et al. [43] present an exploratory
and (4) the application domains. The authors also pro- study on the definitions of NFR relevant to ML systems,
vide recommendations for future research. Ahmad et al. [32] their shared characteristics, and past research interest in
also analyze human-centered approaches in RE for AI each NFR. The authors conclude that the research inter-
software. Their (1) analysis of industry guidelines for AI soft- est in different NFRs differs significantly, and they manage
ware and (2) survey of industry practitioners have revealed to identify six clusters of NFRs sharing similar properties
the current practices and gaps. Jahic et al. [33] propose a and purpose. Hu et al. [44] address reliability requirements
textual domain-specific language that facilitates the spec- for machine vision components by defining relevant image
ification of data requirements and necessary ‘‘recognition transformations, classes of reliability requirements, a method
skills’’ the neural networks should acquire through their for instantiating requirements of each class of reliability
training. Through an example, the authors show the bene- requirements using human performance data, and, finally,
fits of the proposed approach. Through a literature review, a method to verify that components satisfy such requirements.
De Hond et al. [34] outline guidelines and quality criteria The requirements are defined as a tolerated range of visual
for development and evaluation of AI models for healthcare. changes which should not affect the component behavior.
The guidelines include many aspects relevant to RE, such as As mentioned at the beginning of this section, a small
understanding the problem and its context, quality require- number of research articles cover the challenges imposed
ments, risk management planning, and similar. by ML specifics throughout the whole RE process, as it is

VOLUME 11, 2023 72189


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

done in this article. Compared to [10], which analyzes RE TABLE 1. Search queries used in identifying relevant research articles.
challenges through interviews with data scientists, our article
does so through a literature review and a case study. Several
research articles [18], [21], [23], [25] cover the challenges
associated with the various software engineering activities
during ML software development. However, in our opinion,
the challenges related to RE activities are not covered as
extensively as in our article. Two research articles [13], [22]
provide an in-depth overview of the challenges associated
with the different stages of ML software development, includ-
ing those related to RE. Compared to the referenced articles,
our article (1) summarizes the challenges not only in terms of
the conventional RE activities but also in terms of a variety
of conventional types of requirements, (2) presents insights
into the reasons for the relevance of those RE activities and criteria given in Table 1, in the period of June-July 2021.
types of requirements for ML systems, (3) provides a brief The article search was repeated in April 2023 to find relevant
overview of the most relevant definitions and trade-offs for a articles published after the initial search. A description of the
set of ML-specific quality attributes, and (4) shares our expe- process follows.
riences in dealing with those challenges in a real ML project, The initial attempt to identify previously published
along with excerpts from its requirements specification. Our research articles relevant to our research questions was based
article also differs from [31] in the research questions it on search criteria 1 and 2 in Table 1. However, the query
answers and the method it uses to answer them. Namely, results mainly consisted of articles devoted to the use of
our article (1) focuses particularly on the challenges intro- ML methods to facilitate RE, which is irrelevant to this
duced in the conventional RE process by the ML specifics article. One of the reasons for such results could be the
and on the ways to address them, (2) systematizes them by small number of research articles devoted to RE for ML at
conventional RE activities and a large set of conventional the time of searching. Another reason could be the use of
requirement types, (3) reviews research articles which may inconsistent terminology for certain RE activities or soft-
not be explicitly devoted to RE for ML, but are implicitly ware requirements, like (1) the use of synonyms for the
related to a RE activity or requirement type (e.g., articles term ‘‘requirement,’’ (2) the disagreement over the nam-
related to risks, limitations, success metrics, assumptions, ing of certain RE activities, e.g., ‘‘requirements validation’’
constraints, various quality attributes of ML systems), there- over ‘‘requirements verification’’, further discussed in [45],
fore, in our opinion, it gives a broader overview of the topic, or (3) the disagreement over the nature, terminology, and def-
and, finally, (4) shares our practical experience in dealing inition of the non-functional requirements, further discussed
with those challenges through a case study. in [46]. A third reason could be the significant difference
between the conventional and the ML software development
III. METHOD process, leading to a potential terminological inconsistency of
This article answers the research questions through the second one with the first. The potentially relevant articles
(1) a review of literature in the fields of RE and ML and were initially selected based on their title and abstract, taking
(2) a case study. The article is organized according to the into account only journal, conference, conference workshop
conventional RE activities and software requirements. Most articles, and preprints (available on arXiv2 ), all written in
sections begin with a short definition of the RE activity or English. These initially selected articles were then analyzed
software requirement to which they are dedicated, continue more thoroughly from our side. The articles in the final
with a brief review of the ML domain literature relevant to the selection were not necessarily dedicated to RE for ML or AI
activity/requirement, and end with experiences from the case systems in their entirety but contained findings on the subject.
study. The following two sections describe the methodology Since the number of selected articles was again small, the
and its limitations. references and the articles which cited those articles entirely
dedicated to RE for ML or AI (e.g., [10], [36]) were analyzed
A. RESEARCH ARTICLES REVIEW in the same manner to identify other relevant articles. Finally,
The conventional RE activities and software requirements the selected articles were used to extract and synthesize the
were analyzed through well-known publications from the RE answers to the research questions. This process is illustrated
domain (e.g., [45]). The impact on the conventional RE activ- in Figure 1.
ities and software requirements in ML projects was analyzed The search criterion 3 in Table 1 allowed us to identify
through a review of previously published research articles influential articles in specific sub-fields of ML. Although not
that were identified using Google Scholar1 through the search explicitly dedicated to RE for ML, some of these articles

1 https://scholar.google.com/ 2 https://arxiv.org/

72190 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

FIGURE 1. General flowchart of the literature review process (search criteria 1 and 2).

contain findings that, in our opinion, should be considered process. Their references were used to find articles that pro-
during RE in ML projects. These findings include new types vide definitions/insights into the relevant quality attributes as
of implementation, ethical or trustworthiness risks, new types well. Furthermore, a brief summary of some of its trade-offs
of success metrics, assumptions, limitations, and similar. Due with other quality attributes was compiled for each quality
to the broadness of the sub-fields this search criterion covers, attribute. This way, we tried to emphasize the importance
the articles were selected based on our estimation of their of those quality attributes to ML systems, emphasize the
usefulness in answering the research questions. A thorough consequences of giving them insufficient attention during the
review of these sub-fields is out of the article’s scope, and RE activities, and provide the reader with valuable references
therefore, throughout this article, we only briefly summarized for further reading. This process is illustrated in Figure 2.
the findings we considered important. This process is illus- Finally, despite our efforts to identify and include in our
trated in Figure 2. review as many of the previously published research articles
A widely accepted classification of quality attributes rel- relevant to RE for ML systems as possible, due to the afore-
evant to ML systems does not exist at the time of writing, mentioned challenges and the volume of articles in certain
although certain research articles address this challenge ML sub-fields (e.g., certain quality attributes), relevant arti-
(e.g., [37]). Therefore, the research articles dedicated to cles may still be missing.
ML-specific quality attributes were identified through search
criterion 4 in Table 1, but this list of quality attributes should B. CASE STUDY
not be considered a complete one. Section V-D summarizes The object of our case study is an ML system, Aca-
the findings we found relevant to RE from a selected set of demic Disciplines Detector (ADD), which detects concepts
articles dedicated to each quality attribute, regardless of their defined as academic disciplines by the community editing
mentioning of RE related terminology, since the RE literature Wikipedia, based on textual excerpts from their Wikipedia
indicates that elicitation, prioritization, and specification of articles and their similarity to the academic disciplines that
quality requirements in a specific, measurable, attainable, are part of expert-created classification systems [11]. As an
relevant, and time-sensitive manner falls in the domain of example of an integrative ML system, incorporating several
RE [45]. More recent review articles, which have a large custom-trained and third-party ML models in its core func-
number of citing articles, were prioritized in our selection tionalities while attempting to solve a real-world challenge,

VOLUME 11, 2023 72191


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

and performance metrics, as a first step towards successful


practical application of ML, should always be guided by
the problem to be solved. In that sense, any software devel-
opment project begins with activities that provide a proper
understanding of the problem to be solved, the factors that
have motivated the project, and the context in which the
system would be used. In conventional RE, the requirements
elicitation is the process through which the stakeholders’
needs and constraints are identified, and it is intertwined with
the requirements analysis and requirements specification
activities [45].
Identification of all relevant stakeholders is inevitable for
a successful elicitation of the requirements. However, in ML
systems, the requirements may depend not only on the stake-
holders’ needs but on the available data as well. In that
sense, data scientists assess the feasibility of the stakeholders’
requirements through analysis and experiments, so, as Vogel-
sang and Borg [10] indicate, they are important stakeholders
to be consulted during the requirements elicitation. Certain
stakeholders may have unrealistic expectations of the ML
systems’ performance, adoption process, or functionality,
so they should be helped in making their targets more reason-
able, as well as in accepting the uncertainty of the time and
cost estimates [13]. Stakeholders should be aware that despite
its enormous potential, ML introduces nontrivial challenges
to the software development process, which can sometimes
make it a less suitable (e.g., in terms of interpretability) or a
more expensive option (e.g., in terms of time/resources) than
FIGURE 2. General flowchart of the literature review process (search
criteria 3 and 4).
other available options. For example, while DL stands out in
solving closed-end classification problems with sufficiently
large training datasets and test datasets that closely resemble
we believe that ADD is a suitable object of our case study.
those from training, any deviation from these assumptions
Although the inclusion of a single case study may be consid-
or misunderstanding of DL limitations can be a source of
ered a limitation, we believe that our experience can still be
problems [47]. Supervised DL algorithms may require at least
helpful in analyzing the RE challenges in ML projects.
10 million labeled examples to achieve or exceed human
performance [4], which can hardly be obtained in certain
IV. REQUIREMENTS ENGINEERING ACTIVITIES IN domains. Furthermore, Martínez-Fernández et al. [13] bring
MACHINE LEARNING PROJECTS attention to the applicability of research results in practice
Requirements engineering covers the activities related to because sometimes they can oversimplify reality and be inap-
(1) requirements development (requirements elicitation, plicable in real conditions. In short, the decision to implement
analysis, specification, and validation) and (2) require- an ML-based solution to a problem should be based primarily
ments management, which are inevitable activities in any on the outcome of the problem-specific analyses.
project regardless of its approach to software development In addition to the stakeholders’ requirements related to
(e.g., waterfall or agile) because they give reassurance that the system functionality, it is essential to understand their
the problem is properly understood and resolved [45]. This requirements related to the system quality attributes. For
section analyzes the research questions related to the rele- example, these include their interpretability requirements,
vance of conventional RE activities to the ML development and when less interpretable classes of models are taken
process, the challenges this process brings to the activities, into consideration, their requirements for the system out-
and their necessary adjustments to better fit into this process. put explainability, as further elaborated in Sections V-B
and V-D1. Furthermore, it is important to properly collect
A. REQUIREMENTS ELICITATION AND ANALYSIS the stakeholders’ security, privacy, and safety requirements,
1) LITERATURE REVIEW as well as to identify potential sources of bias that may
As with all other software systems, the success of lead to a discriminatory outcome for a particular group of
production-level ML systems depends primarily on their individuals. Therefore, legal experts are another important
fulfillment of specific business goals or end-user needs. group of stakeholders to be consulted during requirements
Goodfellow et al. [4] indicate that the definition of goals elicitation [10].

72192 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

D’Amour et al. [48] also highlight that real ML sys- research communities for which ADD was intended were
tems typically have behavioral requirements that go beyond divided into two classes due to the hypothesized differences in
generalization to an independent and identically distributed their interests and level of knowledge in the fields that ADD
test dataset (e.g., requirements on interpretability, fairness). covered. While we expected that the KO research community
When those relevant requirements are not well specified members have an extensive knowledge of the achievements
and enforced on the ML pipeline, i.e., are ‘‘underspecified’’, related to academic disciplines detection, we did not expect
many near-optimal solutions that fit such incomplete speci- this to be the case with the members of the ML and NLP
fication but behave differently in different dimensions (e.g., communities. On the other hand, the latter communities were
the previously mentioned interpretability or fairness) may expected to have greater familiarity with the ML and NLP
exist and be selected over the desired one, which can be methods used in ADD, which, among other things, implied
a cause of failure in applied ML [48]. In addition, deep different presentation clarity expectations by the different
neural networks are sometimes prone to learning undesirable communities. Furthermore, we expected that the ML and
‘‘shortcut’’ solutions to problems, i.e., decision rules that NLP community members would be mainly interested in the
perform well on independent and identically distributed test comparison of the state-of-the-art text encoders based on deep
data but fail on out-of-distribution data (data that may be neural networks to conventional text analysis methods on a
closer to real data) [49]. Avoiding such solutions, therefore, new domain-specific dataset, while the other communities
requires a thorough understanding of what makes a particular in the comparison of the detected academic disciplines to
solution easy to be learned in a given context, the impact previously published results. A separate stakeholder class
of the various factors throughout the ML pipeline, and their interested in the end results but with limited knowledge of
interactions [49]. the technical aspects of ADD was called data consumers.
The analysis of some of the available academic discipline
2) CASE STUDY classification systems and Wikipedia’s policies resulted in a
The ADD project was motivated by the importance of the number of insights that were later incorporated into the func-
established disciplinary system to society and the challenges tional and non-functional requirements. Several examples
in tracking its changes over time. Our previous work [50], include Wikipedia’s policies on article titles, lead sections,
[51] had made us aware of these challenges, so we hypothe- and life cycle, as well as the ML domain recommendations
sized that integrating different data sources into a data-driven for imbalanced dataset evaluation metrics (for more details,
methodology could be helpful in addressing them. Addi- see [11]). The initial requirements were further refined in
tionally, we identified a gap in the field related to the use the data analysis and experimentation phases, e.g., through
of Wikipedia and the new ML breakthroughs. Neither a analysis of Wikipedia’s article titles, article interlinks, cate-
detailed study of Wikipedia’s potential in this field nor a gory graph, and similar. In addition to refining the already
study of the potential of those ML breakthroughs when identified requirements, new requirements were discovered
applied to sufficiently large domain-specific datasets was in these phases, such as requirements related to data pre-
available. Therefore, we hypothesized that if used appro- processing.
priately, Wikipedia could provide large amounts of data to
maximize the capabilities of ML algorithms in studying the
disciplines, their relations, and evolution over time [11]. All B. REQUIREMENTS SPECIFICATION
of the above made ADD conceptually and methodologically 1) LITERATURE REVIEW
different from the similar methodologies proposed in research In conventional software development, the approach to for-
articles (for more details, see [11]). mal specification of the requirements largely depends on
Due to the research nature of the ADD project, the the selected approach to software development, with best
requirements elicitation was mainly done through individual practices already in place for each. We are not aware of best
activities, e.g., analysis of available classification systems practices defined for ML systems specifically. Maass and
of academic disciplines, reading related literature, and anal- Storey [27] indicate that the field of conceptual modeling
ysis of Wikipedia’s policies. To better understand how to already has proven specification languages for functional,
make ADD as useful as possible to the communities for non-functional, and business requirements. Data requirement
which it was intended, the characteristics of its stakehold- should use those already available for database systems and
ers were identified first. The stakeholders were classified linked data, whereas the approaches to specifying perfor-
into four classes, (1) team member, (2) research community mance, ethical, interpretability, and resilience requirements
member (Knowledge Organization (KO) and related fields), require further refinement [27]. Adaptation of formal meth-
(3) research community member (ML and Natural Lan- ods has been proposed as another possible direction for
guage Processing (NLP)), and (4) data consumer. Given the designing AI systems with provable correctness against
fact that we did not have direct representatives of some mathematically specified requirements [52]. Model-driven
of the stakeholder classes, studying their characteristics engineering principles have also been used in specifying
through imaginary personas [45] helped us better define their requirements for neural networks [33]. Through a litera-
(hypothesized) needs and requirements. For example, the ture review, Ahmad et al. [31] have found that the most

VOLUME 11, 2023 72193


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

commonly used modeling notations or languages in speci- potentially ambiguous examples that may cause incorrect pre-
fying requirements for AI systems are (1) Unified Model- dictions (e.g., scientific terminology or notable people related
ing Language (UML), (2) Goal-Oriented RE (GORE) and to a particular academic discipline), identifying exceptions
(3) Domain Specific Models, but the authors conclude that (e.g., articles on academic disciplines that do not comply
they still have limitations when it comes to their use for this with Wikipedia’s policies), defining approaches to test model
purpose. performance changes over time, and similar [11].

2) CASE STUDY V. REQUIREMENTS FOR MACHINE LEARNING SYSTEMS


The requirements specification approach used in the ADD This section analyzes the research questions related to the rel-
project was mainly based on best practices from conventional evance of different types of software requirements in address-
RE, tailored to ML specifics when necessary. A combination ing the ML specifics, the impact ML specifics have on their
of text and visual models was used. Excerpts from the ADD conventional understanding, and the necessary adjustments
requirements specification are given in Section V. of this understanding. In conventional software systems,
the requirements can be organized into multiple levels of
abstraction, where the lower levels refine the higher ones.
C. REQUIREMENTS VALIDATION
For example, Wiegers and Beatty [45] suggest a three-level
1) LITERATURE REVIEW
model of requirements consisting of business, user, and func-
The requirements validation ensures that the right require- tional requirements, accompanied by non-functional and data
ments which meet the needs of the stakeholders are captured, requirements. This section analyzes some of the well-known
and it is performed through activities such as requirements types of requirements in the context of ML systems, and uses
reviews, development of conceptual tests, definition of accep- this model to a limited extent in organizing its subsections
tance criteria and similar [45]. The use of ML affects this (for the exact three-level model of requirements, the readers
RE phase as well, especially in terms of testing approaches. are referred to the referenced publication). Nevertheless, the
Riccio et al. [53] point to a limitation in the effectiveness of analyzed requirements in this section are only a subset of
conventional testing approaches when applied to ML sys- the broader set of requirements and relevant information
tems, primarily because of the program logic dependence on suggested by the RE literature to date. Furthermore, through
training data and the stochastic nature of the learning process. the organization of the subsections, we do not attempt to
The authors emphasize the need for novel techniques that suggest a particular approach to organizing the requirements
address the specifics of ML systems. Since the testing of ML in requirements specifications, so we direct the readers to
systems is a rapidly progressing research field at the moment, publications and standards written for that particular purpose.
readers are directed to Riccio et al. [53] and Zhang et al. [54]
for a more thorough review of the challenges and novel A. HIGH-LEVEL (BUSINESS) REQUIREMENTS
methods. The business requirements, usually coming from stakehold-
When it comes to measuring DL models’ performance, ers familiar with the reasons for undertaking the project,
Geirhos et al. [49] show that measuring performance only on refer to the needs that initiated the project and the desired
an independent and identically distributed test dataset can outcome [45]. A description of some of the information that
sometimes be misleading if the assumption that the data may be part of these requirements follows, analyzed in the
generation and sampling mechanisms are the same is not jus- context of ML systems. It is supplemented with excerpts from
tified. The authors suggest that testing on out-of-distribution the ADD requirements, given in Table 2.
data should become a standard practice in order to distinguish
desired solutions from ‘‘shortcut’’ solutions [49]. Further- 1) OBJECTIVES
more, the results presented in [48] indicate that models should
a: LITERATURE REVIEW
be explicitly tested for any required behavior that is not
Defining the objectives to be achieved through the use of
guaranteed by the independent and identically distributed test
a particular software system is an essential factor for the
dataset, as some required behavior will almost certainly be
success of the project, regardless of whether it involves ML or
underspecified. These tests should be application specific and
not. The specificity of ML systems in terms of their potential
based on the requirements [48].
impact on individuals, groups, and even society, requires
defining objectives aligned with the already recognized eth-
2) CASE STUDY ical principles for this type of systems by the community.
The validation of the requirements for ADD was done In the ethical guidelines and principles for AI systems pub-
through reviews, development of test cases for conventionally lished recently by the public (e.g., Ethics Guidelines for
programmed software components, planning the ML-based Trustworthy AI [38]) and private sector (e.g., Google AI
components testing, and defining criteria that the components Principles3 ), convergence towards several ethical principles
and the system had to meet. The planning of the ML-based has been observed, i.e., transparency, justice and fairness,
components testing included defining criteria for collect-
ing representative training/test datasets, identifying types of 3 https://ai.google/responsibility/principles/

72194 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

non-maleficence, responsibility, and privacy, together with 3) LIMITATIONS


beneficence, freedom and autonomy, trust, dignity, sustain- a: LITERATURE REVIEW
ability, and solidarity [55]. Although some divergence of ML models are trained and tested under certain assumptions
the ethical principle definitions and uncertainty about their and conditions. Therefore, it should not be assumed that they
implementation in practice has been highlighted in [55], work equally well in other settings. As the ‘‘no free lunch’’
we firmly believe that the aspiration for development of theorem for ML [56] states, no algorithm is universally better
ethical ML systems based on recognized principles has to be than any other, including random guessing, when averaged
clearly and unambiguously stated in the high-level require- over all possible tasks [4]. For example, the limitations of
ments that guide the project. Consequently, these principles DL models, summarized in [47], indicate that they have poor
have to be included in the low-level requirements, incor- performance when their training data is limited, when their
porated in the development practices, implemented in the test data differs significantly from their training data, as well
system, validated, and monitored. as in broad example spaces filled with novelties. The quality
attribute definitions have their own limitations as well [57].
b: CASE STUDY Given these facts on the one hand, and ML systems’ potential
An excerpt from the objectives of ADD is given in Table 2. effects on individuals (or even autonomy in some cases) on
Some of the listed objectives exactly refer to the development the other, it is essential that the limitations of ML systems are
of a thoroughly evaluated system, transparent in terms of its clearly stated, communicated to its stakeholders, and agreed
methodology. upon. Communicating clearly what the system outputs mean
and what they do not, what the intended and unintended use
2) SUCCESS/PERFORMANCE METRICS cases are, helps in avoiding misinterpretations or inappropri-
a: LITERATURE REVIEW
ate use. As Jacovi et al. [58] highlight, vaguely specifying the
expected behavior of an AI system, which users should trust
The ML community has defined various performance metrics
to be upheld (called a ‘‘contract’’ by the authors), can lead to
appropriate for different types of ML problems, such as accu-
unwarranted trust in the system and its misuse, as users may
racy, precision, recall, f-measure, mean squared error, and
implicitly assume ‘‘contracts’’ that during the development of
others. Performance is usually measured on a dataset unseen
the system have not been considered to be upheld.
during the model training stage to ensure proper functioning
of the model on unseen data in a real-world setting. Never-
theless, in certain tasks it may be challenging to find an ML b: CASE STUDY
performance metric that corresponds to the desired system An excerpt from the limitations of ADD is given in Table 2.
behavior, or measuring that behavior may be impractical [4]. The limitations in the excerpt primarily refer to the interpreta-
In addition, a preferred and realistic level of performance tion of the system output, i.e., what the output means and what
that makes the ML system worthwhile, safe, and useful has it does not, and its proper use, i.e., intended and unintended
to be determined [4]. It is recommendable to document the use cases.
approaches to uncertainty and variability, e.g., k-fold cross-
validation, as well [9]. Some articles [10] indicate that it 4) RISKS
is the requirements engineer’s job to translate the customer a: LITERATURE REVIEW
expectations to appropriate metrics. Risks are conditions that should be identified, evaluated, and
controlled, because they can negatively affect the success
b: CASE STUDY of a project in terms of user acceptance, implementation,
ADD success metrics were defined at two levels of abstrac- competition, and similar [45]. ML systems face risks that are
tion, i.e., (1) high-level success metrics that refer to the not inherent in conventional software systems, like specific
system’s overall success in achieving its objectives and ethical, moral, legal, security, and other similar risks. While
(2) ML performance metrics specified for each ML com- AI algorithms have the potential to augment human well-
ponent separately, together with the expected performance. being, at the same time, they can sometimes exhibit behavior
An excerpt from the high-level success metrics for ADD with unintended and unanticipated consequences by their
is given in Table 2. The high-level success metrics indicate creators, both positive and negative [8]. As their properties
that the number of detected academic disciplines should and operating environments become too complex to allow
be similar to that in expert-created classification systems. an analytical formalization of some of their behaviors, pre-
They also require processing of multiple Wikipedia exports dicting their effects on individuals and the society becomes
over a period of four years, in order to demonstrate the challenging as well [8]. In that sense, anticipating any poten-
low variability of the test performance, and the high overlap tial risks from the influence these systems have on people and
of the detected disciplines in adjacent processed exports. the other way round, although absolutely necessary, can be
Examples of ML performance metrics, along with reasons rather challenging. This section summarizes some of the risks
for selecting them over others, are given in Table 3 and specific to ML systems, like the implementation challenges
Section V-C. that may turn into risks. An additional discussion on the

VOLUME 11, 2023 72195


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

risks associated with various quality attributes is available in TABLE 2. An excerpt from the high-level requirements for ADD. The
implementation of the requirements is described in [11].
Section V-D.
Engineering robust ML systems has specific challenges
that are not inherent in other types of software systems.
ML models are highly sensitive to changes in their input
data distribution and learning hyperparameters, and such
changes may lead to model retraining, further affecting all
of its dependent models in a way that cannot always be pre-
dicted [6]. They may depend on data from external systems
or models, changes of which may be beyond our control,
and be sensitive to changes in the environment which they
interact with [6]. Inadequate model update frequency in fre-
quently changing environments can be a risk factor to its
performance, as can failing to evaluate the model perfor-
mance on an important data slice, especially if it differs
from the overall performance [59]. Reproducibility, debug-
gability, and auditability are important aspects that require
version control of the model specifications [59] and tracking
of the data on which the model was trained, but proper data
management and versioning are more complex than doing
the same for software code [7]. ML systems face specific
security, privacy, and safety risks that must be adequately
addressed because of their potential consequences. The lack
of interpretability or explainability is another risk factor to
the stakeholders’ trust and acceptance of the system. In the
context of DL, many cases of failure can be attributed to
so-called ‘‘shortcut’’ solutions [49]. Underspecifying rele-
vant behavior to be learned by the ML pipeline can lead to
such ‘‘shortcut’’ or otherwise undesirable solutions, because
the ML pipeline can choose one such solution over another
which has the same test performance and much greater
compliance with the ‘‘unspecified’’ but desired behavioral
requirements [48], [49].

b: CASE STUDY
An excerpt from the risks of the ADD project is given in
Table 2. The risks in the excerpt refer to the potential nonac-
ceptance of the system by the users (due to its significant
local explanations, explanations by example, and etc., fur-
differences with previously published systems and method-
ther discussed in [60]), it is essential that they are aligned
ologies), as well as to the difficulties to precisely define the
with the user mental models, needs, and use cases [62].
ground truth in the evaluation process (due to imprecisely
Amershi et al. [63] state that in many ML systems, their users
defined domain-specific terminology and nonexistence of a
have been able to come up with new possibilities for explana-
widely-accepted finite set of academic disciplines).
tions, other than the ones they have received. Heyn et al. [16]
emphasize the importance of understanding user needs and
B. USER REQUIREMENTS interactions with the system during RE, in order to provide
1) LITERATURE REVIEW users with functionalities they would accept, trust, and use
Since ML systems may have user interactions that affect properly.
users and their acceptance of the system, collecting the user Nevertheless, in the more general context of business ana-
expectations from such systems can reveal useful interactions lytics projects, Wiegers and Beatty [45] state that elicitation
and quality requirements. For example, studies have shown of the user expectations from such systems is insufficient
that the perception of an ML system interpretability depends to reveal the complex knowledge needed to develop them.
on the audience to which the explanations are presented The same is true for ML systems. Moreover, features in ML
and the task [60], [61]. Although different types of post-hoc systems are introduced not only as a result of user needs, but
explanations may be appropriate to different end users in for other reasons as well, like the availability of certain data,
different tasks (e.g., textual explanations, visual explanations, the need to collect additional data through user interactions,

72196 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

and similar [64]. These systems may provide outcomes based overall performance on the entire dataset may be insuffi-
on user data, and their behavior may evolve as they collect cient and why its disaggregation across different data subsets
more data, so in this context, Yang et al. [65] distinguish four is needed. They suggest identifying factors related to vari-
types of AI systems based on two factors, i.e., their capability able performance, like categories of data instances with
uncertainty and output complexity. While the first type has similar characteristics, instrumentation, environmental con-
bounded capabilities and a fixed set of outputs, the fourth ditions etc., and measuring the performance across these
has evolving capabilities and adaptive open-ended outputs, factors when possible [9]. Defining the expected performance
making it difficult to predict what the fourth type of AI across the relevant factors and their combinations is essential,
systems can reliably do, when they can fail or how likely the as some data subsets may be more critical than others in
failures are, in order to plan appropriate interactions [65]. the context in which the system is used [66], so measuring
performance changes over individual factors or their com-
2) CASE STUDY bination becomes possible [9]. An example of environment
requirements specification through a data distribution matrix,
Due to the experimental nature of ADD and the characteris-
as well as an example of performance requirements for each
tics of its users, it does not have a user interface, but users
environment through a confusion matrix is given in [19].
interact with the application by running its modules with
a set of required parameters, after providing them with the
necessary files [11]. Users receive the results in local files 2) CASE STUDY
generated by the modules. In this sense, we were able to ADD consists of several ML models supported by conven-
identify most of the usage scenarios that involve different tional software components. Therefore, the behavior of the
classes of users. However, due to the inherent uncertainty of conventional components was fully specified in the functional
ML models, it is still possible to have model outcomes or requirements. On the contrary, only the desired behavior,
failures that have not been anticipated. performance expectations, assumptions, constraints, depen-
dencies and similar, were specified for the ML-based com-
C. FUNCTIONAL REQUIREMENTS ponents. An excerpt from the functional requirements for
1) LITERATURE REVIEW
the text classification component is given in Table 3. The
test dataset was sampled from the operational data accord-
In general, functional requirements describe what a software
ing to the expected data distribution across the two classes,
system should be capable of doing. Typically, the expected
as detailed in [11]. Due to the highly imbalanced data distri-
behavior of conventional software components is precisely
bution, the f-measure was selected over the accuracy, with the
specified in the functional requirements. This is not the
expected level of performance defined by class [11].
case with ML models, which learn how to relate the input
data to the expected outcome through a training process.
Nevertheless, ML systems commonly consist of both con- D. QUALITY REQUIREMENTS
ventionally programmed functionalities and functionalities A quality attribute can be defined as a measurable and testable
implemented by ML models. In that sense, certain func- property of a system that shows how well the needs of the
tional requirements are defined conventionally, by explicitly stakeholders are met, i.e., the quality requirements are quali-
specifying the rules that relate inputs to outputs. At the fications of certain functional requirements, or qualifications
same time, those functionalities that require training an of the whole system [67]. Examples of quality attributes are
ML model are described through the function that the reliability, efficiency, robustness, usability, scalability, and
model is expected to learn and the expected performance. many others. Different quality attributes can be of different
Kuwajima and Ishikawa [37] indicate that while conven- importance to different categories of systems. For example,
tional software can be decomposed into smaller functions the specifics of the ML systems require paying particular
that have separate requirements, design, and implementa- attention to quality attributes related to ethics and trust. At the
tion, the functions implemented by ML models are usually same time, these systems face new types of challenges that do
large and fuzzy, sometimes accompanied by large datasets. not occur in conventional software systems (e.g., in security
They suggest dividing these large functions into smaller ones and privacy), so adaptation of some of the conventional qual-
by specifying relevant domain-specific conditions/contexts ity attribute definitions, or even defining new attributes may
through training/test dataset partitions and then evaluating the be necessary [36], [37].
models on each of them [37]. Kuwajima et al. [19] suggest In complex systems, quality attributes can hardly be
that model requirements are specified through the expected achieved in isolation, without affecting other attributes,
operational data distribution, which can then be agreed upon so designing a system that meets its predefined quality
and enable the collection of test data that reflects the real requirements is partly about making the right trade-offs [67].
operational conditions. The authors further suggest that in The same is true for ML systems, in which while optimizing
such case, the training data can be designed to allow the an explicitly specified objective, the learning algorithm may
achievement of that requirements specification [19]. In a neglect some other which it was not explicitly instructed
similar context, Mitchell et al. [9] discuss why measuring the to optimize. Therefore, the quality attributes relevant to an

VOLUME 11, 2023 72197


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

TABLE 3. An excerpt from the functional requirements, assumptions, constraints, and dependencies of the text classification component. The
implementation of the requirements is described in [11].

ML system should be identified in cooperation with its associated or equated with the term ‘‘interpretability,’’ such
stakeholders, formally defined, incorporated into the data as ‘‘explainability,’’ ‘‘transparency,’’ or ‘‘understandability,’’
and learning algorithm, and evaluated through appropriate among others. Therefore, this section attempts to summarize
data and metrics, while addressing any potential trade-offs. their similarities and differences reported in the literature.
This section briefly reviews quality attributes with specific While some authors make a distinction between the terms
meaning and relevance to the ML domain. Because of the ‘‘interpretability’’ and ‘‘explainability,’’ others use them
vague boundaries between certain quality attributes and the interchangeably [69]. However, several research articles have
rapid progress of ML, the list should not be considered an found that the ML community uses the term ‘‘interpretable’’
exhaustive or a complete one. For a more thorough overview more often than the term ‘‘explainable’’ [69], [70]. Lip-
of each quality attribute, the reader is referred to the refer- ton [71] points out that interpretability is associated with
enced articles. An excerpt from the quality requirements for different notations, i.e., transparency (understanding how the
the ADD system as a whole is given in Table 4. While some model works at the level of the entire model, its compo-
requirements refer to the conventional aspects of software nents, or training algorithm), and post-hoc interpretability
development, such as the requirements for system scalability (giving an explanation of the model decision, which does
or usability, some particularly address ML specifics, such as not necessarily explain how the model came to that deci-
the requirements for the ML models interpretability or their sion). Transparent models are understandable to a certain
robustness to noisy input data. degree by themselves, i.e., simulatable if a person can rea-
son about them as a whole, decomposable if all their parts
are understandable to a person without additional tools, and
1) INTERPRETABILITY algorithmically transparent if a person can follow the process
a: LITERATURE REVIEW of producing an output from an input [60], [71]. ML mod-
In the context of ML systems, interpretability can be defined els that lack transparency need a different level of post-hoc
as an ability to explain or present in a comprehensible way explanations, which may even apply to transparent models,
to a person [68]. It is related to the barriers to optimiza- based on the audience and their level of complexity [60].
tion and evaluation that arise from the problem formulation For example, while linear/logistic regression, decision trees,
incompleteness in the ML domain, like the discrepancy rule-based models, or k-nearest neighbors are considered
between the real objective and the one that is actually opti- transparent models, whereas support vector machines or var-
mized, the inability to define and evaluate all edge cases, ious types of deep neural networks are considered as models
the difficulties in defining ethics or trust requirements, and that lack transparency [60], high-dimensional linear models,
similar [68]. There are many other terms that are often rule-based models with a large number of rules, or deep
72198 VOLUME 11, 2023
A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

decision trees tend to become less interpretable [71]. Nev- TABLE 4. An excerpt from the quality requirements for the ADD system
as a whole. The implementation of the requirements is described in [11].
ertheless, quantification measures of model interpretability
have yet to be formalized by the community [60], [70].
Carvalho et al. [69] state that interpretability is essentially
a subjective concept, so accordingly, when it is defined and
addressed, the domain of the problem, the use case, and
the needs of the audience asking questions about the model
decisions should be considered. For interpretability to be
implemented in the right way, it is important to analyze
what makes an explanation understandable, reasonable, and
human-friendly to its recipients in the specific context [69].
According to Miller [72], while most of the work in the
domain relies on the researchers’ intuition of what constitutes
a good explanation, it may be useful to look at the findings
from psychology, philosophy, or cognitive science of the way
people give explanations to each other.
The trade-off between interpretability and performance is
frequently discussed, because complex models that usually
have better performance tend to be less interpretable. How-
ever, such a trade-off may not exist in some cases when the
data is well structured and the features are of high quality, but
even when it exists, the development of sophisticated explain-
ability methods can help overcome it [60]. Herm et al. [73]
have shown empirically that this trade-off is less gradual
than assumed, when analyzed from the end-user perspective.
They have further shown that rather than being a curve, the
trade-off exhibits a grouped structure and is context depen-
dent (e.g., on the data complexity) [73].
b: CASE STUDY
Table 4 contains an excerpt from the interpretability require-
ments for the ADD system. It includes requirements for
consideration of inherently interpretable ML models in the
experimentation phase, preference for such models when they
perform similarly to the less interpretable ones, visualization
of the models input/output, and outputting supplementary
data which allows further result analysis.

2) FAIRNESS
a: LITERATURE REVIEW
With the increased use of ML algorithms in making deci-
sions about individuals, ensuring an outcome that is fair
and non-discriminatory in relation to sensitive characteris-
tics (e.g., gender, race) requires serious attention from the
ML practitioners. Fairness can be defined as an absence
of prejudice or favoritism towards individuals or groups
based on certain inherent or acquired characteristics they contain discriminatory information analogous to that in the
possess [74]. The ML community has proposed a number of protected attributes [75]. Demographic Parity, also known
different formal definitions of fairness. Some target individ- as Statistical Parity [76], requires membership in a pro-
ual fairness, i.e., similarly treating similar individuals, while tected class not to be correlated with the decision. Equalized
others target group fairness, i.e., treating different groups Odds [77] requires protected and unprotected classes to
equally. However, fairness definitions have their limitations have equal true-positive and false-positive rates, while Equal
too, as discussed in [57] and [75]. The most basic definition, Opportunity [77] is a weaker notation than Equalized Odds
known as Fairness Through Unawareness, requires protected and requires non-discrimination only over the ‘‘advantaged’’
attributes not to be explicitly used in decision-making pro- outcome. While the previous three definitions fall in the
cesses but, still, it has shortcomings as other features may category of group fairness, the following two belong to the

VOLUME 11, 2023 72199


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

category of individual fairness. The Individual Fairness [76] A commonly discussed trade-off is that of models’ robust-
definition requires similar individuals to get similar predic- ness and accuracy. Recent works have shown that larger
tions by an algorithm under some carefully chosen similarity and more complex datasets are needed for robust learning
metric. The Counterfactual Fairness [75] definition indicates versus those needed for standard learning [82], as well as a
that protected attributes should not be a cause of the predictor trade-off between the accuracy of models trained for adver-
in any individual instance. sarial robustness and their standard accuracy achieved when
Bias in ML can be addressed in three stages, i.e., trained on unperturbed inputs [83]. They have further shown
(1) by removing it from the data through pre-processing, that the learned feature representations differ in the two
(2) by modifying the learning algorithm, and (3) by reassign- settings, but models that encode prior about human percep-
ing the model predictions, when the model is treated as a tion seem invariant to perturbations to which the humans
black-box [74]. One source of bias in ML is the data itself, are invariant [83]. Recent works study methods to mitigate
in which bias can take a variety of forms, like historical bias robustness and accuracy trade-offs [84].
already existing in the real world and therefore existing in
the data, representation bias when the data lacks diversity by b: CASE STUDY
a particular criterion, bias in measuring a particular feature,
Table 4 contains an excerpt from the robustness require-
aggregation bias and many other types [74]. One example of
ments for the ADD system. It includes requirements for
bias found in ML results, as well as a discussion of the risk
proper input file format validation, default input values, and
of its inheritance and even amplification by other dependent
robustness to exceptions during processing of large input
models, is presented in [78], where a set of widely used
files. It also includes ML-specific robustness requirements,
word vectors, the distances of which represent relationships
i.e., such requiring ML models robustness to ambiguous or
between words, have been found to contain a gender bias.
non-standard input examples.
Since bias can be inadvertently introduced into ML systems
in a number of ways and at various stages of its development,
its identification and addressing should begin as early as 4) SECURITY
possible. Properly defined fairness requirements are the right a: LITERATURE REVIEW
place to start. The growing use of ML in many different domains, including
safety-critical ones, requires an understanding of the new
security vulnerabilities that are not present in other types of
3) ROBUSTNESS systems and strengthening the robustness against them. Nev-
a: LITERATURE REVIEW ertheless, Carlini et al. [80] indicate that while the studies of
In line with the general robustness definition, ML algorithms adversarial examples in new domains are advancing rapidly,
should be capable of learning robust models even in the the design of systems robust to such examples is slower.
presence of noisy training data and remain robust at operation To analyze the security of a system, it is necessary to
time. This makes robustness a rather broad attribute, closely identify (1) security goals, i.e., requirements that, if violated,
related to many of the other described in this section. result in a compromise of an asset and (2) a threat model [85].
ML systems’ ability to stay robust at operation time, when This model defines the conditions under which a defense
faced with input different from that seen during training, is designed to be secure and the security guarantees it pro-
is essential because non-robust ML systems may not only vides [80]. Some of the different models proposed in the
show poor performance but may wrongly assume a good literature consider the adversary’s (1) goal/incentives, i.e.,
performance and confidently take wrong action [79]. The accessing system assets or denying normal operation, and
robustness of ML models has been studied extensively in the (2) capability, i.e., its knowledge of the system and con-
context of adversarial examples, inputs designed to force a straints to its capability [85]. Others consider the adversary’s
model to produce erroneous outputs, most commonly through (1) goal, (2) knowledge, i.e., complete knowledge of the
small perturbations which make the new input close to the model or varying degree of black-box access, and (3) capabil-
original one according to a domain-specific distance met- ity [80]. In the context of supervised learning, the violations
ric, but misclassified by the model [80]. Evaluating model can be classified across three dimensions, i.e., (1) influence
robustness is important for several reasons, i.e., (1) to prevent (causative or exploratory), (2) security violation (integrity or
models from misbehaving due to adversaries, (2) to use their availability), and (3) specificity of the adversary’s intention
good worst-case robustness as an evidence that they will (targeted or indiscriminate) [85]. Different examples of learn-
not misbehave in the real-world due to unforeseen random- ing in adversarial environments have been described in the
ness, and (3) to compare models with human abilities [80]. literature, e.g., in [86] and [87].
To address ML models’ robustness properly, their perfor- Barreno et al. [85] have found that improving the
mance expectations should be defined rigorously, deviations worst-case robustness of an algorithm can make it less
from such expectations should be prevented, and methods effective on average. Based on their analysis of the most
to identify/correct such deviations should be defined, all of common shortcomings of adversarial example defenses,
which leads to accountability in the ML field [81]. Carlini et al. [80] have defined a set of guidelines for defense

72200 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

evaluation, emphasizing the extreme caution and skepticism human sense. The author identifies several sources of risk
that this process requires. in ML systems, i.e., (1) assumption that the training data
comes from the operational data distribution, (2) low prob-
5) PRIVACY ability density of the operational data distribution in certain
a: LITERATURE REVIEW
regions, (3) uncertainty coming from the way the test set was
instantiated, and (4) dependence of the loss function on the
There are certain situations in which the exposure of a model,
predicted and actual values only [93]. Several approaches
its parameters, or training data should be prevented due to
to mitigate these risks include ensuring an inherently safe
confidentiality or privacy. Still, ML models’ capacity to mem-
design, adding safety factors or margins, adding additional
orize elements of the training data makes it challenging to
procedural safeguards beyond those designed in the core
provide guarantees that participation in a training set does
functionality, and ensuring a safe fail [93]. In the context
not affect the privacy of the individuals [86]. The adversaries
of supervised and reinforcement learning, Amodei et al. [79]
usually aim at recovering the training data or the model, like
have identified several sources of safety risks, i.e., (1) a
recovering partially known inputs with the most probable
loss function that inadvertently ignores aspects of complex
values or extracting the training data using the outputs [86].
environments that could be harmful if changed at operation
Several methodologies for addressing privacy concerns in the
time (2) a loss function minimized by an easy solution during
ML domain are differential privacy, federated learning, and
training which was not the designer’s true intention, (3) sub-
data encryption.
stituting the correct loss function with another one because
Differential privacy represents a mathematically rigorous
the former is too expensive for frequent evaluation, (4) failure
definition of privacy, which ensures that the output of a
to ensure safe actions when the system encounters unseen
database analysis is distributed very similarly to the output of
input. Furthermore, Jacovi et al. [58] indicate that adequate
the analysis of another that differs from the first in one row
verification of the existence of a certain risk (an undesirable
only, while bounding the maximum divergence between the
but possible event) from the use of an AI system is a prereq-
two distributions by a privacy loss parameter [88]. Federated
uisite for verification of the existence of Human-AI trust.
learning refers to a setting where many clients collabora-
tively train a model while being orchestrated by a central
server/service provider and while keeping their training data E. OTHER REQUIREMENTS
decentralized [89]. The learning objective is achieved through 1) ASSUMPTIONS
updates that contain the minimum necessary information for a: LITERATURE REVIEW
the learning task and which are suitable for an immediate
The assumptions are an almost inevitable aspect of ML sys-
aggregation [89]. Another way to preserve data privacy is
tem development. For example, assumptions are made when
to train a model or make inferences on encrypted data using
certain aspects of the problem to be solved or its data are not
methods like homomorphic encryption or secure multi-party
observable. Based on those assumptions, the real problem is
computation. Several examples of their use in the ML domain
translated to an ML problem, and an appropriate class of mod-
include customizing ML algorithms to use homomorphic
els is selected to solve it. Other examples of assumptions are
encryption in training and inference stages [90], making
those related to the data distribution across different classes in
predictions with neural networks on encrypted data using
the real dataset, assumptions that the statistical properties are
homomorphic encryption [91], and others.
similar across the entire dataset [94], and similar. Assump-
In the context of trade-offs that come from the use of
tions are made about quality attributes as well. Deviations
privacy-preserving methods, Brundage et al. [92] point to
from the assumptions on which a particular class of models
trade-offs between the privacy benefits, the model quality,
is based can be a source of problems, as summarized for
the developers’ experience, and the costs in computation,
DL models in [47]. Furthermore, in DL, the assumptions
communication, or energy consumption. Papernot et al. [86]
made about the neural network architecture, training data,
point to a fundamental tension between the security/privacy
loss function, and optimization algorithm not only constrain
and the precision in ML systems with a finite capacity.
the problem solutions that can be learned, but determine
In terms of neural networks and homomorphic encryption,
how easily a particular solution can be learned, therefore,
Gilad-Bachrach et al. [91] indicate that adding encryption
may inadvertently create opportunities to learn an undesir-
makes the training process slower, at the same time prevent-
able ‘‘shortcut’’ solution to a problem that does not work
ing the data scientists from inspecting the data or tuning the
well in real-world settings [49]. Unclearly defined or omitted
model during training.
assumptions affect accountability in AI systems, as they leave
room for avoiding responsibility for any errors resulting from
6) SAFETY wrong assumption, by blaming unavoidable and inexplicable
a: LITERATURE REVIEW software ‘‘bugs’’ [81]. Therefore, identifying and document-
In the context of ML, Varshney [93] defines safety as mini- ing the assumptions prevents stakeholders from neglecting or
mization of the risk and uncertainty associated with harmful misinterpreting them in the development process and allows
events, i.e., events related to sufficiently high cost in some for appropriate addressing of their effects.

VOLUME 11, 2023 72201


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

TABLE 5. Brief summary of the relevance of conventional RE activities to the ML development process, the challenges this process brings them, and the
necessary adjustments to these activities to better fit into the process.

b: CASE STUDY ML systems are inevitably dependent on data, and they often
The assumptions for ADD were defined at two levels of depend on external models or external software libraries,
abstraction, i.e., high-level assumptions and assumptions so the team implementing the system may not always have
related to specific functionalities. An excerpt from the first full control over them. For those reasons, proper documenta-
type is given in Table 2. It includes our assumptions related tion of the dependencies and monitoring their effects on the
to the definitions and subordination of the domain-specific ML system is crucial. Breck et al. [59] have summarized a
thematic structure detected by ADD and its related ones. set of best practices for preventing potential risks arising from
These statements are considered assumptions due to the lack dependencies, which can sometimes lead to model misbehav-
of precise definitions and widely accepted understanding ior even without strange enough outputs to trigger monitoring
of their subordination in the domain. By clearly document- mechanisms.
ing them, we ensured that all stakeholders of ADD share
the same understanding. The second type of assumptions b: CASE STUDY
includes those related to ML models, e.g., the expected data Table 3 contains an excerpt from the ADD dependencies
distribution, relevant features, or appropriate class of models. defined for the text classification component. They primarily
An excerpt from the assumptions related to the text classifi- refer to the component dependence on third-party pre-trained
cation component is given in Table 3. text encoders, software libraries for ML and NLP function-
alities, and Wikipedia XML export files which ADD uses on
input.
2) DEPENDENCIES
a: LITERATURE REVIEW 3) CONSTRAINTS
Dependencies are external factors or components on which a: LITERATURE REVIEW
a project or system depends but are beyond its control, Constraints are restrictions on the design and implementa-
so they can turn into risks if left undefined or inappropri- tion choices that the developers can make about a solution,
ately monitored [45]. As already mentioned in Section V-A4, which can result from decisions made by management,

72202 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

TABLE 6. Brief summary of the importance of different types of requirements (high-level, user, and functional) in addressing the ML specifics, the
challenges that ML specifics bring to their conventional understanding, and the necessary adjustments of this understanding.

requirements from external stakeholders, requirements for to the relevant standards agreed upon among stakeholders has
compliance with standards or agreements, and a variety of been suggested also [95].
other reasons [45]. In the context of ML systems, examples
include policy constraints that may enforce certain require- b: CASE STUDY
ments, e.g., on privacy [23]. Other examples include data Table 3 contains an excerpt from the constraints of the ADD
constraints which describe meaningful feature ranges, feature project. They refer to certain imposed experimental choices to
dependencies, or invariants, ensuring the data validity after the developers of the text classification component in order to
its transformation [27]. Constraints may encode certain prior meet the high-level objectives of ADD related to the compar-
knowledge, a preference towards a simpler class of mod- ison of state-of-the-art and conventional text representation
els [4], or in other ways guide the ML pipeline in learning methods.
models that satisfy a broader set of behavioral requirements
that are sometimes not covered by the standard ML test- VI. DISCUSSION
ing process (e.g., requirements related to interpretability or This section offers a summary of all previously stated find-
fairness) [48]. For ML systems that continuously learn and ings in the article, related to the importance of the RE
change their behavior, hard-coding rules for system behavior activities in the development of ML systems, the importance
that prevent it from learning behavior that does not conform of certain types of requirements, the challenges associated

VOLUME 11, 2023 72203


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

TABLE 7. Brief summary of the importance of different types of quality requirements in addressing the ML specifics, the challenges that ML specifics
bring to their conventional understanding, and the necessary adjustments of this understanding.

with RE activities, and those associated with the conventional Furthermore, ML systems can implement complex deci-
understanding of the requirements. sion functions, which depend on many factors and which
ML systems have become ubiquitous in many segments may lead to outcomes that cannot always be predicted
of our lives due to the numerous benefits from their use. with certainty. Therefore, the importance of identifying,
Nevertheless, ML systems are complex systems which learn analyzing, documenting, and validating the expected behav-
their behavior from data. Since data can be imperfect or ior of an ML system, the intended and unintended use
reflect historical human biases, ML systems are at risk of cases, the risks, limitations, assumptions, the performance
acquiring these imperfections through the learning process. and quality expectations, or the required compliance with

72204 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

TABLE 8. Brief summary of the importance of different types of requirements (assumptions, dependencies, and constraints) in addressing the ML
specifics, the challenges that ML specifics bring to their conventional understanding, and the necessary adjustments of this understanding.

ethical/legal constraints should not be underestimated. On the process, all of which presented in the previous sections of this
contrary, these RE activities should be given attention as article.
early as possible in the ML development process. The The literature review also confirms the importance of
literature review provided in this article confirms that care- each of the requirement types considered in this article to
fully conducted RE activities can add value to the rather ML systems. However, the conventional understanding of
complex ML development process, in the same way that some of them (e.g., functional requirements or certain quality
they add value to the conventional software development attributes) may require adjustment in the context of ML.
process. Table 6, Table 7, and Table 8 briefly summarize the findings
Nevertheless, the ML development process has its own related to (1) the importance of the different types of require-
specifics that affect the already established RE best prac- ments in addressing the ML specifics, (2) the challenges
tices. The results of the literature review and the case study that ML specifics bring to the conventional understanding of
are consistent in terms of the significant impact that the these requirements, and (3) the necessary adjustments of this
ML development process has on the conventional, well- understanding, all of which presented in the previous sections
established RE activities, but they also highlight the benefits of this article.
of these activities in dealing with the complexity of the Finally, given the current prevalence of ML in software
process. ML introduces new activities through which require- development, we believe that the number of research articles
ments are identified and refined (e.g., data analysis and on this topic will continue to grow in the coming years,
experimentation), introduces non-trivial challenges to be offering experiences from real ML projects, as well as new or
anticipated in the RE phase, and makes some of the estab- adjusted methodologies that better fit the ML development
lished RE best practices inapplicable. However, at the time process. However, until widely accepted RE best practices
of writing, RE best practices for ML systems do not exist for ML systems are available, we believe that the already
and have yet to be defined by the community. Table 5 established RE models, applied with awareness of the ML
briefly summarizes the findings related to (1) the relevance specifics, provide a solid foundation for a thorough and
of conventional RE activities to the ML development process, shared understanding of what needs to be implemented in and
(2) the challenges that this process brings them, and what is expected from an ML system, while minimizing the
(3) their necessary adjustments to better fit into this risk of neglecting important requirements.

VOLUME 11, 2023 72205


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

VII. CONCLUSION [6] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,


Machine learning has become a common choice in modern V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, ‘‘Hidden tech-
nical debt in machine learning systems,’’ in Proc. Adv. Neural Inf. Process.
software development across many domains. Nevertheless, Syst., vol. 28, 2015, pp. 2503–2511.
while it can provide data-driven solutions to many prob- [7] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan,
lems that people find difficult to solve, at the same time, B. Nushi, and T. Zimmermann, ‘‘Software engineering for machine learn-
ing: A case study,’’ in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., Softw.
it challenges the well-established software development best Eng. Pract. (ICSE-SEIP), May 2019, pp. 291–300.
practices. Furthermore, machine learning introduces new [8] I. Rahwan et al., ‘‘Machine behaviour,’’ Nature, vol. 568, no. 7753,
technical and ethical challenges of which the stakeholders pp. 477–486, 2019.
must be fully aware even before the project begins. Since the [9] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson,
E. Spitzer, I. D. Raji, and T. Gebru, ‘‘Model cards for model reporting,’’ in
requirements engineering activities provide a proper under- Proc. Conf. Fairness, Accountability, Transparency. New York, NY, USA:
standing of the problem and ensure implementation of an Association for Computing Machinery, Jan. 2019, pp. 220–229.
appropriate solution, these are the right activities where [10] A. Vogelsang and M. Borg, ‘‘Requirements engineering for machine
learning: Perspectives from data scientists,’’ in Proc. IEEE 27th Int.
solving machine learning challenges should begin. As the Requirements Eng. Conf. Workshops (REW), Sep. 2019, pp. 245–251.
requirements engineering activities are also influenced by the [11] A. Gjorgjevikj, K. Mishev, and D. Trajanov, ‘‘ADD: Academic disciplines
machine learning specifics, but best practices do not exist yet, detector based on Wikipedia,’’ IEEE Access, vol. 8, pp. 7005–7019, 2020.
this article aims to analyze the impact that machine learning [12] C. Kästner and E. Kang, ‘‘Teaching software engineering for AI-enabled
systems,’’ 2020, arXiv:2001.06691.
has on conventional requirements engineering activities and [13] S. Martínez-Fernández, J. Bogner, X. Franch, M. Oriol, J. Siebert,
types of requirements, to emphasize the importance of proper A. Trendowicz, A. Maria Vollmer, and S. Wagner, ‘‘Software engineering
requirements engineering in machine learning projects, and to for AI-based systems: A survey,’’ 2021, arXiv:2105.01984.
[14] H. Belani, M. Vukovic, and Ž. Car, ‘‘Requirements engineering challenges
share our experience through a case study. Most importantly, in building AI-based complex systems,’’ in Proc. IEEE 27th Int. Require-
the purpose of this article is to motivate further discussion ment Eng. Conf. Workshops (REW), Sep. 2019, pp. 252–255.
and sharing of practical experiences on this important topic [15] T. Chuprina, D. Mendez, and K. Wnuk, ‘‘Towards artefact-based require-
ments engineering for data-centric systems,’’ 2021, arXiv:2103.05233.
because, in the future, machine learning systems will become
[16] H.-M. Heyn, E. Knauss, A. P. Muhammad, O. Eriksson, J. Linder,
even more present in our daily lives. P. Subbiah, S. K. Pradhan, and S. Tungal, ‘‘Requirement engineering chal-
The presented literature review and case study findings lenges for AI-intense systems development,’’ 2021, arXiv:2103.10270.
confirm that the machine learning development process [17] S. Studer, T. Binh Bui, C. Drescher, A. Hanuschkin, L. Winkler, S. Peters,
and K.-R. Mueller, ‘‘Towards CRISP-ML(Q): A machine learning process
affects the conventional, well-established requirements engi- model with quality assurance methodology,’’ 2020, arXiv:2003.05155.
neering activities, but they also confirm the relevance of [18] X. Zhang, Y. Yang, Y. Feng, and Z. Chen, ‘‘Software engineering
these activities to the process. Furthermore, the findings practice in the development of deep learning applications,’’ 2019,
arXiv:1910.03156.
confirm the relevance of the different requirement types con-
[19] H. Kuwajima, H. Yasuoka, and T. Nakae, ‘‘Engineering problems in
sidered in this article to machine learning systems, as well machine learning systems,’’ Mach. Learn., vol. 109, no. 5, pp. 1103–1126,
as the necessary adjustment of the conventional understand- May 2020.
ing of some of them in the context of machine learning [20] M. Saidur Rahman, E. Rivera, F. Khomh, Y.-G. Guéhéneuc, and
B. Lehnert, ‘‘Machine learning software engineering in practice: An indus-
(e.g., functional requirements or certain quality attributes). trial case study,’’ 2019, arXiv:1906.07154.
Therefore, we believe that the future research should con- [21] Z. Wan, X. Xia, D. Lo, and G. C. Murphy, ‘‘How does machine learning
tinue focusing on adjusting (1) the requirements engineering change software development practices?’’ IEEE Trans. Softw. Eng., vol. 47,
no. 9, pp. 1857–1871, Sep. 2021.
activities and (2) the understanding of the different require-
[22] G. Giray, ‘‘A software engineering perspective on engineering machine
ment types so they fit even better into the machine learning learning systems: State of the art and challenges,’’ J. Syst. Softw., vol. 180,
development process, as well as on presenting require- Oct. 2021, Art. no. 111031.
ments engineering experiences from real machine learning [23] A. Serban and J. Visser, ‘‘Adapting software architectures to machine
learning challenges,’’ 2021, arXiv:2105.12422.
projects. [24] A. Pereira and C. Thomas, ‘‘Challenges of machine learning applied to
safety-critical cyber-physical systems,’’ Mach. Learn. Knowl. Extraction,
vol. 2, no. 4, pp. 579–602, Nov. 2020.
REFERENCES
[25] G. Lorenzoni, P. Alencar, N. Nascimento, and D. Cowan, ‘‘Machine
[1] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, learning model development from a software engineering perspective:
no. 7553, pp. 436–444, 2015. A systematic literature review,’’ 2021, arXiv:2102.07574.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification [26] L. E. Lwakatare, A. Raj, J. Bosch, H. H. Olsson, and I. Crnkovic, ‘‘A tax-
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro- onomy of software engineering challenges for machine learning systems:
cess. Syst. (NIPS), vol. 25, Dec. 2012, pp. 1097–1105. An empirical investigation,’’ in Proc. Int. Conf. Agile Softw. Develop.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, Cham, Switzerland: Springer, 2019, pp. 227–243.
G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, [27] W. Maass and V. C. Storey, ‘‘Pairing conceptual modeling with machine
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, learning,’’ Data Knowl. Eng., vol. 134, Jul. 2021, Art. no. 101909.
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and [28] H. Villamizar, M. Kalinowski, and H. Lopes, ‘‘A catalogue of concerns for
D. Hassabis, ‘‘Mastering the game of go with deep neural networks and specifying machine learning-enabled systems,’’ 2022, arXiv:2204.07662.
tree search,’’ Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. [29] H. Villamizar, M. Kalinowski, and H. Lopes, ‘‘Towards perspective-
[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. based specification of machine learning-enabled systems,’’ 2022,
Cambridge, MA, USA: MIT Press, 2016. [Online]. Available: http://www. arXiv:2206.09760.
deeplearningbook.org [30] Z. Pei, L. Liu, C. Wang, and J. Wang, ‘‘Requirements engineering for
[5] A. Karpathy. (2017). Software 2.0. Accessed: May 15, 2023. [Online]. machine learning: A review and reflection,’’ in Proc. IEEE 30th Int.
Available: https://karpathy.medium.com/software-2-0-a64152b37c35 Requirements Eng. Conf. Workshops (REW), Aug. 2022, pp. 166–175.

72206 VOLUME 11, 2023


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

[31] K. Ahmad, M. Abdelrazek, C. Arora, M. Bano, and J. Grundy, ‘‘Require- [53] V. Riccio, G. Jahangirova, A. Stocco, N. Humbatova, M. Weiss, and
ments engineering for artificial intelligence systems: A systematic map- P. Tonella, ‘‘Testing machine learning based systems: A systematic map-
ping study,’’ Inf. Softw. Technol., vol. 158, Jun. 2023, Art. no. 107176. ping,’’ Empirical Softw. Eng., vol. 25, no. 6, pp. 5193–5254, Nov. 2020.
[32] K. Ahmad, M. Abdelrazek, C. Arora, M. Bano, and J. Grundy, ‘‘Require- [54] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, ‘‘Machine learning testing:
ments practices and gaps when engineering human-centered artificial Survey, landscapes and horizons,’’ IEEE Trans. Softw. Eng., vol. 48, no. 1,
intelligence systems,’’ 2023, arXiv:2301.10404. pp. 1–36, Jan. 2022.
[33] B. Jahić, N. Guelfi, and B. Ries, ‘‘SEMKIS-DSL: A domain-specific lan- [55] A. Jobin, M. Ienca, and E. Vayena, ‘‘The global landscape of AI ethics
guage to support requirements engineering of datasets and neural network guidelines,’’ Nature Mach. Intell., vol. 1, no. 9, pp. 389–399, Sep. 2019.
recognition,’’ Information, vol. 14, no. 4, p. 213, 2023. [56] D. H. Wolpert, ‘‘The lack of a priori distinctions between learning algo-
[34] A. A. H. de Hond, A. M. Leeuwenberg, L. Hooft, I. M. J. Kant, rithms,’’ Neural Comput., vol. 8, no. 7, pp. 1341–1390, Oct. 1996.
S. W. J. Nijman, H. J. A. van Os, J. J. Aardoom, T. P. A. Debray, E. Schuit, [57] S. Corbett-Davies and S. Goel, ‘‘The measure and mismeasure of fairness:
M. van Smeden, J. B. Reitsma, E. W. Steyerberg, N. H. Chavannes, A critical review of fair machine learning,’’ 2018, arXiv:1808.00023.
and K. G. M. Moons, ‘‘Guidelines and quality criteria for artificial [58] A. Jacovi, A. Marasović, T. Miller, and Y. Goldberg, ‘‘Formalizing trust
intelligence-based prediction models in healthcare: A scoping review,’’ in artificial intelligence: Prerequisites, causes and goals of human trust
NPJ Digit. Med., vol. 5, no. 1, pp. 1–12, Jan. 2022. in AI,’’ in Proc. ACM Conf. Fairness, Accountability, Transparency.
[35] L. Pons and I. Ozkaya, ‘‘Priority quality attributes for engineering AI- New York, NY, USA: Association for Computing Machinery, Mar. 2021,
enabled systems,’’ 2019, arXiv:1911.02912. pp. 624–635.
[59] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, ‘‘The ML test score:
[36] J. Horkoff, ‘‘Non-functional requirements for machine learning: Chal-
A rubric for ML production readiness and technical debt reduction,’’ in
lenges and new directions,’’ in Proc. IEEE 27th Int. Requirements Eng.
Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2017, pp. 1123–1132.
Conf. (RE), Sep. 2019, pp. 386–391.
[60] A. B. Arrieta, N. Díaz-Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Bar-
[37] H. Kuwajima and F. Ishikawa, ‘‘Adapting SQuaRE for quality assessment
bado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila,
of artificial intelligence systems,’’ 2019, arXiv:1908.02134.
and F. Herrera, ‘‘Explainable artificial intelligence (XAI): Concepts, tax-
[38] Ethics Guidelines for Trustworthy AI, AI HLEG, Eur. Commission, 2019. onomies, opportunities and challenges toward responsible AI,’’ Inf. Fusion,
[39] J. Siebert, L. Joeckel, J. Heidrich, A. Trendowicz, K. Nakamichi, vol. 58, pp. 82–115, Jun. 2020.
K. Ohashi, I. Namba, R. Yamamoto, and M. Aoyama, ‘‘Construction of a [61] R. Tomsett, D. Braines, D. Harborne, A. Preece, and S. Chakraborty,
quality model for machine learning systems,’’ Softw. Quality J., vol. 2021, ‘‘Interpretable to whom? A role-based model for analyzing interpretable
pp. 1–29, Jan. 2021. machine learning systems,’’ 2018, arXiv:1806.07552.
[40] K. Nakamichi, K. Ohashi, I. Namba, R. Yamamoto, M. Aoyama, [62] S. R. Hong, J. Hullman, and E. Bertini, ‘‘Human factors in model
L. Joeckel, J. Siebert, and J. Heidrich, ‘‘Requirements-driven method to interpretability: Industry practices, challenges, and needs,’’ Proc. ACM
determine quality characteristics and measurements for machine learning Hum.-Comput. Interact., vol. 4, no. CSCW1, pp. 1–26, May 2020.
software and its evaluation,’’ in Proc. IEEE 28th Int. Requirements Eng. [63] S. Amershi, D. Weld, M. Vorvoreanu, A. Fourney, B. Nushi, P. Collisson,
Conf. (RE), Aug. 2020, pp. 260–270. J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, J. Teevan, R. Kikin-Gil, and
[41] K. M. Habibullah and J. Horkoff, ‘‘Non-functional requirements for E. Horvitz, ‘‘Guidelines for human-AI interaction,’’ in Proc. CHI Conf.
machine learning: Understanding current use and challenges in indus- Hum. Factors Comput. Syst. New York, NY, USA: Association for Com-
try,’’ in Proc. IEEE 29th Int. Requirements Eng. Conf. (RE), Sep. 2021, puting Machinery, 2019, pp. 1–13.
pp. 13–23. [64] F. Girardin and N. Lathia, ‘‘When user experience designers partner with
[42] K. M. Habibullah, G. Gay, and J. Horkoff, ‘‘Non-functional requirements data scientists,’’ in Proc. AAAI Spring Symp., 2017, pp. 1–15.
for machine learning: Understanding current use and challenges among [65] Q. Yang, A. Steinfeld, C. Rosé, and J. Zimmerman, ‘‘Re-examining
practitioners,’’ Requirements Eng., vol. 28, no. 2, pp. 283–316, Jun. 2023. whether, why, and how human-AI interaction is uniquely difficult to
[43] K. M. Habibullah, G. Gay, and J. Horkoff, ‘‘Non-functional requirements design,’’ in Proc. CHI Conf. Hum. Factors Comput. Syst. New York, NY,
for machine learning: An exploration of system scope and interest,’’ 2022, USA: Association for Computing Machinery, 2020, pp. 1–13.
arXiv:2203.11063. [66] V. Chen, S. Wu, A. J. Ratner, J. Weng, and C. Ré, ‘‘Slice-based learning:
[44] B. C. Hu, L. Marsso, K. Czarnecki, R. Salay, H. Shen, and M. Chechik, A programming model for residual learning in critical data slices,’’ in Proc.
‘‘If a human can see it, so should your system: Reliability requirements for Adv. Neural Inf. Process. Syst., 2019, pp. 9392–9402.
machine vision components,’’ in Proc. IEEE/ACM 44th Int. Conf. Softw. [67] L. Bass, P. Clements, and R. Kazman, Software Architecture in Practice.
Eng. (ICSE). New York, NY, USA: Association for Computing Machinery, Boston, MA, USA: Addison-Wesley, 2003.
May 2022, pp. 1145–1156. [68] F. Doshi-Velez and B. Kim, ‘‘Towards a rigorous science of interpretable
[45] K. Wiegers and J. Beatty, Software Requirements. London, U.K.: Pearson, machine learning,’’ 2017, arXiv:1702.08608.
2013. [69] D. V. Carvalho, E. M. Pereira, and J. S. Cardoso, ‘‘Machine learning
[46] J. Eckhardt, A. Vogelsang, and D. M. Fernández, ‘‘Are ‘non-functional’ interpretability: A survey on methods and metrics,’’ Electronics, vol. 8,
requirements really non-functional? An investigation of non-functional no. 8, p. 832, Jul. 2019.
requirements in practice,’’ in Proc. 38th Int. Conf. Softw. Eng. New York, [70] A. Adadi and M. Berrada, ‘‘Peeking inside the black-box: A sur-
NY, USA: Association for Computing Machinery, 2016, pp. 832–842. vey on explainable artificial intelligence (XAI),’’ IEEE Access, vol. 6,
pp. 52138–52160, 2018.
[47] G. Marcus, ‘‘Deep learning: A critical appraisal,’’ 2018,
[71] Z. C. Lipton, ‘‘The mythos of model interpretability: In machine learning,
arXiv:1801.00631.
the concept of interpretability is both important and slippery,’’ Queue,
[48] A. D’Amour et al., ‘‘Underspecification presents challenges for credibil- vol. 16, no. 3, pp. 31–57, Jun. 2018.
ity in modern machine learning,’’ J. Mach. Learn. Res., vol. 23, no. 1,
[72] T. Miller, ‘‘Explanation in artificial intelligence: Insights from the social
pp. 10237–10297, 2022.
sciences,’’ Artif. Intell., vol. 267, pp. 1–38, Feb. 2019.
[49] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, [73] L.-V. Herm, K. Heinrich, J. Wanner, and C. Janiesch, ‘‘Stop ordering
M. Bethge, and F. A. Wichmann, ‘‘Shortcut learning in deep neural net- machine learning algorithms by their explainability! A user-centered inves-
works,’’ Nature Mach. Intell., vol. 2, no. 11, pp. 665–673, Nov. 2020. tigation of performance and explainability,’’ Int. J. Inf. Manage., vol. 69,
[50] A. Gjorgjevik, R. Stojanov, and D. Trajanov, ‘‘SemCCM: Course and Apr. 2023, Art. no. 102538.
competence management in learning management systems using semantic [74] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan,
web technologies,’’ in Proc. 10th Int. Conf. Semantic Syst. New York, NY, ‘‘A survey on bias and fairness in machine learning,’’ 2019,
USA: Association for Computing Machinery, Sep. 2014, pp. 140–147. arXiv:1908.09635.
[51] A. Dimitrovski, A. Gjorgjevikj, and D. Trajanov, ‘‘Courses content clas- [75] M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva, ‘‘Counterfactual
sification based on Wikipedia and CIP taxonomy,’’ in ICT Innovations fairness,’’ 2017, arXiv:1703.06856.
2017, D. Trajanov and V. Bakeva, Eds. Cham, Switzerland: Springer, 2017, [76] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, ‘‘Fair-
pp. 140–153. ness through awareness,’’ in Proc. 3rd Innov. Theor. Comput. Sci. Conf.
[52] S. A. Seshia, D. Sadigh, and S. S. Sastry, ‘‘Toward verified artificial New York, NY, USA: Association for Computing Machinery, Jan. 2012,
intelligence,’’ Commun. ACM, vol. 65, no. 7, pp. 46–55, Jul. 2022. pp. 214–226.

VOLUME 11, 2023 72207


A. Gjorgjevikj et al.: Requirements Engineering in Machine Learning Projects

[77] M. Hardt, E. Price, and N. Srebro, ‘‘Equality of opportunity in supervised KOSTADIN MISHEV received the master’s degree
learning,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 29, Dec. 2016, in computer networks and e-technologies and the
pp. 3315–3323. Ph.D. degree in computer science and engineer-
[78] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, ing from Ss. Cyril and Methodius University in
‘‘Man is to computer programmer as woman is to homemaker? Debiasing Skopje, in 2016 and 2023, respectively. He is
word embeddings,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 29, 2016, currently an Assistant Professor with the Faculty
pp. 4349–4357. of Computer Science and Engineering, Ss. Cyril
[79] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and and Methodius University in Skopje. His research
D. Mané, ‘‘Concrete problems in AI safety,’’ 2016, arXiv:1606.06565.
interests include data science, semantic web, nat-
[80] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras,
ural language processing, enterprise application
I. Goodfellow, A. Madry, and A. Kurakin, ‘‘On evaluating adversarial
robustness,’’ 2019, arXiv:1902.06705. architectures, web technologies, and computer networks.
[81] A. F. Cooper, E. Moss, B. Laufer, and H. Nissenbaum, ‘‘Accountability
in an algorithmic society: Relationality, responsibility, and robustness in
machine learning,’’ in Proc. ACM Conf. Fairness, Accountability, Trans-
parency. New York, NY, USA: Association for Computing Machinery,
Jun. 2022, pp. 864–876.
[82] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry, ‘‘Adversar-
ially robust generalization requires more data,’’ 2018, arXiv:1804.11285.
[83] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, ‘‘Robust-
ness may be at odds with accuracy,’’ 2018, arXiv:1805.12152.
[84] A. Raghunathan, S. Michael Xie, F. Yang, J. Duchi, and P. Liang, ‘‘Under-
standing and mitigating the tradeoff between robustness and accuracy,’’
2020, arXiv:2002.10716. LJUPCHO ANTOVSKI is currently a Professor in
[85] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, ‘‘The security of software engineering with the Faculty of Com-
machine learning,’’ Mach. Learn., vol. 81, no. 2, pp. 121–148, 2010. puter Science and Engineering, Ss. Cyril and
[86] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, ‘‘Towards Methodius University in Skopje, teaching courses
the science of security and privacy in machine learning,’’ 2016, related to project management, architecture of
arXiv:1611.03814. computers, mobile applications and platforms,
[87] Q. Liu, P. Li, W. Zhao, W. Cai, S. Yu, and V. C. M. Leung, ‘‘A survey design and architecture of software, and soft-
on security threats and defensive techniques of machine learning: A data
ware requirements engineering. He has more than
driven view,’’ IEEE Access, vol. 6, pp. 12103–12117, 2018.
20 years of experience in the IT area with a vast
[88] C. Dwork and G. N. Rothblum, ‘‘Concentrated differential privacy,’’ 2016,
arXiv:1603.01887.
consultancy experience in projects related to appli-
[89] P. Kairouz et al., ‘‘Advances and open problems in federated learning,’’ cation of technology in elections, electronic and mobile government, public
2019, arXiv:1912.04977. and corporate information systems, software project management, and secure
[90] L. J. M. Aslett, P. M. Esperança, and C. C. Holmes, ‘‘Encrypted IT systems.
statistical machine learning: New privacy preserving methods,’’ 2015,
arXiv:1508.06845.
[91] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and
J. Wernsing, ‘‘CryptoNets: Applying neural networks to encrypted data
with high throughput and accuracy,’’ in Proc. Int. Conf. Mach. Learn.,
2016, pp. 201–210.
[92] M. Brundage et al., ‘‘Toward trustworthy AI development: Mechanisms for
supporting verifiable claims,’’ 2020, arXiv:2004.07213.
[93] K. R. Varshney, ‘‘Engineering safety in machine learning,’’ in Proc. Inf.
Theory Appl. Workshop (ITA), Jan. 2016, pp. 1–5.
[94] A. L’Heureux, K. Grolinger, H. F. Elyamany, and M. A. M. Capretz,
‘‘Machine learning with big data: Challenges and approaches,’’ IEEE DIMITAR TRAJANOV (Member, IEEE) received
Access, vol. 5, pp. 7776–7797, 2017. the Ph.D. degree in computer science. From March
[95] S. Lo Piano, ‘‘Ethical principles in machine learning and artificial intelli- 2011 until September 2015, he was the founding
gence: Cases from the field and possible ways forward,’’ Humanities Social Dean of the Faculty of Computer Science and
Sci. Commun., vol. 7, no. 1, pp. 1–7, Jun. 2020.
Engineering. He is currently a Full Professor with
the Faculty of Computer Science and Engineering,
Ss. Cyril and Methodius University in Skopje and
ANA GJORGJEVIKJ received the bachelor’s Visiting Research Professor with Boston Univer-
degree in computer science and engineering and sity. He is also the Leader of the Regional Social
the master’s degree in computer networks and Innovation Hub established, in 2013, as a coopera-
e-technologies from Ss. Cyril and Methodius Uni- tion between UNDP and the Faculty of Computer Science and Engineering.
versity in Skopje, in 2010 and 2014, respectively, He is the author of more than 190 journals and conference papers and seven
where she is currently pursuing the Ph.D. degree in books. He has been involved in more than 70 research and industry projects,
computer science and engineering, with a partic- of which in more than 40 projects as a project leader. His research interests
ular focus on deep learning and natural language include data science, machine learning, natural language processing, Fin-
processing. She has more than ten years of experi- Tech, semantic web, open data, social innovation, e-commerce, technology
ence as a software engineer. Her research interests for development, and climate change.
include data science, machine learning, and natural language processing.

72208 VOLUME 11, 2023

You might also like