0% found this document useful (0 votes)
16 views

Large Language Models For Business Process Management

Uploaded by

alexandre.msl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Large Language Models For Business Process Management

Uploaded by

alexandre.msl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369645331

Large Language Models for Business Process Management: Opportunities and


Challenges

Preprint · March 2023


DOI: 10.48550/arXiv.2304.04309

CITATIONS READS
2 2,939

3 authors:

Maxim Vidgof Stefan Bachhofner


Wirtschaftsuniversität Wien Wirtschaftsuniversität Wien
11 PUBLICATIONS 93 CITATIONS 17 PUBLICATIONS 254 CITATIONS

SEE PROFILE SEE PROFILE

Jan Mendling
Humboldt-Universität zu Berlin
628 PUBLICATIONS 24,797 CITATIONS

SEE PROFILE

All content following this page was uploaded by Stefan Bachhofner on 30 March 2023.

The user has requested enhancement of the downloaded file.


Large Language Models for Business Process
Management: Opportunities and Challenges⋆

Maxim Vidgof⋆⋆1[0000−0003−2394−2247] , Stefan


⋆⋆1[0000−0001−7785−2090]
Bachhofner , and Jan Mendling1,2,3[0000−0002−7260−524X]
1
Vienna University of Economics and Business, Welthandelsplatz 1, 1020 Vienna,
Austria {first_name.last_name}@wu.ac.at
2
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
{jan.mendling}@hu-berlin.de
3
Weizenbaum Institute, Hardenbergstraße 32, 10623 Berlin, Germany

Abstract. Large language models are deep learning models with a large
number of parameters. The models made noticeable progress on a large
number of tasks, and as a consequence allowing them to serve as valuable
and versatile tools for a diverse range of applications. Their capabilities
also offer opportunities for business process management, however, these
opportunities have not yet been systematically investigated. In this pa-
per, we address this research problem by foregrounding various manage-
ment tasks of the BPM lifecycle. We investigate six research directions
highlighting problems that need to be addressed when using large lan-
guage models, including usage guidelines for practitioners.

Keywords: Natural language processing · Large language models · Gen-


erative Pre-Trained Transformer · Deep learning · Research Challenges

1 Introduction

Recent releases of applications building on Large Language Model (LLM) have


been quickly adopted by large circle of users. ChatGPT stands out with reaching
100 million users in 2 months [31]. The key factor explaining this fast uptake is
their general applicability making them a general-purpose technology. Also many
tasks in research can be approached with LLM applications, include finding peer
reviewers, evaluating manuscripts and grants, improving prose in manuscripts,
and summarizing texts [32]. For this reason, some argue that LLM – especially
conversational LLM – are a ”game-changer for science” [6].

This research received funding from the Teaming.AI project, which is part of the Eu-
ropean Union’s Horizon 2020 research and innovation program under grant agree-
ment No 957402. The research by Jan Mendling was supported by the Einstein
Foundation Berlin under grant EPP-2019-524 and by the German Federal Ministry
of Education and Research under grant 16DII133.
⋆⋆
Equal contribution
Preprint submitted to arXiv
2 M. Vidgof et al.

Much of the current discussion of applications like ChatGPT is concerned


with the question how good it works now and in the future. We believe that this
question needs to be approached with a clearly defined task in mind. Starting
with a task focus will move the discussion away from funny or disturbing errors
and biases [31] towards how the collaboration between human experts and LLM
applications can be organized. Furthermore, this bears the chance to learn about
specific categories of failures, which eventually will help to refine the technology
in a systematic way.
In this paper, we address the research challenge of how LLM applications can
be integrated at different stages of business process management. To this end,
we refer to the BPM lifecycle [8] and its various management tasks [13]. Our
research approach is exploratory in a sense that we developed strategies of how
LLM applications can be integrated in specific BPM tasks. We observe various
promising usage scenarios and identify challenges for future research.
The paper is structured as follows. Section 2 discusses the essential concepts
of Deep Learning (DL) and LLM in relation to Business Process Management
(BPM) practises. In Section 3 we identify and discuss LLM applications within
BPM and along the different BPM lifecycle phases. Based on these applications,
Section 4 describes six core research directions ranging from how LLM change the
dynamics and execution BPM projects, to data sets, and benchmarks specific to
BPM. Section 5 identifies challenges when using LLM. Furthermore, we provide
an outlook on how LLM might evolve in the future.

2 Background

The advent of LLM applications paves the way towards a plethora of new BPM-
related applications. So far, BPM has adopted natural language processing [1],
artificial intelligence [7], and knowledge graphs [14] to support various applica-
tion scenarios. In this section, we discuss the foundations of DL (Section 2.1)
and LLMs (Section 2.2). In this way, we aim to clarify their specific capabilities.

2.1 Deep learning

Recent LLM applications build on machine learning and deep learning models,
such as recurrent neural networks (RNNs) and transformer networks. Machine
Learning (ML) studies algorithms that are ”capable of learning to improve their
performance of a task on the basis of their own previous experience” [15]. In
essence, ML techniques use either supervised learning, unsupervised learning,
and reinforcement learning as a paradigm. Several of them are relevant for LLM.
In supervised learning, the ML algorithm receives as an input a collection
of pairs, where one pair consists of features representing a concept, along with
a label. Importantly, this label is task specific and encodes what the algorithm
should learn about the concepts. Such labels can be, for instance, spam and no
spam for a spam classifier, or bounding boxes with annotations for an image.
There are two cases of supervised learning that are relevant for LLM: few-shot
LLM for BPM: Opportunities and Challenges 3

and zero-shot learning. Few-shot learning is when a ML algorithm adapts to


a new situation with little amount of labelled data, and zero-shot learning is
when the algorithm can do this with no labelled data at all. For example, a
language model can be provided with a few input-output pairs, and the model
can inverse the mapping function without any parameter changes. In unsuper-
vised learning, the algorithm only receives a feature tensor of a concept as an
input and the desired output is unknown. The algorithm then finds structural
properties of the concepts present in the feature tensor. A typical application
is dimensionality reduction, for instance using auto-encoders. In reinforcement
learning, the algorithm receives a feature tensor of a concept as an input for
which an output is produced, which is then evaluated through rewards. The
algorithm then uses this feedback to improve its parameters. ChatGPT uses a
form of reinforcement learning known as deep reinforcement learning to improve
its language generation capabilities, in particular, ”Learning to summarize from
human feedback” [29]. ChatGPT is fine-tuned using a reward signal that assesses
the quality of its generated responses, with the goal of maximizing the reward
signal over time. The model’s ability to learn from the reward signal allows it to
generate increasingly relevant and coherent responses.

Deep learning (DL) is a ML method based on Neural Networks (NN). In


general, they are NNs with many layers stacked on top each other, which en-
ables them to learn multiple layers of representations [10]. Importantly, these
representations can be learned without supervision.Networks with only one hid-
den layer are called shallow.Deep networks are able to handle more complex
problems compared to shallow networks. Combined with the availability of large
amounts of data, improvements on how to speed up the optimization, and pow-
erful computing resources, enables them to be trained effectively. In the context
of natural language processing, deep learning has been particularly effective in
tasks such as machine translation, sentiment analysis, and named entity recog-
nition. The ability of deep learning to learn multiple layers of representations
from input data has proven to be particularly powerful for these tasks. This is
because natural language processing involves dealing with sequences of words
and characters, and the relationships between these sequences are often com-
plex and multi-layered. The use of large amounts of labeled training data and
powerful computational resources has enabled deep learning models to achieve
state-of-the-art results in many Natural Language Processing (NLP) tasks. For
example, the transformer architecture, introduced in the paper ”Attention is All
You Need” by Vaswani et al. [33] has become the standard architecture for many
NLP tasks, including language translation and language modeling.

In recent years, BPM research has integrated the capabilities of deep learning
to a large extent for process prediction. For an overview, see [16]. There are also
recent applications for automatic process discovery [28], for generating process
models from hand-written sketches [27], and for anomaly detection [17].
4 M. Vidgof et al.

2.2 Large language models

LLM are DL models trained on vast amounts of text data to perform various
natural language processing tasks. These models, which typically range from
hundreds of millions to billions of parameters, are designed to capture the com-
plexities and nuances of human language. The largest models, such as GPT-1
and GPT-3, are capable of generating human-like text, answering questions,
translating languages, and computer code. The training process of these models
involves processing massive amounts of text data, which is used to learn pat-
terns and relationships between words and phrases. These models then use this
information to predict the likelihood of a given token, or sequence of tokens,
in a specific context. This allows them to generate coherent and contextually
relevant text or perform other language-related tasks. The rise of large language
models has resulted in significant advancements in the field of NLP, and they
are widely used in various applications, including chatbots, virtual assistants,
and text generation systems. One of their strengths is their ability to perform
few-shot and zero-shot learning with prompt-based learning [11].
In 2018, Radford et al. introduced GPT-1 (also sometimes called simply
GPT) in their paper on ”Improving language understanding by generative pre-
training” [23]. Generative Pre-trained Transformer (GPT)-1 refers to the largest
model the authors have trained (110 million parameters). In the paper, the au-
thors studied the ability of transformer networks trained in two phases for lan-
guage understanding. In the first phase, they trained a transformer network to
predict the next token given a set of tokens that appeared before (also called un-
supervised pre-training, generative pre-training, or in statistics auto-regressive).
In the second phase, the transformer networks was fine tuned on tasks with
supervised learning (also called discriminative fine-tuning). In summary, their
major finding is that combining task agnostic unsupervised learning in the first
phase, then using this model in a second phase with supervised learning for fine
tuning on tasks can lead to performance gains - from 1.5% on textual entailment
to 8.9% on commonsense reasoning.
In 2019, Radford et al. introduced GPT-2 in their paper ”Language Models
are Unsupervised Multitask Learners”[24]. Again, GPT-2 refers to the largest
model they have trained. GPT-2 is hence a scaled up version of GPT-1 in model
size (1.5 billion parameters), and also in training data size. In particular, GPT-
2 has roughly more than ten times the number of parameters than GPT-1,
and is trained on roughly more than ten times the amount of training data.
They report two major findings. First, the unsupervised GPT-2 can outperform
language models that are trained on task specific data set, without these data
sets being in the training data set of GPT-2. Second, GPT-2 seems to learn
tasks (for example question answering) from unlabeled text data. In both cases,
however, the performance did not reach the state-of-the-art. In summary, their
major finding is that LLMs can learn tasks without the need to train them on
these tasks, given that they have sufficient unlabeled training data.
In 2020, Brown et al. introduced GPT-3 with the paper ”Language Models
are Few-Shot Learners”[5]. Unlike the above two cases, GPT-3 refers to all the
LLM for BPM: Opportunities and Challenges 5

models the authors have trained, i.e. it refers to a family of models. The largest
model the authors have trained is GPT-3 175B, a model with 175 billion param-
eters. In their paper, the authors showed that language models like GPT-3 can
learn tasks with only a few examples, hence the title includes ”few-shot learn-
ers”. The authors demonstrated this ability by fine-tuning GPT-3 on various
tasks, including question answering and language translation, using only a small
number of examples.
In 2023, OpenAI introduced GPT-4 [21]. Contrary to previous versions of
GPT, this version is a multimodal model as it can process text and images as
an input to produce text. This model is a major step forward as it improves
on numerous benchmarks; however, it suffers from reliability issues, a limited
context window, and inability to learn from experience like previous GPT mod-
els. This release, however, diverges from previous GPT models as OpenAI is
secretive about ”the architecture (including model size), hardware, training com-
pute, dataset construction, training method, or similar”. We only know about the
model that it is a transformer style-model, pre-trained on predicting the next
token on publicly available and not disclosed licensed data, and then fine-tuned
with Reinforcement Learning from Human Feedback (RLHF). Notwithstanding
this departure, the authors include in their report findings on predicting model
scalability. They in particular report on predicting the loss as a function of com-
pute, and the mean log pass rate (a measure on how many code sample pass a
unit test) as a function of compute given a training methodology. In both cases,
they find that they could predict the respective measure with high accuracy
based on data generated with significantly less compute (1.000 to 10.000 less).
They also find the inverse scaling price for a task, meaning that the performance
on a task first decreases as a function of model size and then increases after a
particular model size.
In 2022, OpenAI introduced a conversational LLM – called ChatGPT [19].
As a model, the first version of ChatGPT was based on GPT-3.5 and is an
InstructGPT sibling. GPT-3.5 is a GPT-3.0 model trained on a training data
set that contains text and software code up to the fourth quarter of 2021 [18].
InstructGPT was introduced in ”Training language models to follow instructions
with human feedback ” [22], and is a GPT-3 model fine tuned with supervised
learning in the first step, and in the second step with reinforcement learning
from human feedback [29]. ChatGPT is hence a GPT-3.5 model fine tuned for
conversational interaction with the user. In other words, the user interacts with
the model via sequence of text (the conversation) to accomplish a task. For
example, we can copy and paste a text into ChatGPT’s input field and ask it
to summarize it. We can even be more specific, we can say that the summary
should be 10 sentences long and be written in a preferred style. Importantly, if
we are unhappy with the result we can ask ChatGPT to refine its own summary
without copying and pasting the text it should summarize. At the moment of
this writing, ChatGPT can be used with GPT-4 as the backend LLM.
There are also other large language models. In 2022, Zhang et al. introduced
Open Pre-trained Transformer (OPT) with the paper ”OPT: Open Pre-trained
6 M. Vidgof et al.

Transformer Language Models” [34]. The main contribution of that paper is


that it makes all artifacts including the nine models available for interested
researchers.These models are GPT-3 class models in parameter size and per-
formance. Another open LLM is BigScience Large Open-science Open-access
Multilingual Language Model (BLOOM) (176 billion parameters), which was
developed in the BigScience Workshop [26].

2.3 Uptake of large language models

Above, we briefly discuss LLMs, where we focus particularly on the GPT model
family as these are the most popular LLMs, we hypothesise. It is important to
recognize the transition from GPT-3 to GPT-4, as it brought a massive increase
on a variety of benchmarks, particular on academic and professional exams [21].
These performance increases in NLP tasks are a result of natural language under-
standing and have, as we argue, massive implications for what can be automated
– the automation frontier. This frontier is arguably shifted further when natural
language understanding is combined with plugin software components. In fact,
at the time of writing, the company behind GPT is experimenting with Chat-
GPT plugins. Among the currently offered plugins are Klarna, Wolfram, the
integration with vector data bases for information retrieval, and an embedded
code interpreter for Python [20]. This has an impact on Robotics Process Au-
tomation (RPA), and more broadly on business process automation including
Business Process Management Systems (BPMSs), and more generally on how
work is carried out.

3 Large language models and the BPM lifecycle

In this section we identify applications of LLM within BPM. We systematically


explore these applications along the phases of the BPM lifecycle, namely iden-
tification, discovery, analysis, redesign, implementation, and monitoring [8]. In
this way, we complement recent efforts to build an overarching inventory of LLM
applications, such as in other fields like data mining 4 .

3.1 Identification

The BPM lifecycle starts from Identification. Normally, at this stage there is not
much structured process knowledge available in the company, and relevant in-
formation has to be extracted from heterogeneous internal documentation. This
is exactly where LLM shine as they can quickly scan and summarize large vol-
umes of text, highlighting important documents or directly outputting required
information.
4
See for example the OpenAI Cookbook GitHub repository, which provides code
examples for the OpenAI API
LLM for BPM: Opportunities and Challenges 7

Identifying processes from documentation The idea is to give LLM all relevant
documentation existing in the organization as input. This can include legal doc-
uments, job descriptions, advertisements, internal knowledge bases and hand-
books. The LLM is then tasked to identify which processes are taking place in
the organization. It can be further instructed to classify the input documents
according to processes they describe. Multimodal LLMs can improve the results
even further as charts, presentations and photos can also directly be used as
information sources.

Process selection LLM can be further asked to assess strategic importance of


processes based on, e.g. number and types of documents that refer to them as
well as extract this information from process descriptions. If given access to
information systems supporting the process or other KPIs, LLM can also assess
process health. Finally, assessing feasibility is also theoretically achievable as
long as necessary information, e.g. recent technology reports, is given as input
as well. Based on these criteria, LLM can prioritize the processes for further
improvement.

3.2 Discovery
The second stage of BPM lifecycle is Process Discovery. At this stage one or a
combination of process discovery methods is selected to produce process models.
When one speaks of automated process discovery, one usually means process
mining – a technique of extracting process models and other relevant data out
of event logs left by information systems supporting the execution of a process.
However, with LLM also other discovery techniques can benefit from (at least
partial) automation.

Process discovery from documentation Apart from process mining, documenta-


tion analysis is an established process discovery method. In this method, process
analyst uses the information found in heterogeneous sources such as internal doc-
umentation, job advertisements, handbooks, etc. Searching in these documents
might require a lot of time and effort. LLM are extremely suitable for this task
as they can summarize high volumes of text in a concise and structured way.
More precisely, they can output process descriptions in desired format (plain
text, numbered lists, etc.). One can also specify the level of detail, as to whether
the output should include only the activities and events or also resources and
additional information. Finally, as some LLM are also capable of working with
structured document formats such as XML, in fact even BPMN models can be
produced automatically.

Process discovery from communication logs Another information source that can
be used in evidence-based discovery is communicaiton logs, i.e. e-mails and chats
between process participants: internal employees but also external partners and
customers. LLM can extract patterns from these communication logs, which can
be seen as various steps in a process. Then, they can similarly produce process
descriptions or models.
8 M. Vidgof et al.

Interview chat bot Possible applications of LLM in process discovery can also go
beyond evidence-based discovery. Another common discovery method are inter-
views with domain experts. In these interviews, process analyst asks questions
about the process and produces a process model based on several interviews.
Typically, several separate interviews with different domain experts are required
to produce the first version of process model. Afterwards, additional rounds of
interviews are conducted in order to get and incorporate feedback and to perform
validation. In the worst case, domain experts might have conflicting perceptions
of the process, then resolving such conflicts becomes a very difficult and time-
consuming task for both process analyst and domain experts.
LLM can solve parts of this problem by providing a chat bot interface for
domain experts. In this way, the domain experts answer questions in the chat.
This can bring a lot of advantages. First, the domain experts do not have to
allocate lengthy time slots for interviews but instead talk with the chat bot at
desired pace. Second, the feedback loop gets shorter as LLM can produce process
models directly after or even during the conversation with the domain expert and
also do updates to the model, thus validation can happen simultaneously with
model creation. Finally, the benefits will only grow if multiple domain experts
interact with the chat bot simultaneously (and independently) but the chat bot
can use all of this input in the conversations. The latter option is, however, more
difficult to implement.

Combined process discovery All process discovery methods have their advantages
and drawbacks. Often, a combination of these methods is used to achieve best
results. However, this combination is limited by the resources that are allocated
for process discovery task. Discovery methods presented above give valuable out-
put yet requiring much less resources. Thus, it is possible to apply more of them
simultaneously for even better result. The combination of these methods can be
used in addition to traditional process mining or "manual" process discovery,
which will provide the richest insights. While it could happen that the results
of different methods have some inconsistencies that will have to be fixed, also
fixing them can be done in (semi-)automated manner.

Process model querying As LLM seem to "understand" process models serialized


as XML, they can be used to answer some questions about the model. This can be
very useful for quality assurance. First of all, it can be used for checking syntactic
quality. While there are tools out there that can do it already, and with much
less overhead, it is still convenient to have this feature in LLM because LLM, in
contrast to other methods, may be able to check other quality aspects as well. For
instance, it can also check semantic quality. Indeed, process analyst can give LLM
both interview transcript and a process model as input and LLM can check both
validity and completeness based on this interview. It must be noted, of course,
that this will only work under the assumption that the interview transcript has
these features of validity and completeness. Another way of checking semantic
quality of the model would be via process simulation, e.g. to explicitly ask LLM
whether the given process model could have produced a given execution sequence
LLM for BPM: Opportunities and Challenges 9

or to ask LLM to give possible execution sequences that can be generated by


the model. LLM are known to be able to simulate Linux shell, for instance,
thus they might be also able to simutale a BPMS execution engine as long as
enough input is provided. Finally, LLM can also (at least so some extent) check
pragmatic quality of the models as long as some definition of guidelines, e.g.
7PMG is provided as input as well. It must be also noted that LLM can not only
spot these quality issues but also suggest fixes.

3.3 Analysis
The next stage is Process Analysis. At this stage, the discovered processes are
analyzed to find problems and bottlenecks. While this is a cognitively loaded
task, LLM can be used to help human analysts in some regard.

Issue discovery If an issue exists in a process, chances are high somebody has
already complained about it. Depending on the company, product, and process
it can be the customer, partner or an employee and in can happen on different
platforms, including social media, support service or internal communication
tools. LLM are good at summarizing large volumes of unstructured text as well
as finding patterns, and this capability can be used for this task. It is as easy as
just scraping the text from these platforms and giving it as input to the LLM
with a simple prompt like "find all things customers have complained about".

Issue spotting After an issue in the process is found, the next step is to spot the
part of the process that creates this issue. In some cases, it can be a difficult task,
especially in a complex process. The idea here is to give LLM all process models
(or models of the relevant process in case it is known that only one process causes
the issue and it is known exactly which process) and the spotted problems. The
task of LLM is, by analyzing task names and descriptions to make suggestions
which tasks may be responsible for the issue. In advanced cases, LLM might
be even capable of suggesting some fixes. It might be something as simple as
suggesting to automate some manual task that takes too long but it also might
be some more complex process redesign suggestion as long as LLM is given
redesign methods as additional input or is trained on redesign methods as well.

3.4 Redesign
The fourth phase of the BPM lifecycle is Process Redesign. In this stage, process
improvement suggestions are developed based on discovered issues and general
process improvement methods. These suggestions are evaluated, and a to-be
process model is developed at the end of this stage.

Business process improvement An obvious yet very promising use case is t just
ask LLM to redesign the process. As already mentioned, simple issues arising
from just one activity can be fixed by the LLM. However, it does not stop there
and is theoretically only depending on the quality of the input given to the LLM.
10 M. Vidgof et al.

Indeed, if it is given exhaustive information about the process (detailed process


model as well as description of the process or tasks) as well as detailed description
of some redesign method (or it is trained on some redesign methods), redesigning
the business process is as simple as just telling the LLM to apply the method on
the process. This can, however, be improved even further. First, the description
of the issues discovered in the previous phase can be given as additional input
to guide process redesign to fix those first. Second, LLM can be instructed to
apply different redesign methods and to give separate lists of suggestions given
by each of them so the analyst can then select the best options. Moreover, LLM
itself can be asked to choose the best suggestions and motivate its choice. It
must be noted, of course, that this will only work if sufficient input is given.
For instance, for inward-looking redesign methods, the methods themselves as
long as detailed process information is required. For outward-looking methods,
in addition to that, there should be enough outside information and/or a way
for LLM to properly communicate with the outside world.

3.5 Implementation
The next phase of the BPM lifecycle is Process Implementation. It covers organi-
zational and technical changes required to change the way of working of process
participants as well as IT support for the to-be process.

BPMN model explanations with plain text As mentioned, LLM can work with
BPMN models serialized in XML. We have already discussed how LLM can
manipulate process models in order to increase quality as well as suggest or
incorporate redesign ideas. To close the circle, LLM can produce textual expla-
nations of BPMN models. What is more interesting, one can control the level
of detail as well. So, depending on the target audience, LLM can produce tex-
tual overview but also detailed descriptions of the models. It can transform it
into requirements for software developers if enough details are contained in the
BPMN model itself.

BPMN model chatbot Building on top of the previous use case, model description
can be also tailored to every specific user. This way, given a model or – better
– model repository with additional documentation, LLM can prepare specific
descriptions for, e.g. process owner but also for individual participants for which
all specific tasks they are responsible for are also described and explained in
detail. Furthermore, in this use case one can add interaction between the user
and LLM. This way, user may ask clarifications for parts he did not understand
or generally ask for more details as long as some guidance is required.

Process orchestrator LLMs can be accessed via APIs and at the same time
can access APIs themselves, opening a huge variety of opportunities. While the
former means it can be used for automated tasks and be called by the orchestra-
tor, the latter means that it could theoretically be an orchestrator itself: given
executable process model and additional constraints as context as well as the
LLM for BPM: Opportunities and Challenges 11

required instance data as input, it can theoretically execute a process by call-


ing other APIs and assigning tasks in a more flexible way than a traditional
orchestrator.

3.6 Monitoring
The last phase of the BPM lifecycle is Process Monitoring. At this stage, already
implemented processes are executed, and their performance is monitored. The
observations collected in this phase are used for operational management as well
as serve as input for further iterations of the lifecycle.

Process dashboard chatbot Dashboards are a powerful tool that provides overview
of the most important KPIs of a process on a single screen. However, the ultimate
goal of them is to tell the viewer whether the status of the process is good or
not, and the numbers and colors are mostly used as an intermediary medium.
LLM can take away this intermediate step and allow the user to directly know
the status of the processes.

4 Research directions
In this section we propose the research directions. We categorize the research
directions into three groups. The first group studies the use of LLM, and their
applications, in practice. This includes the use within BPM projects in compa-
nies or as part of an Information System (IS) (Section 4.1), the development
of usage guidelines for practitioners and researchers (Section 4.2), and also the
derivation of BPM tasks (Section 4.3) and their corresponding data sets (Sec-
tion 4.4). The second group studies how LLM can be combined with existing
BPM tools, and more generally BPM technologies, to increase user experience
(Section 4.5). Crucially, this group draws from findings in the first group. The
third and final group develops large language models specifically for business
process management, so these models can understand the context and language
of business processes and support various tasks, such as process discovery, mon-
itoring, analysis, and optimization (Section 4.6). Again, this group builds upon
the findings of the first group.

4.1 The use of large language models in BPM practice


The first research direction studies the use of LLM in practice. One major ques-
tion to answer is for which tasks LLM can be used. In Section 3, we present a
list of tasks for which LLM can be used. However, this list might not be com-
plete, in addition some of the tasks might turn out to be of little use. Tied to
this is the question what tasks will bring, and ultimately bring the most value
for an organization. The next big question is the relation between a task and
the model properties needed to achieve a pre-defined value. One question here
is which tasks can be achieved with already existing models. Another question
12 M. Vidgof et al.

to study is whether we always need the largest, and hence most accurate model,
for each task. We hypothesize that this might not be the case. Finally, and most
importantly, the next big question to answer is how LLM will change how work
is carried out within BPM projects, and within processes that are actively man-
aged. We for example hypothesize that conversational LLMs might take the spot
of the duck in the famous duck approach 5 . This question is a socio-technical
systems question, and we hence strongly believe that the BPM community, and
the information systems community more broadly, is especially well equipped to
contribute to this question.

4.2 Usage guidelines for researchers and practitioners

The second research direction builds usage guidelines for BPM researchers and
practitioners. One question such guidelines have to answer is given an organiza-
tional context, the lifecycle phase, and the process context of a task, suggest a
LLM to achieve an expected value. In addition, such guidelines systematically
collect best practices for creating prompts. For example, for the BPM lifecycle
phase process implementation, and monitoring and controlling, a company might
consider using a LLM within a managed process. Let us assume this company
is a bank and wants to automate the task of replying to customer inquiries with
LLM. Then this guideline proposes for the process implementation a specific
LLM, with the number of parameters it has, gives examples on how to create a
prompt template, fill the template with customer background information, and
finally on how to integrate the customer inquiry within the prompt template.
For process monitoring and controlling, the guidelines might propose a different
model for analyzing different inquiry clusters as the lifecycle phase context is
different. As an example, consider here that the LLM first categorizes each in-
quiry into a positive and negative sentiment, and then lists for both the top five
inquiry reasons. This research direction builds upon the first research direction,
as first research direction, among others, determines the tasks for which LLM
can be used in principle.

4.3 Creation, release, and maintenance of task variants specific to


BPM

This research direction builds and maintains two different task lists. The first
list maps general NLP tasks to tasks within BPM. As an example, consider
the general NLP task of text summarizing. Within BPM, text summarizing can
relate to summarizing a set of process descriptions or task descriptions. We can
think of this list as a one to many mapping between NLP tasks on the one hand,
and BPM tasks on the other. The second list enumerates tasks that are unique
to BPM. This research direction uses the findings from the directions presented
in Section 4.1 and Section 4.2.
5
Rubber duck debugging
LLM for BPM: Opportunities and Challenges 13

4.4 Creation, release, and maintenance of data sets and benchmarks


Public data sets and benchmarks are crucial for the progress of LLM in re-
search as they allow researchers to measure progress. In addition, they are also
important for practitioners as they define data set properties (such as meta-
information) they are likely to need themselves when they fine tune a model.
As a result, data sets and benchmarks need to be properly aligned with the
automation needs of BPM. Blagec et al. argue similarly as we, but for the clin-
ical profession [3]. In their study, they analyzed 450 NLP data sets and found
that ”AI benchmarks of direct clinical relevance are scarce and fail to cover most
work activities that clinicians want to see addressed”. A research direction for the
BPM community is hence to do the same for BPM. One question worth studying
is whether existing NLP data sets and benchmarks are of relevance to BPM, for
example, if they cover the activities of BPM researchers and practitioners. This
research direction builds upon the research direction in Section 4.3.

4.5 LLM and BPM artifacts


This research direction studies the interplay of LLM, BPM artifacts, and BPM
tasks. The goal is to understand which artifacts are necessary for LLM, and their
multimodal successors, to create useful outputs. It can hence be understood as
a special case of prompt engineering, which we might call multimodal prompt
engineering for BPM. This is an important research direction as the output
quality of a LLM depends heavily on the context quality and quantity it is
given. In other words, the more context, and the higher the quality of each
context, the higher the output quality of the LLM. For this reason, we believe
that it should be considered its own research direction. As an example, consider
again the customer inquiry process from above. In this case, we can imagine
that the context of the LLM depends on the inquiry. In one case, the customer
might include an image in the inquiry. Or think of the redesign phase of the
inquiry process. During this phase, artifacts are created, for example drawings
of processes on a board, comments to these processes in a word processor, and
remarks on data availability and access in an audio file. This information might
be useful when we ask – a possibly different – LLM why a customer inquiry on
current special offers cannot yet be answered. The reason here might be that
a central system which stores special offers does not yet exist. This research
direction builds upon the directions presented in Section 4.1 and Section 4.2.

4.6 Development and release of Large Language Models for


Business Process Management
This research directions studies how LLM are build for BPM tasks, all previously
discussed research direction are the foundation for this direction. The goal of the
research direction is to build LLMs that are attuned to the specific challenges
and requirements of BPM, compared to general-purpose language models. This
includes specialized models in the sense of exclusive for, and also general-purpose
14 M. Vidgof et al.

language models that are fine-tuned on the BPM domain. An important aspect
of this direction is to open source the created LLM, as is done for OPT [34]. This
is important for researchers can use this model in their studies, and practice as
companies can use these models free of charge for their use cases.

5 Discussion
In this section we discuss the challenges of LLM, the power of combination and
inflated expectations, and end with an outlook and future work.

Challenges The use of LLM entails opportunities and challenges. For exam-
ple, they can help to understand difficult research, but they also carry over
deficiencies (including factual errors) in the training data set to the texts they
generate [32]. In a systematic study of these errors, Borji analyzes errors of
ChatGPT and categorizes them – the author further outlines and discusses the
risks, limitations and societal implication of such models6 [4]. The failure cate-
gories identified by the author include reasoning, factual, math, and coding. A
similar deficiencies study was done in [2], but these authors focus on LLM in
general. A news feature in Nature discusses these and the risks of using LLM [9].
One consequence for education might be that essays as an assignment should be
re-considered [30].

The power of combination and managing expectations The major innovation of


ChatGPT was not the introduction of a new technology, but the combination of
already existing ones and an easy to use user-interface [12]. This effect of combi-
nation extends beyond LLM, NLP, or ML innovations. For example, OpenAI is
currently experimenting with integrating ChatGPT with software plugins, which
might even in the short run lead to a software marketplace for their platform7 .
For this reason, we suggest and advocate in our research directions above to
study and build these combinations with existing BPM technologies, instead of
solely focusing on developing new ones. In this paper, we have so far made the
case for the opportunities LLM realize, shortly discussed their shortcomings, and
pointed out how important it is to combine technologies within a field, and across
field boundaries. However, we also stress here how important it is to manage,
maybe even overshooting, expectations driven by this very recent developments.
For example, the speculation about the possible capabilities on the successor of
GPT-3 were driven up by the hype to a point where ”people are begging to be
dissapointed” [12].

Outlook and future work LLM are used, and will be used in commercial products
with huge amounts of users. We speculate that this will have an effect on research,
as funding agencies might increase the amount of grants for this research field.
An ever increasing user base that interacts with LLM (directly or indirectly) is
6
See the ChatGPT failure archive (GitHub) for an up-to-date list
7
https://openai.com/blog/chatgpt-plugins
LLM for BPM: Opportunities and Challenges 15

therefore, in our view, inevitable. For future work, we plan to work on developing
research directions that are beyond the scope of this paper. We expect that LLM
will have an effect on how work is carried out (see Section 2.3 and Section 4.1).
But this may have far greater impacts than what we cover here, for example on
the BPM capabilities, which are strategy, governance, information technology,
people, and culture [25].

6 Conclusion

In this paper we present six research directions for studying and building LLMs
for BPM. We use the BPM lifecycle to propose applications of LLM to showcase
the impact of these models.

References
1. Van der Aa, H., Carmona Vargas, J., Leopold, H., Mendling, J., Padró, L.: Chal-
lenges and opportunities of applying natural language processing in business pro-
cess management. In: COLING 2018: The 27th International Conference on Com-
putational Linguistics: Proceedings of the Conference: August 20-26, 2018 Santa
Fe, New Mexico, USA. pp. 2791–2801. Association for Computational Linguistics
(2018)
2. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dan-
gers of stochastic parrots: Can language models be too big? In: Proceedings
of the 2021 ACM Conference on Fairness, Accountability, and Transparency.
p. 610–623. FAccT ’21, Association for Computing Machinery, New York, NY,
USA (2021). https://doi.org/10.1145/3442188.3445922, https://doi.org/10.1145/
3442188.3445922
3. Blagec, K., Kraiger, J., Frühwirt, W., Samwald, M.: Benchmark datasets driving
artificial intelligence development fail to capture the needs of medical professionals.
Journal of Biomedical Informatics p. 104274 (2022)
4. Borji, A.: A categorical archive of ChatGPT failures (2023). https://doi.org/10.
48550/ARXIV.2302.03494, https://arxiv.org/abs/2302.03494
5. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C.,
Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are
few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin,
H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–
1901. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
6. van Dis, E.A.M., Bollen, J., Zuidema, W., van Rooij, R., Bockting, C.L.: ChatGPT:
five priorities for research. Nature 614(7947), 224–226 (Feb 2023)
7. Dumas, M., Fournier, F., Limonad, L., Marrella, A., Montali, M., Rehse, J.R.,
Accorsi, R., Calvanese, D., De Giacomo, G., Fahland, D., et al.: AI-augmented
business process management systems: a research manifesto. ACM Transactions
on Management Information Systems 14(1), 1–19 (2023)
16 M. Vidgof et al.

8. Dumas, M., La Rosa, M., Mendling, J., Reijers, H.A.: Fundamentals of Business
Process Management, vol. 2. Springer (2018)
9. Hutson, M.: Robo-writers: the rise and risks of language-generating AI. Nature
591(7848), 22–25 (Mar 2021)
10. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
(2015)
11. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and
predict: A systematic survey of prompting methods in natural language processing.
ACM Computing Surveys 55(9), 1–35 (2023)
12. Loizos, C.: StrictlyVC in conversation with Sam Altman, part two (OpenAI).
https://www.youtube.com/watch?v=ebjkD1Om4uw (Jannuary 2023), YouTube
channel of Connie Loizos
13. Malinova, M., Mendling, J.: Identifying do’s and don’ts using the integrated
business process management framework. Business Process Management Journal
(2018)
14. Miller, J.A., Mahmud, R.: Research directions in process modeling and mining
using knowledge graphs and machine learning. In: Qingyang, W., Zhang, L.J. (eds.)
Services Computing – SCC 2022. pp. 86–100. Springer Nature Switzerland, Cham
(2022)
15. Mjolsness, E., DeCoste, D.: Machine learning for science: state of the art and future
prospects. Science 293(5537), 2051–2055 (2001)
16. Neu, D.A., Lahann, J., Fettke, P.: A systematic literature review on state-of-the-
art deep learning methods for process prediction. Artificial Intelligence Review pp.
1–27 (2022)
17. Nolle, T., Seeliger, A., Thoma, N., Mühlhäuser, M.: Deepalign: alignment-based
process anomaly correction using recurrent neural networks. In: Advanced Informa-
tion Systems Engineering: 32nd International Conference, CAiSE 2020, Grenoble,
France, June 8–12, 2020, Proceedings. pp. 319–333. Springer (2020)
18. OpenAI: Model index for researchers. https://platform.openai.com/docs/model-
index-for-researchers/model-index-for-researchers
19. OpenAI: Chatgpt: Optimizing language models for dialogue.
https://openai.com/blog/chatgpt/ (November 2022)
20. OpenAI: ChatGPT plugins. https://openai.com/blog/chatgpt-plugins (March
2023)
21. OpenAI: GPT-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf
(March 2023)
22. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow
instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022)
23. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.:
Improving language understanding by generative pre-training.
https://openai.com/blog/language-unsupervised/ (2018)
24. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language
models are unsupervised multitask learners. https://openai.com/blog/better-
language-models/ (2019)
25. Rosemann, M., vom Brocke, J.: The six core elements of business process man-
agement. In: Handbook on business process management 1, pp. 105–122. Springer
(2015)
26. Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R.,
Luccioni, A.S., Yvon, F., Gallé, M., et al.: Bloom: A 176b-parameter open-access
multilingual language model. arXiv preprint arXiv:2211.05100 (2022)
LLM for BPM: Opportunities and Challenges 17

27. Schäfer, B., Van der Aa, H., Leopold, H., Stuckenschmidt, H.: Sketch2process: End-
to-end bpmn sketch recognition based on neural networks. IEEE Transactions on
Software Engineering (2022)
28. Sommers, D., Menkovski, V., Fahland, D.: Supervised learning of process discovery
techniques using graph neural networks. Information Systems p. 102209 (2023)
29. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford,
A., Amodei, D., Christiano, P.F.: Learning to summarize with human feed-
back. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.)
Advances in Neural Information Processing Systems. vol. 33, pp. 3008–3021.
Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/
1f89885d556929e98d3ef9b86448f951-Paper.pdf
30. Stokel-Walker, C.: AI bot ChatGPT writes smart essays-should academics worry?
Nature (December 2022)
31. Teubner, T., Flath, C.M., Weinhardt, C., van der Aalst, W., Hinz, O.: Welcome
to the era of chatgpt et al. the prospects of large language models. Business &
Information Systems Engineering pp. 1–7 (2023)
32. Van Noorden, R.: How language-generation AIs could transform science. Nature
605(7908), 21 (2022)
33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information
Processing Systems. pp. 5998–6008 (2017)
34. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab,
M., Li, X., Lin, X.V., et al.: OPT: Open pre-trained transformer language models.
arXiv preprint: https://arxiv.org/abs/2205.01068 (2022)

View publication stats

You might also like