0% found this document useful (0 votes)
16 views

Encoding Conceptual Models For Machine Learning A Systematic Review

Domain modelling transforms informal requirements written in natural language in the form of problem descriptions into concise and analyzable domain models. As the manual construction of these domain models is often timeconsuming, error-prone, and labor-intensive, several approaches already exist to automate domain modelling

Uploaded by

sit22cs182
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Encoding Conceptual Models For Machine Learning A Systematic Review

Domain modelling transforms informal requirements written in natural language in the form of problem descriptions into concise and analyzable domain models. As the manual construction of these domain models is often timeconsuming, error-prone, and labor-intensive, several approaches already exist to automate domain modelling

Uploaded by

sit22cs182
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)

2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C) | 979-8-3503-2498-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/MODELS-C59198.2023.00094

Encoding Conceptual Models for Machine


Learning: A Systematic Review
Syed Juned Ali Aleksandar Gavric
Business Informatics Group, TU Wien Business Informatics Group, TU Wien
[email protected] [email protected]

Henderik Proper Dominik Bork


Business Informatics Group, TU Wien Business Informatics Group, TU Wien
[email protected] [email protected]

Abstract—Conceptual models are essential in Software and data provided by conceptual models has gained much atten-
Information Systems Engineering to meet many purposes since tion in supporting various conceptual modeling tasks such
they explicitly represent the subject domains. Machine Learning as intelligent modeling assistants [3], model completion [4],
(ML) approaches have recently been used in conceptual modeling
to realize, among others, intelligent modeling assistance, model model transformation [5], metamodel repository management,
transformation, and metamodel classification. These works en- and model domain classification [6], [7]. Furthermore, there
code models in various ways, making the encoded models suitable is a potential to apply ML to publicly available sources of
for applying ML algorithms. The encodings capture the models’ high quality (F.A.I.R. principles [8]) models to enable reuse,
structure and/or semantics, making this information available adaptation, and (collaborative) learning, as well as empirical
to the ML model during training. Therefore, the choice of the
encoding for any ML-driven task is crucial for the ML model to modeling research.
learn the relevant contextual information. In this paper, we report ML on conceptual models aims to “learn” generalized pat-
findings from a systematic literature review which yields insights terns that capture the explicit mapping between the conceptual
into the current research in machine learning for conceptual model’s elements and the domain concepts represented by
modeling (ML4CM). The review focuses on the various encodings them. In other words, the trained ML model should be able
used in existing ML4CM solutions and provides insights into
i) which are the information sources, ii) how is the conceptual to answer what the conceptual model represents in terms of
model’s structure and/or semantics encoded, iii) why is the model the “meaning” of the domain concepts and model elements.
encoded, i.e., for which conceptual modeling task and, iv) which ML-based solutions for conceptual modeling follow a specific
ML algorithms are applied. The results aim to structure the state pattern of first encoding the conceptual model’s semantics in a
of the art in encoding conceptual models for ML. representation suitable for training ML models. Then, the ML
Index Terms—Machine learning, Model-driven engineering,
Model Encoding, Systematic Literature Review
models are trained to learn the knowledge encoded in con-
ceptual models to support CM tasks like metamodel element
prediction and domain classification. ML models typically aim
I. I NTRODUCTION to learn generalized patterns from an input dataset by utilizing
Conceptual modeling (CM) explicitly captures (descriptive a certain encoding of the knowledge represented by the
and/or prescriptive) domain knowledge where a domain, in conceptual models. Therefore, the encoding constraints what
an enterprise and information systems engineering context, is can an ML model learn from the available knowledge in the
anything that is being modeled, including—but not limited model. The contextual information that captures representative
to—business processes, information structures, business trans- semantics of the data needs to be accessible to the ML model
actions, and value exchanges, enabling domain understanding during training for the ML model to learn semantically rich
and communication among stakeholders [1]. Model-driven en- patterns. Current ML-based CM solutions primarily rely on the
gineering (MDE) is a software development approach that em- lexical terms (i.e., names) used as labels on modeling language
phasizes the use of models1 as the primary artifacts throughout primitives (e.g., classes, relations, attributes) to capture the
the entire software development lifecycle. These models can be models’ contextual semantics. This leads to a situation where
automatically transformed and refined to generate executable the natural language (NL) semantics of the primitives are
code, documentation, and other artifacts [2]. encoded. However, additional sources of semantics such as
Applying Machine Learning (ML) techniques i.e., Deep structural semantics, the metamodel semantics, and the CM
Learning (DL) and Natural Language Processing (NLP), on elements’ ontological semantics are left implicit.
Therefore, various issues arise depending upon the require-
1 Throughout the paper, we will use the term ‘model’ to relate to a ments that need to be addressed before applying ML to
conceptual model and ‘ML model’ to relate to machine learning models. CM tasks. Firstly, the sources of relevant information need

979-8-3503-2498-3/23/$31.00 ©2023 IEEE 562


DOI 10.1109/MODELS-C59198.2023.00094
Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
to be decided, i.e., which information sources need to be software models, simpler encodings that do not require graph-
made available to the ML model to learn. The source of based encoding perform better. However, it is not surprising
information could be structural, i.e., graph-based properties that domain classification would not require model structure
of the conceptual model and/or semantic, e.g., lexical terms, information because the lexical terms of the model sufficiently
metamodel, and ontological semantics. Secondly, how should capture the information required for the domain classification
the model structure and semantics be encoded to be used by task. Therefore, the choice of encoding is task-dependent and
the ML model during training? Finally, based on the selected encodings should be selected based on the task details.
information for a task and the selected encoding, which ML The research area of ML4CM and model-driven engineering
model should be used to train on the encoded models? This (MDE) is still recent. Therefore, there is a lack of related
topic still needs to be well understood and has not been work. Many papers pragmatically use ad-hoc model encodings
explored in depth. Therefore, we conducted a Systematic specific to their application requirements, often lacking a
Literature Review (SLR) to comprehensively analyze how the systematic and comprehensible elaboration on the encoding
issues mentioned above are dealt with by the state-of-the-art choice and its alternatives. For e.g., Clariso et al. [10] present
and find some crucial insights that would allow us to draw graph kernels as a generalized model encoding for cluster-
some associations between the different encodings on the one ing software modeling artifacts and improve the efficiency
hand and different purposes, modeling languages, and ML and usability of various software modeling activities, e.g.,
models used on the other. Finally, we make our complete design space exploration, testing, verification, and validation
results available2 , including the links to the model datasets but without a systematic review of other encodings. It is
used. important to note that our study is not a comparative study of
The remainder of this paper is structured as follows: Sec- encodings. Instead, with our SLR, we aim to provide a better
tion II presents the related works. Section III describes our understanding of the literature in relation to model encoding
SLR research methodology, including the research questions such that researchers and practitioners interested in applying
we address. In Section IV we present the responses to the re- ML to conceptual models can make an informed decision for
search questions. We discuss our overall findings in Section V encoding their models depending on the task they need to
before we concluded this paper with Section VI. solve.

II. R ELATED W ORK III. R ESEARCH M ETHOD


In recent years there has been a surge in the works of Our Systematic Literature Review (SLR) followed the re-
combining AI with conceptual modeling. Based on a study [9], search method introduced by Kitchenham and Charters [11].
machine learning has been the area of artificial intelligence The SLR aims to analyze the state-of-the-art in the context of
most applied with conceptual modeling. These works focus on model encodings in ML4CM. In the remainder of this section,
using AI to do conceptual model processing, i.e., automated we will describe the steps involved in our SLR.
processing of the information present in a conceptual model
A. Defining the research scope
to assist the modeler in modeling tasks.
Lopez et al. [7] present a comparative study of different The paper at hand aims to respond to the following main
ML classification techniques that automatically label models research questions: RQ1: Which information present in the
stored in model repositories. They compare different ML conceptual model is used for ML training?; RQ2: How is the
models (e.g., Feed-Forward Neural Networks, Graph Neural information encoded for ML training?; RQ3: How does the
Networks, and K-Nearest Neighbors) with varying model ML purpose correlate with the used encoding?; and RQ4: How
encodings (TF-IDF, word embeddings, graphs, and paths). does the ML model correlate with the encoding?
However, several differences to our work need to be pointed In responding to RQ1, we will investigate which information
out. Firstly, they do not discuss the source of information, provided by a conceptual model (e.g., structural, semantic)
i.e., which information is made accessible to the model, is incorporated in current ML4CM approaches and which
and do not differentiate how the structure and semantics sources of relevant data are used during ML training. For
of the model are encoded. In our work, we separate the responding to RQ2, we zoom-in on the different encodings
source of structural and semantic sources of information available to represent the model information in a format suit-
and subdivide the semantic sources further into linguistic, able for ML. RQ3 is responded to by separately investigating
metamodel, and ontology-based sources. Secondly, their study the correlation between the purpose i.e., the CM task to be
focuses on model domain classification applications, which solved, and the encoding and modeling language used. RQ4
do not comprehensively understand the relationship between is responded to by determining how the choice of ML models
the encoding and the applications requiring model encoding. relates to the purpose and the chosen encoding.
E.g., they report that even though structural encoding schemes
B. Conduct Search
based on graphs should be superior based on the rationale
that they are a good match for the graph-based nature of We conducted a larger and much more generic and inclusive
Systematic Mapping Study (SMS) about Conceptual Modeling
2 https://goo.by/HEA7D and Artificial Intelligence (results are reported in [9]) using

563

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
the following logically structured search query in eq. (1).
Instead of starting from scratch and developing a separate
query, we filtered out the documents relevant to our study
from the SMS. We understand that this approach can be seen
as a limitation. However, we chose this alternative for two
reasons: i) our query in the SMS is very inclusive (see eq. (1)),
and ii) we had already done a detailed review and were able
to easily exclude papers which had the contribution in the
direction of AI towards CM (AI4CM). Note that due to the
nature of our query, we do not include the works that do not
apply ML in their approach. This implies that the works that
e.g., propose non-ML-based similarity metrics based on the
structural features of the model graph or the semantics of the
model elements are not within the scope of this work. We aim
to conduct a broader review to cover such cases in the future.

Q = (∨CMi ) ∧ (∨AIj ), where (1) Fig. 1: SLR relevant papers screening


CMi ∈ {“conceptual modeling”, “metamodel”, “meta-model”
“domain specific language”, “modeling formalism”,
“modeling tool”, “modeling language”, “modeling metrics such as the number of classes, and the number of
method”, “model driven”, “model-driven”, “mde”} cyclic dependencies in a model, can implicitly encode the
model structure [12] without explicitly encoding the graph
AIj ∈ {“artificial intelligence”, “ai”, “machine learning”, as a network of nodes and edges. Similarly, metrics related
“ml”, “deep learning”, “dl”, “neural network”, to the semantic data, e.g., type of model elements like the
“genetic algorithm”, “smart”, “intelligent”} number of relations of type generalization, can implicitly
C. Screening papers capture the semantic information [13]. Therefore we classify
the model structure and semantic data in the encoding as
After executing the query on January 16, 2023, we followed explicit, implicit, and not encoded. The source of structural
the steps shown in Fig. 1 to screen the papers relevant to data is the model’s graph structure; however, semantic data
our study. The top portion of Fig. 1 shows the screening comes from the lexical terms associated with the model’s
steps of the SMS. We got 647 relevant papers with mappings. elements (entities, relationships, attributes) in natural language
In the current work, we focus on the literature that encodes and from external sources. We restrict the external sources to
models for applying ML methods. We applied four exclusion a model’s metamodel (e.g., EClassifier, EAttribute for ECore,
criteria for further filtering: exclude papers that EC-1: con- Aspect, Layer for ArchiMate models) and external domain
tribute from CM to AI; EC-2: focus the conceptual model ontology (e.g., WordNet), and foundational ontology (e.g.,
creation using text-to-model, image-to-model transformation UFO). Furthermore, in Table I, we show the classification of
approaches because we need the model as the starting artifact ML models based on different types of ML models. However,
for model encoding; therefore, this excludes papers that, e.g., we focus on individual models as well in our data analysis
apply NLP to generate domain-specific models from text; EC- (see Section IV-C). Based on our classification scheme, Fig. 2
3: involve “genetic algorithms” because genetic algorithms do shows the overall CM encoding process. We aim to get insights
not learn from the models’ data unlike ML approaches; and into this process using our SLR.
EC-4: did not report any model encodings-related information.
Furthermore, we ensured we did not miss important papers by
doing a forward/backward search. After the screening, we had
37 remaining relevant papers.
D. Search for keywords in abstracts
After filtering the 37 papers, we carefully reviewed all the
papers and applied a classification over several attributes of our
RQs. Table I shows all the attribute dimensions for classifying
papers and describes each attribute with its possible values. A
conceptual model captures information in its structural and
semantic data3 . We consider that the structural and seman-
tic data can be encoded explicitly or implicitly depending
on the encoding. For example, N -grams encoding, specific
Fig. 2: Steps involved in ML4CM
3 We refer to semantics as non-structural data like lexical terms in the model

564

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Classification scheme keywords description
Attribute Description Values
Model Structure If the model’s graph structure is encoded or not. Explicit, Implicit, Not Used
Structural Encoding The model structure encoding type. Raw Graph, Tree-based, Graph Kernel, Bag of Paths, Axiomatic, N-grams, Manual
Metrics
Semantic Data If the semantic data in the model is encoded. Explicit, Implicit, Not Used
Metamodel Semantics If the metamodel semantics are captured in the encoding. Yes, No
Ontological Semantics If the model terms are annotated with ontological semantics Yes, No
and further used in model encoding.
Semantic Encoding The model semantics encoding type. BoW Word Embeddings, BoW TF-IDF, Raw BoW, One-hot, Raw String, Manual
Metrics
Modeling Purpose The ML-based application for which the model is encoded. Analysis, Classification, Completion, Refactoring, Repair, Transformation
ML Model The ML model used in the paper to train on the encoded Classical Machine Learning, Deep Learning without Graph, Deep Learning with
models’ data. Graph, Reinforcement Learning

IV. F INDINGS

In the following, we present the results of our data analysis


and in effect, respond to the RQs defined above.

A. Response to RQ1 – Which information present in the


conceptual model is used for ML training?
In Fig. 3 we show which model information is used to en-
code models. The figure shows that using explicit information
sources, i.e., the lexical terms of the model elements and the
model structure are most commonly used to encode models. Fig. 3: Data source distribution for model encoding
Several works also encode metamodel information (10 papers)
and ontological semantics (5 papers) explicitly. We see that B. Response to RQ2 – How is the model information encoded?
the natural language lexical terms of model elements have the In the following, we zoom-in into how the model’s structure
highest contribution to the semantic data (26 papers), followed and semantic data are encoded. After classifying all the papers
by metamodel-based semantics (11 papers, 10 explicit, and with the corresponding structural and semantic encoding,
1 implicit), and the least used are ontological semantics (6 we found seven different types of structural and semantic
papers, 5 explicit, and 1 implicit). Some works use metamodel- encodings as shown in Fig. 4.
level information, e.g., element types like EPackage, and 1) Structural Encodings: – Graph structural, i.e., encodings
EClassifier, but in most cases, the ML model does not use the that explicitly capture the model’s structure using the graph
metamodel information. The model’s labels are user-defined structure information (cf. Fig. 4a) include: i) Raw Graph;
labels not rooted in any domain or foundational ontology. ii) Tree-based encoding where each model is represented as
Rooting the model in an external ontology requires ontology an independent tree, the root contains the keyword MODEL,
alignment, which requires additional effort. Therefore, these and its children are the model elements, which can be either
aspects are consistent with the results in Fig. 3. OBJECTSs, ASSOCIATIONs [5]; and iii) Bag of Path (BoP)
Only a few cases are implicitly encoded, i.e., using met- where the paths of fixed lengths capturing the nodes and edges
rics and keywords. Four papers implicitly encode the model between two nodes of the model graph are stored [15]. The
structure and only two implicitly capture the metamodel and implicit encodings include i) manually selected metrics de-
ontological semantics. Encoding the graph structure directly pending on the user requirements; ii) n-grams, which captures
allows ML models to jointly consider global and local struc- the sequence of vertices’ labels in the model of length n-1 [12];
ture [14] rather than using selected metrics focusing only on iii) Axiomatic representation which represents the model in
global graph structural information. Similarly, using lexical terms of a set of axioms [16]; and iv) Graph kernels to embed
terms directly seems logical over using a manually curated sub-structures of models into some features [17].
set of metrics. ML models can capture the latent correlations 2) Semantic Encodings: – Semantic data encodings are
between model elements’ terms that a manually curated set of visualised in Fig. 4b. The lexical terms of the models are en-
metrics might miss. Fig. 3 also shows that the literature fo- coded in the following ways: i) Model serialization which uses
cused on the model’s semantics because the model’s structure the model directly in its XML format; ii) One-hot encoding
is not used in a model’s encoding in almost half the cases (15 where each lexical term in the encoding is represented in the
papers) whereas only five papers lack semantic data encoding. form of a fixed-sized vector with all zeros except a single one

565

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
(a) Structural Encodings (b) Semantic Encodings
Fig. 4: Visualization of different model encodings

corresponding to the lexical term, where the size of the vector such as i) using string comparison metrics like Levenshtein
is the total size of the vocabulary; iii) Raw Bag-of-words Distance4 ; ii) defining own metrics for calculating the simi-
(BoW) where the model is represented as a vector containing larity between labels of elements [27], [28] which does not
all lexical model terms; iv) Term Frequency-Inverse Document require any normalization using NLP; iii) renaming all the
Frequency (TF-IDF) vector where the term frequency along tokens that are not keywords to a closed set of words (for
with inverse document frequency of the lexical terms is instance, classes’ names are A, B, C, ..., attribute names are
calculated and then the model is represented with the TF-IDF x, y, z, ..., etc.) [5]; or iv) using only specific keywords as
value of each lexical term in the model; v) BoW embeddings lexical terms where the NL semantics of the terms are not
where each word is represented in the form of a fixed-sized relevant [16], [27]. Word embeddings from large language
vector of arbitrary length where the values in the vector for models (LLMs) (GPT5 , BERT6 ) can encode lexical terms
each word are produced by a language model pre-trained on to get nuanced contextualized word embeddings. Therefore,
a general or domain-specific data corpus; vi) manual metrics several recent works use LLMs-based word embeddings.
that use some specific keywords (e.g., keyword “set”, “get” Thirdly, the model encoded as a raw graph is the most com-
in the model serialization) metrics to implicitly capture the mon encoding technique for capturing the model’s structural
model semantics; and vii) Axiomatic representation, which is information. However, the number of works overall is signifi-
the same as in the case of structural encoding. cantly less (8 out of 37). Moreover, we see that raw graph as
a structural encoding and BoW Embeddings as the semantic
3) Encodings’ usage analysis: – We show the analysis
encoding is a frequent combination. Further analysis showed
of the different encoding pairs used in both the structural
that cases that use this combination benefit from capturing
and semantic dimensions in Table II. We note several key
both the structural and semantic information, e.g., learning a
things from the table. Firstly, the “No Encodings” for the
vector representation of a model [44], and characterizing a
model structure column has the most papers. This is consistent
model generator [45]. Other path-based encodings, such as
with the fact that 15 out of 37 papers did not include
N-grams and Bag-of-Path (BoP), that can encode the model
structural encodings (see Fig. 3) and only used semantic
as a set of paths are not frequent. This seems to be due to
data encoding, with TF-IDF as the most common encoding.
these encodings’ limitation in sufficiently capturing the model
BoW word embeddings and TF-IDF are vector-embeddings-
structure. Finally, several works have used manually selected
based encodings and are the most common choice to embed
metrics and axiomatic representation to encode the model’s
the semantic data (19 out of 37). This choice seems logical
structure and semantics. In these cases, authors, instead of
because if one needs to capture the correlations between the
using the ML model, design their task-specific metrics without
lexical terms of the model, its metamodel, and any ontological
applying any encoding.
semantics associated with the model, then techniques like TF-
IDF and pre-trained language models (LM) can capture these
correlations more effectively as these techniques learn (in case C. Response to RQ3 – How does the ML purpose correlate
of LMs), compute (in case of TF-IDF) these word embeddings with the used encoding and modeling language?
over a large vocabulary, thereby providing a more contextual In the following, we elaborate on the different purposes of
encoding. Several works apply NLP normalization techniques model encodings for ML4CM.
like stop word removal, stemming, and lemmatization before
encoding the lexical terms with TF-IDF or LMs. Interestingly, 4 https://en.wikipedia.org/wiki/Levenshtein distance
we found that some works do not even apply NLP techniques 5 https://openai.com/research/gpt-4

while using lexical terms. There are different reasons for this, 6 https://huggingface.co/docs/transformers/model doc/bert

566

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Structural and Semantic Encodings across all relevant papers
Semantic Encoding Structural Encodings
No Encodings Manual Metrics Axiomatic N-grams BoP Graph Kernel Tree-based Raw Graph Total

No Encoding N/A [18] [19]  [20] [17]  [21], [22] 6

Manual Metrics  [13], [23], [24]       3

Axiomatic [16]        1

Serialized model [25], [26]        2

Raw BoW [27], [28]        2

One-hot [29], [30] [31]      [32] 4

TF-IDF [6], [33]–[37]   [12], [38] [15]   [39] 10

BoW Embeddings [40], [41]    [42]  [4], [5] [39], [43]–[45] 9


Total 15 5 1 2 3 1 2 8 37

1) Extracted Purposes: After carefully reading the papers,


we classified each paper in one of the following categories: i)
Analysis—if ML is applied to do some model analysis e.g.,
discovering patterns in the model [20], ii) Classification—
if ML is applied to classify the encoded model based on
user-defined model similarity criteria into one of the user-
defined classes e.g., domain classification of metamodels [7],
[34], iii) Completion—if ML is used to autocomplete a partial
model, recommend elements to the modeler, e.g., NLP-based
model autocompletion [41], iv) Refactoring—if ML is used
to support model refactoring, e.g., model-driven bug report
visualisation [43], v) Repair—if ML is used to repair a
partially broken model e.g., [16], and vi) Transformation— Fig. 5: Relationship of Purpose with model encoding
if ML is used for model transformation e.g., [5].
2) Purpose Analysis: Fig. 5 shows classification and D. Response to RQ4 – How does the ML model correlate with
completion as the most common ML-based applications to the encoding and purpose?
conceptual models. ML methods can efficiently find patterns In this research question, we focus on the used ML model
in conceptual models which can be used to characterize and and potential correlations with the model encoding.
classify models. Therefore, classification has been applied 1) ML model classification: We divided the ML models
for e.g., conceptual model search and automatic metamodel into four classes depending on the type of learning architecture
clustering. In the case of model completion, which involves as follows: i) Classical Machine Learning which do not
predicting the next element given a partial model, BoW involve any Deep Learning architecture, the ML models in
Embeddings is the most used encoding. Model encodings this category include XGBoost7 , CatBoost8 , Support Vector
that capture the contextual information of model elements Machine (SVM), Random Forest, Apriori association rules,
and further allow similarity comparison help predict the next K-nearest neighbors, Integer Linear Programming, and Naive
element more accurately due to the available contextual infor- Bayes; ii) Deep Learning without Graph which includes
mation. Word embeddings capture information locally (i.e., DL architectures that do not explicitly capture the graph
within the model) and globally across a large dataset of structure of the data with the Transformer9 , Long-Short-Term-
models, even across domains, acting as a rich contextual data Memory (LSTM) and Feed Forward Neural Networks as
source. Therefore, it is consistent that most works used word ML models; iii) Deep Learning with Graph which includes
embeddings to capture semantic data for model completion. DL architectures that capture the graph structure with Graph
Moreover, structural encoding (as a raw graph) is also mostly Neural Networks (GNN) and Graph-aware Attention Networks
used for model completion. Predicting the next element in a (GaAN); and iv) Reinforcement Learning (RL) which includes
model involves predicting the next node or edge in the graph. ML models like Markov Decision Process (MDP).
Therefore it would help the ML model to know the struc- 2) ML model usage analysis: Fig. 6 shows the relationship
tural information during prediction. Furthermore, a model- between the used ML models with the model encodings. Fig. 6
driven project involves several consecutive transformations, (left) shows the different ML categories and the used models
and automating model transformation operations can reduce
7 https://xgboost.readthedocs.io
the time-to-market of project development and improve its last accessed 04.07.2023
8 https://catboost.ai/
last accessed 04.07.2023
quality [5]. This explains the next highest frequency of model 9 https://huggingface.co/docs/transformers/ last accessed: 24.08.2023
transformation as the ML4CM purpose.

567

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 6: Structural (left) and semantic (right) encodings with ML models
to encode the model structure. The plot shows that GNN and structural semantics by encoding the graph’s structural aspects
FFNN are the most used models to capture the model struc- and semantic data using generalized semantically rich word
ture. Moreover, GNNs use a raw graph as model encoding, embeddings. Moreover, FFNN and KNN are frequently used
making GNNs suitable for learning the structural and semantic as general-purpose ML models to encode semantic data with
information from the conceptual model’s graph encoding. Fur- different kinds of encodings. Transformers have been used
thermore, tree-based encodings are used to serialize the model only with BoW word embeddings because of the Transformer
as a sequence of tokens [4], [5] to make the model encoding architecture’s capability of fine-tuning generalized word em-
suitable for DL models like Transformers and LSTM (which beddings for a given context. Therefore, the papers that use
do not explicitly capture the model structure) for model com- Transformers first use the generalized word embeddings from
pletion and transformation. However, tree-based encodings in the pre-trained language models like BERT and then fine-
the current works do not capture longer dependencies, i.e., tune their embeddings for their task [4]. Finally, user and
exceeding an element’s direct neighbors. In contrast, raw graph content-based collaborative filtering (UBCF and CBCF) use
with GNNs allows capturing such information to larger depths, a one-hot encoding for model elements recommendation [30]
which explains the higher frequency of the combination of and a K-Means with BoW TF-IDF encoding for model
raw graphs with GNNs. Multiple models are used with Graph classification [36]. There are other ML approaches such as
Kernel encoding, where Graph Kernels transform the model Random Forest (RF), Naive Bayes (NB), and Inductive Linear
into a set of features and use them to apply graph similarity Programming that are not usually used in ML4CM research.
metrics with ML models like SVM, Naive Bayes, and Random
Forest. BoP encoding stores the model as a collection of paths V. D ISCUSSION
such that the paths (or part of models) thereby allow model
This section summarizes our findings, discusses insights,
similarity comparisons using different ML approaches like
and reflects on the remaining research gaps we observed.
KNN, Apriori association rules, or even complex DL models
We see in Fig. 3 that there is a lack of metamodel and
like Transformers as shown in Fig. 6. Note that N-grams are
ontological semantics contribution towards the “meaning” of
quite similar to BoP in capturing the sequence of vertices
model elements. We consider this lack a first research gap.
that can capture relationships but do not capture complex
The relationship of the model elements with the domain is
relationships compared to BoP [46]. However, it is interesting
not sufficiently captured by only the lexical terms represented
to note that the N-grams encoding is also used with GaAN,
as BoW, TF-IDF, or word embeddings. Using only the NL
where GaAN compensates for the limitations of N-grams of
semantics of words leads to missing out on the contextual se-
not capturing the complex relationships by capturing longer
mantics provided by the model’s metamodel capturing model
graph structural dependencies.
elements’ types and ontological semantics capturing domain
Fig. 6 shows that KNN and GNNs are frequently used concepts. Moreover, providing only type level information i.e.,
for semantic encoding. KNN seems to be a common choice an element is a Class or Relationship, not the relationship
due to its simplicity of finding similarity measures of models between the types on the metamodel level also hides informa-
where the nearest neighbor of a model is considered a similar tion. Without encoding metamodel or ontological semantics,
model based on the model encoding. The similarity measure the ML model misses out on learning type level semantics,
enables efficient model comparison and thereby, classification. the relationship between types, the properties of types (why
Furthermore, KNN is most frequently used with BoW TF- is a class abstract, when does a class need to be abstract),
IDF encoding. The common encodings used with GNNs are common software design patterns, what kind of a foundational
BoW word embeddings. This shows that GNNs can capture ontological stereotype should the class have—all of which

568

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
is important information that makes the conceptual model a [7] J. A. H. López, R. Rubei, J. S. Cuadrado, and D. Di Ruscio, “Machine
learning methods for model classification: a comparative study,” in
semantically rich artifact. Proceedings of the 25th International Conference on Model Driven
We further see in Fig. 3 and Table II that in many cases Engineering Languages and Systems, 2022, pp. 165–175.
structural encodings are not used. We consider this as a sec- [8] A. Jacobsen, R. de Miranda Azevedo, N. Juty, D. Batista, S. Coles,
R. Cornet, M. Courtot, M. Crosas, M. Dumontier, C. T. Evelo et al.,
ond research gap. Moreover, graph encoding techniques like “Fair principles: interpretations and implementation considerations,” pp.
Graph Kernel, which capture local and global neighborhood 10–29, 2020.
structures, are underrepresented and can be used to add more [9] D. Bork, S. J. Ali, and B. Roelens, “Conceptual modeling and
artificial intelligence: A systematic mapping study,” arXiv preprint
structural information to the encoding. arXiv:2303.06758, 2023.
We acknowledge that in our SLR, we have not provided [10] R. Clarisó and J. Cabot, “Applying graph kernels to model-driven engi-
a performance analysis of each of the encodings related to neering problems,” in 1st International Workshop on Machine Learning
and Software Engineering in Symbiosis, 2018, pp. 1–5.
different purposes. However, comparative performance evalu- [11] B. Kitchenham, S. Charters et al., “Guidelines for performing systematic
ation is difficult because of the lack of standardized datasets literature reviews in software engineering,” 2007.
for specific purposes and specific modeling languages. There [12] Ö. Babur and L. Cleophas, “Using n-grams for the automated clustering
of structural models,” in 43rd Int. Conf. on Current Trends in Theory
are further no baselines to test the performance of different and Practice of Computer Science, 2017, pp. 510–524.
encodings systematically. In our analysis, we found that out of [13] B. K. Sidhu, K. Singh, and N. Sharma, “A machine learning approach
all the works that make their dataset public, all the datasets are to software model refactoring,” International Journal of Computers and
Applications, vol. 44, no. 2, pp. 166–177, 2022.
different except for [47] which shows a lack of standardized [14] G. Lin, X. Kang, K. Liao, F. Zhao, and Y. Chen, “Deep graph learning
datasets. Recently, Lopez et al. [7] performed a comparative for semi-supervised classification,” Pattern Recognition, vol. 118, p.
analysis of ML encodings for the domain classification task. 108039, 2021.
However, there is a need to do similar studies for other [15] J. A. H. López and J. S. Cuadrado, “An efficient and scalable search
engine for models,” Softw. Syst. Model., vol. 21, no. 5, pp. 1715–1737,
ML4CM tasks because, as we see from our analysis, the choice 2022.
of encoding is task-dependent. [16] M. Fumagalli, T. P. Sales, and G. Guizzardi, “Towards automated support
for conceptual model diagnosis and repair,” in Advances in Conceptual
Modeling: ER 2020 Workshops. Springer, 2020, pp. 15–25.
VI. C ONCLUSION [17] A. Khalilipour, F. Bozyigit, C. Utku, and M. Challenger, “Categorization
In this paper, we provided an SLR-based detailed analysis of of the models based on structural information extraction and machine
learning,” in Proceedings of the INFUS 2022 Conference, Volume 2,
the various encodings used in the context of machine learning 2022, pp. 173–181.
for conceptual modeling (ML4CM) i.e., using ML methods to [18] A. D. P. Lino and A. Rocha, “Automatic evaluation of erd in e-learning
support CM tasks. We zoomed into what information from the environments,” in 2018 13th Iberian Conference on Information Systems
and Technologies (CISTI). IEEE, 2018, pp. 1–5.
model is encoded, i.e., its semantics and/or structure. We then [19] M. Essaidi, A. Osmani, and C. Rouveirol, “Model-driven data warehouse
analyzed how the information is encoded, thereby identifying automation: A dependent-concept learning approach,” in Advances and
14 different encodings for structural and semantic aspects. Applications in Model-Driven Engineering, 2014, pp. 240–267.
[20] M. Fumagalli, T. P. Sales, and G. Guizzardi, “Pattern discovery in
Then we analyzed why is the model information encoded, conceptual models using frequent itemset mining,” in 41st International
i.e., to solve what task. Finally, we analyzed the relationship Conference on Conceptual Modeling, 2022, pp. 52–62.
between the ML models used with the proposed encodings as [21] J. Yu, M. Gao, Y. Li, Z. Zhang, W. H. Ip, and K. L. Yung, “Workflow
performance prediction based on graph structure aware deep attention
well as the purpose in the literature. Based on the findings, as neural network,” J. Ind. Inf. Integr., vol. 27, p. 100337, 2022.
part of our future work, we plan to do a systematic comparative [22] D. Bork, S. J. Ali, and G. M. Dinev, “Ai-enhanced hybrid decision
study of different encodings for various ML4CM purposes management,” Bus. Inf. Syst. Eng., vol. 65, no. 2, pp. 1–21, 2023.
[23] M. H. Osman, M. R. Chaudron, and P. Van Der Putten, “An analysis
and use a specific dataset to produce benchmarks for other of machine learning algorithms for condensing reverse engineered
researchers to use. class diagrams,” in 2013 IEEE International Conference on Software
Maintenance. IEEE, 2013, pp. 140–149.
R EFERENCES [24] A. Burattin, P. Soffer, D. Fahland, J. Mendling, H. A. Reijers, I. Van-
derfeesten, M. Weidlich, and B. Weber, “Who is behind the model?
[1] H. A. Proper and G. Guizzardi, “Modeling for enterprises; let’s go to classifying modelers based on pragmatic model features,” in 16th Int.
rome via rime,” hand, vol. 1, p. 3, 2022. Conference on Business Process Management, 2018, pp. 322–338.
[2] M. Brambilla, J. Cabot, and M. Wimmer, “Model-driven software [25] A. Barriga, R. Heldal, L. Iovino, M. Marthinsen, and A. Rutle, “An ex-
engineering in practice,” Synthesis lectures on software engineering, tensible framework for customizable model repair,” in Proceedings of the
vol. 3, no. 1, pp. 1–207, 2017. 23rd ACM/IEEE International conference on model driven engineering
[3] R. Saini, G. Mussbacher, J. L. Guo, and J. Kienzle, “Domobot: a bot languages and systems, 2020, pp. 24–34.
for automated and interactive domain modelling,” in Proceedings of the [26] F. Basciani, J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio,
23rd ACM/IEEE international conference on model driven engineering “Automated clustering of metamodel repositories,” in 28th Int. Conf. on
languages and systems: companion proceedings, 2020, pp. 1–10. Advanced Information Systems Engineering, 2016, pp. 342–358.
[4] M. Weyssow, H. Sahraoui, and E. Syriani, “Recommending metamodel [27] A. Adamu, S. M. Abdulrahman, W. M. N. W. Zainoon, and A. Zakari,
concepts during modeling activities with pre-trained language models,” “Model matching: Prediction of the influence of uml class diagram
Softw. Syst. Model., vol. 21, no. 3, pp. 1071–1089, 2022. parameters during similarity assessment using artificial neural network,”
[5] L. Burgueno, J. Cabot, S. Li, and S. Gérard, “A generic lstm neural Deep Learning Approaches for Spoken and Natural Language Process-
network architecture to infer heterogeneous model transformations,” ing, pp. 97–109, 2021.
Softw. Syst. Model., vol. 21, no. 1, pp. 139–156, 2022. [28] A. Elkamel, M. Gzara, and H. Ben-Abdallah, “An uml class recom-
[6] P. T. Nguyen, J. Di Rocco, D. Di Ruscio, A. Pierantonio, and L. Iovino, mender system for software design,” in 13th International Conference
“Automated classification of metamodel repositories: a machine learning of Computer Systems and Applications. IEEE, 2016, pp. 1–8.
approach,” in 22nd International Conference on Model Driven Engineer- [29] X. Dolques, M. Huchard, C. Nebut, and P. Reitz, “Learning trans-
ing Languages and Systems, 2019, pp. 272–282. formation rules from transformation examples: An approach based on

569

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.
relational concept analysis,” in 14th IEEE Int. Enterprise Distributed
Object Computing Conference Workshops. IEEE, 2010, pp. 27–32.
[30] L. Almonte, S. Pérez-Soler, E. Guerra, I. Cantador, and J. de Lara,
“Automating the synthesis of recommender systems for modelling
languages,” in Proceedings of the 14th ACM SIGPLAN International
Conference on Software Language Engineering, 2021, pp. 22–35.
[31] M. Eisenberg, H.-P. Pichler, A. Garmendia, and M. Wimmer, “Towards
reinforcement learning for in-place model transformations,” in 2021
ACM/IEEE 24th International Conference on Model Driven Engineering
Languages and Systems (MODELS). IEEE, 2021, pp. 82–88.
[32] J. Di Rocco, C. Di Sipio, D. Di Ruscio, and P. T. Nguyen, “A gnn-
based recommender system to assist the specification of metamodels
and models,” in ACM/IEEE 24th International Conference on Model
Driven Engineering Languages and Systems. IEEE, 2021, pp. 70–81.
[33] P. T. Nguyen, D. Di Ruscio, A. Pierantonio, J. Di Rocco, and L. Iovino,
“Convolutional neural networks for enhanced classification mechanisms
of metamodels,” J. Syst. Softw., vol. 172, p. 110860, 2021.
[34] R. Rubei, J. Di Rocco, D. Di Ruscio, P. T. Nguyen, and A. Pierantonio,
“A lightweight approach for the automated classification and clustering
of metamodels,” in ACM/IEEE Int. Conf. on Model Driven Engineering
Languages and Systems Companion (MODELS-C), 2021, pp. 477–482.
[35] P. T. Nguyen, J. Di Rocco, L. Iovino, D. Di Ruscio, and A. Pierantonio,
“Evaluation of a machine learning classifier for metamodels,” Softw.
Syst. Model., vol. 20, no. 6, pp. 1797–1821, 2021.
[36] Ö. Babur, L. Cleophas, T. Verhoeff, and M. van den Brand, “Towards
statistical comparison and analysis of models,” in 2016 4th International
Conference on Model-Driven Engineering and Software Development
(MODELSWARD). IEEE, 2016, pp. 361–367.
[37] Ö. Babur, L. Cleophas, and M. van den Brand, “Hierarchical clustering
of metamodels for comparative analysis and visualization,” in European
Conference on Modelling Foundations and Applications, 2016, pp. 3–18.
[38] ——, “Metamodel clone detection with samos,” Journal of Computer
Languages, vol. 51, pp. 57–74, 2019.
[39] V. Borozanov, S. Hacks, and N. Silva, “Using machine learning tech-
niques for evaluating the similarity of enterprise architecture models,”
in Int. Conf. on Advanced Information Systems Engineering, 2019, pp.
563–578.
[40] P. Danenas and T. Skersys, “Exploring natural language processing in
model-to-model transformations,” IEEE Access, vol. 10, pp. 116 942–
116 958, 2022.
[41] L. Burgueño, R. Clarisó, S. Gérard, S. Li, and J. Cabot, “An nlp-
based architecture for the autocompletion of partial domain models,”
in Advanced Information Systems Engineering: 33rd International Con-
ference, CAiSE 2021. Springer, 2021, pp. 91–106.
[42] M. Goldstein and C. González-Álvarez, “Augmenting modelers with se-
mantic autocompletion of processes,” in Business Process Management
Forum: BPM Forum 2021. Springer, 2021, pp. 20–36.
[43] G. M. Lahijany, M. Ohrndorf, J. Zenkert, M. Fathi, and U. Kelte,
“Identibug: Model-driven visualization of bug reports by extracting class
diagram excerpts,” in 2021 IEEE International Conference on Systems,
Man, and Cybernetics (SMC). IEEE, 2021, pp. 3317–3323.
[44] S. J. Ali, G. Guizzardi, and D. Bork, “Enabling representation learning
in ontology-driven conceptual modeling using graph neural networks,”
in Int. Conf. on Advanced Information Systems Engineering, 2023.
[45] J. A. H. López and J. S. Cuadrado, “Towards the characterization
of realistic model generators using graph neural networks,” in 2021
ACM/IEEE 24th International Conference on Model Driven Engineering
Languages and Systems (MODELS). IEEE, 2021, pp. 58–69.
[46] B. Li, T. Liu, Z. Zhao, P. Wang, and X. Du, “Neural bag-of-ngrams,”
in AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
[47] J. A. H. López, J. L. Cánovas Izquierdo, and J. S. Cuadrado, “Modelset:
a dataset for machine learning in model-driven engineering,” Softw. Syst.
Model., pp. 1–20, 2022.

570

Authorized licensed use limited to: Sri Sai Ram Engineering College. Downloaded on November 19,2024 at 08:27:51 UTC from IEEE Xplore. Restrictions apply.

You might also like