An Introduction To ML Lifecycle Ontology and Its Applications
An Introduction To ML Lifecycle Ontology and Its Applications
Applications
Abstract. Machine Learning (ML) adoption is on the rapid rise with nearly 40%
compound annual growth rate over the next decade. In other words, companies
will be flooded with ML models developed with different datasets and software.
The ability to have information at fingertips to what datasets were used, how
these ML models were developed, what they were used for, what their
performances and uncertainties are, and what their internal structure look like can
have several benefits. These pieces of ML metadata are what we collectively call
ML lifecycle data. In this paper, we explain our current research into developing
an ML Lifecycle Ontology (MLLO) to capture such data in a knowledge graph.
The motivation for that is to not only make such data available in a standard
queryable representation across different ML software, but to also be able to
connect it with other domain knowledge. We will introduce MLLO at the high-
level and outline basic and advanced use case scenarios, in which the data, the
MLLO, and domain knowledge may be used to improve the development and
usage of ML models and associated datasets. We then describe future work we
are undergoing to demonstrate the hypothesis.
1 Introduction
2 Previous work
based format [7]. However, the PMML model's limitations in supporting cutting-edge
and customized models have been exposed due to its complexity today since its
publication in 1998 as well as the inherent challenges of XML to be edited manually
and debugged [8]. Regardless, of the two data standards in question their objective also
implies that they have limited scopes for dealing with other aspects such as logging
model artifacts post deployment and efficient capturing of metadata during model
development and data preprocessing stages.
ISO has also attempted to standardize terminology surrounding ML. The first
standard, ISO/IEC 22989:2022, aims to describe the key concepts surrounding ML
models. The second standard, ISO/IEC 23053:2022, provides the terminology
necessary to describe ML systems in general (e.g., training modalities, model
parametrization, evaluation). While the ISO standards provide value by providing
common terminology, they are not supported by a formal data model or logical
rigorousness, which may lead to implementation ambiguity. The next section will
describe how ontologies help establish terminological rigorousness and provide
connectivity across different areas.
3 Use Cases
Use cases are one of the fundamental steps in ontology validation. Namely, use cases
test the ontology for real-world applicability, help identify inconsistencies, and assess
the completeness of the ontology concerning a particular application area.
In this section, we will present two use cases instrumental to demonstrate the
applicability and value of MLLO. The first use case revolves around capturing the key
M. Drobnjakovic, P. Charoenwut, A. Nikolov, H. Oh, and B. Kulvatunyou
aspects of ML models and data preprocessing. Moreover, the first use case aims to
validate MLLO capability to assist with particular tasks during development and
deployment. Later in this section, we will introduce the use case focusing on the
biopharmaceutical industry. We will explain how MLLO could help in regulatory
compliance and improve model development in combination with domain-specific
ontologies.
The aim of the basic use case is to validate that the MLLO ontology has sufficient
coverage to capture the architecture, and input requirements of ML models as well as
the data processing steps utilized prior to ML model training or execution. We also aim
to assess the MLLO's capability to capture various ML training runs, and support
analysis of training configuration impact on different models performance. Finally,
with the basic use case, we aim to demonstrate how MLLO can be utilized to track
model performance on datasets that have different quality (e.g., noise level), which
emulates scenarios where, during deployment, the data captured might vary due to
instrument deterioration, physical phenomena or change in the measurement
capabilities of the instrument utilized.
To achieve our objectives, we are using several models developed for the MNIST
(Mixed National Institute of Standards and Technology) dataset. The dataset contains
70,000 pre-labeled grayscale images of handwritten digits with 28x28 pixels. [9]. We
have decided to use this dataset because it is well-understood and easily accessible.
Additionally, as the MNIST dataset is often used as a benchmark for machine learning,
numerous well-documented MNIST-trained models are publicly available (e.g.,
Kaggle). In addition to the standard MNIST dataset, we have created several datasets
that contain varying degrees of Gaussian and Poisson noise (Fig. 1).
We used two neural network models. The first model, a convolutional neural
network, comprises eight different layers, namely Convolutional Layer 1, Average
Pooling Layer 1, Convolutional Layer 2, Average Pooling Layer 2, Convolutional
Layer 3, Flattening Layer, Fully Connected Layer 1, and Fully Connected Layer 2.
Rectified linear units (ReLU) have been employed as the activation function for every
layer except for the final layer, where softmax has been used. The other one, the multi-
layer perceptron, is composed of three dense layers. ReLU has been also used for the
first two layers, while softmax has been utilized for the last layer.
All the models have been trained and validated with the original MNIST dataset and
subsequently tested with the noisy datasets. Hyperparameter tuning was also conducted
for each model.
The implementation of all models have been done in Python. We have chosen
Python for initial validation, due to it being a dominant language used in machine
learning. It has gained popularity because of its simplicity, extensive library ecosystem,
and strong community support. As such, popular libraries like TensorFlow, Keras,
PyTorch, and Scikitlearn make data manipulation, analysis, and model deployment
much easier.
Original Data Original Data + Gaussian Original Data + Gaussian noise
noise with S.D. of 16 with S.D. of 48
Original Data + Gaussian Original Data + Poisson Original Data + Poisson noise +
noise with S.D. of 96 noise Gaussian noise with S.D. of 48
Fig. 1. Example MNIST data with varying degrees of noise
demonstrate that 1) ontologically encoded knowledge that combines both the “domain
and ML expert” perspective can increase the accuracy of a model utilized in biopharma
and 2) that model development and selection of the optimal model can be accelerated
by using MLLO.
The MLLO development process is guided by the hub-and-spokes principles, with its
foundation provided by BFO and IOF-Core ontologies. Both top-down and bottom-up
methodologies are utilized to construct the ontology. Existing ML standards (e.g.,
PMML, ONNX and ISO/IEC 22989:2022) and ontologies (e.g., STATO, ML-Schema)
are leveraged, with constructs adapted and reused where applicable. Throughout the
development, competency questions (Table 1) derived from real-world ML challenges
and use case scenarios serve as pivotal guides, ensuring that the ontology addresses
crucial industry needs.
The resulting MLLO ontology contains x classes and y object properties and is
composed of three integral areas (Fig. 2).
Fig. 2. Representation of the connections between three integral areas of the MLLO
In this section, we delve into the validation process of our basic use case, aiming to
achieve objectives specified in section 3.1 by using MLLO. First we will describe the
methodology utilized to extract and ingest the metadata into MLLO as well as the
competency question driven use case validation. Next, the results of SPARQL queries
which provide answers to competency questions will be provided and analyzed.
5.1 Methodology
The metadata associated with trained models has been extracted and saved into JSON
that conforms to the JSON schema derived from MLLO. The metadata extraction was
done by using our in-house Python script, which extracts the metadata by using a
combination of Python frameworks built-in save methods as well as user-given
prompts. For example, in the case of TensorFlow/Keras, the get_config() method
was used to retrieve model’s layers metadata, which consists of layer name, layer type,
layer configurations, activation function, and initializer. Additionally, optimization
configurations are obtained using the get_compile_config() method. Metadata
elements that can’t be extracted from the default save model method are hardcoded.
These include model_name, create_for_project, hyperparameters and
evaluation_score.
The JSONs were converted into JSON-LD and mapped to MLLO by using SPARQL
CONSTRUCT query. Elements pertaining to data preprocessing were entered manually
into the knowledge graph. The knowledge graph was validated for consistency and
coherency by using Hermit 1.4.3. Additionally, the knowledge graph was assessed
manually to ensure that all the metadata was properly transferred and that any
interrelations are represented. Fig. 3 shows the visualization of the CNN model
architecture in MLLO. In the figure, it can be seen that the model has the correct type
and that proper layer ordering (determined by indices) is preserved. Also, the
connections between activation functions and a particular layer are present. It should
be noted that while not explicitly shown in the figure, the ontology managed to capture
correct layer dimensions and all the configuration variables (e.g., padding) pertaining
to a particular layer.
Finally, the resulting knowledge graph was used to answer the competency questions
(CQ4, CQ6, CQ3, and CQ9 from Table 1). The competency questions were chosen to
reflect objectives identified at the beginning of section 3.1.
5.2 Validation Results
reshaping
operation”
CQ4 was addressed through the utilization of a query that finds hyperparameters which
differ in value across different training instances and the trained model's performance
on the test dataset (original MNIST test dataset). The query results were then used to
construct a scatter plot Fig. 4). The scatter plot suggests that the CNN model actually
performs slightly better with a smaller epoch number and batch size. In the case of the
MLP model, the best performance can be achieved by reducing the batch size, while
keeping the number of epochs the same. It also demonstrates that MLLO is capable of
establishing the connection between a performance metric (in this case, classification
accuracy) and the variation in hyperparameter relative to some baseline run (e.g., batch
size 128 and epoch number 10). This capability of MLLO can make performance
comparisons during hyperparameter tuning more straightforward and streamlined.
Fig. 3. Visualization of the nodes representing the architecture of the Convolutional Neural
Network described in the basic use case section. Indices in the IRI of the parameter layer
correspond to index instances connected to the nodes, which were omitted to preserve image
clarity.
Fig. 4. Classification accuracy of the CNN model and MLP model depending on the training
hyperparameters. The cross indicates the baseline run.
CQ6 was tackled by creating a query that retrieves all the datasets used as inputs to
the execution of a particular trained model. Within the query, the classification accuracy
on the test dataset was explicitly marked as a baseline. The classification accuracy of
the model execution on other datasets (in this case, classified as production datasets)
M. Drobnjakovic, P. Charoenwut, A. Nikolov, H. Oh, and B. Kulvatunyou
was compared to the baseline to estimate model robustness - change in model accuracy
with respect to different noise types and noise degrees. The results of the query were
then plotted as bar plots (Fig. 5). The results show that the CNN model performance is
robust to Poisson noise and a combination of Poisson and medium Gaussian noise. The
only significant performance change was observed when applying the model to highly
noisy Gaussian datasets. In the case of the MLP model performance significantly drops
for all except the lowest Gaussian amount of noise, indicating that the model is not
robust to any noise fluctuation in data. The findings indicate that MLLO is capable of
comparing the performance of various models on a specific dataset and linking it to the
dataset’s features, such as the type and degree of noise. This could offer valuable
insights to machine learning experts regarding which model to use in a specific setting
and help pinpoint potential reasons if model performance changes during production.
Fig. 5. Robustness of the CNN Model and MLP Model to varying degrees of noise and
different noise types. The green bar represents the dataset with no noise.
CQ9 was answered via a query that retrieves the input requirements associated with a
model and any associated data elements with them. The results are displayed in Table
3. The results show that for each model the expected data type and input dimension is
specified. While both models expect images to be encoded as “Float32”, the MLP also
requires the 28x28 images to be flattened. This kind of information can help to
determine what kind of preprocessing a data source might need before being utilized
with a particular model. It is worth mentioning that the requirements specified here are
relatively simple. However, the MLLO can also capture model assumptions as
requirements, which can potentially be used to infer if the characteristics of datasets
satisfy the assumptions of a particular model. The availability of such information
could accelerate model and feature selection decisions for a particular task or ease
identifying the required data preprocessing (e.g., feature engineering) steps. The full
extent of these capabilities will be explored in the future.
Table 3. Result of SPARQL query based on CQ9
model input_requirement associated_data
1 ex:ConvNet1 ex:InputRequirements0 “specifies data type: Float32 ;
specifies dimension shape: [28, 28,
1]”
2 ex- ex- “specifies data type: Float32 ;
mlp:MultiLayerPerceptron1 mlp:InputRequirements0 specifies dimension shape: [784]”
adding support for MATLAB and R, it can be expanded to include a wide range of ML
models developed across various programming languages used by different domain
experts. This expansion can be achieved by developing language-specific adapters or
wrappers that extract metadata in a standardized format compatible with the existing
JSON-based solution. The benefits are to enhance reproducibility, and streamline
knowledge transfer across communities.
To fully leverage the capabilities of the MLLO, it will also be necessary to have a
comprehensive platform as a tool for evaluating ML models. Therefore, we are
developing a practical MLLO-based application termed the MLLO Editor. The MLLO
Editor's design is centered around seamlessly integrating features aimed at facilitating
the input, organization, and analysis of information pertaining to ML models and
dataset characteristics. It will allow users to capture relevant information according to
the MLLO ontology. At its core, the editor offers a user-friendly interface that
simplifies the process of capturing details about machine learning models, including
their architectures, hyperparameters, and training configurations. It facilitates
annotation and incorporation of dataset characteristics, enabling users to establish
explicit relationships between their models and the data on which they were trained.
Furthermore, the editor offers robust tools and tailored visualizations for conducting
comparative analyses, enabling users to compare various model information based on
performance metrics, hyperparameters, or other relevant factors. Finally, the MLLO
Editor includes features such as versioning and history tracking, enabling users to
maintain a comprehensive record of changes made to their models and associated
information over time.
7 References