1 s2.0 S1386505623003556 Main
1 s2.0 S1386505623003556 Main
A R T I C L E I N F O A B S T R A C T
Keywords: Background: Nurses are essential for assessing and managing acute pain in hospitalized patients, especially those
Clinical decision support software who are unable to self-report pain. Given their role and subject matter expertise (SME), nurses are also essential
Data labeling for the design and development of a supervised machine learning (ML) model for pain detection and clinical
Human-centered Design for Embedded
decision support software (CDSS) in a pain recognition automated monitoring system (PRAMS). Our first step for
Machine Learning Solutions Machine Learning
Machine learning models
developing PRAMS with nurses was to create SME-friendly data labeling software.
Purpose: To develop an intuitive and efficient data labeling software solution, Human-to-Artificial Intelligence
(H2AI).
Method: The Human-centered Design for Embedded Machine Learning Solutions (HCDe-MLS) model was used to
engage nurses. In this paper, HCDe-MLS will be explained using H2AI and PRAMS as illustrative cases.
Findings: Using HCDe-MLS, H2AI was developed and facilitated labeling of 139 videos (mean = 29.83 min) with
3189 images labeled (mean = 75 s) by 6 nurses. OpenCV was used for video-to-image pre-processing; and
MobileFaceNet was used for default landmark placement on images. H2AI randomly assigned videos to nurses for
data labeling, tracked labelers’ inter-rater reliability, and stored labeled data to train ML models.
Conclusions: Nurses’ engagement in CDSS development was critical for ensuring the end-product addressed
nurses’ priorities, reflected nurses’ cognitive and decision-making processes, and garnered nurses’ trust for
technology adoption.
Abbreviations: CDSS, clinical decision support software; H2AI, Human-to-Artificial Intelligence; HCDe-MLS, Human Centered Design for Embedded Machine
Learning; IRR, inter-rater reliability; NCSF, Neonatal Facial Coding System; NICU, neonatal intensive care unit; PRAMS, Pain Recognition Automated Monitoring
System; SME, subject matter expert; sQuaRE, System and Software Quality Requirements and Evaluation; VAS, visual analog scale.
* Corresponding author at: Ann & Robert H. Lurie Children’s Hospital of Chicago, 225 E. Chicago Avenue, Box 101, Chicago, IL 60611-2991, USA.
E-mail addresses: [email protected] (N.A. Kaduwela), [email protected] (S. Horner), [email protected] (P. Dadar),
[email protected], [email protected] (R.C.B. Manworren).
https://doi.org/10.1016/j.ijmedinf.2023.105337
Received 14 August 2023; Received in revised form 16 December 2023; Accepted 30 December 2023
Available online 6 January 2024
1386-5056/© 2024 Elsevier B.V. All rights reserved.
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
The three core characteristics of human-centered design are under 1.2. Purpose
standing users, stakeholder engagement, and a systems approach [3].
Understanding nurses and nursing workflows are essential for designing The purpose of this study was to develop an intuitive and efficient
CDSS that meets nurses’ needs. Engaging nurses throughout the design data labeling software solution, Human-to-Artificial Intelligence (H2AI).
and development lifecycle ensures that the end-product addresses This paper describes the HCDe-MLS model and its use to develop the
nurses’ priorities and solves pragmatic problems [4]. By prioritizing H2AI software solution. H2AI facilitates data labeling by subject matter
nurses’ needs and human–machine interactions, CDSS can be designed experts (SMEs), enables tracking of data labeling progress, and allows
to efficiently accomplish tasks, while ensuring nurses’ experiences are monitoring of data labeling quality with inter-rater reliability (IRR)
intuitive and meaningful. Ideally, CDSS reflects nurses’ actual cognitive dashboards. We leveraged the HCDe-MLS model to maximize neonatal
and decision-making processes [5]. intensive care (NICU) nurses’ engagement, experience, and productivity
to develop H2AI, an intuitive ML model data labeling software solution.
1.1. HCDe-MLS and software quality In this case, H2AI was developed and used to train a ML neonatal pain
classification model, a Pain Recognition Automated Monitoring System
According to the System and Software Quality Requirements and (PRAMS).
Evaluation (sQuaRE) standards, software evaluation should include five
quality-in-use and eight product quality characteristics [6,7]. Usability 2. Methods: HCDe-MLS model
is the sQuaRE characteristic most often evaluated [8]; however, indi
vidual factors are most important for influencing perceptions of software The HCDe-MLS model combines human thinking and ML lifecycles
quality [9]. The HCDe-MLS model ensures user’s perceptions of software (Fig. 1).
quality are influenced by individual, technological, and organizational
factors.
Trust, or the certainty that a system will not fail, is a critical driver of 2.1. Human thinking lifecycle
technology usage behaviors and is essential for user adoption of tech
nology [10]. Product-related factors, such as perceived usefulness, The human thinking lifecycle seeks to understand human needs
helpfulness, functionality, reliability, and ease of use, as well as (Table 1). Human thinking requires imagination, logic, and systematic
security/service-related and social factors influence trust and CDSS reasoning to artfully create user-focused outcomes [12]. The ML life
adoption. Design elements of the user interface are an important pre cycle seeks to find patterns in existing data, apply these patterns to new
dictor of users’ trustworthiness in CDSS [11]. Thus, the HCDe-MLS data, and embed ML in solutions. ML lifecycle follows the standard agile
model of including users in software development is critical for influ and iterative software development stages [13].
encing nurses’ involvement, training, knowledge, competency, resis
tance to change, and overall perceptions of CDSS quality. 2.1.1. Empathize
In the human thinking lifecycle, empathizing focuses on learning
about target users. The purpose of empathizing is to set aside assump
tions and instead gain insights into users’ actual physical, cognitive, and
emotional needs to complete tasks [3,14]. However, Boy [2] clarifies
Fig. 1. Human-Centered Design for Embedded Machine Learning Solutions (HCDe-MLS): Human thinking (Left side of diagram) from Empathize to Test, and then the
Machine Learning lifecycle (Right side of diagram) from Analyze to Evaluate. When both are complete, the solution is deployed (Bottom of diagram). Reprinted with
permission from ©2019 Kavi Global.
2
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
Table 1
Human Thinking Lifecycle Stages and Respective Toolbox.
Lifecycle Empathize Define Ideate Prototype Test
stages
Description • Collect • Synthesize insights • Generate ideas for possible • Build low- and high-fidelity tactile • Generate performance
information • Microtheory of user solutions to defined problems and representations of solutions data
• Gather insights problem and needs needs • Gather feedback from
• Validate with users users and stakeholders
Tools • Interviews • Empathy mapping • Brainstorming • Feature v1/v2 sketches • Feedback grid
• Focus groups • User persona • Mind-mapping • Visual prototypes
• Surveys • Journey mapping • Affinity diagram
• Storytelling • Co-creation
• Generative
technique
that tasks, such as pain assessment, are technology-centered pre requirements, the nonfunctional requirements, data sources, data
scriptions to humans, and activities are what humans really do. Empa collection, and integration in a format that can later be consumed by the
thizing requires listening, engaging, observing, and understanding users ML model, application, and user are defined [17]. Functional re
to gain insight into human activities [12,15]. quirements, including inputs, calculations, and processes, are then
translated into technical requirements of how the software performs its
2.1.2. Define actions. Nonfunctional requirements are the look and feel of the product,
Empathizing is followed by defining the problem and user re the user interface and experience. A core focus in this stage is data
quirements. Through qualitative research methods, insights are syn management, including obtaining data essential for the process of
thesized, and user personas are developed. Personas are detailed training, testing, and validating the ML model [13]. Data collection re
descriptions of target users developed from highly specific data about quires gathering data samples of real-world system, process, or phe
real people [16]. The aim of using personas is to create the users’ point nomenon for which the ML model is being built. The data collected may
of view, reframe the problem, and effectively focus design efforts on be heterogeneous because of various disparate sources; thus, pre-
users’ needs and preferences. Defining the problem brings clarity to processing the data to ensure consistency is inevitable. When data
ensure the solution solves the true problem in the best way [12]. samples are unavailable or their collection is too costly, time consuming,
unethical or dangerous, augmentation methods are used to add these
2.1.3. Ideate critical data samples to collected data sets [18].
Engaging users in brainstorming. Mind-mapping or co-creation ses
sions initiates ideation [3]. The aim of the ideate stage is to channel 2.2.2. Design
empathy, familiarity, creativity, and collective situational awareness to The design stage is the most creative in the ML lifecycle. Here, focus
address the shared purpose by developing a broad range of possible transitions from the problem to the solution to design optimal solution
solutions that are unbounded by the limitations and status quo of the architecture leveraging technologies to solve problems efficiently and
current state [2]. Then, all the possibilities must be evaluated against the effectively. The goal is to transform the requirement specifications into
constraints of resources and context to prioritize and finalize the most structure. Creativity, system thinking, risk taking, agile approaches, and
feasible solution [12,15]. knowledge of human systems integration architecture is required [2].
An outline of the solution is generated, including the technical
2.1.4. Prototype approach, solution architecture, ML models, evaluation metrics, capa
The best idea is then built as a prototype. Prototypes may range from bility of the team, project constraints, risks, timeline, and budget. So
low-fidelity sketches to high-fidelity working artifacts [3,4]. Effective lution features are prioritized based on complexity, speed to value, and
prototypes communicate concepts and test ideas through iterative cost to determine the optimal minimal viable product. Buy versus build
feedback from users and stakeholders. Prototypes are important for decisions are made for technology and components, as well as
implementing possibilities and for maintaining a solution-building leveraging accelerators, for example, existing pre-trained models like
approach [12]. Convolutional Architecture for Fast Feature Embedding (Caffe)-based
convolutional neural network (CNN). CNN is a feed-forward neural
2.1.5. Test network that uses filters to effectively extract information from images.
The final stage of the human thinking lifecycle involves testing and Hsu et al. [19] introduced a CNN-based model to detect 68 facial
refining of the software created in the prototype stage [12]. Essential landmarks on facial images.
components of testing include representative users, stakeholders, tasks, Visual appeal and usability can override trust in information quality;
and environments. Qualitative and quantitative methods are used to however, accuracy is one factor that stimulates reflection and motiva
identify problems, capture recommendations for improvement, and tion for information quality [11]. CDSS performance requires access to
statistically support qualitative concerns [15]. Tangible metrics should data sets and multimodal healthcare data that can be assessed cogni
be developed with users and stakeholders to improve the assessment of tively and longitudinally to make dynamic predictions and reflect timing
complex system interoperability [2]. of clinical decision making [5]. Predictive models must know the
dimensionality of the data, for example, the strong predictive value of
International Classification of Disease codes (ICD-10-CM) and
2.2. Machine learning lifecycle Diagnosis-Related Groups (DRGs) and a priori interactions of clinical
data. When large de-identified data sets are used to train predictive ML
The first stage of the ML lifecycle is to analyze user needs and models, historical mistakes in datasets, known as “historical decision
translate needs into requirements. bias,” are carried forward in model [5,13]. ML model performance im
proves when temporal changes and trends of repeated measurements are
2.2.1. Analyze considered. For dynamic predictions of clinical outcomes, models can be
The analyze stage advances tasks to activity and complexity analysis trained “on-the-fly” [5]. Unfortunately, on-the-fly training of Bayesian
[2]. The scope and boundaries of software, the functional and technical
3
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
models results in reduced model performance, and on-the-fly training of 3. Stakeholders and setting
computationally expensive complex algorithms (e.g., support vector
machines [SVM] and CNN) result in slow responses and limited CDSS After receiving approval from the Institutional Review Board, study
utility. Thus, to manage performance, most clinical decision support #2021-4348, small focus groups were conducted by video calls for 1 h
models are trained in nightly or weekly batches and only scoring of a each week from February to May in 2021 to empathize and identify user
new patient record is done on-the-fly in real time. needs. Stakeholders included NICU nurses (n = 6), nurse scientists (n =
2) with expertise in neonatal development and pain management,
2.2.3. Build human-centered design specialists, architects, data scientists, and
Model building is the process of implementing ML models (e.g., lo product managers. Nurses had a mean of 18.7 years of NICU nursing
gistic regression, SVM, random forest, and deep learning models like experience (ranging from 5 to 42 years) and worked in a 64-bed level IV
CNN) to solve the identified problem. Model building follows data pre- NICU, part of a 364-bed, free-standing, university-affiliated, not-for-
processing, and encompasses feature engineering, splitting of data into profit urban children’s hospital in Illinois that cares for neonates with
training data and test data, and running various models on the training complex medical needs.
set. ML models are broadly classified as supervised and unsupervised;
the learning process is defined as classification or regression [20]. Su 4. Results: H2AI development case
pervised learning algorithms learn to map inputs to outputs based on
labeled input–output training data pairs. Supervised learning may Our cross-functional team identified an opportunity to train a variety
define outputs by classification (resulting in a finite set of output cate of ML models by labeling data. Models could then be compared against
gories) or by regression (defining the probability of the output based on the nurses’ benchmark to gain clinical trust and encourage CDSS
the input). Model selection is based on the type of problem, volume, and adoption. We developed user personas to define user tasks and needs
availability of training data, as well as the need for model transparency through thematic analysis by the human-centered design specialists and
and explain-ability [13,18,21]. verification from all focus group members (Table 2). These personas
provided real-life context to reframe the problem and focus design ef
2.2.4. Tune forts toward efficiently leveraging nurses’ expertise for data labeling;
Some ML models have hyperparameters, which are used to control and eventually, development of an effective PRAMS.
the learning process and can be iteratively tuned to optimize model Then, our cross-functional team identified novel ideas and disruptive
performance and results. Tuning is the stage of improving ML model innovations to optimize user workflows, maximize productivity, and
performance by choosing and optimizing the hyperparameters of the minimize user burden. Our resulting mind map (Fig. 2) illustrates the
training algorithm to control for overfitting, underfitting, and model key H2AI product features identified.
complexity [13,18,21]. ML models lack design specifications; instead,
algorithms are developed by learning parameters from mathematically 4.1. Data labeling tasks
derived data. With models that do not require hyperparameter tuning, i.
e., pre-trained models like Caffe-based CNN and MobileFaceNet, the Six Data Labeling Tasks were defined based on clinical neonatal pain
tuning stage is unnecessary. However, in general, model performance assessment standards [24], a review of the literature [25,26], and ML
can be improved by iterating on the features fed into the model. modeling needs [6,7,13,18,21]. First, nurses used the Neonatal Facial
Coding System (NFCS) to label each video frame. NFCS is a valid and
2.2.5. Evaluate reliable objective measure of pain [24,27,28]. Second, nurses rated their
The performance of the chosen model is then evaluated against the perception of pain intensity on a Visual Analog Scale (VAS) of 0–100,
original use requirements and acceptance criteria on previously unseen with 0 indicating no pain and 100 indicating the worst possible pain.
test data [13]. Model evaluation demonstrates the robustness and Third, nurses identified and labeled facial landmarks to help the com
generalizability of the model and enables comparison to other existing puter vision model identify facial action units from movement of facial
methods. Performance metrics should be quantifiable and reflect data features. Fourth, nurses identified occlusions, where neonates’ hands or
characteristics and the CDSS [13,22]. For supervised models, perfor blankets obstruct facial landmarks. Fifth, nurses classified pain by frame
mance metrics typically include accuracy, precision, recall, and speed. image, and sixth, at the video level.
Especially in the healthcare context, it is important to evaluate tradeoffs
between types of error (i.e., false positives and false negatives) to ensure 4.2. User workflows
patients are not misclassified or incorrectly treated. Evaluation metrics
must also consider human interpretation of what the algorithm does and Four User Workflows were developed.
means [22].
4.2.1. Practice workflow
2.2.6. Deploy The purpose of the Practice workflow was to educate nurses in the
The last stage of the HCDe-MLS model is deployment to the pro tasks and features of the application. Since users had identified that they
duction environment. Deployment refers to configuring the CDSS for would need to access the application from a variety of computers, the
integration with other applications to serve as designed at scale. Built-in Practice workflow was also used to test their equipment.
mechanisms to integrate feedback and support CDSS may be required
[13]. Human-machine interfaces must enhance operator automation- 4.2.2. Training workflow
related situational awareness [23]. Failing to attend to the knowledge, The Training workflow was created to ensure consistency of labeling
expertise, and training to optimize human–machine interactions results among nurses. A nurse scientist labeled five random frames in parallel
in automation errors. In addition, deviation from test data to operation with each nurse, then the two met to reconcile any labeling differences.
data must be monitored to identify covariance shift or concept drift If agreement thresholds were reached before meeting, the nurse was
[13,21]. Therefore, it is essential that, in safety–critical systems like “passed” on to the Labeling workflow. If thresholds were not attained,
healthcare, any deployed model is transparent, explainable, interpret parallel labeling continued in repeated sets of five additional frames
able, and continuously monitored to meet clinical decision support until the thresholds were reached.
needs [21].
4.2.3. Labeling and review workflows
The Labeling workflow was identical to the Training workflow, except
4
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
Table 2
User Personas.
We are… We are trying to… But… Because… We need to create solutions that…
Nurse Automate pain We need a ML model that In healthcare, risk from false positives and Innovate and involve direct care nurses in the
Scientists classification in neonates healthcare professionals will trust false negatives is high rigorous development of a continuous pain
monitoring system for vulnerable neonates
Architects & Build a supervised ML We need nurse-labeled data and a We want the model to be trustworthy, and Inspire and partner with healthcare
Data model to automate pain method to collect the data labels to therefore comparable to SME benchmarks professionals to develop an efficient solution
Scientists classification train the ML model and validated methods of pain for ML modeling
classification
NICU nurses Label neonatal facial Variability in assessments among There are so many landmark points and Empower nurses to engage in designing the
landmarks and facial action nurses are normal; documenting NCSF pain classification results, pain labeling system and the development of a
data to train the ML model each assessment and decision is intensity, and overall pain classification to clinical decision support solution to provide
time-consuming capture for each frame better care for my patients
ML, machine learning; NCSF, Neonatal Facial Coding System; NICU, neonatal intensive care unit.; SME, subject matter expert.
Fig. 2. Mind map of key features for embedded ML solutions development. AI, artificial intelligence; API, application programming interface.
the generated data labels were stored for later use to train the ML pain agreements required user authentication and secondary verification.
classification model. To ensure nurses consistently met IRR thresholds This greatly influenced the architectural design and ML modeling
throughout the labeling of thousands of video frames, the nurse scien approach. This solution was funded on a time-limited grant which
tists were randomly assigned to label up to 10 % of the videos each nurse required a software solution be in production in three months. User
labeled. Given the volume of frames to be labeled, the Review workflow requirements were translated into technical requirements, mapped to
was created to allow real-time monitoring of data labeling progress (i.e., the appropriate technology and ML model solution (Table 4), and then
how many frames/videos were labeled, how long each task takes, and consolidated into a single cohesive solution. The solution architecture
IRR for each nurse). (Fig. 4) encompasses a holistic software solution from front-end user
interface to the embedded ML models output, back-end data storage,
4.3. H2AI prototype and service calls to pass data between the front and back ends.
When ready to create a prototype, the data scientist first conducted a 4.4.1. H2AI build
buy versus build comparison to determine if the data labeling capabil To build the ML model, neonatal pain and no pain video images from
ities already existed in the market. Image annotation solutions already the iCOPEvid Neonatal Pain Video Database was obtained with
existed. Common features were pixel identification, bounding box, re permission and used for this study [25]. Videos and images needed to be
gion detection, text tagging/object, however, none provided the ability labeled by nurses in the data labeling solution. H2AI utilizes pre-trained
to upload data based on human interactions with video images. Mid- and models that are optimized to extract facial features from video frames
high-fidelity prototypes were then built (Fig. 3) and tested by nurses. with high efficiency and capture labels at the lowest level of granularity.
Our feedback grid both itemizes improvement opportunities and posi Intel’s open-source framework, OpenCV, has a built-in Face Detector
tive feedback (Table 3). that is reliable in 90–95 % of clear, forward- and camera-facing human
photos ([29], Open CV). OpenCV was selected to convert video to im
ages, crop the face, and put the bounding box on the face to position
4.4. H2AI machine learning lifecycle
facial landmarks within the acceptable level of confidence (Fig. 5). The
default OpenCV model cropped the outline of the face, especially by the
Feedback was analyzed and mapped to the product backlog to
ears and chin; thus, additional padding of 20 pixels were added before
optimize functionality, user experience, and productivity. Data security
5
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
cropping the image. This ensured that all facial features were available developed MobileFaceNet, using ArcFace [30] loss to achieve > 99.5 %
for landmarks that might otherwise be lost. accuracy for the face detection task on the Labeled Faces in the Wild
Home (LFW) dataset [32]. MobileFaceNet is also effective as a general
4.4.2. H2AI landmark model comparison facial feature extractor [33]. MobileFaceNet is specifically designed for
Two pre-trained facial landmark models were implemented, and the face recognition task by replacing the global average pooling layer
precision of their respective landmark placements were compared. First, with a global depthwise convolution (GDConv) layer, which enhances
a Caffe-based CNN model was implemented. Caffe is a deep learning the discriminative ability of the model. The first layer of each sequence
framework that defines a net layer-by-layer in its own model schema. uses a stride s, and all other layers use stride = 1 to preserve the same
The network defines the model in a bottom-to-top approach from input output feature map size as the original layer. All spatial convolutions in
data to loss. The model was composed of 24 layers: 8 convolutional the bottlenecks use 3 × 3 kernels. The expansion factor t is always
layers, 4 pooling layers, 2 dense layers, 9 batch-normalization layers, applied to the input size and GDConv7×7 denotes GDConv of 7 × 7
and 1 flatten layer. Using Keras Functional Application Programming kernels. A downsampling strategy is used at the beginning of the
Interface (API), the pre-processed frames of images were fed into the network, and a linear 1 × 1 convolution layer follows a linear global
model. depth-wise convolution layer as the feature output layer. During
The second model, MobileFaceNet, uses a more streamlined archi training, batch normalization is used, and batch normalization folding is
tecture with depthwise separable convolutions [30]. Chen et al. [31] applied before deployment.
6
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
Table 3 Table 4
Feedback Grid. Technical Requirements and Appropriate Technology and ML Model Solution.
Feedback Nurses’ current Solution Feature User Requirement Technical Requirement Solution
human–machine Enhancement
User needs to provide Video data needs to be Video data can be stored in
interaction paradigm
data labels for six pre-processed into images a blob format. OpenCV is
We need the data labeling Medical records allow Copied forward tasks on each image and made available in the the most popular image
to be more efficient nurses to copy forward previously selected frame from each data labeling solution for processing library to
results from tasks across video. users to label. capture images from
frames videos and detect faces.
Increase size of dots on Nurses use a variety of • Increased (Task 2 & 3) User wants landmarks Pre-trained landmark
user interface computer brands, dot size to be as precise as models can be run in the
• Facial landmarking monitor sizes, trackpads, • Increased (Task 4) possible for optimal back end to place default
dots are too small to mouse, etc. pain intensity slider landmarking landmarks as close as
see, select, and move granularity efficiency. possible to outline facial
• Pain VAS slider is too features. There are several
small options: Caffe Model and
Need fine-grain control to Enabled single select, MobileFaceNet can be
move facial landmark multi-select, rotation, compared.
dot(s) space expansion, and Image and default Image path and default
space contraction of a landmark positions need landmark positions can be
group of landmarks dots to be made available in sent via Restful API call via
at once the user interface. JSON file to the user
Limit risk of user pain • Numeric pain scales Hid numbers on pain interface, which can then
intensity score bias have inherent bias intensity slider display the coordinates on
• VAS is a valid pain the UI, over the image file.
intensity measure with Users want to automate Labeled data from the User labels are stored in a
more rigor pain detection using users needs to be collected relational Azure SQL
Need practice and Goal is to maximize IRR Created workflows: the validated NFCS and stored in a format that database to be accessed
training workflows for practice, training, and pain scale measures. can later be used to train easily from Python when
training nurse data labeling, with user- the supervised computer doing model training and
labelers and ensuring specified IRR thresholds vision pain detection benchmarking.
data quality that need to be passed in model.
training before entering Users want to track Reporting on top of the Power BI can be used to
labeling workflows labeling process. SQL relational data store provide reporting on data
Cannot identify chin May negatively influence Removed chin quiver of labeled data needs to be labeling progress by
quivering due to use of IRR from NFCS reported and visualized. displaying counts
still image Users want to see the IRR is required at each IRR calculations can use
Need a way to monitor • Variable schedules due A Power BI dashboard IRR across users. frame level. Python code on Azure
nurses’ progress in to patient demands was embedded into the Cloud Databricks to
training and labeling • Encourage and reward user interface to compute.
efforts summarize the progress Funding allows a 30-day Users require Cloud Native solution
• Track paid time of nurses timeline. functionality, security, architecture can enable
• Need to identify data and authentication. rapid delivery by
drift leveraging pre-built
Need better default To improve efficiency by Updated pre-trained components.
landmark placement at having to move fewer facial landmarking
the start of each landmark points into default placement AI
labeling task place model from OpenCV to Landmark labeling took nurses a mean of 51.24 s per image and 20.36
MobileFaceNet min per video. In total, NFCS and landmark labeling took nurses a mean
AI, artificial intelligence; BI, business intelligence; IRR, inter-rater reliability; of 75 s per image and 29.83 min per video. The best performing ML
NFCS, Neonatal Facial Coding System; VAS, visual analog scale. model from nurses’ labeling of this data in H2AI had 97.7 % precision,
98 % accuracy, 98.5 % recall, and Area Under the receiver operating
Since Caffe-based CNN and MobileFaceNet are pre-trained models, characteristic Curve (AUC) of 0.98 [34]. HCDe-MLS and development of
we did not tune hyperparameters for landmark detection. However, we H2AI was a critical first step in the development of a trustworthy
adjusted the size and color tone of input images to achieve the best re PRAMS.
sults. Models were then compared and evaluated using visual inspec
tion across several images, including challenging images with 5. Discussion
occlusions. As seen in Fig. 6, the Caffe-based CNN model lacked preci
sion; and MobileFaceNet better captured the outline of the upper lip Our cross-functional team leveraged the HCDe-MLS model to
(versus the tongue), nose, and eyes. Users agreed that MobileFaceNet develop the H2AI solution. H2AI is a data labeling solution that facili
was the better solution for default facial landmarking and was more tates efficient labeling of video image data by SMEs and stores the user
robust at handling occlusions and blurry images from movement. generated data labels for later development of and access by ML models.
Therefore, we integrated and deployed this pre-trained model into the In this case, data labeled by nurses was used to train a highly precise and
production environment; and nurses, who had met IRR thresholds, then accurate model with excellent recall. With further refinement, H2AI will
began data labeling workflows. now be used to train an ML model to continuously monitor neonatal
facial actions for pain, a Pain Recognition Automated Monitoring Sys
4.4.3. H2AI efficiency evaluation tem (PRAMS).
Using HCDe-MLS, H2AI was developed and facilitated labeling of
139 videos with 3189 images labeled by 6 nurses. Nurses began labeling 5.1. ML models and efficacy comparison for pain classification
data after meeting IRR thresholds of 88 % agreement were attained on
NFCS items and binary pain classification, and when agreement on pain With 98 % accuracy, 97.7 % precision, 98.5 % recall, and AUC of
intensity scores were ±10 points across 5 random test frames [34]. NFCS 0.98, our supervised ML pain classification model far exceeded previ
labeling took nurses a mean of 12.23 s per image and 4.67 min per video. ously reported models developed with the same video dataset (highest
7
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
AUC = 0.93) and was better performing than all except one model was time and labor-intensive, taking up to 3 h for every minute of video
developed with a smaller (AUC = 0.98, 15 videos) dataset [25,26,34]. As [35]. Brahnam et al. [25] used iCOPEvid video images and Gaussian of
Zamzmi, et al. suggested [26] incorporating clinical and contextual in Local Descriptors (GOLD) approach to extract facial features. This is a
formation is necessary to refine and develop a context-sensitive PRAMS. time-consuming four-step process that involves dense scale-invariant
Using HCDe-MLS and H2AI, we have demonstrated a method to effec feature transform (SIFT) descriptors and probability density estima
tively incorporate nurses’ clinical and contextual knowledge to advance tion. SIFT is computed based on the histogram of the gradient, making it
development of effective pain recognition models and PRAMS [34]. mathematically complicated and computationally heavy.
Using HCDe-MLS and H2AI also improved data labeling efficiency. Ashraf et al. [36] utilized the Active Appearance Model (AAM) to
Researchers using other methods in their attempt to automate pain identify shape and appearance variations of adult faces but identified a
assessment based on facial expressions have reported that data labeling lack of ground truth at the individual frame level. Also in contrast to our
8
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
Fig. 6. Facial landmarking comparison by model: Caffe-based CNN (left) and MobileFaceNet (right).
approach, Brahnam et al. [25] achieved ground truth at the frame level 6. Conclusions
and validated their neonatal pain classification ML model based on as
sessments by 185 college students with no appreciable healthcare or When training computer vision algorithms for healthcare CDSS, ML
neonatal pain assessment experience. By having a frame-level ground models must be explainable and validated against the expertise of
truth based on data labeled by nurse SMEs, our model can learn and healthcare professionals. We have demonstrated that HCDe-MLS can be
improve in its performance. This level of data labeling granularity is used to generate a user-centric software solution with embedded ML. We
needed to ensure nurses will trust PRAMS, a CDSS solution for pain engaged nurses in the design, building, and deployment of H2AI, a first
detection. step in our development of a PRAMS. To meet nurses’ needs and deliver
the best user experience, we used Cloud Native, a serverless architecture
to accelerate time to solution delivery. OpenCV provided efficient video-
5.2. Limitations to-image data pre-processing for data labeling. MobileFaceNet demon
strated superior results for default landmark placement on neonatal
H2AI and our best performing ML model was developed using the video images. We found that H2AI facilitates efficient data labeling and
iCOPEvid neonatal pain database. This database is small and lacks racial stores labeled training data for future access to train ML models. H2AI
and ethnic diversity [25] that may influence MobileFaceNet detection of also tracks IRR and compares ML model performance to SMEs. The H2AI
facial landmarks [33]. Therefore, time required for data labeling by SME solution can be generalized to other industry uses.
may be longer with a more diverse dataset. Recent federal data sharing Summary Table:
requirements may facilitate access to more diverse video and clinical What is already known on the topic:
datasets that may accelerate further development of models that pro
mote healthcare equity in CDSS and PRAMS. • Individual factors are most important for influencing perceptions of
The iCOPEvid database contained video that we then converted to software quality.
frame images for data labeling granularity [25,26]. However, the • User-interface design, perceived usefulness, helpfulness, function
resulting ML model may fail to capture dynamic patterns of facial ex ality, reliability, and ease of use, as well as security/service-related
pressions that may be important for discriminating pain or other con and social factors influence trust and clinical decision support soft
ditions. To date, only one novel multimodal spatiotemporal approach ware adoption.
for assessing neonatal postoperative pain has been reported with an AUC
of 0.87 and 79 % accuracy, exceeding many other unimodal facial What this study added to our knowledge:
coding approaches [37].
• The Human-Centered Design for Embedded Machine Learning So
lutions (HCDe-MLS) model provides a systematic approach for
5.3. 5.3 Future potential H2AI applications engaging nurses to develop patient monitoring clinical decision
support software solutions.
H2AI can be utilized to label data and develop ML models to detect • Nurses informed the development of Human-to-Artificial Intelli
pain in other vulnerable patients who cannot provide self-report [24], to gence (H2AI), an intuitive and efficient data labeling software so
detect other human conditions associated with facial actions, such as lution for healthcare professionals’ use.
depression and anxiety [38], and to detect potential threats by differ
entiating anger from hostility using micro-expressions [39]. With cus 7. Financial disclosure statement
tomization, H2AI can also be extended to Natural Language Processing
(NLP) models, where the model is trained to deliver sentiment analysis, This study was supported by a Perinatal Origins of Disease Grant
entity name recognition, and optical character recognition. Audio from Stanley Manne Children’s Research Institute and The National
tagging is also a potential area of development for H2AI, such that in Science Foundation grant number #2205472. Dr. Manworren is sup
formation pertaining to the sound bites from the videos, such as cry, ported by The Posey and Fred Love Endowment of Nursing Research at
could assist in the model’s learning process. We are moving forward to Ann & Robert H. Lurie Children’s Hospital of Chicago.
develop PRAMS with a clinical trial of continuous video facial moni
toring for pain. Determining the latency of alert, specifically, the length
of time or number of consecutive images that classify a condition before
a clinician is alerted, is a feature we must add to H2AI.
9
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
CRediT authorship contribution statement [12] A. Anand, A.S. Mishra, A. Deep, K. Alse, Generation of educational technology
research problems using design thinking framework, in: 2015 IEEE Seventh
International Conference on Technology for Education (T4E), 2015, pp. 69–72,
Naomi A. Kaduwela: Conceptualization, Software, Validation, https://doi.org/10.1109/T4E.2015.28.
Formal analysis, Investigation, Resources, Data curation, Writing – re [13] H. Suresh, J. Guttag, A framework for understanding sources of harm throughout
view & editing, Visualization, Supervision, Project administration, the machine learning life cycle (2021). arXiv:1901.10002V4. DOI: 10.48550/
arXiv.1901.10002.
Funding acquisition. Susan Horner: Validation, Formal analysis, [14] E.V. Eikey, M.C. Reddy, C.E. Kuziemsky, Examining the role of collaboration in
Investigation, Resources, Writing – review & editing, Visualization, studies of health information technologies in biomedical informatics: a systematic
Supervision, Project administration. Priyansh Dadar: Software, Vali review of 25 years of research, J. Biomed. Inform. 57 (2015) 263–277, https://doi.
org/10.1016/j.jbi.2015.08.006.
dation, Formal analysis, Data curation, Writing – review & editing. [15] K. Thoring, R. Muller, Understanding design thinking: a process model based on
Renee C.B. Manworren: Validation, Formal analysis, Investigation, method engineering, in: 13th International Conference on Engineering and Product
Resources, Writing – original draft, Writing – review & editing, Visual Design Education, London, UK, 8-9 September 2011, 2011, pp. 493-498. The
Design Society. https://www.designsociety.org/publication/30932/.
ization, Supervision, Project administration, Funding acquisition. [16] A.A. Abahussin, R.M. West, M.J. Allsop, D.C. Wong, L.E. Ziegler, A pain recording
system based on mobile health technology for cancer patients in a home setting: a
user-centred design, in: 2020 IEEE International Conference on Healthcare
Declaration of competing interest Informatics (ICHI), 2020 (11) 1–10. DOI: 10.1109/ICHI48887.2020.9374388.
[17] G. Tobias, A.B. Spanier, Developing a mobile app (iGAM) to promote gingival
health by professional monitoring of dental selfies: user-centered design approach,
The authors declare that they have no known competing financial J. Med. Internet Res. 8 (8) (2020) 17, https://doi.org/10.2196/19433.
interests or personal relationships that could have appeared to influence [18] R. Ashmore, R. Calinescu, C. Paterson, Assuring the machine learning lifecycle:
desiderata, methods, and challenges, ACM Comput. Surv. 54 (5) (2021) 111,
the work reported in this paper.
https://doi.org/10.1145/3453444.
[19] C.F. Hsu, C.C. Lin, T.Y. Hung, C.L. Lei, K.T. Chen, A detailed look at CNN-based
Acknowledgements approaches in facial landmark detection (2020). arXiv:2005.08649. DOI:
10.48550/arXiv.2005.08649.
[20] F. Liang, W.G. Hatcher, W. Liao, W. Gao, W. Yu, Machine learning for security and
We would like to thank Keri Benbrook, Ashley Entler, Taylor Greene, the internet of things: the good, the bad, and the ugly, IEEE (2019), https://doi.
Catherine Myler, Karah Riley, Kim Rindeau, and Rebecca Zuravel, org/10.1109/ACCESS.2019.2948912.
[21] H. Kuwajima, H. Yasuoka, T. Nakae, Engineering problems in machine learning
bacclaureate degree prepared NICU registered nurses at Ann & Robert systems, Mach. Learn. 109 (2020) 1103–1126, https://doi.org/10.1007/s10994-
H. Lurie Children’s Hospital of Chicago for their participation in focus 020-05872-w.
groups to developed H2AI and labeling of data to train the ML model for [22] E.P.S. Baumer, Toward human-centered algorithm design, Big Data Soc. 4 (2)
(2017) 1–12, https://doi.org/10.1177/2053951717718854.
PRAMs. We would also like to thank Rajesh Inbasekaran and Balaku
[23] B. Strauch, The automation-by-expertise by-training-interaction: why automation-
maran Manoharan for their guidance in overall software solution ar related accidents continue to occur in sociotechnical systems, Hum. Factors 59 (2)
chitecture and application design and development, and Rahul (2017) 204–228, https://doi.org/10.1177/001872081666545.
[24] K. Herr, P. Coyne, B. Ely, C. Gelinas, R.C.B. Manworren, Pain assessment in the
Dhanasiri for his help in building the MobileFaceNet model. Lastly, we
patient unable to self- report: clinical practice recommendations in support of the
would like to acknowledge Jun Hua Wong, Kosha Soni, Trishla Mishra, ASPMN 2019 position statement, Pain Manag. Nurs. 20 (5) (2019) 404–417,
and Stacey Tobin for their assistance in literature review, manuscript https://doi.org/10.1016/j.pmn.2019.07.005.
preparation, and editing. [25] S. Brahnam, L. Nanni, S. McMurtrey, A. Lumini, R. Brattin, M.R. Slack, T. Barrier,
Neonatal pain detection in videos using the iCOPEvid dataset and an ensemble of
descriptors extracted from gaussian of local descriptors, Appl. Comput. Inf. 19
References (2020) 122–143, https://doi.org/10.1016/j.aci.2019.05.003.
[26] G. Zamzmi, R. Kasturi, D. Goldgof, R. Zhi, T. Ashmeade, Y. Sun, A review of
automated pain assessment in infants: features, classification tasks, and databases,
[1] R.C.B. Manworren, Nurses’ management of children’s acute postoperative pain: a
IEEE Rev. Biomed. Eng. 11 (2018) 77–94, https://doi.org/10.1109/
theory of bureaucratic caring deductive study, J. Ped. Nurs. 64 (2022) 42–55,
RBME.2017.2777907.
https://doi.org/10.1016/j.pedn.2022.01.021.
[27] R.V.E. Grunau, K.D. Craig, Pain expression in neonates: facial action and cry, Pain
[2] G.A. Boy, Human-centered design of complex systems: an experience-based
28 (1987) 395–410, https://doi.org/10.1016/0304-3959(87)90073-X.
approach, Design Sci. 3 (2017) e8, https://doi.org/10.1017/dsj.2017.8.
[28] L.M. Relland, A. Gehred, N.L. Maitre, Behavioral and physiological signs for pain
[3] M. Melles, A. Albayrak, R. Goossens, Innovating health care: key characteristics of
assessment in preterm and term neonates during nociception specific response: a
human-centered design, Int. J. Quality Health Care 33 (1) (2020) 37–44, https://
systematic review, Pediatr. Neurol. 90 (2019) 13–23, https://doi.org/10.1016/j.
doi.org/10.1093/intqhc/mzaa127.
pediatrneurol.2018.10.001.
[4] T.M. Ward, M. Skubic, M. Rantz, A. Vorderstrasse, Human-centered approaches
[29] S. Emami, V.P. Suciu, Facial recognition using OpenCV, J. Mobile Embedded
that integrate sensor technology across the lifespan: opportunities and challenges,
Distrib. Syst. 4 (2012). https://www.researchgate.net/publication/267426877_Fa
Nurs. Outlook 2020 (2020) 734–744, https://doi.org/10.1016/j.
cial_Recognition_using_OpenCV.
outlook.2020.05.004.
[30] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: additive angular margin loss for deep
[5] D. Zikos, A framework to design successful clinical decision support systems.
face recognition, in: IEEE Conference on Computer Vision and Pattern Recognition
Proceedings of the 10th International Conference on Pervasive Technology Related to
(CVPR), Long Beach, CA, USA, 2019, pp. 4685-4694. DOI: 10.1109/
Assistive Environments. In: PETRA 2017. Association for Computing Machinery, New
CVPR.2019.00482.
York, NY, USA, 2017, 185-188. DOI: 10.1145/3056540.3064960.
[31] S. Chen, Y. Liu, X. Gao, Z. Han, Mobilefacenets: efficient CNNs for accurate real-
[6] Canadian Standards Association, CAN/CSA-ISO/IED 25050:12, Systems and
time face verification on mobile devices, in: Chinese Conference on Biometric
software engineering- Systems and software quality requirements and evaluation
Recognition, Springer, Cham, 2018, pp. 428–438. DOI: 10.1007/978-3-319-97909-
(SQuaRE-System and software quality models), 2012. http://scc.ca/en/standadsd
0_46.
b/standards/28356.
[32] G.B. Huang, M. Mattar, T. Berg, E. Learned-Miller, Labeled faces in the wild: a
[7] ISO 25000 STANDARDS, 2019. https://iso25000.com/index.php/en/iso-25000-
database for studying face recognition in unconstrained environments, in:
standards.
Technical Report, 07-49. University of Massachusetts, Amherst, 2007. http://vis
[8] L. Souza-Pereira, S. Ouhbi, N. Pombo, Quality-in-use characteristics for clinical
-www.cs.umass.edu/lfw/lfw.pdf.
decision support system assessment, Comput. Methods Programs Biomed. 207
[33] H. Wang, H. Zhang, L. Yu, L. Wang, X. Yang, Facial feature embedded cyclegan for
(2021) 106169, https://doi.org/10.1016/j.cmpb.2021.106169.
Vis-Nir translation, in: ICASSP 2020 – 2020 IEEE International Conference on
[9] K. Curcio, A. Malucelli, S. Reinehr, M.A. Paludo, An analysis of the factors
Acoustics, Speech and Signal Processing. (ICASSP), 2020, 1903–1907. DOI:
determining software product quality: a comparative study, Comput. Standards
10.1109/ICASSP40776.2020.9054007.
Interfaces 48 (2016) 10–18, https://doi.org/10.1016/j.csi.2016.04.002.
[34] R.C.B. Manworren, S. Horner, R. Joseph, P. Dadar, N. Kaduwela, Performance
[10] M.B. Garcia, N.U. Pilueta, M.F. Jardiniano, VITAL APP: Development and user
evaluation of a supervised machine learning pain classification model developed
acceptability of an IoT-based patient monitoring device for synchronous
by neonatal nurses, Adv. Neonatal Care (accepted for publication, 2023).
measurements of vital signs, in: IEEE 11th International Conference on Humanoid,
[35] J.W.B. Peters, H.M. Koot, R.E. Grunau, et al., Neonatal facial coding system for
Nanotechnology, Information Technology, Communication and Control,
assessing postoperative pain in infants: item reduction is valid and feasible, Clin. J.
Environment, and Management (HNICEM), Laoag, Philippines, 2019, pp. 1–6. DOI:
Pain 19 (2003) 353–363.
10.1109/HNICEM48295.2019.9072724.
[36] A.B. Ashraf, S. Lucey, J.F. Cohn, T. Chen, Z. Ambadar, K.M. Prkachin, P.
[11] S. Pengate, P. Antonenko, A multimethod evaluation of online trust and its
E. Solomon, The painful face–pain expression recognition using active appearance
interaction with metacognitive awareness: an emotional design perspective, Int. J.
models, Image Vis. Comput. 27 (12) (2009) 1788–1796, https://doi.org/10.1016/
Human-Comput. Interaction 29 (2013) 582–593, https://doi.org/10.1080/
j.imavis.2009.05.007.
10447318.2012.735185.
10
N.A. Kaduwela et al. International Journal of Medical Informatics 183 (2024) 105337
[37] M.S. Salekin, G. Zamzmi, D. Goldgof, R. Kasturi, T. Ho, Y. Sun, Multimodal spatio- [39] M. Tsikandilakis, P. Bali, J. Derrfuss, P. Chapman, Anger and hostility: are they
temporal deep learning approach for neonatal postoperative pain assessment, different? An analytical exploration of facial-expressive differences, and
Comput. Biol. Med. 129 (2021) 104150. physiological and facial-emotional responses, Cogn. Emotion 34 (3) (2019)
[38] M. Gavrilescu, N. Vizireanu, Predicting depression, anxiety, and stress levels from 581–595, https://doi.org/10.1080/02699931.2019.1664415.
videos using the Facial Action Coding System, Sensors (Basel) 19 (17) (2019) 3693,
https://doi.org/10.3390/s19173693.
11