0% found this document useful (0 votes)
285 views

SR 11-7, Validation and Machine Learning Models

The document discusses challenges for model validation of machine learning models under the Fed SR 11-7 regime. Key challenges include the opaque "black-box" nature of machine learning algorithms making conceptual soundness difficult to assess. Machine learning models also rely more on outcome analysis than assumptions. Validating inputs is challenging due to the large, complex, and unstructured data used. Ongoing monitoring may not be feasible for some machine learning models that cannot be continuously monitored or require retraining. The document provides recommendations to help validation cover these new aspects introduced by machine learning.

Uploaded by

raypa1401
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
285 views

SR 11-7, Validation and Machine Learning Models

The document discusses challenges for model validation of machine learning models under the Fed SR 11-7 regime. Key challenges include the opaque "black-box" nature of machine learning algorithms making conceptual soundness difficult to assess. Machine learning models also rely more on outcome analysis than assumptions. Validating inputs is challenging due to the large, complex, and unstructured data used. Ongoing monitoring may not be feasible for some machine learning models that cannot be continuously monitored or require retraining. The document provides recommendations to help validation cover these new aspects introduced by machine learning.

Uploaded by

raypa1401
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

SR 11-7, Validation and Machine Learning Models

By Dong (Tony) Yang1

November 2020

Table of Contents
I. Overview ........................................................................................................................................................ 1
II. Machine Learning! Machine Learning! Machine Learning! .......................................................................... 1
III. What Are the New Wrinkles? ........................................................................................................................ 2
IV. Stumbling Blocks to SR 11-7 Execution ......................................................................................................... 3
V. What Should Validations Cover, and How? .................................................................................................. 5
VI. Detailed Recommendations on Machine Learning Model Validations ..................................................... 11
VII. Conclusion .................................................................................................................................................... 12
VIII. Appendix – Detailed Recommendations (by ML Algorithm) ...................................................................... 13

I. Overview

In this article I attempt to explore the changes and challenges brought forth by the machine learning
models to model risk management, especially model validations, for institutions under the Fed SR 11-7 /
OCC 2011-12 (“SR 11-7”) regime. I also provide detailed recommendations on validation focuses,
considerations and procedures, following the SR 11-7 principles and requirements, for 16 prevailing
machine learning algorithms.

This article can be considered as an extension of the discussions in Chapter 9, “Model Risk Management
in the Current Environment”, authored by me, of “Commercial Banking Risk Management: Regulation in
the Wake of the Financial Crisis” (Palgrave Macmillan, 2017, ISBN 978-1-137-59442-6).

II. Machine Learning! Machine Learning! Machine Learning!

It seems over-night that we are seeing this juggernaut of machine learning coming down the pike – but it
is really such a new and novelty wonder? The answer is, not necessarily.

There is no consensus about the purview of “machine learning” models, however, it is a common
recognition that there are plenty of different modeling algorithms and techniques that can be classified
as “machine learning”, including some fairly “traditional” approaches, such as linear and logistic

1
All views and opinions expressed in this article are my own ([email protected])

1
regressions, which have been quite prevalent in financial modeling and widely applied by banks in the
past decades. On the other hand though, the breakthrough of computer and data management
technologies did pave the way for a huge upsurge of other machine learning algorithms, as we have seen
in the recent years.

While machine learning has been a predominant discussion in such industries as healthcare, automotive,
energy, entertainment, mass media, and so on, the banking industry started testing the waters only
latterly, in areas like fraud detection, loss forecasting, marketing, portfolio management, trading,
underwriting, etc. In many (if not most) of the cases, such explorations are in the form of Proof of
Concept (PoC) or challenger models, and the validation from the second line of defense, applying the
principles and requirements from SR 11-7, has not become part of the bread and butter for Model Risk
Management (MRM) teams. This, however, may change soon, along with the growing use and application
of machine learning models, and the need for such models to be less esoteric.

So – what is machine learning bringing to model validation under SR 11-7?

III. What Are the New Wrinkles?

As mentioned, there is a plethora of machine learning models, which may be based on distinct algorithms
and methodologies; and as such, they bring along divergent and unique characteristics and nuances. But
by and large, the most conspicuous features of machine learning models that differentiate them from
traditional modeling practices include the following:

1. “Black-Box” – most, if not all, machine learning algorithms are opaque and hard to interpret. Indeed,
whether the algorithms are based on greedy optimization (such as gradient descent), or matrix
decomposition, or forward / backward propagation, or what have you, they all share one trait in
common: imperviousness to outsiders, even to those who actually run or use (or, as in many cases,
develop) the models.

2. Model Assumptions – in machine learning models, business assumptions usually play a much smaller
role, as compared to traditional models, especially the assumption-based or rule-driven models (such
as many regulatory compliance models). This is also quite different from models that rely heavily on
market information (such as valuation or interest rate term structure models). The key assumptions
for machine learning models are usually the hyper-parameters developed and tuned to be applied in
the model. Accordingly, the “business intuition” of the variables is also blurred, to say the least.

3. Data Immensity – while not all algorithms require large amount of data to build on, the
exponentially-grown data that is now available for modeling is undeniably one of the main drivers of
the proliferation of machine learning models. Such data is usually massive, complex, and often with
multifarious features. For supervised learning, data labeling is oftentimes a head-scratcher, too.

2
4. Outcome Analysis – machine learning models do give more weight to outcome analysis as the most
pivotal control in assessing the performance of the models. However, such outcome analysis is
focused on cross validation (and other variants), while some other conventional procedures (such as
back test) may not be relevant any more.

5. Model Implementation – similar to traditional models, coding and implementation of machine


learning models can truly present risks. But unlike traditional models, the computational load
including occupied memory and processing time, as well as the capability to accommodate huge
amount of data (or lack thereof), are more important considerations when assessing machine
learning model implementation.

6. Ongoing Monitoring and Tracking (M&T) – simply put, ongoing M&T is not feasible for certain
machine learning models. For certain algorithms (such as PCA), there is pretty much no effective way
(or, quite frankly, necessity) to continuously monitor the model on a stand-along basis. Another
example – for neural networks that suffer from catastrophic interference, traditional ongoing
monitoring procedures are almost certain to be ineffective without re-training the model.

Then what do these mean to model validation, and are we supposed to take the rough with the smooth?

IV. Stumbling Blocks to SR 11-7 Execution

Needless to say, machine learning is no magic bullet, nor does it bring Midas touch to anyone. It does
come with a panoply of risks that need to be effectively managed. In the meantime, a number of new
challenges to model validation, especially under the SR 11-7 framework, are ushered in. Most notably,
such challenges include:

1. Conceptual Soundness – given the opaque model training process, black-box algorithms,
nontransparent variable selection procedures, and a number of other factors, assessment of the
conceptual soundness of machine learning models may be quite an uphill battle. This is especially
challenging for unsupervised learning algorithms – imagine that one has to explain the “business
intuition” when a K-means algorithm automatically groups all the data (unlabeled) into different
clusters, which do not represent any business interpretations but are purely based on the “distance”
among the data…a real bear I guess.

Another challenge is the lack of repeatability of many machine learning models, that is, infeasibility
to precisely replicate the model’s outputs. Unlike, say, an OLS or ARIMA model, for which a validator
can reasonably expect to be able to exactly match the model’s coefficients if given the same data,
many machine learning models spit out different results every time it is run, and such differences are
non-deterministic. For such models, they can be tested only indirectly, e.g. by testing the conceptual
soundness of the selected features and assumptions, and by evaluating the process and outputs.

3
2. Inputs – as mentioned earlier, machine learning models oftentimes are built on humongous data that
is disorganized and unstructured. That not only poses challenges to the data cleaning and parsing
process, which can be truly burdensome and irksome, but also raises the risk of unbefitting model
choice, especially for algorithms that are sensitive to noisy data (such as AdaBoost, GBM, SVM, KNN,
etc.). Accordingly, the validators need to gauge both risks when assessing model inputs, which
requires thorough understanding and comprehensive considerations of not only the training data,
but also the model limitations.

3. Assumptions – the challenges are two-fold pertaining to model assumptions: lack of business
reasoning, and difficulty to evaluate the hyper-parameters. In a traditional validation, the former is
usually critical in determining whether the model is out of line with its business purposes – for
example, is the assumption that “any recoveries from credit losses will be collected in X months after
default” reasonable, based on common business practice? Such review and assessment, however, is
now infeasible if business inferences are lacking from machine learning model assumptions.

As to the hyper-parameters, the challenge is that they are too “subjective”, even though they are
usually tuned before completion of model training. There are, of course, some common practice (or
“rules of thumb”) when determining certain hyper-parameters such as the learning rate and
regularization, which in the meantime raises the bar for the validators’ familiarity with the
corresponding models and algorithms.

4. Processing – for machine learning models, it is essential to keep in mind the processing time
constraints and the model’s limitations to handle large amount of data. For certain models, their
performance may deteriorate by an order of magnitude when training data increases, which may be
reflected in not only unbearable computation time but also unacceptable model results. Therefore it
is imperative for the validator to know the algorithms and techniques in and out, especially their
limitations.

5. Implementation – although there are a lot of wonderful programming packages, such as Python
(with all the handy libraries), to facilitate many types of machine learning modeling, coding is still a
key and high-risk part of model development and implementation. In other words, machine learning
models, given their nuanced nature and characteristics, are highly subject to implementation errors.

6. Ongoing M&T – it goes without saying that various commonly-seen ongoing M&T procedures, such
as “ongoing monitoring of pricing errors”, or “regular reconciliation to market information”, or
“periodic back-test to new deals”, and so on, may not be effective procedures to capture the model
risk associated with many types of machine learning algorithms. There is no one-size-fits-all ongoing
M&T solution for all models, and this is particularly true for machine learning models.

Other model validation components are not listed here, not because they are not affected by the new
challenges from machine learning models, but because running the gamut here does not add much more
value – for example, will the validation procedures and activities need to be adjusted for machine
learning model output? How about model reporting? Model documentation? Model governance and
controls?... The answer is of course yes to all these – but you’ve already got the point.

4
Nevertheless, machine learning models are still models, and are subject to model risks that need to be
effectively identified, measured, and controlled. SR 11-7, as a high-level guidance for practitioners to
control model risk, is still largely applicable (although some level of modifications and updates to SR 11-7
might become inevitable somewhere down the road). Accordingly, the SR 11-7 guidelines can, and
should, be applied to validations of machine learning models.

Now, how can model validators apply SR 11-7 components to machine learning model validations?

V. What Should Validations Cover, and How?

First of all, there are a lot of model validation components, based on SR 11-7, that do not have to be
shaken up just to fit into machine learning models, as the traditional validation procedures and activities
can still be effective to gauge the corresponding model risks. On the other hand, one has to admit that,
given certain unique features of many machine learning algorithms, some of the SR 11-7 requirements
cannot be executed without significant variations and interpretations.

All in all, this is largely a case-by-case situation, and fitting the machine learning model validation into a
Procrustean bed does not help anyone and should never be the solution. Throughout the course of a
validation, validators should not deviate from the objective of identifying, measuring, and reporting true
risks in models, and that is not different for machine learning models.

Some general considerations and discussion points specific to machine learning model validation are
provided below – this is not meant to be all-inclusive and cover all the validation components, given the
limited space. More detailed recommendations related to the topics below, based on different machine
learning algorithms, are also provided in the next section and the Appendix.

1. Conceptual Soundness (Theory, Design, Assumptions)

Under SR 11-7, conceptual soundness assessment usually covers the theoretical background of the
modeling framework and approaches, benchmarking analysis to potential alternatives, business
intuition, business interpretation of the selected variables (including signs of coefficients), and overall
fit of the model to its designated business purposes and use. These are still executable for many
machine learning models, especially supervised learning algorithms.

For example, for most ensemble methods, such as Bagging, Random Forest, GBM, AdaBoost, and
XGBoost, they all need to start with a predetermined set of features / variables. The models are
based on the concept that the modeled objects (classification, regression, or other problems) are
driven by certain selected features that are assumed to be relevant, regardless of the way by which
the data is selected and the training methodology is tweaked (such as randomly selecting a subset of
data for each tree, or changing the weights of weak learners in each run, or shifting subsets of

5
features iteratively, or whatnot). Therefore, to assess the conceptual soundness of such models, the
validators should review the full set of variable and focus on the assumed relevance.

Another case in point is the variable selection process, which sometimes is completely behind the
scene for machine learning models. There are algorithms that incorporated automatic feature
selection, such as via regularization (SVM), which is the key functionality to ensure model
performance when there are a large number of features. In such cases, validators should clearly
document such functionalities, and ensure that the audience understand that model integrity is not
necessarily undermined by such lack of interpretability given the specific condition – as we know,
automatic variable selection is not totally new to many validators (remember the stepwise and alike
methods?). On the other hand, however, It will be ideal to have a qualitative variable selection
process in addition to the machine-pick procedure – if there is one, it certainly should be vetted in
the validation; if not, it may worth a discussion with the model developers.

Many machine learning models do provide standard packages to generate feature importance
measurements, which should always be reviewed and assessed, both from the quantitative and
business intuition perspectives. In certain cases, feature importance measurement should be
assessed in tandem with other metrics, such as count of the co-occurrences, to better understand
and interpret the business inferences of the important variables.

Conceptual soundness should also cover the overall fit of the modeling framework and algorithm to
the designated business use and purposes, taking into consideration of the constraints and limitation
of the chosen approach. For example – if the model is used for credit scoring that is subject to fair
lending regulations, but is built on some completely opaque algorithms (such as unsupervised neural
network), then this should definitely call attention to the validators, since there is a risk that the
institution may be put in a bind when requested to explain to the regulators how the models does
not lead to discriminative lending decisions.

2. Model Input

The first roadblock that a validator may encounter in validating machine learning model inputs is the
huge data that needs to mined through. It is important to review the data cleansing, mining and
transformation process that the developer conducted, as the first step. For instance, machine
learning algorithms generally do not take units into account, therefore feature scaling is important to
many models, and should be kept under the validators’ close scrutiny. Another example is that
certain models (such as XGBoost) only take numerical values / vectors as inputs, and thus how the
categorical features are treated needs to be examined.

It is important to review and verify that the data distribution of each feature match the expectations
(or largely so), especially for the algorithms that are very sensitive to certain deviations from the
expected distributions and outliers (such as AdaBoost or SVM).

The validators should also attempt to replicate, or partially replicate, the data handling process, and
parse out the relevant information that is valuable to the model. In many cases, exploratory data

6
analysis is a key step to reveal potential deficiencies related not only to the suitability of the data, but
also to the appropriateness of the model choice. For example, for many machine learning algorithms,
using correlated features is not a good idea, as it sometimes makes prediction less accurate, and
often makes interpretation of the model almost impossible. In such cases, it is worthwhile to review
and examine the correlations among the features.

3. Outcome Analysis

One prominent change in machine learning model risk management is the heavier reliance on
outcome analysis. One thorny issue that is frequently raised is the over-fitting problem, as complex
nonparametric and nonlinear methods used in machine learning algorithms are likely to contribute to
an over-fitted model. But more specifically, it is the challenge of the trade-off between over-fitting
and under-fitting (or variance and bias). While this has always been a problem to statistical /
regression models, machine learning models do bring a lot of nuances to it. For example,
conceptually AdaBoost, like other ensemble methods, may be subject to high over-fitting risk, yet
some empirical evidence does not support this assertion (i.e. certain experimental data indicated that
AdaBoost does not appear to have serious over-fitting problems), while there are no theoretical
explanations to support this actuality. However miraculous this may appear to be, shall a validator
take it for granted that AdaBoost is immune to over-fitting? My recommendation is no.

Without doubt, it is imperative to perform outcome analysis on machine learning models (probably
more so than other types of models). Luckily, it has pretty much become a matter of course for the
machine learning model developers to perform cross validations to assess model outputs, as part of
the model development process, which gives the validators something good to start with. However,
the outcome analysis in a validation may still be no cakewalk or smooth sailing.

First of all, the validators need to review and assess whether appropriate procedures have been
followed in the model development process to ensure outcome analysis is meaningfully conducted,
including the appropriate allocation of the data into the training, cross validation and testing sets –
although not prescribed, the rule of thumb is 60%, 20% and 20% for the three sets, respectively, to
strike a good balance and fit the needs.

Second, the validators should construct and review the validation curve, which usually is built based
on a scoring function such as accuracy for classifiers. The scores from both the training set and test
set are then compared to assess the generalization error of the model – if the training score and the
test score are both low, the model demonstrates under-fitting; if the training score is high and the
test score is low, the model may be over-fitted (a low training score and a high test score is usually
not possible).

Another useful test is the learning curve, which shows the test and training scores for varying
numbers of training samples. It can help to measure the benefit from adding more training data, and
assess whether the model suffers more from a variance error or a bias error. If the training score is
much greater than the test score for the maximum number of training samples, adding more training

7
samples may improve generalization; but if both the test score and the training score converge to a
value that is too low with increasing size of the training set, more training data may not help.

Third, the validators should attempt to independently design and perform certain tests and outcome
analysis. One example is sensitivity tests on the impact of tunable hyper-parameters (applying such
methods as a grid search) to identify potential issues related to model performance and/or stability.
Other tests like K-fold, tests on sliced data, and so on, may also be effective measures to reveal
hidden deficiencies.

Last but not least, the measurements and metrics used in the cross validation need to be carefully
examined, since common ones like ROC/AUC may not be the best and most effective diagnostics
given certain algorithms, or model purposes – for example, if the cost of false negatives vs. false
positives are distinct, and it is more critical to minimize one type of classification error (such as, for
email spam detection, one might want to emphasize minimizing false positives, even if that results in
a significant increase of false negatives), ROC/AUC might not be the best criteria. Another example is
the choice between ROC and precision-recall curve – usually ROC curves are more appropriate when
the observations are balanced between each class, whereas precision-recall curves are more
appropriate for imbalanced datasets. Validators need to consider and evaluate such information in
determining the appropriateness of the outcome analysis conducted and used by the model
developers. Also, depending on the algorithms, there may be other effective measurements (such as
Gini and MSE for ensemble methods) that should be considered – if not done by the developers, it
should be done by the validators.

4. Processing and Implementation

As is well known in the modeling world, the devil is often in the detail, and risks are frequently
embedded in processing and implementation, which is especially true for machine learning models,
given that their results are highly sensitive to tweaks in the code (even subtle ones). For example, in
gradient descent, one wants to make sure that the parameters of all the features are updated
simultaneously in each iteration to reduce cost function, and it is an easy mistake in coding to breach
this requisite (that is, the parameters of some features are updated before others), which will impact
gradient descent convergence and ultimately the model results.

Consequently, processing and implementation of the model, including the codes, needs to be
thoroughly inspected. Such review and test activities should include both direct review of the code (if
available) and supplementary tests. For instance, the implementation of the cost function should be
carefully reviewed, as there could be embedded risks such as the cost function not reaching the
global minimum (i.e. not optimized), not converging, or not decreasing by every iteration. In addition
to direct review of the cost function code, validators should also perform such procedures as
convergence test, review of learning rate, etc., to verify the appropriateness of the cost function
implementation.

Implementation should also cover the appropriateness of the specific methodologies selected for the
model (assuming the overall modeling framework and approach have been covered in the conceptual

8
soundness review) – for example, if the chosen model approach is Bagging, it is worth checking
whether the appropriate aggregation methodology (that is, plurality voting for classification, and
averaging for regression) has been implemented.

In contrast to traditional models, it is important to consider occupied memory and space, as well as
running time constraints, when assessing the processing and implementation of machine learning
models. If risks are identified (such as too many trees in a Random Forest model, which may lead to
unbearably slow model run and operation), they should be comprehensively evaluated, not only
based on the current model but also taking the long view, with considerations of ongoing model
operations and potential data increase in future model runs.

Although model specifications for machine learning models may seem like “configuration”, unit
testing can be a critical part of model development and implementation tests, to ensure such
assertions as “a model can restore from a checkpoint after a mid-training job crash”. Honestly,
depending on the width and depth of the unit tests, this may be out of a validation’s scope; however,
validators should at least obtain an understanding of such assertions concerning model processing,
implementation and ongoing operation that the model developers are comfortable to make
(including the evidences).

5. Ongoing M&T

Similar to other models, the performance of machine learning models needs to be monitored and
tracked, to ensure the models’ continued applicability and appropriateness given the ever-changing
business and market conditions. Obviously, the models need to be retrained / rebuilt following
certain frequency; but there are other ongoing M&T procedures that should be performed in
between times. For validators, it is important to review and verify that 1). A robust ongoing M&T plan
is in place with appropriate procedures designed; and 2). The ongoing M&T plan has been properly
executed (if the validation is the initial validation before the model is used in production, only 1) is
relevant).

In evaluating both of these, the validators may want to make sure that the following are considered
as part of the ongoing M&T review:

a. The model performance is usually tracked through model errors. Lots of models produce a set of
scores, which oftentimes represent certain probability estimates. One procedure that can be
performed is to monitor the score distribution produced by the model – if there are dramatic and
unexpected changes to the distribution, then there should be a process to investigate such
occurrences, as this may indicate some important changes to the business environment (in the
way of changed patterns of input data, different features, and so on) that can undermine the
appropriateness to continue using the model as is. One example is for Autoencoder models,
where the re-construction error should be continuously monitored, and the increase of the error
may indicate changes in the subject that the model is applied to (such as transactions,
equipment, etc.), which may warrant a revision of the model. And in performing such monitoring

9
procedures, the probability distribution of the reconstruction error can be used to identify
whether the changes are normal or anomalous.
b. The serving input data run through the models needs to be continuously monitored, and
significant deviations in the data distributions, units, missing values, etc. for all features should be
noted and investigated, to ensure that the model, trained on the training dataset, still captures
the consistent features demonstrated in the serving data.
c. Given that machine learning models usually handle big data with heavy computational burden, it
is also important to monitor the computational performance in speed, capacity, and efficiency,
through such measurements as latency, throughput and RAM usage. This is especially important
for models that receive real-time requests and generate instant outputs. One example of such
ongoing M&T procedures is to log and monitor the elapsed time between the request and model
response, and follow up if the record demonstrates a more prolonged pattern.
d. For newly-developed models, it is also recommendable to keep a benchmarking / challenger /
baseline model and compare the performance as part of the ongoing M&T procedures, since the
deviation in the model results can be a sensitive indicator of malfunctions in the current model.
As machine learning models are often developed to replace an old model, this sometimes makes
it easier to keep a benchmarking model (that is, conveniently, the old model).
e. In some cases, it is possible and preferable to utilize some machine learning algorithms to help
monitor model performance, such as via continuous anomaly detection, where the model
performance is constantly monitored based on various performance indicators, and abnormal
changes in the number of anomalies are investigated to identify the causes (e.g. algorithm
tuning, bugs, shifts related to certain inputs, etc.).

6. Effective Challenge

Effective challenge aims to identify deficiencies and risks in the model, and to question the
appropriateness to use the model for its designated purposes, rather than supporting the current
model (or “window-dressing”). And these challenges need to be “effective”, that is, any deficiencies
and issues identified during the effective challenge process should reflect, in a meaningful way, the
true risks associated with the model development, implementation and use (MDIU). Effective
challenges may be raised on all the key model components, and in various ways such as
benchmarking and exploratory analysis.

One important way to provide the model developers with effective challenge is benchmarking and
comparison to alternative modeling approaches, and exploration of selected alternative approaches
by performing analysis, tests, and/or re-building the model if possible. With a comprehensive
understanding of machine learning algorithms, the validators should evaluate the use of the selected
model, taking into account the model use, business requirements, identified patterns in the data, as
well as the limitations and constraints of different modeling algorithms under consideration, and
then conclude on whether the selected model is the optimal choice. For example, if the model is used
for real-time prediction, and one of the key business requirements is fast running time, then the use
of Random Forest algorithms (which is known to be slow with a large number of trees) may be
questionable, and the validators should consider and delve into potential alternatives (such as
XGBoost) that may better fit the business use and essentials. Or if the base learner is known to be

10
stable but a Bagging model was used, then the effectiveness of this model choice should be
challenged.

Such benchmarking should be conducted at all levels and not just for the choice of overall model
approach – it may pertain to the choice of optimization algorithms (e.g. Gradient Descent vs.
Newton's Method), detailed methodologies (e.g. ID3, C4.5, C5.0, or CART, for Decision Trees), hyper-
parameters (e.g. # of classifiers / iterations for a Bagging model), model performance measurements
(e.g. AUC, RMSE, MAE, Log Loss, and Classification Error, for a XGBoost model), and so on.

Implementation of machine learning models should also be effectively challenged, considering the
critical role played by the computational efficiency in such models. Compared with the validation of
traditional models, more inquiries should be made by the validators on the technical details of model
implementation and operation, and discussions should be held on potential enhancements (such as
batch vs. stochastics gradient descent, map-reduce and data parallelism, etc.).

Additionally, effective challenge should be conducted with ongoing model operation and
performance in mind, as certain machine learning models (such as KNN) may be easy to implement,
but as dataset grows, the efficiency and speed of this algorithm may decline very fast. Validators
should evaluate such risks and raise a flag where warranted.

Effective challenge is not poking around, or splitting hairs. The work needs to be grounded in true
risks, and lead to meaningful questions and better yet, valid recommendations. This requires the
validators to be not only familiar with different types of machine learning modeling approaches and
algorithms, but also experienced in common industry practice – needless to say, this is a very high
bar, and that is why I always consider effective challenge as the tallest order in a validation.

OK, I have rambled a lot…you may now ask – do you have any brass-tacks recommendations on the
specific focuses, considerations and procedures in validations of machine learning models?

VI. Detailed Recommendations on Machine Learning Model Validations

In addition to the high-level recommendations above, I also attempt to provide some more detailed
guidance, in the way of suggested focuses, considerations and procedures, on the validation of machine
learning models, in accordance with SR 11-7 principles and requirements. Obviously, such detailed
discussion can only be meaningful in light of the particular modeling technique and algorithm, given the
vastly different characteristics and idiosyncratic risks associated with each type of machine learning
models. As we clearly do not have enough space for an enumeration of all the models and algorithms
that are considered as “machine learning”, I selected the following 16 algorithms that received the most
attention in the market (note that I excluded linear and logistic regressions, since they have prevailed for
a long time and already fit in SR 11-7 quite well):

11
• Supervised Learning
✓ Decision Trees
✓ Ensemble Methods
- Bootstrap Aggregating (Bagging)
- Random Forest
- Adaptive Boosting (AdaBoost)
- Gradient Boosting Machine (GBM)
- Extreme Gradient Boosting (XGBoost)
✓ Support Vector Machine (SVM)
✓ K-Nearest Neighbors (KNN)
✓ Neural Network (Supervised)
• Unsupervised Learning
✓ K-means
✓ Principal Component Analysis (PCA)
✓ Multivariate Anomaly Detection
✓ Neural Network (Unsupervised)
- Recurrent Neural Network2 (RNN)
- Self-Organizing Map (SOM)
- Autoencoder
✓ Hierarchical Clustering

The discussion and recommendations follow four themes:


1) Conceptual Soundness
2) Inputs and Parameters
3) Processing and Implementation
4) Outcome Analysis

The detailed content is provided in the Appendix.

VII. Conclusion

In today’s world, one has to stay ahead of the curve to thrive, or at least on the curve to survive. Banks
will have no choice but to embrace the inexorable march of technologies, including the development and
application of machine learning models, to rev up profits, drive down costs, and control risks. While
machine learning models do bring about paradigm-altering changes and unprecedented challenges to
model risk management, especially to model validation under the SR 11-7 framework, there are ways to
effectively conduct validations and control model risks, in compliance with SR 11-7.

And just like machine learning model progression itself, undoubtedly, validation of such models will be an
ever-evolving process.

2
Recurrent Neural Network can be either supervised or unsupervised.

12
VIII. Appendix – Detailed Recommendations (by ML Algorithm)3

I. SUPERVISED LEARNING
i. Decision Trees

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Review the fitting data and assess the linear-ness of


the boundary
1. The boundary is linear (and therefore 2. Review and assess if the correct model purpose is
Conceptual
linear models may be a better fit) reflected in modeling (including the aggregation
Soundness
2. Model purpose not appropriately method)
(Theory, Design,
determined: classification vs. regression 3. Review (and/or independently construct) the
Assumptions)
3. Variables are not appropriately selected reasonableness of the selected variables both from:
- importance plot; and
- business intuition

1. Prediction accuracy (less tree) and


Review the reasonableness of the key assumptions,
Inputs and interpretability (more trees) are not well
including the selection of predictors, cut points, the
Parameters balanced
stopping criterion (e.g. # obs in any region), etc.
2. Inappropriate tree size is selected

1. Based on the model use, training data, and


differences among algorithms (e.g. minimization
1. Inappropriate algorithm (e.g. ID3, C4.5, subject), determine if the rationale of the chosen
Processing and C5.0, CART) is applied algorithm stands to reason
Implementation 2. Binary split is not appropriately 2. Review and assess if the correct criterion is used for
implemented given the model purpose binary split, e.g. Residual Sum of Squares (RSS) for
Regression; Classification Error Rate, Gini Index or Cross-
Entropy/Deviance for classification

1. Determine if cost complexity pruning has been


applied appropriately, including a review of the tuning
Outcome parameter α; review the process of determining the
Model may be over-fitting
Analysis optimal tuning parameter α (ά);
2. Independently perform cross validation, assess the
model performance and determination of ά

3
For reference only – recommendations are not meant to be exhaustive, or applicable to all models with the algorithms
in question.

13
ii. Ensemble Method – Bootstrap Aggregating (Bagging)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Assess the effectiveness to use bagging based on the


base learner
2. Understand the base learner, assess the conceptual
soundness (incl. business intuition of the features) of the
1. The base learner is stable (e.g. Naïve base model
Conceptual Bayes, K-Nearest Neighbors with high K) 3. Assess the reasonableness of the selected variables by
Soundness which undermines reduction of variance such procedures as reviewing the variable importance
(Theory, Design, 2. Model is not interpretable (less so than (e.g. through PD plot, MDA, etc.), visual inspection,
Assumptions) the underlying model or decision trees) and/or LIME
3. Variables are not appropriately selected 4. Assess the business intuition of the selected variables
5. If high correlation exists among trees, and thus
reduction of variance is ineffective, consider and explore
potential alternative (e.g. Random Forest) and try
different features in each tree

1. The data subsets ("bags") are not


1. Replicate the bagging process (using the same basic
random and with replacement
model, # of trees, # of bags, etc.), and compare and
2. The bagged models are highly correlated,
Inputs and assess the results
and therefore generate similar errors and
Parameters 2. Review the variance density chart and assess the
undermine reduction of variance
similarities of errors
3. Unreasonable rules applied in modeling
3. Understand and assess the rules applied in modeling
(e.g. for ties in plurality)

1. Inappropriate aggregation (classification 1. Review the model code, and assess the aggregation
should be done by plurality voting, based on model purpose
Processing and
regression should be done by averaging) 2. Perform prediction error analysis (using the same
Implementation
2. Inappropriate model complexity hyper- training and testing sets), review the variance / bias
parameters, e.g. # of classifiers /iterations trade off, and assess the level of complexity

1. Perform out of bag tests (replication at least, and


independently design and perform if feasible)
Outcome Model may be subject to high bias, high 2. Perform quantitative measurements (e.g. AUC, Gini,
Analysis variance (inadequate reduction), or both MSE, etc.), and review the learning curve
3. Perform back tests if feasible
4. Design and perform sensitivity / scenario tests

14
iii. Ensemble Method – Random Forest

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Review the model fitting data and assess the model


1. Random Forest may not be the best choice in light of the model purpose and fitting process
model choice if there are not many (predictor candidates)
features / variables but abundant samples 2. Study the potential variables, and determine the level
2. Variables are not appropriately selected, of risk of unreliable variable importance measurement
Conceptual
and the variable importance measurements and variable selection based on the characteristics of the
Soundness
are not reliable (when data includes variables
(Theory, Design,
categorical variables that vary in their scale 2. Assess the business intuition of the selected variables
Assumptions)
of measurement or their number of - focus on all the variables rather than the selected
categories) predictors at each split
3. Modeling process is relatively black-box 3. If variable selection is subject to high risk, consider
and hard to interpret. and explore potential solutions (such as conditional
permutations)

1. Replicate / re-run the model fitting process (if the


1. Random Forest is less interpretable and random_state hyper-parameter was not defined in the
transparent, as such, the model fitting original model, define it and run twice, ideally on
process is in a black box with little different devices and in separate codes).
Inputs and assurance of proper fitting process (or too 2. Understand and review the model developer's
Parameters much randomness) process to set up and fine-tune the hyper-parameters;
2. The hyper-parameters are not assess the reasonableness of the process; re-perform
appropriately determined which impacts the fine-tuning process; and re-run the model using
model performance different hyper-parameters and compare model
performance.

1. The model is over-sized and contains too 1. Review and assess the parameters that determine
many trees, causing large occupancy of number and size of the trees (such as number of
memory and long running time estimators / trees and max number of features to split a
Processing and
2. A large number of trees may make the node), and assess the impact on model processing
Implementation
algorithm slow for real-time prediction (i.e. 2. If the model is used for real-time prediction, evaluate
the model can be fast to train, but slow to the model run (that is, prediction) time against the
use) business requirements.

1. Perform out of sample / cross validation tests


(replication at least, and independently design and
Outcome Although RF less subject to the risk of over- perform if feasible)
Analysis fitting, it remains a theoretical risk 2. Perform quantitative measurements (e.g. AUC, Gini,
MSE, etc.), and review the learning curve
3. Perform back tests if feasible

15
iv. Ensemble Method – AdaBoost

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Assess the reasonableness of the selected variables by


Conceptual
such procedures as reviewing the variable importance
Soundness
Variables are not appropriately selected (e.g. through PD plot, MDA, etc.), visualizing, and/or
(Theory, Design,
LIME.
Assumptions)
2. Assess the business intuition of the selected variables

Noisy data (especially with uniform noise)


Inputs and Review the training data and assess the noisiness of the
and outliers may undermine model
Parameters data
performance

1. Review n and d, and assess if the model choice is


1. As a sequential algorithm based on re-
optimal
weighting of misclassified samples from the
2. Independently re-run the model and observe the
previous weak learner, the model can be
running time, if feasible
prohibitively slow if the data size (n) and
3. Review the model code, and assess the aggregation
Processing and /or the number of features (d) is too large
based on model purpose
Implementation 2. Inappropriate aggregation (classification
4. Calculate and review margins (e.g. % of margin<0.5,
should be done by plurality voting,
minimum margin) by # of iterations, and assess the
regression should be done by averaging)
confidence of classifications
3. Weak classifiers are too weak, leading to
5. Consider / explore possible solutions for acceleration
low margins
(e.g. XGBoost)

1. Perform out of sample / cross validation tests


(replication at least, and independently design and
Although AdaBoost usually does not lead to
perform if feasible)
over-fitting in practice, it remains a
Outcome 2. Perform quantitative measurements (e.g. AUC, Gini,
theoretical risk, especially when the data is
Analysis MSE, etc.), and review the learning curve
noisy, and/or the weak learners are
3. Perform back tests if feasible
complex
4. If feasible, design and perform sensitivity / scenario
tests

16
v. Ensemble Method – Gradient Boosting Machines (GBM)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Assess the reasonableness of the selected variables by


Conceptual
such procedures as reviewing (and/or replicating) the
Soundness
Variables are not appropriately selected variable importance (e.g. through PD plot, MDA, etc.),
(Theory, Design,
visualizing, and/or LIME.
Assumptions)
2. Assess the business intuition of the selected variables

1. GBMs are harder to tune than other


boosting algorithms (such as Random Forest);
and inappropriate selection of the hyper-
1. Review the selection process and methodology of the
parameters (typically number of trees, depth
hyper-parameters
of trees and learning rate) may lead to model
2. Replicate the hyper-parameter selection (depending on
under-performance
Inputs and the specific methodologies)
2. The way that GBM handles missing data
Parameters 3. If feasible, perform sensitivity tests on hyper-
may lead to under-performance of the
parameters and assess the model performance
model. For example, GBM is usually based on
4. Review the methodology to handle missing data, and
CART that commonly handles missing data
assess the appropriateness
through surrogate splitting, which may not be
appropriate given the specific data.
3. GBM is sensitive to noisy data

1. Review n and d, and assess if the model choice is


1. As a sequential algorithm based on
optimal
residuals from the previous predictor, it can
2. Independently re-run the model and observe the
be prohibitively slow if the data size (n)
running time, if feasible
and/or the number of features (d) is too large
3. Review the model code, and assess the aggregation
Processing and 2. Inappropriate aggregation (classification based on model purpose
Implementation should be done by plurality voting, regression 4. Replicate the model and assess the appropriateness of
should be done by averaging) the hyper-parameters (usually number of trees, depth of
3. Training generally takes longer (as the trees and learning rate) given the tradeoff of running time.
trees are built sequentially), and may cause
5. Consider / explore possible solutions for acceleration
model fitting / re-fitting unpractical.
(e.g. XGBoost)

1. Perform out of sample / cross validation tests


(replication at least, and independently design and
perform if feasible)
Model may be over-fitting (especially when 2. Perform quantitative measurements (e.g. AUC, Gini,
Outcome Analysis
data is noisy) MSE, etc.), and review the learning curve
3. Perform back tests if feasible
4. If feasible, design and perform sensitivity / scenario
tests

17
vi. Ensemble Method – Extreme Gradient Boosting (XGBoost )

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Assess the reasonableness of the selected variables by


Conceptual
such procedures as reviewing (and/or replicating) the
Soundness
Variables are not appropriately selected variable importance (e.g. through PD plot, MDA, etc.),
(Theory, Design,
visualizing, and/or LIME
Assumptions)
2. Assess the business intuition of the selected variables

1. Review the procedure and methods to treat


categorical features, including transformation (incl.
addition, simplification etc., if applicable), data cleaning,
1. As XGBoost models only takes numerical and encoding
values / vectors as inputs, categorical 2. Understand and review the model developer's
Inputs and features may be inappropriately treated process to set up and fine-tune the hyper-parameters,
Parameters 2. The hyper-parameters are not set such as learning rate, number of estimators, tree-
appropriately, which leads model astray specific parameters (e.g. max depth), subsample size,
from optimal performance. and regularization parameter; assess the reasonableness
of the process; re-performance the fine-tuning process;
and re-run the model using different hyper-parameters
and compare model performance.

The custom optimization objectives and 1. Review the cost functions and assess the
Processing and evaluation criteria defined by the reasonableness based on the objectives
Implementation developer do not fit the purpose and use of 2. Review the code and assess if the functions are
the model, or are not appropriately coded. correctly coded.

1. Perform out of sample / cross validation tests


(replication at least, and independently design and
perform if feasible)
2. Perform quantitative measurements (e.g. AUC, RMSE,
MAE, Log Loss, Classification Error, etc.), and review the
Outcome learning curve
Model may be over-fitting
Analysis 3. Perform back tests if feasible
4. If early stopping approach is applied, review the
approach (including the code and parameters like early
stopping rounds); if not, consider applying the approach
and independently re-run the model (and compare
results).

18
vii. Support Vector Machine (SVM)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Review the training data and the features, and assess:


1. Model performance may be
a. the appropriateness of using SVM (e.g. as vs. Logistic Regression),
significantly impacted by the # of
as well as the use of kernels (e.g. if n is large, use of kernels may not
features (n) and training data (m)
be appropriate; n is small and m is intermediate, SVM with Gaussian
2. SVM model may not produce
kernel may be appropriate)
the needed outputs to fit the
b. if there is a need to create more features is n is small and m is large
Conceptual designated model use (as SVM
2. Assess the use of SVM vs. other algorithms (e.g. Logistic
Soundness hypothesis is a discriminator
Regression) against the model use and purposes (alternative methods
(Theory, Design, function producing binary
should be considered if estimation of probabilities with moderate
Assumptions) outputs)
confidence is needed)
3. Automatic feature selection
3. Document the key functionality of regularization embedded in SVM
may lead to lack of interpretability
algorithm, and explain the lack of effectiveness of pre-feature
of business intuition, especially
selection / reduction to model performance (however, if feasible,
when the number of features is
perform exploratory analysis, including refitting the model, with
very large
feature selection, and assess the result)

1. Review and assess the fitting data


1. Noisy data (not completely 2. Determine if application of soft margin and/or kernels are needed
separable by hyperplane) may and have been applied
Inputs and lead to poor solution for the 3. Assess the need to expand features and include transformations
Parameters maximal-margin classier. (e.g. polynomials)
2. Data is not linearly separable, 4. Review the appropriateness of the hyper-parameters including the
and hard margin is non-solvable regularization parameter, slack variable, and standard deviation (for
Gaussian kernel)

1. Review the software package (e.g. LIBLINEAR, LIBSVM) used in


implementation, and assess the appropriateness based on the
The model choice (such as use of understood model choice
Processing and
kernels) is not appropriately 2. Review the implementation of the selected methodology, including
Implementation
reflected through implementation data preparation (e.g. whether feature scaling has been performed
before using the Gaussian kernel).
3. If possible, replicate the model fitting process

1. Perform out of sample / cross validation tests (replication at least,


and independently design and perform if feasible)
2. Design and perform quantitative measurements (e.g. ROC/AUC),
Non-optimal balance between and review the learning curve
Outcome Analysis
over-fitting and under-fitting 3. Perform back tests if feasible
4. Review the such hyper-parameters as the regularization parameter
and standard deviation (when using Gaussian kernel), test the
performance of the chosen parameter (e.g. trial of different ones)

19
viii. K-Nearest Neighbors (K-NN)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. As an instance-based learning algorithm,


KNN does not perform explicit
Conceptual generalization and create an abstraction 1. Focus on the features when assessing the business
Soundness from specific instances (i.e. "not modeled", intuition.
(Theory, Design, per se). Therefore it lacks clarify and 2. Review the number of features and determine if KNN
Assumptions) interpretability. is the optimal model choice.
2. KNN does not perform well with large
number of features

1. Independent variables in training data


are measured in different units and not 1. Determine whether the variables are in the same
standardized, which will lead to inaccurate units, and if not, whether feature scaling and
calculation of distances standardization have been applied.
Inputs and
2. KNN is sensitive to outliers, and 2. Examine the fitting data for outliers, and assess the
Parameters
therefore model performance may be risk posed by outliers.
undermined by outliers 3. Understand and assess the approach taken by model
3. Missing values may not be treated developers to treat missing values
appropriately

1. As an instance-based learning, K-NN


might be easy to implement; but as dataset 1. Assess the potential of additional data that need to be
grows, efficiency or speed of the algorithm fit in the model, and the challenge of model running
may decline very fast to an unacceptable time.
level 2. Review and code and verify if the prediction method
Processing and
2. The prediction is implemented (e.g. plurality voting vs. averaging) fits the model
Implementation
erroneously and does not fit the model purpose.
purpose (classification vs. regression) 3. Consider and explore potential algorithms (such as
3. The entire training dataset needs to be Learning Vector Quantization, or LVQ) which may better
utilized (and stored), which may cause fit the data and business requirements.
inefficiencies in implementation

1. Review and/or independently construct error curves,


and assess the appropriateness of the k-value
2. Perform out of sample / cross validation tests
Non-optimal balance between over-fitting
Outcome (replication at least, and independently design and
and under-fitting (e.g. due to inappropriate
Analysis perform if feasible)
k-value)
3. Design and perform quantitative measurements (e.g.
ROC/AUC), and review the learning curve
4. Perform back tests if feasible

20
ix. Neural Network (Supervised)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Assess the importance of interpretability in modeling – if


Conceptual
Relatively "Black box" with little it is crucial, alternative algorithms should be considered
Soundness
interpretability and lack of theoretical 2. Review the labels made to the training samples – interpret
(Theory, Design,
underpinning the business intuition of the labels; review the training data
Assumptions)
and assess the accuracy of data labels

1. Inadequately or inappropriately
labeled data - Neural Networks usually 1. Review and assess the abundancy of the training data; if
require much more labeled samples data is inadequate, consider and explore potential
than other supervised algorithms alternatives (e.g. Naive Bayes) that better handle less data
2. Inappropriately-tuned hyper- 2. Review the hyper-parameters – at this stage, focus on the
parameters may cause issues like over- ranges predefined (for such parameters as # of hidden units
Inputs and
fitting, unbearable running time, etc. per layer, dropout, learning rates, kernel / filter size,
Parameters
3. Inappropriate / insufficient padding, etc.), and choices provided by the developer (for
treatment of certain features (such as such parameters as # of layers, # of epochs, activation
the need to encode categorical integer function, max # of models to train, etc.), for optimization.
features as a one-hot numeric array) Assess whether the ranges and choices provided to tune the
may lead to errors in model fitting or hyper-parameters are in line with normal practice.
performance measurement
1. Inquire and understand the model implementation
process, including any trial-and-error efforts, parameter
tuning considerations, improvement from base model, etc.
2. Consider and explore potential solutions to alleviate
1. Training outcome can be
catastrophic forgetting (e.g. elastic weight consolidation)
nondeterministic and depend crucially
3. Review and verify that there is an ongoing monitoring
on the choice of initial parameters (e.g.
plan to ensure model performance with new information
Processing and the starting point for gradient descent
obtained and entered to the model.
Implementation when training back propagation
4. Review the model code used to tune and optimize the
networks)
hyper-parameters
2. Model may suffer from catastrophic
5. Based on the specific features of the modeled objects,
interference / forgetting
identify the needed treatments (e.g. conversion of
categorical cross-entropy output to one-hot encoded form)
in implementation, and verify whether these treatments
have been appropriately coded
1. The metrics chosen to measure the
performance (e.g. Classification error /
accuracy, kappa coefficient, precision, 1. Evaluate the metrics used to measure model performance
Outcome recall, F-Measure, etc.) do not match in model development, and assess their appropriateness
Analysis the problem under discussion, and 2. Perform / re-perform cross validation; review and assess
therefore does not provide indicative the metrics in the training, validation and test sets
information
2. Model may be over-fitting

21
II. UNSUPERVISED LEARNING
i. K-means

Potential Weaknesses and Risks Validation Procedures and Considerations

1. The number of clusters is not optimal


Conceptual (from quantitative and/or business 1. Review the selected set of clusters, and the business
Soundness intuition perspectives) interpretation and use of the clusters (if available)
(Theory, Design, 2. The automatically determined 2. Construct and review the elbow chart to determine if the
Assumptions) optimal cluster number does not fit # of clusters is reasonable
business needs or requirement

1. The clusters are non-spherical


1. Review the input data and determine whether the
2. The clusters have uneven number of
assumptions required for K-means (clusters are spherical,
points
and of a similar size) are met
Inputs and 3. The inputs include non-numerical
2. Review and assess the data cleaning and/or
Parameters data, outliers, or missing values
transformation process (e.g. transforming non-uniform data
(K-means either cannot appropriately
sets to uniform data sets), including the risk of losing the
handle, or is sensitive to, input data
original features of the data
with the above features)

1. Review the size of the dataset and number of iterations,


The number of iterations for big data and assess the computational load (and the risk of computer
Processing and sets may be huge and generate crash when running / re-running the model)
Implementation unbearable computational load that 2. If the number of iterations appears unreasonably high,
exceed RAM limits consider and explore potential alternative implementation
details (such as adding criteria to stop iterations)

The model outputs are inconsistent in 1. Run multiple iterations and review the magnitude of the
Outcome each iteration due to implementation difference in results
Analysis errors (although inconsistencies are 2. Fix the random starting value (e.g. by fixing the random
expected for K-means) seed) and check if inconsistencies persist

22
ii. Principal Component Analysis (PCA)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Data vectors after transformation


1. The business interpretation of the principal components
loose the original business meaning
resulted from PCA (e.g. main drivers of interest rate curve
and are hard to interpret
movement) should be discussed with the model owner,
2. PCA is used for wrong purposes (e.g.
reviewed for reasonableness, and benchmarked to common
for selecting less features to prevent
industry recognition if possible
Conceptual overfitting) other than dimension
2. The purpose and use of PCA should be carefully examined.
Soundness reduction
This often include the combination / dependencies of PCA
(Theory, Design, 3. The approximation (loss of certain
with other models to achieve the business objectives
Assumptions) original information) of the training
3. Review model documentation and verify adequacy of
data is not fully disclosed to, and
disclosure of the impact from PCA application; ensure full
understood by, stakeholders
documentation in the validation report
4. Non-linear feature representations
4. Consider alternatives (such as Autoencoder) if the data
exist (which cannot be learned from
necessitates a highly non-linear feature representation
PCA)

1. Perform tests (e.g. entropy measurement) to quantify the


1. Noisy data / outliers may distort
dataset homogeneity
PCA's identification of the most
2. Review the input data features and assess the suitability
significant variables.
for PCA
2. Classes are too fine-grained (which
3. Review and test the linear relationship among variables
Inputs and usually do not work well with PCA)
(e.g. via catterplot matrix); explore potential alternatives
Parameters 3. There is no linear relationship among
(such as non-linear PCA with kernels)
variables
4. Review and determine sample adequacy (e.g. does the
4. There are categorical variables
dataset include min. 150 cases, or 5 to 10 cases per
5. Data is not adequate to perform
variable); and/or perform related sample adequacy tests
effective and reliable PCA
(e.g. KMO Test)

1. Features are on different scales (on 1. Verify that mean normalization and feature scaling have
which PCA does not performed well) been performed before processing
Processing and
2. Reconstruction from compressed 2. Review the code for reconstruction and verify the correct
Implementation
representation is not appropriately inputs (i.e. selected eigenvectors and the corresponding
performed compressed representations) are used

23
Potential Weaknesses and Risks Validation Procedures and Considerations

1. k should be selected based on a pre-determined threshold


(i.e. target proportion of explained variance) - review both
this process and the threshold.
1. Selected number of principal
2. Validate the selected k using the diagonal of the matrix
components (k) is not optimal
after SVD of the covariance matrix
2. The resulted PCs are not
3. Review the scree plot and verify selection of optimal k
appropriately reviewed and interpreted
4. Verify the interpretation of the PCs based on the
Outcome in association with the original
coefficients (magnitude and direction) of the original
Analysis variables (a risk that should be
variables to the eigenvectors
considered under Conceptual
5. If PCA is applied in combination other methodologies /
Soundness as well)
algorithms (e.g. Mahalanobis distance, or MD) to achieve
3. The model output cannot be directly
certain business objective (e.g. anomaly detection), the
validated
outcome analysis should cover the final outputs that address
the actual business needs (e.g. monitor the detected
anomalies based on PCA+MD)

24
iii. Multivariate Anomaly Detection

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Depending on the specific algorithm,


1. Perform tests on the distribution assumptions (e.g.
certain assumptions regarding the
Conceptual skewness/kurtosis tests, Henze-Zirkler test, etc.)
input distribution (e.g. multivariate
Soundness 2. Obtain a clear understanding of the business needs and
Gaussian) may be applied
(Theory, Design, requirements, and assess the suitability of the model
2. The cause of anomaly needs to be
Assumptions) 3. Focus on the selected features in assessing business
identified (which cannot be revealed
intuition
through this algorithm)

1. There are relatively large number of


anomalies 1. Review the training samples and summarize the
2. Anomalies are homogeneous in characteristics of the anomalies in question
types and causes 2. Clearly understand the expectations of future anomalies
Inputs and
3. Future anomalies are expected to be 3. Assess the suitability of the algorithms based on the
Parameters
similar to existed ones (those in the understanding of the characteristics of the anomalies in
training samples) question, and consider alternatives that may better fit such
(Multivariate anomaly detection is not anomalies (such as supervised learning)
the best fit for the above anomalies)

1. The model may be computationally


1. Review the number of features and assess the
expensive if there are many features
Processing and computational load
2. The covariance matrix may be non-
Implementation 2. Verify that the model can be processed (covariance matrix
invertible (if # of features exceeds
is invertible)
sample size)

1. Review and/or independently perform outcome analysis


(e.g. out-of-sample test) and assess model performance, via
Model may be over-fitted or under-
such metrics as confusion matrix, precision/recall curve, F-
Outcome fitted due to various factors (e.g.
score, etc.
Analysis distribution assumption, inappropriate
2. Consider additional measurement (such as Mahalanobis
threshold, etc.)
distance) when certain distribution assumptions are not
strictly met

25
iv. Neural Network (Unsupervised) – Overall

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Assess the importance of interpretability in modeling – if


it is crucial, alternative algorithms should be considered
Conceptual
2. Focus on the characteristics of, and/or differences among,
Soundness Nearly complete "black box" with no
the examples provided to train the models, i.e. the features.
(Theory, Design, interpretability
For example, in a Hamming - MAXNET algorithm, explain the
Assumptions)
feature vectors and their business intuition (impact on the
different clusters)

Review the hyper-parameters for optimization – at this


Inappropriately tuned hyper- stage, focus on the ranges pre-defined, and choices provided
Inputs and
parameters causing issues like over- by the developer. Assess whether the ranges and choices
Parameters
fitting, unbearable running time, etc. provided to tune the hyper-parameters are in line with
normal practice.

Processing and
Depending on the specific algorithms Depending on the specific algorithms
Implementation

Outcome
Depending on the specific algorithms Depending on the specific algorithms
Analysis

26
v. Neural Network (Unsupervised) – Recurrent Neural Network (RNN)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. RNNs require significant


computational power and time, making
them an appropriate choice only in
specific situations where their long-
1. Assess the appropriateness of the RNN application - e.g. is
Conceptual term memory can and need to be
the application a temporal progression prediction? If not, an
Soundness utilized
alternative algorithm should be considered
(Theory, Design, 2. There are a large number of
2. Assess the appropriateness of the selected RNN structure
Assumptions) structures which can be used for RNNs,
for the given application
and incorrect structure, if selected,
may lead to model under-performance
(given the specific business needs and
requirements)

1. RNNs can become chaotic, especially


when run without input, or settle on
point attractors 1. Perform sensitivity analysis and assess the model's
2. As a deep learning technique, RNNs propensity of becoming chaotic or settling on point
can have an extreme number of attractors
parameters and as such require 2. Review and assess the abundancy of the training data; if
Inputs and
significant input data data is inadequate, consider and explore potential
Parameters
3. Inappropriate / insufficient alternatives that better handle less data
treatment of certain features (such as 3. Review and assess the treatment of feature formatting to
the need to encode categorical integer ensure that all data has been properly encoded for neural
features as a one-hot numeric array) network processing
may lead to errors in model fitting or
performance measurement

27
Potential Weaknesses and Risks Validation Procedures and Considerations

1. Examine gradient values throughout the training process


to ensure that training has converged to a reasonable
solution without a disappearing / exploding gradient
2. Start the training process with a different set of initial
1. Recurrent neural networks are prone
parameters to ensure that a similar training solution (i.e.,
to exploding or vanishing gradients
roughly global) has been reached
2. Training outcome can be
3. Inquire and understand the model implementation
Processing and nondeterministic and depend crucially
process, including any trial-and-error efforts, parameter
Implementation on the choice of initial parameters
tuning considerations, improvement from base model, etc.
3. RNNs are very expensive to train and
4. Review the model code used to tune and optimize the
should therefore only be run on as
hyper-parameters
many epochs as absolutely necessary
5. Plot the learning curve to verify that the model is not
being trained over more epochs than is absolutely necessary
to continue to make meaningful improvements in the loss
function

1. Given their depth and ability to


remember for very long periods of
time, RNNS are extremely prone to 1. Examine learning curves for in-sample and cross validation
over-fitting data to assess model over-fitting
2. The metrics chosen to measure the 2. Evaluate the metrics used to measure model performance
Outcome
performance (e.g. Classification error / in model development, and assess their reasonableness
Analysis
accuracy, kappa coefficient , precision, 3. Review and verify that there is an ongoing M&T plan to
recall, F-Measure, etc.) do not match ensure model performance with new information obtained
the problem under discussion, and and entered to the model
therefore does not provide indicative
information

28
vi. Neural Network (Unsupervised) – Self-Organizing Map (SOM)

Potential Weaknesses and Risks Validation Procedures and Considerations

1. The model is used for inappropriate


1. Carefully review the model use and purpose, including the
objectives – SOM should mainly be
Conceptual combination and dependencies with other models /
used for initial data analysis,
Soundness algorithms in achieve the business objectives
visualization and preparation (e.g.
(Theory, Design, 2. Explain the algorithm from high level (e.g. based on
dimension and noise reduction)
Assumptions) "similarities"), which may include the features of the input
2. The model training (mapping)
vectors
process is hard to interpret

1. The input data contains categorical


1. Review the types of the sample and assess the
data, and/or mixed types of data
homogeneity
Inputs and (which SOM does not work well with)
2. Inquire the process to determine the hyper-parameters,
Parameters 2.Key hyper-parameters (e.g. learning
and assess the appropriateness of the selections (benchmark
parameters, map topology, map size)
to common practice if possible)
are not appropriately selected

The model may be very 1. Review the number of map units and assess the
computationally expensive, especially computational load
Processing and
for a large map (the complexity scales 2. Consider and inquire about the potential increase of
Implementation
quadrative with the number of map future entropy of samples and need of larger maps (which
units) may lead to implementation challenges)

1. Perform tests on the model output to validate and verify


1. The (topological) properties of the
the preservation of the input properties, such as cluster
input data is not well preserved
visual inspection (via distance matrix), measurement of
through SOM.
Outcome quantization error and topological errors, etc.
2. Overfitting is possible if the
Analysis 2. Perform tests (such as 2-sample tests) to compare the
quantization error is very low (with a
distributions of the training data the set of neurons (and
very low threshold, and with adequate
assess the convergence of the maps based on that
neurons and time).
comparison)

29
vii. Neural Network (Unsupervised) – Autoencoders

Potential Weaknesses and Risks Validation Procedures and Considerations

1. Extremely uninterpretable -
understanding and explaining the 1. The business purposes and use should be thoroughly
latent features of non-visual data is understood and carefully examined - if business
Conceptual
almost impossible requirements include transparency and interpretation,
Soundness
2. Certain variants of Autoencoder (e.g. alternative approaches should be used
(Theory, Design,
Variational Autoencoders) apply 2. If distribution assumptions are applied, such assumptions
Assumptions)
assumptions pertaining to data should be reviewed against the actual data and other
distributions (e.g. Multivariate conditions (incl. business considerations)
Gaussian)

1. Autoencoder does not work well


with certain types of data (e.g. image
1. Review the types of the samples and assess the suitability
compression)
of the model
Inputs and 2. Key hyper-parameters (e.g. learning
2. Inquire the process to determine the hyper-parameters,
Parameters parameters, # of hidden layers, latent
and assess the appropriateness of the selections (benchmark
dimensions) are not appropriately
to common practice if possible)
selected which will impact model
performance

1. Review the number of features and data size, and assess


the computational load
2. Understand the ongoing M&T plan, given the business
Processing and The model is computationally
needs and conditions, and assess the practicality to re-train
Implementation expensive to train
the model on a periodic basis considering the frequency and
magnitude of the increase in the size of training samples in
the future

1. Review and understand the methods (such as


regularization) applied by model developer to mitigate
1. Model may be over-fitted
overfitting, and test the impact on model performance (e.g.
2. The dimension of the latent
sensitivity test)
representation is not appropriate to
2. Perform outcome analysis tests, e.g. monitoring the re-
achieve the proper level of property
Outcome construction error from a testing set (which could be based
retainment
Analysis on the distribution of the re-construction errors)
2. When used as feature learning
3. Consider and explore potential solutions to reduce
algorithm, the model may become an
generalization error (e.g. integrating auto encoder-based
unstable predictor and lead to
feature learning with ensemble methods like Bagging)
generalization error
4. If certain methods (e.g. regularization) are used to control
overfitting, such methods should be examined

30
viii. Hierarchical Clustering

Potential Weaknesses and Risks Validation Procedures and Considerations

Review the interpretation of resulted dendrogram and


Conceptual
The cluster dendrogram resulted from hierarchy based on business intuition; if such effort was not
Soundness
the model is not clearly explainable, or made by model owner, the validator should attempt to
(Theory, Design,
is counter-intuitive explore the features of the clusters (based on the records in
Assumptions)
the clusters) and align them to business intuition

The training data does not have a


Review the data structure, and if warranted, consider and
Inputs and hierarchical structure, and the model
explore potential alternatives that may better fit the data
Parameters results may not be meaningful or
(e.g. K-means for spherical data)
explainable

1. Review the similarity functions in the code, and verify if


1. Inappropriate similarity functions are appropriate ones are applied according to the type of data
used for the input data (e.g. Jaccard Similarity or Cosine Similarity for categorical
2. Conducting a dendrogram technique variables)
on large dataset may be very 2. Review the size of the dataset along with computer
Processing and
computationally expensive (as the time performance metrics and assess the computational load
Implementation
complexity of hierarchical clustering is (and the risk of computer crash when running/re-running
quadratic), and thus may lead to very the model)
slow computation speed or even 3. Understand the potential increase of the dataset for the
computer crash next run of the model, and assess the risk of unbearable
computational load in future model runs

The model outputs are not replicable


Outcome due to implementation errors Replicate the model training process and investigate if
Analysis (consistent results are expected for consistent results cannot be obtained
Hierarchical Clustering)

31

You might also like