SR 11-7, Validation and Machine Learning Models
SR 11-7, Validation and Machine Learning Models
November 2020
Table of Contents
I. Overview ........................................................................................................................................................ 1
II. Machine Learning! Machine Learning! Machine Learning! .......................................................................... 1
III. What Are the New Wrinkles? ........................................................................................................................ 2
IV. Stumbling Blocks to SR 11-7 Execution ......................................................................................................... 3
V. What Should Validations Cover, and How? .................................................................................................. 5
VI. Detailed Recommendations on Machine Learning Model Validations ..................................................... 11
VII. Conclusion .................................................................................................................................................... 12
VIII. Appendix – Detailed Recommendations (by ML Algorithm) ...................................................................... 13
I. Overview
In this article I attempt to explore the changes and challenges brought forth by the machine learning
models to model risk management, especially model validations, for institutions under the Fed SR 11-7 /
OCC 2011-12 (“SR 11-7”) regime. I also provide detailed recommendations on validation focuses,
considerations and procedures, following the SR 11-7 principles and requirements, for 16 prevailing
machine learning algorithms.
This article can be considered as an extension of the discussions in Chapter 9, “Model Risk Management
in the Current Environment”, authored by me, of “Commercial Banking Risk Management: Regulation in
the Wake of the Financial Crisis” (Palgrave Macmillan, 2017, ISBN 978-1-137-59442-6).
It seems over-night that we are seeing this juggernaut of machine learning coming down the pike – but it
is really such a new and novelty wonder? The answer is, not necessarily.
There is no consensus about the purview of “machine learning” models, however, it is a common
recognition that there are plenty of different modeling algorithms and techniques that can be classified
as “machine learning”, including some fairly “traditional” approaches, such as linear and logistic
1
All views and opinions expressed in this article are my own ([email protected])
1
regressions, which have been quite prevalent in financial modeling and widely applied by banks in the
past decades. On the other hand though, the breakthrough of computer and data management
technologies did pave the way for a huge upsurge of other machine learning algorithms, as we have seen
in the recent years.
While machine learning has been a predominant discussion in such industries as healthcare, automotive,
energy, entertainment, mass media, and so on, the banking industry started testing the waters only
latterly, in areas like fraud detection, loss forecasting, marketing, portfolio management, trading,
underwriting, etc. In many (if not most) of the cases, such explorations are in the form of Proof of
Concept (PoC) or challenger models, and the validation from the second line of defense, applying the
principles and requirements from SR 11-7, has not become part of the bread and butter for Model Risk
Management (MRM) teams. This, however, may change soon, along with the growing use and application
of machine learning models, and the need for such models to be less esoteric.
As mentioned, there is a plethora of machine learning models, which may be based on distinct algorithms
and methodologies; and as such, they bring along divergent and unique characteristics and nuances. But
by and large, the most conspicuous features of machine learning models that differentiate them from
traditional modeling practices include the following:
1. “Black-Box” – most, if not all, machine learning algorithms are opaque and hard to interpret. Indeed,
whether the algorithms are based on greedy optimization (such as gradient descent), or matrix
decomposition, or forward / backward propagation, or what have you, they all share one trait in
common: imperviousness to outsiders, even to those who actually run or use (or, as in many cases,
develop) the models.
2. Model Assumptions – in machine learning models, business assumptions usually play a much smaller
role, as compared to traditional models, especially the assumption-based or rule-driven models (such
as many regulatory compliance models). This is also quite different from models that rely heavily on
market information (such as valuation or interest rate term structure models). The key assumptions
for machine learning models are usually the hyper-parameters developed and tuned to be applied in
the model. Accordingly, the “business intuition” of the variables is also blurred, to say the least.
3. Data Immensity – while not all algorithms require large amount of data to build on, the
exponentially-grown data that is now available for modeling is undeniably one of the main drivers of
the proliferation of machine learning models. Such data is usually massive, complex, and often with
multifarious features. For supervised learning, data labeling is oftentimes a head-scratcher, too.
2
4. Outcome Analysis – machine learning models do give more weight to outcome analysis as the most
pivotal control in assessing the performance of the models. However, such outcome analysis is
focused on cross validation (and other variants), while some other conventional procedures (such as
back test) may not be relevant any more.
6. Ongoing Monitoring and Tracking (M&T) – simply put, ongoing M&T is not feasible for certain
machine learning models. For certain algorithms (such as PCA), there is pretty much no effective way
(or, quite frankly, necessity) to continuously monitor the model on a stand-along basis. Another
example – for neural networks that suffer from catastrophic interference, traditional ongoing
monitoring procedures are almost certain to be ineffective without re-training the model.
Then what do these mean to model validation, and are we supposed to take the rough with the smooth?
Needless to say, machine learning is no magic bullet, nor does it bring Midas touch to anyone. It does
come with a panoply of risks that need to be effectively managed. In the meantime, a number of new
challenges to model validation, especially under the SR 11-7 framework, are ushered in. Most notably,
such challenges include:
1. Conceptual Soundness – given the opaque model training process, black-box algorithms,
nontransparent variable selection procedures, and a number of other factors, assessment of the
conceptual soundness of machine learning models may be quite an uphill battle. This is especially
challenging for unsupervised learning algorithms – imagine that one has to explain the “business
intuition” when a K-means algorithm automatically groups all the data (unlabeled) into different
clusters, which do not represent any business interpretations but are purely based on the “distance”
among the data…a real bear I guess.
Another challenge is the lack of repeatability of many machine learning models, that is, infeasibility
to precisely replicate the model’s outputs. Unlike, say, an OLS or ARIMA model, for which a validator
can reasonably expect to be able to exactly match the model’s coefficients if given the same data,
many machine learning models spit out different results every time it is run, and such differences are
non-deterministic. For such models, they can be tested only indirectly, e.g. by testing the conceptual
soundness of the selected features and assumptions, and by evaluating the process and outputs.
3
2. Inputs – as mentioned earlier, machine learning models oftentimes are built on humongous data that
is disorganized and unstructured. That not only poses challenges to the data cleaning and parsing
process, which can be truly burdensome and irksome, but also raises the risk of unbefitting model
choice, especially for algorithms that are sensitive to noisy data (such as AdaBoost, GBM, SVM, KNN,
etc.). Accordingly, the validators need to gauge both risks when assessing model inputs, which
requires thorough understanding and comprehensive considerations of not only the training data,
but also the model limitations.
3. Assumptions – the challenges are two-fold pertaining to model assumptions: lack of business
reasoning, and difficulty to evaluate the hyper-parameters. In a traditional validation, the former is
usually critical in determining whether the model is out of line with its business purposes – for
example, is the assumption that “any recoveries from credit losses will be collected in X months after
default” reasonable, based on common business practice? Such review and assessment, however, is
now infeasible if business inferences are lacking from machine learning model assumptions.
As to the hyper-parameters, the challenge is that they are too “subjective”, even though they are
usually tuned before completion of model training. There are, of course, some common practice (or
“rules of thumb”) when determining certain hyper-parameters such as the learning rate and
regularization, which in the meantime raises the bar for the validators’ familiarity with the
corresponding models and algorithms.
4. Processing – for machine learning models, it is essential to keep in mind the processing time
constraints and the model’s limitations to handle large amount of data. For certain models, their
performance may deteriorate by an order of magnitude when training data increases, which may be
reflected in not only unbearable computation time but also unacceptable model results. Therefore it
is imperative for the validator to know the algorithms and techniques in and out, especially their
limitations.
5. Implementation – although there are a lot of wonderful programming packages, such as Python
(with all the handy libraries), to facilitate many types of machine learning modeling, coding is still a
key and high-risk part of model development and implementation. In other words, machine learning
models, given their nuanced nature and characteristics, are highly subject to implementation errors.
6. Ongoing M&T – it goes without saying that various commonly-seen ongoing M&T procedures, such
as “ongoing monitoring of pricing errors”, or “regular reconciliation to market information”, or
“periodic back-test to new deals”, and so on, may not be effective procedures to capture the model
risk associated with many types of machine learning algorithms. There is no one-size-fits-all ongoing
M&T solution for all models, and this is particularly true for machine learning models.
Other model validation components are not listed here, not because they are not affected by the new
challenges from machine learning models, but because running the gamut here does not add much more
value – for example, will the validation procedures and activities need to be adjusted for machine
learning model output? How about model reporting? Model documentation? Model governance and
controls?... The answer is of course yes to all these – but you’ve already got the point.
4
Nevertheless, machine learning models are still models, and are subject to model risks that need to be
effectively identified, measured, and controlled. SR 11-7, as a high-level guidance for practitioners to
control model risk, is still largely applicable (although some level of modifications and updates to SR 11-7
might become inevitable somewhere down the road). Accordingly, the SR 11-7 guidelines can, and
should, be applied to validations of machine learning models.
Now, how can model validators apply SR 11-7 components to machine learning model validations?
First of all, there are a lot of model validation components, based on SR 11-7, that do not have to be
shaken up just to fit into machine learning models, as the traditional validation procedures and activities
can still be effective to gauge the corresponding model risks. On the other hand, one has to admit that,
given certain unique features of many machine learning algorithms, some of the SR 11-7 requirements
cannot be executed without significant variations and interpretations.
All in all, this is largely a case-by-case situation, and fitting the machine learning model validation into a
Procrustean bed does not help anyone and should never be the solution. Throughout the course of a
validation, validators should not deviate from the objective of identifying, measuring, and reporting true
risks in models, and that is not different for machine learning models.
Some general considerations and discussion points specific to machine learning model validation are
provided below – this is not meant to be all-inclusive and cover all the validation components, given the
limited space. More detailed recommendations related to the topics below, based on different machine
learning algorithms, are also provided in the next section and the Appendix.
Under SR 11-7, conceptual soundness assessment usually covers the theoretical background of the
modeling framework and approaches, benchmarking analysis to potential alternatives, business
intuition, business interpretation of the selected variables (including signs of coefficients), and overall
fit of the model to its designated business purposes and use. These are still executable for many
machine learning models, especially supervised learning algorithms.
For example, for most ensemble methods, such as Bagging, Random Forest, GBM, AdaBoost, and
XGBoost, they all need to start with a predetermined set of features / variables. The models are
based on the concept that the modeled objects (classification, regression, or other problems) are
driven by certain selected features that are assumed to be relevant, regardless of the way by which
the data is selected and the training methodology is tweaked (such as randomly selecting a subset of
data for each tree, or changing the weights of weak learners in each run, or shifting subsets of
5
features iteratively, or whatnot). Therefore, to assess the conceptual soundness of such models, the
validators should review the full set of variable and focus on the assumed relevance.
Another case in point is the variable selection process, which sometimes is completely behind the
scene for machine learning models. There are algorithms that incorporated automatic feature
selection, such as via regularization (SVM), which is the key functionality to ensure model
performance when there are a large number of features. In such cases, validators should clearly
document such functionalities, and ensure that the audience understand that model integrity is not
necessarily undermined by such lack of interpretability given the specific condition – as we know,
automatic variable selection is not totally new to many validators (remember the stepwise and alike
methods?). On the other hand, however, It will be ideal to have a qualitative variable selection
process in addition to the machine-pick procedure – if there is one, it certainly should be vetted in
the validation; if not, it may worth a discussion with the model developers.
Many machine learning models do provide standard packages to generate feature importance
measurements, which should always be reviewed and assessed, both from the quantitative and
business intuition perspectives. In certain cases, feature importance measurement should be
assessed in tandem with other metrics, such as count of the co-occurrences, to better understand
and interpret the business inferences of the important variables.
Conceptual soundness should also cover the overall fit of the modeling framework and algorithm to
the designated business use and purposes, taking into consideration of the constraints and limitation
of the chosen approach. For example – if the model is used for credit scoring that is subject to fair
lending regulations, but is built on some completely opaque algorithms (such as unsupervised neural
network), then this should definitely call attention to the validators, since there is a risk that the
institution may be put in a bind when requested to explain to the regulators how the models does
not lead to discriminative lending decisions.
2. Model Input
The first roadblock that a validator may encounter in validating machine learning model inputs is the
huge data that needs to mined through. It is important to review the data cleansing, mining and
transformation process that the developer conducted, as the first step. For instance, machine
learning algorithms generally do not take units into account, therefore feature scaling is important to
many models, and should be kept under the validators’ close scrutiny. Another example is that
certain models (such as XGBoost) only take numerical values / vectors as inputs, and thus how the
categorical features are treated needs to be examined.
It is important to review and verify that the data distribution of each feature match the expectations
(or largely so), especially for the algorithms that are very sensitive to certain deviations from the
expected distributions and outliers (such as AdaBoost or SVM).
The validators should also attempt to replicate, or partially replicate, the data handling process, and
parse out the relevant information that is valuable to the model. In many cases, exploratory data
6
analysis is a key step to reveal potential deficiencies related not only to the suitability of the data, but
also to the appropriateness of the model choice. For example, for many machine learning algorithms,
using correlated features is not a good idea, as it sometimes makes prediction less accurate, and
often makes interpretation of the model almost impossible. In such cases, it is worthwhile to review
and examine the correlations among the features.
3. Outcome Analysis
One prominent change in machine learning model risk management is the heavier reliance on
outcome analysis. One thorny issue that is frequently raised is the over-fitting problem, as complex
nonparametric and nonlinear methods used in machine learning algorithms are likely to contribute to
an over-fitted model. But more specifically, it is the challenge of the trade-off between over-fitting
and under-fitting (or variance and bias). While this has always been a problem to statistical /
regression models, machine learning models do bring a lot of nuances to it. For example,
conceptually AdaBoost, like other ensemble methods, may be subject to high over-fitting risk, yet
some empirical evidence does not support this assertion (i.e. certain experimental data indicated that
AdaBoost does not appear to have serious over-fitting problems), while there are no theoretical
explanations to support this actuality. However miraculous this may appear to be, shall a validator
take it for granted that AdaBoost is immune to over-fitting? My recommendation is no.
Without doubt, it is imperative to perform outcome analysis on machine learning models (probably
more so than other types of models). Luckily, it has pretty much become a matter of course for the
machine learning model developers to perform cross validations to assess model outputs, as part of
the model development process, which gives the validators something good to start with. However,
the outcome analysis in a validation may still be no cakewalk or smooth sailing.
First of all, the validators need to review and assess whether appropriate procedures have been
followed in the model development process to ensure outcome analysis is meaningfully conducted,
including the appropriate allocation of the data into the training, cross validation and testing sets –
although not prescribed, the rule of thumb is 60%, 20% and 20% for the three sets, respectively, to
strike a good balance and fit the needs.
Second, the validators should construct and review the validation curve, which usually is built based
on a scoring function such as accuracy for classifiers. The scores from both the training set and test
set are then compared to assess the generalization error of the model – if the training score and the
test score are both low, the model demonstrates under-fitting; if the training score is high and the
test score is low, the model may be over-fitted (a low training score and a high test score is usually
not possible).
Another useful test is the learning curve, which shows the test and training scores for varying
numbers of training samples. It can help to measure the benefit from adding more training data, and
assess whether the model suffers more from a variance error or a bias error. If the training score is
much greater than the test score for the maximum number of training samples, adding more training
7
samples may improve generalization; but if both the test score and the training score converge to a
value that is too low with increasing size of the training set, more training data may not help.
Third, the validators should attempt to independently design and perform certain tests and outcome
analysis. One example is sensitivity tests on the impact of tunable hyper-parameters (applying such
methods as a grid search) to identify potential issues related to model performance and/or stability.
Other tests like K-fold, tests on sliced data, and so on, may also be effective measures to reveal
hidden deficiencies.
Last but not least, the measurements and metrics used in the cross validation need to be carefully
examined, since common ones like ROC/AUC may not be the best and most effective diagnostics
given certain algorithms, or model purposes – for example, if the cost of false negatives vs. false
positives are distinct, and it is more critical to minimize one type of classification error (such as, for
email spam detection, one might want to emphasize minimizing false positives, even if that results in
a significant increase of false negatives), ROC/AUC might not be the best criteria. Another example is
the choice between ROC and precision-recall curve – usually ROC curves are more appropriate when
the observations are balanced between each class, whereas precision-recall curves are more
appropriate for imbalanced datasets. Validators need to consider and evaluate such information in
determining the appropriateness of the outcome analysis conducted and used by the model
developers. Also, depending on the algorithms, there may be other effective measurements (such as
Gini and MSE for ensemble methods) that should be considered – if not done by the developers, it
should be done by the validators.
As is well known in the modeling world, the devil is often in the detail, and risks are frequently
embedded in processing and implementation, which is especially true for machine learning models,
given that their results are highly sensitive to tweaks in the code (even subtle ones). For example, in
gradient descent, one wants to make sure that the parameters of all the features are updated
simultaneously in each iteration to reduce cost function, and it is an easy mistake in coding to breach
this requisite (that is, the parameters of some features are updated before others), which will impact
gradient descent convergence and ultimately the model results.
Consequently, processing and implementation of the model, including the codes, needs to be
thoroughly inspected. Such review and test activities should include both direct review of the code (if
available) and supplementary tests. For instance, the implementation of the cost function should be
carefully reviewed, as there could be embedded risks such as the cost function not reaching the
global minimum (i.e. not optimized), not converging, or not decreasing by every iteration. In addition
to direct review of the cost function code, validators should also perform such procedures as
convergence test, review of learning rate, etc., to verify the appropriateness of the cost function
implementation.
Implementation should also cover the appropriateness of the specific methodologies selected for the
model (assuming the overall modeling framework and approach have been covered in the conceptual
8
soundness review) – for example, if the chosen model approach is Bagging, it is worth checking
whether the appropriate aggregation methodology (that is, plurality voting for classification, and
averaging for regression) has been implemented.
In contrast to traditional models, it is important to consider occupied memory and space, as well as
running time constraints, when assessing the processing and implementation of machine learning
models. If risks are identified (such as too many trees in a Random Forest model, which may lead to
unbearably slow model run and operation), they should be comprehensively evaluated, not only
based on the current model but also taking the long view, with considerations of ongoing model
operations and potential data increase in future model runs.
Although model specifications for machine learning models may seem like “configuration”, unit
testing can be a critical part of model development and implementation tests, to ensure such
assertions as “a model can restore from a checkpoint after a mid-training job crash”. Honestly,
depending on the width and depth of the unit tests, this may be out of a validation’s scope; however,
validators should at least obtain an understanding of such assertions concerning model processing,
implementation and ongoing operation that the model developers are comfortable to make
(including the evidences).
5. Ongoing M&T
Similar to other models, the performance of machine learning models needs to be monitored and
tracked, to ensure the models’ continued applicability and appropriateness given the ever-changing
business and market conditions. Obviously, the models need to be retrained / rebuilt following
certain frequency; but there are other ongoing M&T procedures that should be performed in
between times. For validators, it is important to review and verify that 1). A robust ongoing M&T plan
is in place with appropriate procedures designed; and 2). The ongoing M&T plan has been properly
executed (if the validation is the initial validation before the model is used in production, only 1) is
relevant).
In evaluating both of these, the validators may want to make sure that the following are considered
as part of the ongoing M&T review:
a. The model performance is usually tracked through model errors. Lots of models produce a set of
scores, which oftentimes represent certain probability estimates. One procedure that can be
performed is to monitor the score distribution produced by the model – if there are dramatic and
unexpected changes to the distribution, then there should be a process to investigate such
occurrences, as this may indicate some important changes to the business environment (in the
way of changed patterns of input data, different features, and so on) that can undermine the
appropriateness to continue using the model as is. One example is for Autoencoder models,
where the re-construction error should be continuously monitored, and the increase of the error
may indicate changes in the subject that the model is applied to (such as transactions,
equipment, etc.), which may warrant a revision of the model. And in performing such monitoring
9
procedures, the probability distribution of the reconstruction error can be used to identify
whether the changes are normal or anomalous.
b. The serving input data run through the models needs to be continuously monitored, and
significant deviations in the data distributions, units, missing values, etc. for all features should be
noted and investigated, to ensure that the model, trained on the training dataset, still captures
the consistent features demonstrated in the serving data.
c. Given that machine learning models usually handle big data with heavy computational burden, it
is also important to monitor the computational performance in speed, capacity, and efficiency,
through such measurements as latency, throughput and RAM usage. This is especially important
for models that receive real-time requests and generate instant outputs. One example of such
ongoing M&T procedures is to log and monitor the elapsed time between the request and model
response, and follow up if the record demonstrates a more prolonged pattern.
d. For newly-developed models, it is also recommendable to keep a benchmarking / challenger /
baseline model and compare the performance as part of the ongoing M&T procedures, since the
deviation in the model results can be a sensitive indicator of malfunctions in the current model.
As machine learning models are often developed to replace an old model, this sometimes makes
it easier to keep a benchmarking model (that is, conveniently, the old model).
e. In some cases, it is possible and preferable to utilize some machine learning algorithms to help
monitor model performance, such as via continuous anomaly detection, where the model
performance is constantly monitored based on various performance indicators, and abnormal
changes in the number of anomalies are investigated to identify the causes (e.g. algorithm
tuning, bugs, shifts related to certain inputs, etc.).
6. Effective Challenge
Effective challenge aims to identify deficiencies and risks in the model, and to question the
appropriateness to use the model for its designated purposes, rather than supporting the current
model (or “window-dressing”). And these challenges need to be “effective”, that is, any deficiencies
and issues identified during the effective challenge process should reflect, in a meaningful way, the
true risks associated with the model development, implementation and use (MDIU). Effective
challenges may be raised on all the key model components, and in various ways such as
benchmarking and exploratory analysis.
One important way to provide the model developers with effective challenge is benchmarking and
comparison to alternative modeling approaches, and exploration of selected alternative approaches
by performing analysis, tests, and/or re-building the model if possible. With a comprehensive
understanding of machine learning algorithms, the validators should evaluate the use of the selected
model, taking into account the model use, business requirements, identified patterns in the data, as
well as the limitations and constraints of different modeling algorithms under consideration, and
then conclude on whether the selected model is the optimal choice. For example, if the model is used
for real-time prediction, and one of the key business requirements is fast running time, then the use
of Random Forest algorithms (which is known to be slow with a large number of trees) may be
questionable, and the validators should consider and delve into potential alternatives (such as
XGBoost) that may better fit the business use and essentials. Or if the base learner is known to be
10
stable but a Bagging model was used, then the effectiveness of this model choice should be
challenged.
Such benchmarking should be conducted at all levels and not just for the choice of overall model
approach – it may pertain to the choice of optimization algorithms (e.g. Gradient Descent vs.
Newton's Method), detailed methodologies (e.g. ID3, C4.5, C5.0, or CART, for Decision Trees), hyper-
parameters (e.g. # of classifiers / iterations for a Bagging model), model performance measurements
(e.g. AUC, RMSE, MAE, Log Loss, and Classification Error, for a XGBoost model), and so on.
Implementation of machine learning models should also be effectively challenged, considering the
critical role played by the computational efficiency in such models. Compared with the validation of
traditional models, more inquiries should be made by the validators on the technical details of model
implementation and operation, and discussions should be held on potential enhancements (such as
batch vs. stochastics gradient descent, map-reduce and data parallelism, etc.).
Additionally, effective challenge should be conducted with ongoing model operation and
performance in mind, as certain machine learning models (such as KNN) may be easy to implement,
but as dataset grows, the efficiency and speed of this algorithm may decline very fast. Validators
should evaluate such risks and raise a flag where warranted.
Effective challenge is not poking around, or splitting hairs. The work needs to be grounded in true
risks, and lead to meaningful questions and better yet, valid recommendations. This requires the
validators to be not only familiar with different types of machine learning modeling approaches and
algorithms, but also experienced in common industry practice – needless to say, this is a very high
bar, and that is why I always consider effective challenge as the tallest order in a validation.
OK, I have rambled a lot…you may now ask – do you have any brass-tacks recommendations on the
specific focuses, considerations and procedures in validations of machine learning models?
In addition to the high-level recommendations above, I also attempt to provide some more detailed
guidance, in the way of suggested focuses, considerations and procedures, on the validation of machine
learning models, in accordance with SR 11-7 principles and requirements. Obviously, such detailed
discussion can only be meaningful in light of the particular modeling technique and algorithm, given the
vastly different characteristics and idiosyncratic risks associated with each type of machine learning
models. As we clearly do not have enough space for an enumeration of all the models and algorithms
that are considered as “machine learning”, I selected the following 16 algorithms that received the most
attention in the market (note that I excluded linear and logistic regressions, since they have prevailed for
a long time and already fit in SR 11-7 quite well):
11
• Supervised Learning
✓ Decision Trees
✓ Ensemble Methods
- Bootstrap Aggregating (Bagging)
- Random Forest
- Adaptive Boosting (AdaBoost)
- Gradient Boosting Machine (GBM)
- Extreme Gradient Boosting (XGBoost)
✓ Support Vector Machine (SVM)
✓ K-Nearest Neighbors (KNN)
✓ Neural Network (Supervised)
• Unsupervised Learning
✓ K-means
✓ Principal Component Analysis (PCA)
✓ Multivariate Anomaly Detection
✓ Neural Network (Unsupervised)
- Recurrent Neural Network2 (RNN)
- Self-Organizing Map (SOM)
- Autoencoder
✓ Hierarchical Clustering
VII. Conclusion
In today’s world, one has to stay ahead of the curve to thrive, or at least on the curve to survive. Banks
will have no choice but to embrace the inexorable march of technologies, including the development and
application of machine learning models, to rev up profits, drive down costs, and control risks. While
machine learning models do bring about paradigm-altering changes and unprecedented challenges to
model risk management, especially to model validation under the SR 11-7 framework, there are ways to
effectively conduct validations and control model risks, in compliance with SR 11-7.
And just like machine learning model progression itself, undoubtedly, validation of such models will be an
ever-evolving process.
2
Recurrent Neural Network can be either supervised or unsupervised.
12
VIII. Appendix – Detailed Recommendations (by ML Algorithm)3
I. SUPERVISED LEARNING
i. Decision Trees
3
For reference only – recommendations are not meant to be exhaustive, or applicable to all models with the algorithms
in question.
13
ii. Ensemble Method – Bootstrap Aggregating (Bagging)
1. Inappropriate aggregation (classification 1. Review the model code, and assess the aggregation
should be done by plurality voting, based on model purpose
Processing and
regression should be done by averaging) 2. Perform prediction error analysis (using the same
Implementation
2. Inappropriate model complexity hyper- training and testing sets), review the variance / bias
parameters, e.g. # of classifiers /iterations trade off, and assess the level of complexity
14
iii. Ensemble Method – Random Forest
1. The model is over-sized and contains too 1. Review and assess the parameters that determine
many trees, causing large occupancy of number and size of the trees (such as number of
memory and long running time estimators / trees and max number of features to split a
Processing and
2. A large number of trees may make the node), and assess the impact on model processing
Implementation
algorithm slow for real-time prediction (i.e. 2. If the model is used for real-time prediction, evaluate
the model can be fast to train, but slow to the model run (that is, prediction) time against the
use) business requirements.
15
iv. Ensemble Method – AdaBoost
16
v. Ensemble Method – Gradient Boosting Machines (GBM)
17
vi. Ensemble Method – Extreme Gradient Boosting (XGBoost )
The custom optimization objectives and 1. Review the cost functions and assess the
Processing and evaluation criteria defined by the reasonableness based on the objectives
Implementation developer do not fit the purpose and use of 2. Review the code and assess if the functions are
the model, or are not appropriately coded. correctly coded.
18
vii. Support Vector Machine (SVM)
19
viii. K-Nearest Neighbors (K-NN)
20
ix. Neural Network (Supervised)
1. Inadequately or inappropriately
labeled data - Neural Networks usually 1. Review and assess the abundancy of the training data; if
require much more labeled samples data is inadequate, consider and explore potential
than other supervised algorithms alternatives (e.g. Naive Bayes) that better handle less data
2. Inappropriately-tuned hyper- 2. Review the hyper-parameters – at this stage, focus on the
parameters may cause issues like over- ranges predefined (for such parameters as # of hidden units
Inputs and
fitting, unbearable running time, etc. per layer, dropout, learning rates, kernel / filter size,
Parameters
3. Inappropriate / insufficient padding, etc.), and choices provided by the developer (for
treatment of certain features (such as such parameters as # of layers, # of epochs, activation
the need to encode categorical integer function, max # of models to train, etc.), for optimization.
features as a one-hot numeric array) Assess whether the ranges and choices provided to tune the
may lead to errors in model fitting or hyper-parameters are in line with normal practice.
performance measurement
1. Inquire and understand the model implementation
process, including any trial-and-error efforts, parameter
tuning considerations, improvement from base model, etc.
2. Consider and explore potential solutions to alleviate
1. Training outcome can be
catastrophic forgetting (e.g. elastic weight consolidation)
nondeterministic and depend crucially
3. Review and verify that there is an ongoing monitoring
on the choice of initial parameters (e.g.
plan to ensure model performance with new information
Processing and the starting point for gradient descent
obtained and entered to the model.
Implementation when training back propagation
4. Review the model code used to tune and optimize the
networks)
hyper-parameters
2. Model may suffer from catastrophic
5. Based on the specific features of the modeled objects,
interference / forgetting
identify the needed treatments (e.g. conversion of
categorical cross-entropy output to one-hot encoded form)
in implementation, and verify whether these treatments
have been appropriately coded
1. The metrics chosen to measure the
performance (e.g. Classification error /
accuracy, kappa coefficient, precision, 1. Evaluate the metrics used to measure model performance
Outcome recall, F-Measure, etc.) do not match in model development, and assess their appropriateness
Analysis the problem under discussion, and 2. Perform / re-perform cross validation; review and assess
therefore does not provide indicative the metrics in the training, validation and test sets
information
2. Model may be over-fitting
21
II. UNSUPERVISED LEARNING
i. K-means
The model outputs are inconsistent in 1. Run multiple iterations and review the magnitude of the
Outcome each iteration due to implementation difference in results
Analysis errors (although inconsistencies are 2. Fix the random starting value (e.g. by fixing the random
expected for K-means) seed) and check if inconsistencies persist
22
ii. Principal Component Analysis (PCA)
1. Features are on different scales (on 1. Verify that mean normalization and feature scaling have
which PCA does not performed well) been performed before processing
Processing and
2. Reconstruction from compressed 2. Review the code for reconstruction and verify the correct
Implementation
representation is not appropriately inputs (i.e. selected eigenvectors and the corresponding
performed compressed representations) are used
23
Potential Weaknesses and Risks Validation Procedures and Considerations
24
iii. Multivariate Anomaly Detection
25
iv. Neural Network (Unsupervised) – Overall
Processing and
Depending on the specific algorithms Depending on the specific algorithms
Implementation
Outcome
Depending on the specific algorithms Depending on the specific algorithms
Analysis
26
v. Neural Network (Unsupervised) – Recurrent Neural Network (RNN)
27
Potential Weaknesses and Risks Validation Procedures and Considerations
28
vi. Neural Network (Unsupervised) – Self-Organizing Map (SOM)
The model may be very 1. Review the number of map units and assess the
computationally expensive, especially computational load
Processing and
for a large map (the complexity scales 2. Consider and inquire about the potential increase of
Implementation
quadrative with the number of map future entropy of samples and need of larger maps (which
units) may lead to implementation challenges)
29
vii. Neural Network (Unsupervised) – Autoencoders
1. Extremely uninterpretable -
understanding and explaining the 1. The business purposes and use should be thoroughly
latent features of non-visual data is understood and carefully examined - if business
Conceptual
almost impossible requirements include transparency and interpretation,
Soundness
2. Certain variants of Autoencoder (e.g. alternative approaches should be used
(Theory, Design,
Variational Autoencoders) apply 2. If distribution assumptions are applied, such assumptions
Assumptions)
assumptions pertaining to data should be reviewed against the actual data and other
distributions (e.g. Multivariate conditions (incl. business considerations)
Gaussian)
30
viii. Hierarchical Clustering
31