Week5 Paper2
Week5 Paper2
Abstract. Deep metric learning papers from the past four years have
arXiv:2003.08505v3 [cs.CV] 16 Sep 2020
(Note that in many implementations, mpos is set to 0.) The theoretical downside
of this method is that the same distance threshold is applied to all pairs, even
though there may be a large variance in their similarities and dissimilarities.
The triplet margin loss [63] theoretically addresses this issue. A triplet con-
sists of an anchor, positive, and negative sample, where the anchor is more similar
to the positive than the negative. The triplet margin loss attempts to make the
anchor-positive distances (dap ) smaller than the anchor-negative distances (dan ),
by a predefined margin (m):
This theoretically places fewer restrictions on the embedding space, and allows
the model to account for variance in interclass dissimilarities.
A wide variety of losses has since been built on these fundamental concepts.
For example, the angular loss [60] is a triplet loss where the margin is based
on the angles formed by the triplet vectors. The margin loss [65] modifies the
contrastive loss by setting mpos = β − α and mneg = β + α, where α is fixed,
A Metric Learning Reality Check 3
and β is learnable via gradient descent. More recently, Yuan et al. [70] proposed
a variation of the contrastive loss based on signal to noise ratios, where each
embedding vector is considered signal, and the difference between it and other
vectors is considered noise. Other pair losses are based on the softmax function
and LogSumExp, which is a smooth approximation of the maximum function.
Specifically, the lifted structure loss [37] is the contrastive loss but with Log-
SumExp applied to all negative pairs. The N-Pairs loss [50] applies the softmax
function to each positive pair relative to all other pairs. (The N-Pairs loss is
also known as InfoNCE [38] and NT-Xent [6].) The recent multi similarity loss
[62] applies LogSumExp to all pairs, but is specially formulated to give weight
to different relative similarities among each embedding and its neighbors. The
tuplet margin loss [69] combines LogSumExp with an implicit pair weighting
method, while the circle loss [52] weights each pair’s similarity by its deviation
from a pre-determined optimal similarity value. In contrast with these pair and
triplet losses, FastAP [3] attempts to optimize for average precision within each
batch, using a soft histogram binning technique.
theoretically, it has the tendency to include a large number of easy negatives and
positives, causing performance to plateau quickly. Thus, one intuitive strategy
is to select only the most difficult positive and negative samples [20], but this
has been found to produce noisy gradients and convergence to bad local optima
[65]. A possible remedy is semihard negative mining, which finds the negative
samples in a batch that are close to the anchor, but still further away than the
corresponding positive samples [47]. On the other hand, Wu et al. [65] found
that semihard mining makes little progress as the number of semihard negatives
drops. They claim that distance-weighted sampling results in a variety of neg-
atives (easy, semihard, and hard), and improved performance. Online mining
can also be integrated into the structure of models. Specifically, the hard-aware
deeply cascaded method [71] uses models of varying complexity, in which the
loss for the complex models only considers the pairs that the simpler models
find difficult. Recently, Wang et al. [62] proposed a simple pair mining strategy,
where negatives are chosen if they are closer to an anchor than its hardest posi-
tive, and positives are chosen if they are further from an anchor than its hardest
negative.
test set. Concurrent with our work is Roth et al. [44], which addresses many
of the same flaws that we find, and does an extensive analysis of various loss
functions. But again, they do not address the problem of training with test set
feedback, and their hyperparameters are tuned using a small grid search around
values proposed in the original papers. In contrast, we use cross-validation and
bayesian optimization to tune hyperparameters. We find that this significantly
minimizes the performance differences between loss functions. See section 3 for
a complete explanation of our experimental methodology.
In order to claim that a new algorithm outperforms existing methods, its im-
portant to keep as many parameters constant as possible. That way, we can be
certain that it was the new algorithm that boosted performance, and not one
of the extraneous parameters. This has not been the case with metric learning
papers.
One of the easiest ways to improve accuracy is to upgrade the network ar-
chitecture, yet this fundamental parameter has not been kept constant across
papers. Some use GoogleNet, while others use BN-Inception, sometimes referred
to as “Inception with Batch Normalization. Choice of architecture is important
in metric learning, because the networks are typically pretrained on ImageNet,
and then finetuned on smaller datasets. Thus, the initial accuracy on the smaller
datasets varies depending on the chosen network. One widely-cited paper from
2017 used ResNet50, and then claimed huge performance gains. This is question-
able, because the competing methods used GoogleNet, which has significantly
lower initial accuracies (see Table 1). Therefore, much of the performance gain
likely came from the choice of network architecture, and not their proposed
6 Musgrave et al.
nearly 100% Recall@1, even though they have different characteristics. (Note
that 100% Recall@1 means that Recall@K for any K>1 is also 100%.) More im-
portantly, Figure 1(c) shows a better separation of the classes than Figure 1(a),
yet they receive approximately the same score. F1 and NMI also return roughly
equal scores for all three embedding spaces. Moreover, they require the embed-
dings to be clustered, which introduces two factors of variability: the choice of
clustering algorithm, and the sensitivity of clustering results to seed initializa-
tion. Since we know the ground-truth number of clusters, k-means clustering is
the obvious choice and is what is typically used. However, as Figure 1 shows,
this results in uninformative NMI and F1 scores. Other clustering algorithms
could be considered, but each one has its own drawbacks and subtleties. In-
troducing a clustering algorithm into the evaluation process is simply adding a
layer of complexity between the researcher and the embedding space. Instead,
we would like an accuracy metric that operates directly on the embedding space,
like Recall@K, but that provides more nuanced information.
NMI: 95.6% F1: 100% R@1: 99%, NMI: 100% F1: 100% R@1: 99.8% NMI: 100% F1: 100% R@1: 100%,
R-Precision: 77.4% MAP@R: 71.4% R-Precision: 83.3% MAP@R: 77.9% R-Precision: 99.8% MAP@R: 99.8%
NMI also tends to give high scores to datasets that have many classes, regard-
less of the model’s true accuracy (see Table 2). Adjusted Mutual Information
[55] removes this flaw, but still requires clustering to be done first.
Table 2. NMI of embeddings from randomly initialized convnets. CUB200 and Cars196
have about 200 classes, while SOP has about 20,000.
the best test set accuracy is reported. In other words, there is no validation set,
and model selection and hyperparameter tuning are done with direct feedback
from the test set. Some papers do not check performance at regular intervals,
and instead report accuracy after training for a predetermined number of iter-
ations. In this case, it is unclear how the number of iterations is chosen, and
hyperparameters are still tuned based on test set performance. This breaks one
of the most basic commandments of machine learning. Training with test set
feedback leads to overfitting on the test set, and therefore brings into question
the steady rise in accuracy over time, as presented in metric learning papers.
– R-Precision is defined as follows: For each query1 , let R be the total number
of references that are the same class as the query. Find the R nearest refer-
ences to the query, and let r be the number of those nearest references that
are the same class as the query. The score for the query is Rr .
– One weakness of R-precision is that it does not account for the ranking of
the correct retrievals. So we instead use MAP@R, which is Mean Average
Precision with the number of nearest neighbors for each sample set to R. For
a single query:
R
1 X
MAP@R = P (i) (3)
R i=1
(
precision at i, if the ith retrieval is correct
P (i) = (4)
0, otherwise
The benefits of MAP@R are that it is more informative than Recall@1 (see Fig-
ure 1 and Table 3), it can be computed directly from the embedding space (no
clustering step required), it is easy to understand, and it rewards well clustered
embedding spaces. MAP@R is also more stable than Recall@1. Across our ex-
periments, we computed the lag-one autocorrelation of the validation accuracy
during training: Recall@1 = 0.73 and MAP@R = 0.81. Thus, MAP@R is less
noisy, making it easier to select the best performing model checkpoints.
In our results tables in section 4, we present R-precision and MAP@R. For the
sake of comparisons to previous papers, we also show Precision@1 (also known
as “Recall@1” in the previous sections and in metric learning papers).
– The first half of classes are used for cross validation, and the 4 partitions
are created deterministically: the first 0-12.5% of classes make up the first
partition, the next 12.5-25% of classes make up the second partition, and so
on. The training set comprises 3 of the 4 partitions, and cycles through all
leave-one-out possibilities. As a result, the training and validation sets are
always class-disjoint, so optimizing for validation set performance should be
a good proxy for accuracy on open-set tasks. Training stops when validation
accuracy plateaus.
– The second half of classes are used as the test set. This is the same setting
that metric learning papers have used for years, and we use it so that results
can be compared more easily to past papers.
We do 10 training runs using the best hyperparameters, and report the average
across these runs, as well as confidence intervals. This way our results are less
subject to random seed noise.
4 Experiments
Table 7. The losses covered in our experiments. Note that NT-Xent is the name we
used in our code, but it is also known as N-Pairs or InfoNCE. For the Margin loss,
we tested two versions: “Margin” uses the same β value for all training classes, and
“Margin / class” uses a separate β for each training class. In both versions, β is learned
during training. Face verification losses have been consistently left out of metric learning
papers, so we included two losses (CosFace and ArcFace) from that domain. (We used
only the loss functions from those two papers. We did not train on any face datasets
or use any model trained on faces.)
90 90
CUB200 Cars196 SOP CUB200 Cars196 SOP
80 80
70 70
60 60
Precision @ 1
Precision @ 1
50 50
40 40
30 30
20 20
10 10
0 0
Triplet
NT-Xent
Margin
ProxyNCA
Normalized Softmax
CosFace
SNR Contrastive
MS+Miner
Soft Triple
Contrastive
ArcFace
MS
Margin / class
FastAP
Stochastic Hard Mining [51]
Contrastive*
SoftTriple [41]
Divide andConquer [46]
Triplet*
Semihard Mining*
Histogram [54]
DAML [12]
HTL[14]
Angular [60]
ProxyNCA[35]
DeML [5]
MIC [43]
LiftedStructure [37]
HDC [71]
HDML [73]
FastAP [3]
EPSHN [67]
N-Pairs [50]
Spectral [28]
Asymmetric ML [66]
HORDE [22]
(a) The trend according to papers (b) The trend according to reality
Fig. 2. Papers versus Reality: the trend of Precision@1 of various methods over the
years. In a), the baseline methods have * next to them, which indicates that their num-
bers are the average reported accuracy from all papers that included those baselines.
190% 190%
170% 170%
150% 150%
130% 130%
110% 110%
90% 90%
70% 70%
50% 50%
30% 30%
10% 10%
Papers Reality (128-dim) Reality (512-dim) Papers Reality (128-dim) Reality (512-dim)
(a) Relative improvement over the con- (b) Relative improvement over the triplet
trastive loss loss
Fig. 3. Papers versus Reality: we look at the results tables of all methods presented
in Figure 2(a). 11 of these include the contrastive loss, and 12 include the triplet
loss (without semihard mining). For each paper, we compute the relative percentage
improvement of their proposed method over their reported result for the contrastive
or triplet loss, and then take the average improvement across papers (grey bars in the
above figures). The green and red bars are the average relative improvement that we
obtain, in the separated 128-dim and concatenated 512-dim settings, respectively. For
the “reality” numbers in (a) we excluded the FastAP loss from the calculation, since
it was a poor performing outlier in our experiments, and we excluded the triplet loss
since we consider it a baseline method. Likewise for the “reality” numbers in (b), we
excluded the FastAP and contrastive losses from the calculation.
14 Musgrave et al.
ting edge papers not covered in our experiments. It also raises doubts about the
value of the hand-wavy theoretical explanations in metric learning papers. If a
paper attempts to explain the performance gains of its proposed method, and it
turns out that those performance gains are non-existent, then their explanation
must be invalid as well.
5 Conclusion
In this paper, we uncovered several flaws in the current metric learning literature,
namely:
We then ran experiments with these issues fixed, and found that state of the
art loss functions perform marginally better than, and sometimes on par with,
classic methods. This is in stark contrast with the claims made in papers, in
which accuracy has risen dramatically over time.
Future work could explore the relationship between optimal hyperparameters
and dataset/architecture combinations, as well as the reasons for why different
losses are performing similarly to one another. Of course, pushing the state-
of-the-art in accuracy is another research direction. If proper machine learning
practices are followed, and comparisons to prior work are done in a fair manner,
the results of future metric learning papers will better reflect reality, and will be
more likely to generalize to other high-impact areas like self-supervised learning.
6 Acknowledgements
References
18. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised
visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
20. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-
identification. arXiv preprint arXiv:1703.07737 (2017)
21. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: International Conference on Machine Learning.
pp. 448–456 (2015)
22. Jacob, P., Picard, D., Histace, A., Klein, E.: Metric learning with horde: High-
order regularizer for deep embeddings. In: Proceedings of the IEEE International
Conference on Computer Vision. pp. 6539–6548 (2019)
23. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot,
A., Liu, C., Krishnan, D.: Supervised contrastive learning. arXiv preprint
arXiv:2004.11362 (2020)
24. Kim, S., Kim, D., Cho, M., Kwak, S.: Proxy anchor loss for deep metric learning.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 3238–3247 (2020)
25. Kim, W., Goyal, B., Chawla, K., Lee, J., Kwon, K.: Attention-based ensemble for
deep metric learning. In: Proceedings of the European Conference on Computer
Vision (ECCV). pp. 736–751 (2018)
26. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-
grained categorization. In: 4th International IEEE Workshop on 3D Representation
and Recognition (3dRR-13). Sydney, Australia (2013)
27. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in neural information processing systems.
pp. 1097–1105 (2012)
28. Law, M.T., Urtasun, R., Zemel, R.S.: Deep spectral clustering learning. In: Inter-
national Conference on Machine Learning. pp. 1985–1994 (2017)
29. Lin, X., Duan, Y., Dong, Q., Lu, J., Zhou, J.: Deep variational metric learning.
In: Proceedings of the European Conference on Computer Vision (ECCV). pp.
689–704 (2018)
30. Lipton, Z.C., Steinhardt, J.: Troubling trends in machine learning scholarship.
arXiv preprint arXiv:1807.03341 (2018)
31. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere
embedding for face recognition. In: Proceedings of the IEEE conference on com-
puter vision and pattern recognition. pp. 212–220 (2017)
32. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3431–3440 (2015)
33. Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are gans created
equal? a large-scale study. In: Advances in neural information processing systems.
pp. 700–709 (2018)
34. Luo, L., Xiong, Y., Liu, Y.: Adaptive gradient methods with dynamic bound of
learning rate. In: International Conference on Learning Representations (2019),
https://openreview.net/forum?id=Bkg3g2R9FX
35. Movshovitz-Attias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S.: No fuss dis-
tance metric learning using proxies. In: Proceedings of the IEEE International
Conference on Computer Vision. pp. 360–368 (2017)
A Metric Learning Reality Check 17
36. Oh Song, H., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility
location. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 5382–5390 (2017)
37. Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted
structured feature embedding. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition. pp. 4004–4012 (2016)
38. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic-
tive coding. arXiv preprint arXiv:1807.03748 (2018)
39. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisen-
sory features. European Conference on Computer Vision (ECCV) (2018)
40. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-
performance deep learning library. In: Advances in Neural Information Processing
Systems. pp. 8024–8035 (2019)
41. Qian, Q., Shang, L., Sun, B., Hu, J., Li, H., Jin, R.: Softtriple loss: Deep met-
ric learning without triplet sampling. In: Proceedings of the IEEE International
Conference on Computer Vision. pp. 6450–6458 (2019)
42. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in neural information processing
systems. pp. 91–99 (2015)
43. Roth, K., Brattoli, B., Ommer, B.: Mic: Mining interclass characteristics for im-
proved metric learning. In: Proceedings of the IEEE International Conference on
Computer Vision. pp. 8000–8009 (2019)
44. Roth, K., Milbich, T., Sinha, S., Gupta, P., Ommer, B., Cohen, J.P.: Revisiting
training strategies and generalization performance in deep metric learning (2020)
45. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-
nition challenge. International journal of computer vision 115(3), 211–252 (2015)
46. Sanakoyeu, A., Tschernezki, V., Buchler, U., Ommer, B.: Divide and conquer the
embedding space for metric learning. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 471–480 (2019)
47. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face
recognition and clustering. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 815–823 (2015)
48. Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S.,
Brain, G.: Time-contrastive networks: Self-supervised learning from video. In: 2018
IEEE International Conference on Robotics and Automation (ICRA). pp. 1134–
1141. IEEE (2018)
49. Smirnov, E., Melnikov, A., Novoselov, S., Luckyanets, E., Lavrentyeva, G.: Dop-
pelganger mining for face representation learning. In: Proceedings of the IEEE
International Conference on Computer Vision Workshops. pp. 1916–1923 (2017)
50. Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In:
Advances in Neural Information Processing Systems. pp. 1857–1865 (2016)
51. Suh, Y., Han, B., Kim, W., Lee, K.M.: Stochastic class-based hard example mining
for deep metric learning. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 7251–7259 (2019)
52. Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., Wei, Y.: Circle
loss: A unified perspective of pair similarity optimization. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6398–
6407 (2020)
18 Musgrave et al.
53. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural
networks. arXiv preprint arXiv:1905.11946 (2019)
54. Ustinova, E., Lempitsky, V.: Learning deep embeddings with histogram loss. In:
Advances in Neural Information Processing Systems. pp. 4170–4178 (2016)
55. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings
comparison: Variants, properties, normalization and correction for chance. The
Journal of Machine Learning Research 11, 2837–2854 (2010)
56. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD
Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of
Technology (2011)
57. Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification.
IEEE Signal Processing Letters 25(7), 926–930 (2018)
58. Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: Normface: L2 hypersphere embedding
for face verification. In: Proceedings of the 25th ACM international conference on
Multimedia. pp. 1041–1049 (2017)
59. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface:
Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 5265–5274 (2018)
60. Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular
loss. In: Proceedings of the IEEE International Conference on Computer Vision.
pp. 2593–2601 (2017)
61. Wang, X., Hua, Y., Kodirov, E., Hu, G., Garnier, R., Robertson, N.M.: Ranked list
loss for deep metric learning. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 5207–5216 (2019)
62. Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with
general pair weighting for deep metric learning. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition. pp. 5022–5030 (2019)
63. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large mar-
gin nearest neighbor classification. In: Advances in neural information processing
systems. pp. 1473–1480 (2006)
64. Wilber, M., Kwak, S., Belongie, S.: Cost-effective hits for relative similarity
comparisons. In: Human Computation and Crowdsourcing (HCOMP). Pitts-
burgh (2014), /se3/wp-content/uploads/2015/01/hcomp-conference-paper.
pdf,http://arxiv.org/abs/1404.3291
65. Wu, C.Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep
embedding learning. In: Proceedings of the IEEE International Conference on Com-
puter Vision. pp. 2840–2848 (2017)
66. Xu, X., Yang, Y., Deng, C., Zheng, F.: Deep asymmetric metric learning via rich
relationship mining. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. pp. 4076–4085 (2019)
67. Xuan, H., Stylianou, A., Pless, R.: Improved embeddings with easy positive triplet
mining. In: The IEEE Winter Conference on Applications of Computer Vision. pp.
2474–2482 (2020)
68. Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the” neural hype” weak
baselines and the additivity of effectiveness gains from neural ranking models. In:
Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval. pp. 1129–1132 (2019)
69. Yu, B., Tao, D.: Deep metric learning with tuplet margin loss. In: The IEEE
International Conference on Computer Vision (ICCV) (October 2019)
A Metric Learning Reality Check 19
70. Yuan, T., Deng, W., Tang, J., Tang, Y., Chen, B.: Signal-to-noise ratio: A robust
distance metric for deep metric learning. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. pp. 4815–4824 (2019)
71. Yuan, Y., Yang, K., Zhang, C.: Hard-aware deeply cascaded embedding. In: Pro-
ceedings of the IEEE international conference on computer vision. pp. 814–823
(2017)
72. Zhai, A., Wu, H.Y.: Classification is a strong baseline for deep metric learning.
arXiv preprint arXiv:1811.12649 (2018)
73. Zheng, W., Chen, Z., Lu, J., Zhou, J.: Hardness-aware deep metric learning. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
pp. 72–81 (2019)