SlideShare a Scribd company logo
PR-422
Wenzel, Florian, et al. "Hyperparameter ensembles for robustness and uncertainty quantification." Advances
in Neural Information Processing Systems 33 (2020): 6514-6527.
주성훈, VUNO Inc.
2023. 2. 19.
1. Research Background
2. Methods
1. Research Background 3
Ensembles of neural networks
•Neural networks can form ensembles of models that are diverse and perform well on held-out data.
•Diversity is induced by the multi-modal nature of the loss landscape and randomness in initialization
and training.
•Many mechanisms exist to foster diversity, but this paper focuses on combining networks with weight
initialization and different hyperparameters.
http://florianwenzel.com/files/neurips_poster_2020.pdf
/ 22
2. Methods
1. Research Background 4
Approach
•Hyper-deep ensembles
• This approach utilizes a greedy algorithm to create neural network ensembles that leverage diverse hyperparameters and random
initialization for improved performance.
•Hyper-batch ensembles
• we propose a parameterization combining that of ‘batch ensemble’ and self-tuning networks, which enables both weight and
hyperparameter diversity
/ 22
2. Methods
1. Research Background 5
Previous works
•Combining the outputs of several neural networks to improve their single performance
• Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble
members.
R. Zhang, et al. “Cyclical stochastic gradient mcmc for bayesian deep learning.” ICLR, 2019.
• Cyclical learning-rate schedules
• MC dropout
Y. Gal. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” ICML, 2016.
• Random initialization (Deep ensemble)
/ 22
2. Methods
1. Research Background 6
Previous works
•Batch ensemble (Wen et al., ICLR, 2019)
• Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble
members.
•Batch ens
• Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to
ef
fi
cient ensemble and lifelong learning. In ICLR, 2019.
•Not only does ‘Batch ensemble’ lead to a memory saving,
but it also allows for efficient minibatching, where each
datapoint may use a different ensemble member.
X[W ∘ (rksT
k )] = [(X ∘ rksT
k )W] ∘ sT
k
/ 22
2. Methods
2. Methods
2. Methods 8
Hyper-deep ensembles
•Train κ models by random search (random weight init and random hparam). - line 1
•Apply hyper_ens to extract K models out of the κ available ones, with K « κ. - line 2
•For each selected hparam (line 3), retrain for K different weight inits (stratification). (Line 4-8)
/ 22
2. Methods
2. Methods 9
Hyper-batch ensembles
•This combines ideas of batch ensembles (Wen et al., 2019), and self-tuning networks (STNs) (Mackay et al., 2018).
•Batch ens
• Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to
ef
fi
cient ensemble and lifelong learning. In ICLR, 2019.
•Ensemble member k ∈ {1,…, K}
•Weight diversity: ,
rks⊤
k ukv⊤
k
/ 22
2. Methods
2. Methods 10
Hyper-batch ensembles
•Can capture multiple hyperparameters (STNs only covers one hparam).
•Ensemble member k ∈ {1,…, K}
•Weight diversity: ,
rks⊤
k ukv⊤
k
M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Self-tuning networks: Bilevel
optimization of hyperparameters using structured best-response functions. In ICLR, 2018.
•scalable local approximations of the best-response function
/ 22
2. Methods
2. Methods 11
Hyper-batch ensembles
•Model parameters are optimized on the training set using the average member cross entropy (= the usual loss for single
models).
•Hyperparameters (more precisely the hyperparameter distribution parameters ξ) are optimized on the validation set
using the ensemble cross entropy. This directly encourages diversity between members.
•Training objective
•hyperparameter distribution
Hyperparameter distribution
for ensemble member k
/ 22
3. Experimental Results
2. Methods
3. Experimental Results 13
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
NLL = −
1
N
N
∑
i=1
log p(yi |xi; θ) ECE =
1
n
n
∑
i=1
1
B ∑
j∈Bi
aj −
1
B ∑
j∈Bi
yj
• : 모델의 예측 확률, : 실제 확률
• : binning을 위한 구간 크기 ( : i번째 구간)
aj yj
B Bi
/ 22
2. Methods
3. Experimental Results 14
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
NLL = −
1
N
N
∑
i=1
log p(yi |xi; θ) ECE =
1
n
n
∑
i=1
1
B ∑
j∈Bi
aj −
1
B ∑
j∈Bi
yj
• : 모델의 예측 확률, : 실제 확률
• : binning을 위한 구간 크기 ( : i번째 구간)
aj yj
B Bi
•Metrics that depend on the predictive uncertainty—negative log-likelihood (NLL) and expected calibration error (ECE)
/ 22
2. Methods
3. Experimental Results 15
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
/ 22
2. Methods
3. Experimental Results 16
Large-scale setting
Table 2: Performance of ResNet-20 (upper table) and Wide ResNet-28-10 (lower table) models on CIFAR-10/100. We separately
compare the ef
fi
cient methods (2 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles.
• Hyper-deep ens: 100 trials of random search
• Deep ens, single: take the best hyperparameter
configuration found by the random search procedure
•Ensemble size 전반에 걸쳐 성능 향상
•Fix the ensemble size to four:
/ 22
2. Methods
3. Experimental Results 17
Large-scale setting
Average ensemble-member metrics:
CIFAR-100 (NLL, ACC)=(0.904, 0.788)
•The joint training in ‘hyper-batch ens’ leads to complementary ensemble members
/ 22
2. Methods
3. Experimental Results 18
Training time and memory cost
•Both in terms of the number of parameters and training time, hyper-batch ens is about twice as costly as batch
ens.
/ 22
2. Methods
3. Experimental Results 19
Calibration on out of distribution data
•30 types of corruptions to the images of CIFAR-10-C
• D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018
/ 22
2. Methods
3. Experimental Results 20
Calibration on out of distribution data
•The mean accuracies are similar for all ensemble methods, whereas hyper-batch ens shows more robustness than
batch ens as it typically leads to smaller worst values
/ 22
4. Conclusion
2. Methods
4. Conclusions 22
• Main contribution
• Hyper-deep ensembles.
• We define a greedy algorithm to form ensembles of neural networks exploiting two sources of diversity:
varied hyperparameters and random initialization. It is a simple, strong baseline that we hope will be used in
future research.
• Hyper-batch ensembles.
• Both the ensemble members and their hyperparameters are learned end-to-end in a single training
procedure, directly maximizing the ensemble performance.
• It outperforms batch ensembles while keeping their original memory compactness and efficient
minibatching for parallel training and prediction.
• Future works
• Towards more compact parametrization
• Architecture diversity
Thank you.
/ 22
Ad

More Related Content

Similar to PR422_hyper-deep ensembles.pdf (20)

Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
Engineering Research Publication
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
DineshRaj Goud
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
ijcsit
 
Combinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmCombinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic Algorithm
Vivek Maheshwari
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
infopapers
 
F017533540
F017533540F017533540
F017533540
IOSR Journals
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
Microarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector MachineMicroarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector Machine
CSCJournals
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man's
Leonardo Auslender
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
Omkar Rane
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
infopapers
 
PPT file
PPT filePPT file
PPT file
butest
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Sunghoon Joo
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
Deep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp SegmentationDeep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp Segmentation
multimediaeval
 
P1121133727
P1121133727P1121133727
P1121133727
Ashraf Aboshosha
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
DineshRaj Goud
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
ijcsit
 
Combinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmCombinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic Algorithm
Vivek Maheshwari
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
infopapers
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
Microarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector MachineMicroarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector Machine
CSCJournals
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man's
Leonardo Auslender
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
Omkar Rane
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
infopapers
 
PPT file
PPT filePPT file
PPT file
butest
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Sunghoon Joo
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
Deep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp SegmentationDeep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp Segmentation
multimediaeval
 

More from Sunghoon Joo (18)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
Sunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
Sunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
Sunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
Sunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
Sunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
Sunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Sunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
Sunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Sunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
Sunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
Sunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
Sunghoon Joo
 
PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
Sunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
Sunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
Sunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
Sunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
Sunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
Sunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Sunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
Sunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Sunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
Sunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
Sunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
Sunghoon Joo
 
Ad

Recently uploaded (20)

DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Ad

PR422_hyper-deep ensembles.pdf

  • 1. PR-422 Wenzel, Florian, et al. "Hyperparameter ensembles for robustness and uncertainty quantification." Advances in Neural Information Processing Systems 33 (2020): 6514-6527. 주성훈, VUNO Inc. 2023. 2. 19.
  • 3. 2. Methods 1. Research Background 3 Ensembles of neural networks •Neural networks can form ensembles of models that are diverse and perform well on held-out data. •Diversity is induced by the multi-modal nature of the loss landscape and randomness in initialization and training. •Many mechanisms exist to foster diversity, but this paper focuses on combining networks with weight initialization and different hyperparameters. http://florianwenzel.com/files/neurips_poster_2020.pdf / 22
  • 4. 2. Methods 1. Research Background 4 Approach •Hyper-deep ensembles • This approach utilizes a greedy algorithm to create neural network ensembles that leverage diverse hyperparameters and random initialization for improved performance. •Hyper-batch ensembles • we propose a parameterization combining that of ‘batch ensemble’ and self-tuning networks, which enables both weight and hyperparameter diversity / 22
  • 5. 2. Methods 1. Research Background 5 Previous works •Combining the outputs of several neural networks to improve their single performance • Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble members. R. Zhang, et al. “Cyclical stochastic gradient mcmc for bayesian deep learning.” ICLR, 2019. • Cyclical learning-rate schedules • MC dropout Y. Gal. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” ICML, 2016. • Random initialization (Deep ensemble) / 22
  • 6. 2. Methods 1. Research Background 6 Previous works •Batch ensemble (Wen et al., ICLR, 2019) • Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble members. •Batch ens • Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to ef fi cient ensemble and lifelong learning. In ICLR, 2019. •Not only does ‘Batch ensemble’ lead to a memory saving, but it also allows for efficient minibatching, where each datapoint may use a different ensemble member. X[W ∘ (rksT k )] = [(X ∘ rksT k )W] ∘ sT k / 22
  • 8. 2. Methods 2. Methods 8 Hyper-deep ensembles •Train κ models by random search (random weight init and random hparam). - line 1 •Apply hyper_ens to extract K models out of the κ available ones, with K « κ. - line 2 •For each selected hparam (line 3), retrain for K different weight inits (stratification). (Line 4-8) / 22
  • 9. 2. Methods 2. Methods 9 Hyper-batch ensembles •This combines ideas of batch ensembles (Wen et al., 2019), and self-tuning networks (STNs) (Mackay et al., 2018). •Batch ens • Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to ef fi cient ensemble and lifelong learning. In ICLR, 2019. •Ensemble member k ∈ {1,…, K} •Weight diversity: , rks⊤ k ukv⊤ k / 22
  • 10. 2. Methods 2. Methods 10 Hyper-batch ensembles •Can capture multiple hyperparameters (STNs only covers one hparam). •Ensemble member k ∈ {1,…, K} •Weight diversity: , rks⊤ k ukv⊤ k M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. In ICLR, 2018. •scalable local approximations of the best-response function / 22
  • 11. 2. Methods 2. Methods 11 Hyper-batch ensembles •Model parameters are optimized on the training set using the average member cross entropy (= the usual loss for single models). •Hyperparameters (more precisely the hyperparameter distribution parameters ξ) are optimized on the validation set using the ensemble cross entropy. This directly encourages diversity between members. •Training objective •hyperparameter distribution Hyperparameter distribution for ensemble member k / 22
  • 13. 2. Methods 3. Experimental Results 13 Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100 Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “ fi xed init ens” is a shorthand for fi xed init hyper ens, i.e., a “row” in Figure 2-(left). We separately compare the ef fi cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively (in Appendix C.7.2, we assess the statistical signi fi cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types). • Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings) • Deep ens, single: take the best hyperparameter configuration found by the random search procedure NLL = − 1 N N ∑ i=1 log p(yi |xi; θ) ECE = 1 n n ∑ i=1 1 B ∑ j∈Bi aj − 1 B ∑ j∈Bi yj • : 모델의 예측 확률, : 실제 확률 • : binning을 위한 구간 크기 ( : i번째 구간) aj yj B Bi / 22
  • 14. 2. Methods 3. Experimental Results 14 Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100 Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “ fi xed init ens” is a shorthand for fi xed init hyper ens, i.e., a “row” in Figure 2-(left). We separately compare the ef fi cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively (in Appendix C.7.2, we assess the statistical signi fi cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types). • Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings) • Deep ens, single: take the best hyperparameter configuration found by the random search procedure NLL = − 1 N N ∑ i=1 log p(yi |xi; θ) ECE = 1 n n ∑ i=1 1 B ∑ j∈Bi aj − 1 B ∑ j∈Bi yj • : 모델의 예측 확률, : 실제 확률 • : binning을 위한 구간 크기 ( : i번째 구간) aj yj B Bi •Metrics that depend on the predictive uncertainty—negative log-likelihood (NLL) and expected calibration error (ECE) / 22
  • 15. 2. Methods 3. Experimental Results 15 Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100 Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “ fi xed init ens” is a shorthand for fi xed init hyper ens, i.e., a “row” in Figure 2-(left). We separately compare the ef fi cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively (in Appendix C.7.2, we assess the statistical signi fi cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types). • Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings) • Deep ens, single: take the best hyperparameter configuration found by the random search procedure / 22
  • 16. 2. Methods 3. Experimental Results 16 Large-scale setting Table 2: Performance of ResNet-20 (upper table) and Wide ResNet-28-10 (lower table) models on CIFAR-10/100. We separately compare the ef fi cient methods (2 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles. • Hyper-deep ens: 100 trials of random search • Deep ens, single: take the best hyperparameter configuration found by the random search procedure •Ensemble size 전반에 걸쳐 성능 향상 •Fix the ensemble size to four: / 22
  • 17. 2. Methods 3. Experimental Results 17 Large-scale setting Average ensemble-member metrics: CIFAR-100 (NLL, ACC)=(0.904, 0.788) •The joint training in ‘hyper-batch ens’ leads to complementary ensemble members / 22
  • 18. 2. Methods 3. Experimental Results 18 Training time and memory cost •Both in terms of the number of parameters and training time, hyper-batch ens is about twice as costly as batch ens. / 22
  • 19. 2. Methods 3. Experimental Results 19 Calibration on out of distribution data •30 types of corruptions to the images of CIFAR-10-C • D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018 / 22
  • 20. 2. Methods 3. Experimental Results 20 Calibration on out of distribution data •The mean accuracies are similar for all ensemble methods, whereas hyper-batch ens shows more robustness than batch ens as it typically leads to smaller worst values / 22
  • 22. 2. Methods 4. Conclusions 22 • Main contribution • Hyper-deep ensembles. • We define a greedy algorithm to form ensembles of neural networks exploiting two sources of diversity: varied hyperparameters and random initialization. It is a simple, strong baseline that we hope will be used in future research. • Hyper-batch ensembles. • Both the ensemble members and their hyperparameters are learned end-to-end in a single training procedure, directly maximizing the ensemble performance. • It outperforms batch ensembles while keeping their original memory compactness and efficient minibatching for parallel training and prediction. • Future works • Towards more compact parametrization • Architecture diversity Thank you. / 22