SlideShare a Scribd company logo
Taiji Suzuki
The University of Tokyo / AIP-RIKEN
NeurIPS2020
Generalization bound of globally optimal
non-convex neural network training:
Transportation map estimation by
infinite dimensional Langevin dynamics
1
Summary
Neural network optimization
• We formulate NN training as an infinite dimensional
gradient Langevin dynamics in RKHS.
➢“Lift” of noisy gradient descent trajectory.
• Global optimality is ensured.
➢Geometric ergodicity + time discretization error
• Generalization error bound + Excess risk bound.
➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk.
2
• Finite/infinite width can be treated in a unifying manner.
• Good generalization error guarantee
→ Different from NTK and mean field analysis.
Difficulty of NN optimization
Optimization of neural network is “difficult”
because of
3
Nonconvexity High-dimensionality
+
•Neural tangent kernel:
➢ Take infinite width asymptotics as 𝑛 → ∞.
➢ Benefit of NN is lost compared with kernel methods.
•Mean field analysis:
➢ Take infinite width asymptotics to guarantee convergence.
➢ Its generalization error is not well understood.
•(Usual) gradient Langevin dynamics:
➢ Suffer from curse of dimensionality.
Our formulation:
Infinite dimensional gradient Langevin dynamics.
Infinite dim neural network
• 2-layer NN: direct expression
4
(training loss)
• 2-layer NN: transportation map expression
(infinite width)
(integral representation)
Also includes
• DNN
• ResNet
etc.
𝑎𝑚 = 0 (𝑚 > 𝑀)
→ finite width network
Mean field model 5
Expectation w.r.t. prob. density 𝜌 of (𝑎, 𝑤):
Optimization of 𝑓 ⇔ Optimization of 𝜌
Continuity equation
𝑣𝑡: gradient
Convergence is guaranteed for 𝜌𝑡 with density.
(Infinite width)
(movement of
each particle)
(distribution)
[Nitanda&Suzuki, 2017][Chizat&Bach, 2018][Mei, Montanari&Nguyen, 2018]
Each neuron corresponds
to one particle.
One partilce
“Lift” of neural network training 6
Transportation map formulation:
(finite width)
𝜌0 has a finite discrete support
→ finite width network
Finite/Infinite width can be treated
in a unifying manner.
(unlike existing frame-work such as NTK and mean field)
Infinite-dim non-convex optimization 7
Ex.
• ℋ: 𝐿2(𝜌)
• ℋ𝐾: RKHS (e.g., Sobolev sp.)
Optimal solution
nonconvex
We utilize gradient Langevin dynamics
in a Hilbert space to optimize the objective.
Infinite-dim. Langevin dynamics 8
: RKHS with kernel 𝐾.
Cylindrical Brownian motion:
Time discretization
Analogous to Gaussian process estimator.
(Gaussian measure associated with RKHS)
Stationary
distribution
Likelihood Prior
(more precisely we consider semi-implicit Euler scheme)
Infinite dimensional setting
Hilbert space
9
RKHS structure
Assumption (eigenvalue decay)
(not essential, can be relaxed to 𝜇𝑘 ∼ 𝑘−𝑝
for 𝑝 > 1)
Risk bounds of NN training 10
Gen. error: Excess risk:
Time discretization
Optimization method (Infinite dimensional GLD):
Error bound 11
Thm (Generalization error bound)
with probability 1 − 𝛿.
Opt. error:
[Muzellec, Sato, Massias, Suzuki, arXiv:2003.00306 (2020)]
Ο(1/ 𝑛)
PAC-Bayesian stability bound [Rivasplata, Kuzborskij, Szepesvári, and Shawe-Taylor, 2019]
• Loss function ℓ is “sufficiently smooth.”
• Loss and its gradients are bounded:
Assumption
(geometric ergodicity + time discretization)
Λ𝜂
∗ : spectral gap
Fast rate: general result 12
Thm (Excess risk bound: fast rate)
Let and .
Can be faster than Ο(1/ 𝑛)
Example: classification & regression 13
Strong low noise condition:
For sufficiently large 𝑛 and any 𝛽 ≤ 𝑛,
Classification
Regression
Model:
Excess classification error
Summary
Neural network optimization
• We formulate NN training as an infinite dimensional
gradient Langevin dynamics in RKHS.
➢“Lift” of noisy gradient descent trajectory.
• Global optimality is ensured.
➢Geometric ergodicity + time discretization error
• Generalization error bound + Excess risk bound.
➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk.
14
• Finite/infinite width can be treated in a unifying manner.
• Good generalization error guarantee
→ Different from NTK and mean field analysis.

More Related Content

What's hot (20)

PDF
Lecture 6: Convolutional Neural Networks
Sang Jun Lee
 
PPTX
Co-clustering of multi-view datasets: a parallelizable approach
Allen Wu
 
PDF
Recursive Neural Networks
Sangwoo Mo
 
PDF
Deep Learning for Computer Vision: Visualization (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
taeseon ryu
 
PDF
[PR12] Generative Models as Distributions of Functions
JaeJun Yoo
 
PDF
Restricting the Flow: Information Bottlenecks for Attribution
taeseon ryu
 
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
PDF
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Taiji Suzuki
 
PDF
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
taeseon ryu
 
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
PDF
Nonlinear dimension reduction
Yan Xu
 
PPTX
The world of loss function
홍배 김
 
PPTX
Image classification using cnn
Debarko De
 
PDF
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Self-Attention with Linear Complexity
Sangwoo Mo
 
PDF
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
MLAI2
 
PDF
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
taeseon ryu
 
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Lecture 6: Convolutional Neural Networks
Sang Jun Lee
 
Co-clustering of multi-view datasets: a parallelizable approach
Allen Wu
 
Recursive Neural Networks
Sangwoo Mo
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Universitat Politècnica de Catalunya
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
taeseon ryu
 
[PR12] Generative Models as Distributions of Functions
JaeJun Yoo
 
Restricting the Flow: Information Bottlenecks for Attribution
taeseon ryu
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Taiji Suzuki
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
taeseon ryu
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
Nonlinear dimension reduction
Yan Xu
 
The world of loss function
홍배 김
 
Image classification using cnn
Debarko De
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Self-Attention with Linear Complexity
Sangwoo Mo
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
MLAI2
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
taeseon ryu
 
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 

Similar to [NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics (20)

PDF
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Valentin De Bortoli
 
PDF
Deep learning MindMap
Ashish Patel
 
PDF
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
stat-phys-appis-reduced.pdf
University of Groningen
 
PPTX
Optimization techniq
RakshithGowdakodihal
 
PPTX
Techniques in Deep Learning
Sourya Dey
 
PDF
(Artificial) Neural Network
Putri Wikie
 
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Chap 8. Optimization for training deep models
Young-Geun Choi
 
PDF
Stopped Training and Other Remedies for OverFITtting
ESCOM
 
PDF
Artificial Neural Networks Lect3: Neural Network Learning rules
Mohammed Bennamoun
 
PDF
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
diannepatricia
 
PDF
Understanding deep learning requires rethinking generalization
Jamie Seol
 
PDF
Lecture notes
graciasabineKAGLAN
 
PDF
Learning Deep Learning
simaokasonse
 
PDF
The marginal value of adaptive gradient methods in machine learning
Jamie Seol
 
PPTX
Deep Learning with MXNet
Cyrus Moazami-Vahid
 
PDF
Wasserstein GAN an Introduction
Martin Heusel
 
PPTX
Deep learning simplified
Lovelyn Rose
 
PPTX
lecture 9 pdddddddddddddddddssdsdnn.pptx
speedcomcyber25
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Valentin De Bortoli
 
Deep learning MindMap
Ashish Patel
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
stat-phys-appis-reduced.pdf
University of Groningen
 
Optimization techniq
RakshithGowdakodihal
 
Techniques in Deep Learning
Sourya Dey
 
(Artificial) Neural Network
Putri Wikie
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Universitat Politècnica de Catalunya
 
Chap 8. Optimization for training deep models
Young-Geun Choi
 
Stopped Training and Other Remedies for OverFITtting
ESCOM
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Mohammed Bennamoun
 
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
diannepatricia
 
Understanding deep learning requires rethinking generalization
Jamie Seol
 
Lecture notes
graciasabineKAGLAN
 
Learning Deep Learning
simaokasonse
 
The marginal value of adaptive gradient methods in machine learning
Jamie Seol
 
Deep Learning with MXNet
Cyrus Moazami-Vahid
 
Wasserstein GAN an Introduction
Martin Heusel
 
Deep learning simplified
Lovelyn Rose
 
lecture 9 pdddddddddddddddddssdsdnn.pptx
speedcomcyber25
 
Ad

More from Taiji Suzuki (12)

PPTX
深層学習の数理:カーネル法, スパース推定との接点
Taiji Suzuki
 
PDF
数学で解き明かす深層学習の原理
Taiji Suzuki
 
PPTX
深層学習の数理
Taiji Suzuki
 
PDF
はじめての機械学習
Taiji Suzuki
 
PDF
Ibis2016
Taiji Suzuki
 
PDF
Sparse estimation tutorial 2014
Taiji Suzuki
 
PDF
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
 
PDF
機械学習におけるオンライン確率的最適化の理論
Taiji Suzuki
 
PDF
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
Taiji Suzuki
 
PDF
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
Taiji Suzuki
 
PPT
Jokyokai
Taiji Suzuki
 
PDF
Jokyokai2
Taiji Suzuki
 
深層学習の数理:カーネル法, スパース推定との接点
Taiji Suzuki
 
数学で解き明かす深層学習の原理
Taiji Suzuki
 
深層学習の数理
Taiji Suzuki
 
はじめての機械学習
Taiji Suzuki
 
Ibis2016
Taiji Suzuki
 
Sparse estimation tutorial 2014
Taiji Suzuki
 
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
 
機械学習におけるオンライン確率的最適化の理論
Taiji Suzuki
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
Taiji Suzuki
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
Taiji Suzuki
 
Jokyokai
Taiji Suzuki
 
Jokyokai2
Taiji Suzuki
 
Ad

Recently uploaded (20)

PDF
NTPC PATRATU Summer internship report.pdf
hemant03701
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
NTPC PATRATU Summer internship report.pdf
hemant03701
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Distribution reservoir and service storage pptx
dhanashree78
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 

[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics

  • 1. Taiji Suzuki The University of Tokyo / AIP-RIKEN NeurIPS2020 Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics 1
  • 2. Summary Neural network optimization • We formulate NN training as an infinite dimensional gradient Langevin dynamics in RKHS. ➢“Lift” of noisy gradient descent trajectory. • Global optimality is ensured. ➢Geometric ergodicity + time discretization error • Generalization error bound + Excess risk bound. ➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk. 2 • Finite/infinite width can be treated in a unifying manner. • Good generalization error guarantee → Different from NTK and mean field analysis.
  • 3. Difficulty of NN optimization Optimization of neural network is “difficult” because of 3 Nonconvexity High-dimensionality + •Neural tangent kernel: ➢ Take infinite width asymptotics as 𝑛 → ∞. ➢ Benefit of NN is lost compared with kernel methods. •Mean field analysis: ➢ Take infinite width asymptotics to guarantee convergence. ➢ Its generalization error is not well understood. •(Usual) gradient Langevin dynamics: ➢ Suffer from curse of dimensionality. Our formulation: Infinite dimensional gradient Langevin dynamics.
  • 4. Infinite dim neural network • 2-layer NN: direct expression 4 (training loss) • 2-layer NN: transportation map expression (infinite width) (integral representation) Also includes • DNN • ResNet etc. 𝑎𝑚 = 0 (𝑚 > 𝑀) → finite width network
  • 5. Mean field model 5 Expectation w.r.t. prob. density 𝜌 of (𝑎, 𝑤): Optimization of 𝑓 ⇔ Optimization of 𝜌 Continuity equation 𝑣𝑡: gradient Convergence is guaranteed for 𝜌𝑡 with density. (Infinite width) (movement of each particle) (distribution) [Nitanda&Suzuki, 2017][Chizat&Bach, 2018][Mei, Montanari&Nguyen, 2018] Each neuron corresponds to one particle. One partilce
  • 6. “Lift” of neural network training 6 Transportation map formulation: (finite width) 𝜌0 has a finite discrete support → finite width network Finite/Infinite width can be treated in a unifying manner. (unlike existing frame-work such as NTK and mean field)
  • 7. Infinite-dim non-convex optimization 7 Ex. • ℋ: 𝐿2(𝜌) • ℋ𝐾: RKHS (e.g., Sobolev sp.) Optimal solution nonconvex We utilize gradient Langevin dynamics in a Hilbert space to optimize the objective.
  • 8. Infinite-dim. Langevin dynamics 8 : RKHS with kernel 𝐾. Cylindrical Brownian motion: Time discretization Analogous to Gaussian process estimator. (Gaussian measure associated with RKHS) Stationary distribution Likelihood Prior (more precisely we consider semi-implicit Euler scheme)
  • 9. Infinite dimensional setting Hilbert space 9 RKHS structure Assumption (eigenvalue decay) (not essential, can be relaxed to 𝜇𝑘 ∼ 𝑘−𝑝 for 𝑝 > 1)
  • 10. Risk bounds of NN training 10 Gen. error: Excess risk: Time discretization Optimization method (Infinite dimensional GLD):
  • 11. Error bound 11 Thm (Generalization error bound) with probability 1 − 𝛿. Opt. error: [Muzellec, Sato, Massias, Suzuki, arXiv:2003.00306 (2020)] Ο(1/ 𝑛) PAC-Bayesian stability bound [Rivasplata, Kuzborskij, Szepesvári, and Shawe-Taylor, 2019] • Loss function ℓ is “sufficiently smooth.” • Loss and its gradients are bounded: Assumption (geometric ergodicity + time discretization) Λ𝜂 ∗ : spectral gap
  • 12. Fast rate: general result 12 Thm (Excess risk bound: fast rate) Let and . Can be faster than Ο(1/ 𝑛)
  • 13. Example: classification & regression 13 Strong low noise condition: For sufficiently large 𝑛 and any 𝛽 ≤ 𝑛, Classification Regression Model: Excess classification error
  • 14. Summary Neural network optimization • We formulate NN training as an infinite dimensional gradient Langevin dynamics in RKHS. ➢“Lift” of noisy gradient descent trajectory. • Global optimality is ensured. ➢Geometric ergodicity + time discretization error • Generalization error bound + Excess risk bound. ➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk. 14 • Finite/infinite width can be treated in a unifying manner. • Good generalization error guarantee → Different from NTK and mean field analysis.