SlideShare a Scribd company logo
Exploring Optimization in Vowpal
Wabbit
-Shiladitya Sen
Vowpal Wabbit
• Online
• Open Source
• Machine Learning Library
• Has achieved record-breaking speed by implementation of
• Parallel Processing
• Caching
• Hashing,etc.
• A “true” library:
offers a wide range of
machine learning and
optimization algorithms
Machine Learning Models
• Linear Regressor ( --loss_function squared)
• Logistic Regressor (--loss_function logistic)
• SVM (--loss_function hinge)
• Neural Networks ( --nn <arg> )
• Matrix Factorization
• Latent Dirichlet Allocation ( --lda <arg> )
• Active Learning ( --active_learning)
Regularization
• L1 Regularization ( --l1 <arg> )
• L2 Regularization ( --l2 <arg> )
Optimization Algorithms
• Online Gradient Descent ( default )
• Conjugate Gradient ( --conjugate_gradient )
• L-BFGS ( --bfgs )
Optimization
The Convex Definition
Convex Sets
Definition:
(0,1)andCyx,whereCy)x-(1
ifsetconvexabetosaidisofCsubsetA n



Convex Functions:
set.convexais
f(x)}andX,x|){(x,
asdefinedepigraph,itsiffunctionconvexabetosaidis
infunctionconvexaisXwhereX:ffunctionvalued-realA n



It can be proved from the definition of convex functions
that such a function can have no maxima.
In other words…
Might have at most one minima
i.e. Local minima is global minima
Loss functions which are convex help in optimization for
Machine Learning
Optimization
Algorithm I : Online Gradient Descent
What the batch implementation of
Gradient Descent (GD) does
How does Batch-version of GD work?
• Expresses total loss J as a function of a set of
parameters : x
•
• Takes a calculated step α in that direction to
reach a new point, with new co-ordinate values
of x
descent.steepestofdirectionisJ(x)-So,
ascent.steepestofdirectiontheasJ(x)Calculates


achieved.istolerancerequireduntilcontinuesThis
)(
:Algorithm
1 tttt xJxx  
What is the online implementation of GD?
How does online GD work?
1. Takes a point from the dataset :
2. Using existing hypothesis, predicts value
3. True value is revealed
4. Calculates error J as a function of parameters
x for point
5.
6.
7.
8. Moves onto next point
tp
tp
)(Evaluates txJ
)(
:descentsteepestofdirectionin thestepaTakes
txJ

)(:asparametersUpdates 1 ttt xJxx  
1tp 
Looking Deeper into Online GD
• Essentially calculates error function J(x)
independently for each point, as opposed to
calculating J(x) as sum of all errors as in Batch
implementation (Offline) GD
• To achieve accuracy, Online GD takes multiple
passes through the dataset
(Continued…)
Still deeper…
• So that a convergence is reached, the step η in
each pass is reduced. In VW, this is
implemented as:
• Cache file used for multiple passes (-c)
10][late]learning_r-[--l-
0.5][p-initial_p-
1][i-initial_t-
1][drning_rate-decay_lea-
)(
.
'
'
1








 p
ee
e
pn
ii
ild
e
So why Online GD?
• It takes less space…
• And my system needs its space!
Optimization
Algorithm II: Method of Conjugate
Gradients
What is wrong with Gradient Descent?
•Often takes steps in the same direction
•Convergence Issues
Convergence Problems:
The need for Conjugate Gradients:
Wouldn’t it be wonderful if we did not need to take steps
in the same direction to minimize error in that direction?
This is where Conjugate Gradient comes in…
Method of Orthogonal Directions
• In an (n+1) dimensional vector space where J is
defined with n parameters, at most n linearly
independent directions for parameters exist
• Error function may have a component in at most n
linearly independent (orthogonal) directions
• Intended: A step in each of these directions
i.e. at most n steps to minimize the error
• Not solvable for orthogonal directions
Conjugate Directions:
0:)respect towithConjugate(
0:Orthogonal
directionssearchared,d ji


j
T
i
j
T
i
AddA
dd
How do we get the conjugate
directions?
• We first choose n mutually orthogonal
directions:
• We calculate as:
nuuu ,...,, 21
id
.calculateto,...,,toorthogonal-
notarewhichofcomponentsanyoutSubtracts
k21
1
1


n
i
i
k
kkii
dddA
u
dud 



So what is Method of Conjugate Gradients?
• If we set to , the gradient in the i-th step,
we have the Method of Conjugate Gradients.
• The step size in the direction is found by an
exact line search.
iu ir
jitindependanlinearlyare, ji rr
id
The Algorithm for Conjugate Gradient:
Requirement for Preconditioning:
• Round-off errors – leads to slight deviations
from Conjugate Directions
• As a result, Conjugate Gradient is
implemented iteratively
• To minimize number of iterations,
preconditioning is done on the vector space
What is Pre-conditioning?
• The vector space is modified by multiplying a
matrix such that M is a symmetric,
positive-definite matrix.
• This leads to a better clustering of the
eigenvectors and a faster convergence.
-1
M
Optimization
Algorithm III: L-BFGS
Why think linearly?
Newton’s Method proposes a step along a non-
linear path as opposed to a linear one as in GD
and CG..
Leads to a faster convergence…
Newton’s Method:
)).(()(
2
1
)).(()()(
:)(ofexpansionseriessTaylor'order2nd
2
xxxJxx
xxxJxJxxJ
xxJ
T



:getwe,respect towithMinimizing x
)()]([ 12
xJxJx  
)()]([
:formiterativeIn
12
1 xJxJxx nn  
 
What is this BFGS Algorithm?
•
• Named after Broyden-Fletcher-Goldfarb-
Shanno
• Maintains an approximate matrix B and
updates B upon each iteration
BFGS.iswhichamongpopularmost
Methods,Newton-toQuasiled
)]([][gcalculatininonsComplicati 121 
 xJHB
BFGS Algorithm:
Memory is a limited asset
• In Vowpal Wabbit, the version of BFGS
implemented is L-BFGS
• In L-BFGS, all the previous updates to B are
not stored in memory
• At a particular iteration i, only the last m
updates are stored and used to make new
update
• Also, the step size η in each step is calculated
by an inexact line search following Wolfe’s
Conditions.

More Related Content

What's hot (20)

PDF
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
Revolution Analytics
 
PPTX
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
PPTX
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
MLconf
 
PDF
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
MLconf
 
PDF
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
MLconf
 
PPTX
Scaling out logistic regression with Spark
Barak Gitsis
 
PDF
Performance Analysis of Lattice QCD with APGAS Programming Model
Koichi Shirahata
 
PPTX
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
MLconf
 
PDF
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 
PPTX
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
PDF
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
PDF
Deep Learning in theano
Massimo Quadrana
 
PDF
Multinomial Logistic Regression with Apache Spark
DB Tsai
 
PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alexis Perrier
 
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf
 
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning in Python with Tensorflow for Finance
Ben Ball
 
PDF
Learning stochastic neural networks with Chainer
Seiya Tokui
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Revolution Analytics
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
MLconf
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
MLconf
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
MLconf
 
Scaling out logistic regression with Spark
Barak Gitsis
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Koichi Shirahata
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
MLconf
 
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
Deep Learning in theano
Massimo Quadrana
 
Multinomial Logistic Regression with Apache Spark
DB Tsai
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alexis Perrier
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Deep Learning in Python with Tensorflow for Finance
Ben Ball
 
Learning stochastic neural networks with Chainer
Seiya Tokui
 

Viewers also liked (16)

PPTX
Kill the wabbit
Joe Kleinwaechter
 
PPTX
Vowpal Wabbit
odsc
 
PPTX
Datasets for logistic regression
Prashant2902
 
PDF
CTR logistic regression
Joseph Duimstra, Ph.D.
 
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
PDF
一淘广告机器学习
Shaoning Pan
 
ODP
Click-Trough Rate (CTR) prediction
Andrey Lange
 
PPTX
Dynamic pricing
jsnowbabyyyy
 
PDF
Cross Device Ad Targeting at Scale
Trieu Nguyen
 
PDF
Ad Click Prediction - Paper review
Mazen Aly
 
PDF
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
PPTX
Ranking scales
zeeshan3434
 
PDF
Large scale-ctr-prediction lessons-learned-florian-hartl
PyData
 
PDF
数据挖掘竞赛经验分享 严强
Felicia Wenyi Fei
 
PPTX
CTR Prediction using Spark Machine Learning Pipelines
Manisha Sule
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Kill the wabbit
Joe Kleinwaechter
 
Vowpal Wabbit
odsc
 
Datasets for logistic regression
Prashant2902
 
CTR logistic regression
Joseph Duimstra, Ph.D.
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
一淘广告机器学习
Shaoning Pan
 
Click-Trough Rate (CTR) prediction
Andrey Lange
 
Dynamic pricing
jsnowbabyyyy
 
Cross Device Ad Targeting at Scale
Trieu Nguyen
 
Ad Click Prediction - Paper review
Mazen Aly
 
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
Ranking scales
zeeshan3434
 
Large scale-ctr-prediction lessons-learned-florian-hartl
PyData
 
数据挖掘竞赛经验分享 严强
Felicia Wenyi Fei
 
CTR Prediction using Spark Machine Learning Pipelines
Manisha Sule
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Ad

Similar to Exploring Optimization in Vowpal Wabbit (20)

PPTX
Optimization tutorial
Northwestern University
 
PPT
lecture6.ppt
AbhiYadav655132
 
PDF
Gradient descent method
Sanghyuk Chun
 
PDF
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
PPT
Least Square Optimization and Sparse-Linear Solver
Ji-yong Kwon
 
PDF
lecture01_lecture01_lecture0001_ceva.pdf
AnaNeacsu5
 
PDF
CI_L01_Optimization.pdf
SantiagoGarridoBulln
 
PPT
Introduction to Gradient Methods in machine learning
SmileySachin1
 
PPT
Adaline and Madaline.ppt
neelamsanjeevkumar
 
PPTX
A machine learning method for efficient design optimization in nano-optics
JCMwave
 
PDF
MSc Thesis_Francisco Franco_A New Interpolation Approach for Linearly Constra...
Francisco Javier Franco Espinoza
 
PPTX
A machine learning method for efficient design optimization in nano-optics
JCMwave
 
PDF
MAST30013 Techniques in Operations Research
Lachlan Russell
 
PPT
REDES NEURONALES Performance Optimization
ESCOM
 
PPTX
Optforml
Devdatt Dubhashi
 
PDF
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
diannepatricia
 
PDF
Optim_methods.pdf
SantiagoGarridoBulln
 
PPTX
Machine learning introduction lecture notes
UmeshJagga1
 
PDF
Distributed Coordinate Descent for Logistic Regression with Regularization
Илья Трофимов
 
PPTX
Optimization of mathematical function using gradient descent algorithm.pptx
s17714274
 
Optimization tutorial
Northwestern University
 
lecture6.ppt
AbhiYadav655132
 
Gradient descent method
Sanghyuk Chun
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
Least Square Optimization and Sparse-Linear Solver
Ji-yong Kwon
 
lecture01_lecture01_lecture0001_ceva.pdf
AnaNeacsu5
 
CI_L01_Optimization.pdf
SantiagoGarridoBulln
 
Introduction to Gradient Methods in machine learning
SmileySachin1
 
Adaline and Madaline.ppt
neelamsanjeevkumar
 
A machine learning method for efficient design optimization in nano-optics
JCMwave
 
MSc Thesis_Francisco Franco_A New Interpolation Approach for Linearly Constra...
Francisco Javier Franco Espinoza
 
A machine learning method for efficient design optimization in nano-optics
JCMwave
 
MAST30013 Techniques in Operations Research
Lachlan Russell
 
REDES NEURONALES Performance Optimization
ESCOM
 
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
diannepatricia
 
Optim_methods.pdf
SantiagoGarridoBulln
 
Machine learning introduction lecture notes
UmeshJagga1
 
Distributed Coordinate Descent for Logistic Regression with Regularization
Илья Трофимов
 
Optimization of mathematical function using gradient descent algorithm.pptx
s17714274
 
Ad

Exploring Optimization in Vowpal Wabbit

  • 1. Exploring Optimization in Vowpal Wabbit -Shiladitya Sen
  • 2. Vowpal Wabbit • Online • Open Source • Machine Learning Library • Has achieved record-breaking speed by implementation of • Parallel Processing • Caching • Hashing,etc. • A “true” library: offers a wide range of machine learning and optimization algorithms
  • 3. Machine Learning Models • Linear Regressor ( --loss_function squared) • Logistic Regressor (--loss_function logistic) • SVM (--loss_function hinge) • Neural Networks ( --nn <arg> ) • Matrix Factorization • Latent Dirichlet Allocation ( --lda <arg> ) • Active Learning ( --active_learning)
  • 4. Regularization • L1 Regularization ( --l1 <arg> ) • L2 Regularization ( --l2 <arg> )
  • 5. Optimization Algorithms • Online Gradient Descent ( default ) • Conjugate Gradient ( --conjugate_gradient ) • L-BFGS ( --bfgs )
  • 9. It can be proved from the definition of convex functions that such a function can have no maxima. In other words… Might have at most one minima i.e. Local minima is global minima Loss functions which are convex help in optimization for Machine Learning
  • 10. Optimization Algorithm I : Online Gradient Descent
  • 11. What the batch implementation of Gradient Descent (GD) does
  • 12. How does Batch-version of GD work? • Expresses total loss J as a function of a set of parameters : x • • Takes a calculated step α in that direction to reach a new point, with new co-ordinate values of x descent.steepestofdirectionisJ(x)-So, ascent.steepestofdirectiontheasJ(x)Calculates   achieved.istolerancerequireduntilcontinuesThis )( :Algorithm 1 tttt xJxx  
  • 13. What is the online implementation of GD?
  • 14. How does online GD work? 1. Takes a point from the dataset : 2. Using existing hypothesis, predicts value 3. True value is revealed 4. Calculates error J as a function of parameters x for point 5. 6. 7. 8. Moves onto next point tp tp )(Evaluates txJ )( :descentsteepestofdirectionin thestepaTakes txJ  )(:asparametersUpdates 1 ttt xJxx   1tp 
  • 15. Looking Deeper into Online GD • Essentially calculates error function J(x) independently for each point, as opposed to calculating J(x) as sum of all errors as in Batch implementation (Offline) GD • To achieve accuracy, Online GD takes multiple passes through the dataset (Continued…)
  • 16. Still deeper… • So that a convergence is reached, the step η in each pass is reduced. In VW, this is implemented as: • Cache file used for multiple passes (-c) 10][late]learning_r-[--l- 0.5][p-initial_p- 1][i-initial_t- 1][drning_rate-decay_lea- )( . ' ' 1          p ee e pn ii ild e
  • 17. So why Online GD? • It takes less space… • And my system needs its space!
  • 18. Optimization Algorithm II: Method of Conjugate Gradients
  • 19. What is wrong with Gradient Descent? •Often takes steps in the same direction •Convergence Issues
  • 21. The need for Conjugate Gradients: Wouldn’t it be wonderful if we did not need to take steps in the same direction to minimize error in that direction? This is where Conjugate Gradient comes in…
  • 22. Method of Orthogonal Directions • In an (n+1) dimensional vector space where J is defined with n parameters, at most n linearly independent directions for parameters exist • Error function may have a component in at most n linearly independent (orthogonal) directions • Intended: A step in each of these directions i.e. at most n steps to minimize the error • Not solvable for orthogonal directions
  • 24. How do we get the conjugate directions? • We first choose n mutually orthogonal directions: • We calculate as: nuuu ,...,, 21 id .calculateto,...,,toorthogonal- notarewhichofcomponentsanyoutSubtracts k21 1 1   n i i k kkii dddA u dud    
  • 25. So what is Method of Conjugate Gradients? • If we set to , the gradient in the i-th step, we have the Method of Conjugate Gradients. • The step size in the direction is found by an exact line search. iu ir jitindependanlinearlyare, ji rr id
  • 26. The Algorithm for Conjugate Gradient:
  • 27. Requirement for Preconditioning: • Round-off errors – leads to slight deviations from Conjugate Directions • As a result, Conjugate Gradient is implemented iteratively • To minimize number of iterations, preconditioning is done on the vector space
  • 28. What is Pre-conditioning? • The vector space is modified by multiplying a matrix such that M is a symmetric, positive-definite matrix. • This leads to a better clustering of the eigenvectors and a faster convergence. -1 M
  • 30. Why think linearly? Newton’s Method proposes a step along a non- linear path as opposed to a linear one as in GD and CG.. Leads to a faster convergence…
  • 32. What is this BFGS Algorithm? • • Named after Broyden-Fletcher-Goldfarb- Shanno • Maintains an approximate matrix B and updates B upon each iteration BFGS.iswhichamongpopularmost Methods,Newton-toQuasiled )]([][gcalculatininonsComplicati 121   xJHB
  • 34. Memory is a limited asset • In Vowpal Wabbit, the version of BFGS implemented is L-BFGS • In L-BFGS, all the previous updates to B are not stored in memory • At a particular iteration i, only the last m updates are stored and used to make new update • Also, the step size η in each step is calculated by an inexact line search following Wolfe’s Conditions.