0% found this document useful (0 votes)
162 views

n27 PDF

The document summarizes key points from a lecture on machine learning. It discusses empirical risk minimization, where models are trained to minimize empirical risk on training data but generalization to new data depends on how well the training data represents the overall distribution. It also outlines common machine learning pipelines involving data collection, feature engineering, and model training. Unsupervised learning techniques like PCA and clustering are mentioned for reducing dimensions and finding patterns in unlabeled data.

Uploaded by

Christine Straub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views

n27 PDF

The document summarizes key points from a lecture on machine learning. It discusses empirical risk minimization, where models are trained to minimize empirical risk on training data but generalization to new data depends on how well the training data represents the overall distribution. It also outlines common machine learning pipelines involving data collection, feature engineering, and model training. Unsupervised learning techniques like PCA and clustering are mentioned for reducing dimensions and finding patterns in unlabeled data.

Uploaded by

Christine Straub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Closing Remarks

compiled by Alvin Wan from Professor Benjamin Rechts lecture

1 Empirical Risk Minimization


Risk is the expected loss of our prediction function, or in colloquial terms, the guess on how
well well predict:

R[f ] = E[loss(f (x), y)]

Machine learning is only as good as your data, because thats the only way we can access the
model. Samples expose the actual distribution of our data, but only partially. Our empirical
risk, or training error, is the following:

n
1X
RT [f ] = loss(f (xi ), yi )
n i=1

We can measure and optimize this empirical risk. However, we can guess how well this
estimator serves as a proxy for our actual risk. Let us consider the Fundamental Theorem
of Machine Learning:

R[f ] = (R[f ] RT [f ]) + RT [f ]

where the last term RT [f ] is the training error, and R[f ] RT [f ] is our generalization error.
Only our training error is guaranteed to be observable. We can measure generalization only
assuming that we have some holdout data that is representative of all possible data. In
effect, our assumption is the following:

RV [f ] RT [f ] R[f ] RT [f ]

q data is i.i.d., (x1 , y1 ) . . . (xn , yn ), then the scale of our error is roughly RT [f ] R[f ]
If your
d
O( n
). If d > n, then we need to regularize.
q
Its much more important that R[f ] RV [f ] O( logMK ) is small, wjere K is the number
of models used for cross validation. The ideal validation is run on newly-generated data in
exactly the same way old data was generated, independent of the training set.

1
2 Pipeline
Data

Features

Train Model

We have some rules of thumb for features:

Text: Bag of words, n-grams, histograms

Vision: Pixels, gradients, histograms of gradients, wavelets

Medicine: age, gender, family histories, blood tests

Here is a survey of methods:

linear predictors wT x

Lifting: x [x1 , x2 , x1 x2 . . . ]T (nonlinear)

Kernel trick: x (x), x (z) = k(x, z)

Neural networks: f (x, ) optimize w.r.t.

Another way to build features is to use binning, histograms, or trees. With binning, we split
data per ranges of values, such as [a > 0, a < 0]. Always compare against nearest neighbors.
In general, you should not be hyper-sensitive to continuous-valued features e.g., necessitating
the third or fourth decimal place. A model is stable (i.e., robustness) if fT fT {xi ,yi }
means a good model. One theorem states that if the previous statement holds, we know
that generalization error is fairly low.

2
3 Unsupervised Learning

We can either reduce dimensions (PCA) or cluster (k-means, agglomerative, spectral).

Stochastic gradient descent is one algorithm that is well-suited for all the methods presented
in this course. n should be fairly large, and pick a learning just large enough so that you
do not diverge, and decrease as the algorithm runs. 200 epochs is a good upper bound for
overfitting.

After this course, you can take the following to further your knowledge in this domain:

Optimization EE127A, EE227C

Probability EE126, Stat 134, Stat 210A, 210B

Applications NLP, Vision, Robotics

Learning theory teaches active learning and experiment design. Scalable machine learning
is one field to explore. Safety, reliability, robustness are also possible fields to explore.

You might also like