n27 PDF
n27 PDF
Machine learning is only as good as your data, because thats the only way we can access the
model. Samples expose the actual distribution of our data, but only partially. Our empirical
risk, or training error, is the following:
n
1X
RT [f ] = loss(f (xi ), yi )
n i=1
We can measure and optimize this empirical risk. However, we can guess how well this
estimator serves as a proxy for our actual risk. Let us consider the Fundamental Theorem
of Machine Learning:
R[f ] = (R[f ] RT [f ]) + RT [f ]
where the last term RT [f ] is the training error, and R[f ] RT [f ] is our generalization error.
Only our training error is guaranteed to be observable. We can measure generalization only
assuming that we have some holdout data that is representative of all possible data. In
effect, our assumption is the following:
RV [f ] RT [f ] R[f ] RT [f ]
q data is i.i.d., (x1 , y1 ) . . . (xn , yn ), then the scale of our error is roughly RT [f ] R[f ]
If your
d
O( n
). If d > n, then we need to regularize.
q
Its much more important that R[f ] RV [f ] O( logMK ) is small, wjere K is the number
of models used for cross validation. The ideal validation is run on newly-generated data in
exactly the same way old data was generated, independent of the training set.
1
2 Pipeline
Data
Features
Train Model
linear predictors wT x
Another way to build features is to use binning, histograms, or trees. With binning, we split
data per ranges of values, such as [a > 0, a < 0]. Always compare against nearest neighbors.
In general, you should not be hyper-sensitive to continuous-valued features e.g., necessitating
the third or fourth decimal place. A model is stable (i.e., robustness) if fT fT {xi ,yi }
means a good model. One theorem states that if the previous statement holds, we know
that generalization error is fairly low.
2
3 Unsupervised Learning
Stochastic gradient descent is one algorithm that is well-suited for all the methods presented
in this course. n should be fairly large, and pick a learning just large enough so that you
do not diverge, and decrease as the algorithm runs. 200 epochs is a good upper bound for
overfitting.
After this course, you can take the following to further your knowledge in this domain:
Learning theory teaches active learning and experiment design. Scalable machine learning
is one field to explore. Safety, reliability, robustness are also possible fields to explore.