0% found this document useful (0 votes)
7 views

1 Intro

Uploaded by

caiyuzhu.cs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

1 Intro

Uploaded by

caiyuzhu.cs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

教师介绍

机器学习 盛 律

Machine Learning - 邮 箱:[email protected]


- 办公室:工程训练中心东419室
第一讲:机器学习概论 - 主 页:https://lucassheng.github.io

盛律/软件学院
2024 秋冬学期

1 2

助教介绍 课程说明
责任助教
n 新设立的专业学位核心课(48学时),10多个学院联合开设
n 本学期开课学院
p 计算机、仪器光电、软件、人工智能、网安
n 每个学院教学内容 和考核方式可能略有不同,不同学院的成
绩标准可能有差异
黄泽桓 文皓 樊红兴 王立芃 陈泽人
[email protected] [email protected] [email protected] [email protected] [email protected]
n 强烈建议来自计算机、仪器光电、人工智能、网安等学院的同
学联系本学院教务老师,重新选择本学院开设的机器学习课程
n 对课程有任何问题,都可以联系助教咨询

3 4
课程沟通和资源 课程目标
n 课程微信群 →_→
p 入群请将“群昵称”改为 n 掌握机器学习的基本理论与当前进展
“学号-姓名”
n 课程资源:智学北航
n 能够运用机器学习解决实际问题

n 为相关科学研究和工程实践夯实基础

5 6

参考书目 内容大纲
序号 教学内容 序号 教学内容
1 机器学习概述(3学时) 10 关联规则学习(2学时)
2 机器学习基础(3学时) 11 概率图模型(3学时)
3 线性模型(3学时) 12 采样方法(2学时)
4 正则化与稀疏学习(2学时) 13 决策树(2学时)
5 支持向量机与核方法(3学时) 14 集成学习(2学时)
6 神经网络(3学时) 15 半监督学习(2学时)
7 深度神经网络(6学时) 16 强化学习(3学时)
8 聚类(3学时) 17 应用案例分析(3学时)
Pattern Recognition 统计学习方法 Machine Learning 机器学习 9 降维(3学时)
and Machine Learning
Christopher M. Bishop 李航 Tom M. Mitchell 周志华 • 前置课程:工科数学分析、高等代数、概率统计等

7 8
考核方法
n 平时成绩( 65%)
p 4次作业(10%+15%+20%+20%)
• 理论计算/推导 + 代码实现
n 期末考试成绩( 35%) What is Machine Learning?
p 1个小论文,单人独立完成
• 论文题目: 题目自拟,需要结合机器学习算法和个人研究方向
• 格式要求: CVPR模板,6页双栏正文(不包括参考文献)
• 申优答辩: 5分钟PPT视频,带语音解说
p 期中提交题目和简要介绍,期末提交论文
n 没有补考!!!
9 10

Herbert A. Simons (司马贺) Herbert A. Simons (司马贺)


- Economist, Political Scientist, Cognitive Psychologist - Economist, Political Scientist, Cognitive Psychologist
- Awarded: Nobel Prize in Economics in 1978, Turing Award in 1975 - Awarded: Nobel Prize in Economics in 1978, Turing Award in 1975

Learning denotes changes in the system that are Machine Learning denotes automatic changes in the
adaptive in the sense that they enable the system to AI system that are adaptive in the sense that they
do the same task (or tasks drawn from a population of enable the system to do the same task (or tasks drawn
similar tasks) more effectively the next time. from a population of similar tasks) more effectively the
next time.
-- Why should Machine Learning?
Machine Learning: An Artificial Intelligence Approach

Learning = Improve with experience at some task


https://en.wikipedia.org/wiki/Herbert_A._Simon https://en.wikipedia.org/wiki/Herbert_A._Simon
11 12
Machine Learning Quiz
- A computer program is said to learn from n Suppose your email program watches which emails you do or
experience ! with respect to some class of do not mark as spam, and based on that learns how to better
tasks " and performance measure #, if its filter spam. What is the task $ in this setting?
performance at tasks in $, as measured by %,
improved with experience !
p Classifying emails as spam or not spam
p Watching you label emails as spam or not spam
n Machine learning depends on the nature of p The number (or fraction) of emails correctly classified as spam/not
p Tasks ! we wish the system to learn spam
Machine Learning p Performance measure " we use to evaluate the system p None of the above—this is not a machine learning problem
Tom M. Mitchell
p Training signal or experience # we give it

13 14

ML in A Probabilistic Perspective ML and AI?


n All unknown quantities as random variables
n Probability distribution describes a weighted set of
possible values the variable may have

n Why?
p Optimal approach to decision making under uncertainty
p Probabilistic modeling is the language used by most other
areas of science and engineering, and thus provides a
unifying framework between these fields
15 16
ML for Daily Life
Optical Character Recognition (OCR)

Machine Learning
is Everywhere
Mail digit recognition, AT&T labs
http://www.research.att.com/~yann/

License plate readers


http://en.wikipedia.org/wiki/Automatic_number_plat
e_recognition
17 18

ML for Daily Life ML for Daily Life


Applications on Faces Biometrics

Fingerprint scanners on many new Face unlock on Apple iPhone X


smartphones and other devices See also http://www.sensiblevision.com/

19 20
ML for Daily Life ML for Social Media & Industry & E-
Speech Recognition & Translation Commerce

https://www.microsoft.com/en-us/research/video/speech- n Predict stock prices, improve search, reduce spam, improve advertiser
recognition-breakthrough-for-the-spoken-translated-
word/?from=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fresearch.microsoft.com%2Fapps%
return on investment, ...
2Fvideo%2F%3Fid%3D175450
n Machine learning typically generates 10+% improvements

21 22

ML for Social Media & Industry & E-


ML for Energy
Commerce
n Predict how efficiently its data
n Search, photo tagging, ranking articles to centers consume electricity
your news feed
n Input: Total server IT load, total
n Product recommendation, eCommerce fraud number of condenser water
detection, forecasting demand, pricing pumps running, mean heat
exchange approach
temperature, outdoor wind
n Predict whether a customer will cancel a
speed, ...
service and jump to a competitor
Google Data Center

23 24
ML for Touching New Horizons New Breakthrough
Generative Foundation Models
n Identifying stars, supernovae, clusters, galaxies, quasars,
exoplanets, etc.
n AlphaGo/AlphaZero series:超越人类最佳棋手
n AlphaFold:已预测出98.5%的蛋白质结构

Generative Pre-trained Transformer (GPT) Generative Image Modeling, & etc.

“Large Models + Generative Modeling” makes a big difference!


25 26

New Breakthrough An AI What We Expect


Generative Foundation Models

27 28
The State of ML & AI: We are Really, The State of ML & AI: We are Really,
Really Far Really Far

CS231n: Convolutional
Neural Networks for Visual
Recognition

http://karpathy.github.io/2012/10/22/state-of-computer-vision/ 29 http://karpathy.github.io/2012/10/22/state-of-computer-vision/ 30

General Pipeline

Testing
Machine Learning Input
System
Pipelines and Algorithms Samples

Learning
Algorithms
Training

31 32
Basic Concepts of Data Basic Concepts of Data Observations/Samples

i.i.d. 1
2
Labels
3 (supervised)
Training Set
4
5
6
Yann LeCun (杨立昆) 7
MNIST dataset 8
http://yann.lecun.com/exdb/mnist/ 9
Testing Set
0
33 34

Features/Attributes/Representation Common Learning Algorithms


n Raw data signals may be too difficult to be Supervised Learning Unsupervised Learning
learned

Continuous Discrete
p Image pixels, audio waveforms, etc.
n Features, or called attributes, representations, are Classification Clustering
extracted to describe the samples conceptually
Wid
th

n Different types of fish differ in


g th

Length, lightness, width, number of fins, shape of fins, Dimension


Len

p
position of mouth, ... Regression
Reduction
f = [length, lightness, width, . . . ]
35 36
Supervised Learning Unsupervised Learning
n The learner is provided with a set of inputs together n Training examples as input
with the corresponding desired outputs patterns, with no associated
outputs
Has a teacher!
No teacher!
- Teaching kids to recognize different
animals - Similarity measure exists to detect
- Graded examinations with corrected groups/clustering
answers provided

37 38

How are They Different? Pipeline of Supervised Learning


Supervised Learning Unsupervised Learning Texts,
Feature
Rewrite the general pipeline
Texts vectors
Images,
Texts
n Predictive Model n Clustering (finite groups) Images
Videos,
Images
n Classification (discrete labels) n Probability distribution estimation Videos
Audios, ...
Videos
Audios
n Regression (continuous values) n Finding association in features Audios Machine
Learning
n Dimension reduction
Decision boundaries Labels Algorithm
Labels
Labels
Training Phase

Testing Phase
New
Features of Text,
Feature Predictive Expected
sample points Image,
vector Model label/value
Video,
Audio, …
39 40
Pipeline of Unsupervised Learning More Learning Algorithms
Feature n Supervised Learning n Semi-supervised Learning
Texts, Rewrite the general pipeline
Texts vectors p Partially labelled and unlabeled
Images,
Texts
Images
Videos,
samples
Images
Videos p Find decision boundaries for the
Audios, ...
Videos
Audios complete training samples
Audios Machine
Learning
Algorithm

Training Phase n Unsupervised Learning


Testing Phase
New
Likelihood,
Text,
Feature Cluster Id,
Image, Model
vector Better
Video,
representation
Audio, …
41 42

More Learning Algorithms More Learning Algorithms


n Reinforcement Learning n Multi-label Learning: tags, text categorization, gene functions, ...
n Multi-instance Learning: Content-based image retrieval, ...
p Training examples as input- n Multi-task Learning: ask a set of related subtasks for help
output pairs
n Transfer Learning: learning in one context to another context
p Trying to increase the
reinforcement it receives n Federated Learning: private protecting, distributed machine
learning
p Graded examinations with only n Lifelong Learning: learning without forgetting previous knowledge
overall scores but no correct n In-context Learning: in the scenario of large models
answers
n …

43 44
Training and Testing

Machine Learning Training set


(Observed)

Training, Evaluations and


l s
m ode
ng
ea rn i
in l

Goals
Tra Data are drawn from a
same distribution
Ev
a lu
a te
le a
rn e
Full dataset dm
od Testing set
e ls (UNSEEN in the training phase)

45 46

Example: Polynomial Curve Fitting Example: Polynomial Curve Fitting


n Supervised Learning
Ground truth curve t = sin(2⇡x)
n Continuous Regression
Samples

Sum-of-squares Error as the


training objective:
N
Polynomial estimation: 1X
E(w) = {y(xn , w) tn }2
PM 2 n=1
y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = j=0 w j xj
Optimize it!
! is user-defined
47 48
Example: Polynomial Curve Fitting Example: Polynomial Curve Fitting
n Evaluation metric n Evaluate the learned model 1
p on the testing set Training
ERMS (w) = 2E(w)/N
n Over-fitting when the Test
dimension of model
n Larger dimensions of parameter is too large
the model parameter

ERMS
0.5
n Smaller RMS about
the training samples No-free-lunch!!
n Need to make some
assumptions or biases
How to fairly evaluate
the model performance? 0
0 3 M 6 9
49 50

No Free Lunch Theorem Evaluation Criteria


All models are wrong, but some models are useful. n There are several factors that affect the performance:
p Types of training provided
p The form and extent of any initial background knowledge
n There is no single best model that works optimally for all kinds
of problems p The type of feedback provided
p The learning algorithms used
n A set of assumptions (also called inductive bias) that works
well in one domain may work poorly in another
n The best way to pick a suitable model is based on domain
knowledge, and/or trial and error
Modeling & Optimization

51 52
<latexit sha1_base64="wyLxvIcjqYIERsTgasjH7UdL9BY=">AAACP3icbVA9SytBFJ1Vn8bo06ilzWAQYhN2BdEyIMKr1AcmCtkQZid34+Ds7DpzNxiW7fxHr7Gw8S9oZfsaC0W0tHPyIbynHhg4nHMud+4JEikMuu69MzE59WN6pjBbnJv/ubBYWlpumDjVHOo8lrE+CZgBKRTUUaCEk0QDiwIJx8HZ7sA/7oE2IlZH2E+gFbGuEqHgDK3ULjX22j7CBWZxinmlu0F9Cef0QxRqpCUR9SOGp5zJ7CCv+OZcY+aHmvGsM472eJ5n+7kvFd3PN9qlslt1h6BfiTcm5drmn+2ru9fLw3bp1u/EPI1AIZfMmKbnJtjKmEbBJeRFPzWQMH7GutC0VLEITCsb3p/Tdat0aBhr+xTSofrvRMYiY/pRYJODK8xnbyB+5zVTDHdatoUkRVB8tChMJcWYDsqkHaGBo+xbwrgW9q+UnzJbC9rKi7YE7/PJX0ljs+ptVd3fXrlWIyMUyCpZIxXikW1SI7/IIakTTq7JX/JInpwb58F5dl5G0QlnPLNC/oPz9g4wQ7VN</latexit>

q
dvc
Eout (g)  Ein (g) ± O( ln N )
Goals Goals N

Under-fitting Over-fitting
n Supervised
Eout
<latexit sha1_base64="ilTpUVRBqj0Fz/OyfcUgNGsu6vk=">AAAB83icbVDLSsNAFJ3UV62vapdugkVwISURRJcBEVxWsA9oQplMJ+3QySTM3BFLyMaPcONCEbf+jAvBP/AbXDl9LLT1wIXDOfdy7z1hypkCx/m0CkvLK6trxfXSxubW9k55d6+pEi0JbZCEJ7IdYkU5E7QBDDhtp5LiOOS0FQ4vxn7rlkrFEnEDo5QGMe4LFjGCwUj+ZdcHegdZoiHvlqtOzZnAXiTujFS9ysf3/Vd+XO+W3/1eQnRMBRCOleq4TgpBhiUwwmle8rWiKSZD3KcdQwWOqQqyyc25fWiUnh0l0pQAe6L+nshwrNQoDk1njGGg5r2x+J/X0RCdBxkTqQYqyHRRpLkNiT0OwO4xSQnwkSGYSGZutckAS0zAxFQyIbjzLy+S5knNPa05127V89AURbSPDtARctEZ8tAVqqMGIihFD+gJPVvaerRerNdpa8GazVTQH1hvPz8Qljw=</latexit>

Minimize Eout or maximize probabilistic terms


<latexit sha1_base64="JzXtvQUKE3hnULMWijZsBnN8wEg=">AAAB83icbVDLSgNBEJz1GeMr0ZPkMhgET2FXED14CIjgMQHzgOwSZiezyZDZnWWmRwxLfsOLB0W8+isevOnXOHkcNLGgoajqprsrTAXX4Lpfzsrq2vrGZm4rv72zu7dfKB40tTSKsgaVQqp2SDQTPGEN4CBYO1WMxKFgrXB4PfFb90xpLpM7GKUsiEk/4RGnBKzk33R9YA+QSQPjbqHsVtwp8DLx5qRcLfql7/rRR61b+PR7kpqYJUAF0brjuSkEGVHAqWDjvG80Swkdkj7rWJqQmOkgm948xidW6eFIKlsJ4Kn6eyIjsdajOLSdMYGBXvQm4n9ex0B0GWQ8SQ2whM4WRUZgkHgSAO5xxSiIkSWEKm5vxXRAFKFgY8rbELzFl5dJ86zinVfculeuXqEZcqiEjtEp8tAFqqJbVEMNRFGKHtEzenGM8+S8Om+z1hVnPnOI/sB5/wFnvpTc</latexit>

p
<latexit sha1_base64="e/s25t6lpPR7f1yCcLOjsAr73Aw=">AAACLHicbVBNSxxBEO3R+LV+rQa8eGmyKHpZZgTRi7DgJbmIgqvCzjj09NasjT09Y3eNuDTzg7zkf4QcBMkhIl495g/kkt5dD/HjQcHjvSqq6iWFFAZ9/8EbG/80MTk1PVObnZtfWKwvLZ+YvNQc2jyXuT5LmAEpFLRRoISzQgPLEgmnyeX+wD+9Bm1Ero6xX0CUsZ4SqeAMnRTX90OEG7Sgda4rukfDVDNug8oeVDQ0ZRZbtRdU5wc0zBheJIn9VnX6saKhgiva27iJ1WYU1xt+0x+CvifBC2m01vmPPz//rhzG9fuwm/MyA4VcMmM6gV9gZJlGwSVUtbA0UDB+yXrQcVSxDExkh89WdM0pXZrm2pVCOlT/n7AsM6afJa5zcLJ56w3Ej7xOieluZIUqSgTFR4vSUlLM6SA52hUaOMq+I4xr4W6l/IK5uNDlW3MhBG9ffk9OtprBdtM/ChqtFhlhmqySL2SDBGSHtMhXckjahJNbckd+kwfvu/fLe/SeRq1j3svMZ/IK3vM/GvesgQ==</latexit>

PN
Error rate error = I[yn 6= g(xn )] 1
N n=1
<latexit sha1_base64="wyLxvIcjqYIERsTgasjH7UdL9BY=">AAACP3icbVA9SytBFJ1Vn8bo06ilzWAQYhN2BdEyIMKr1AcmCtkQZid34+Ds7DpzNxiW7fxHr7Gw8S9oZfsaC0W0tHPyIbynHhg4nHMud+4JEikMuu69MzE59WN6pjBbnJv/ubBYWlpumDjVHOo8lrE+CZgBKRTUUaCEk0QDiwIJx8HZ7sA/7oE2IlZH2E+gFbGuEqHgDK3ULjX22j7CBWZxinmlu0F9Cef0QxRqpCUR9SOGp5zJ7CCv+OZcY+aHmvGsM472eJ5n+7kvFd3PN9qlslt1h6BfiTcm5drmn+2ru9fLw3bp1u/EPI1AIZfMmKbnJtjKmEbBJeRFPzWQMH7GutC0VLEITCsb3p/Tdat0aBhr+xTSofrvRMYiY/pRYJODK8xnbyB+5zVTDHdatoUkRVB8tChMJcWYDsqkHaGBo+xbwrgW9q+UnzJbC9rKi7YE7/PJX0ljs+ptVd3fXrlWIyMUyCpZIxXikW1SI7/IIakTTq7JX/JInpwb58F5dl5G0QlnPLNC/oPz9g4wQ7VN</latexit>

q
Eout (g)  Ein (g) ± O( dNvc ln N )
How about & increases?
Testing Training

n Unsupervised: <latexit sha1_base64="DuJkz7Ovdrdr82nWDgkwgd1bTyQ=">AAAB8nicbVDLSsNAFJ3UV62vapduBovgQkoiiC4DIrisYB+QhjKZTtqhk0mYuRFLyMZ/cONCEbd+jQvBP/AbXDl9LLT1wIXDOfdy77lBIrgG2/60CkvLK6trxfXSxubW9k55d6+p41RR1qCxiFU7IJoJLlkDOAjWThQjUSBYKxhejP3WLVOax/IGRgnzI9KXPOSUgJG8y24H2B1kXObdctWu2RPgReLMSNWtfHzff+XH9W75vdOLaRoxCVQQrT3HTsDPiAJOBctLnVSzhNAh6TPPUEkipv1scnKOD43Sw2GsTEnAE/X3REYirUdRYDojAgM9743F/zwvhfDcN3mSFJik00VhKjDEeJwf97hiFMTIEEIVN7diOiCKUDBfKpknOPORF0nzpOac1uxrp+q6aIoi2kcH6Ag56Ay56ArVUQNRFKMH9ISeLbAerRfrddpasGYzFfQH1tsPUdqVsQ==</latexit>

Ein
p Minimum quantization error, Minimum distance, MAP, MLE
53 54

Components of Generalization Error Example: Mean-squared Error


n Bias
E(MSE) = noise2 + bias2 + variance
<latexit sha1_base64="HU0X1ESAWHF/s4pusJYu9jNUh90=">AAACKXicbVBNSyNBEO1xdzUbV43r0UuzQVCEMCOIe1kYEGEvC4pGA0kMNZ2KNvb0DN01Yhjm73jxr3jZBZd1r4L4M+xkcogfDxpevVdFdb0oVdKS7997Mx8+fpqdq3yuzn9ZWFyqLX89tklmBDZFohLTisCikhqbJElhKzUIcaTwJLrYHfknl2isTPQRDVPsxnCm5UAKICf1auHeeofwivJfh3vFBv/By0on0mJxusU3J0IkwU7Xl2AkaIFFr1b3G/4Y/C0JJqQe8oOnx8rc9n6v9qfTT0QWoyahwNp24KfUzcGQFAqLaiezmIK4gDNsO6ohRtvNx5cWfM0pfT5IjHua+FidnsghtnYYR64zBjq3r72R+J7XzmjwvZtLnWaEWpSLBpnilPBRbLwvDQpSQ0dAGOn+ysU5GBDkwq26EILXJ78lx1uNYLvhHwT1MGQlKmyVfWPrLGA7LGQ/2T5rMsGu2S27Y3+9G++398/7X7bOeJOZFfYC3sMzttepRQ==</latexit>

p How much the average model over all training sets differ
from the true model?
p Error due to inaccurate assumptions / simplifications made Error due to Error due to
Unavoidable
by the model incorrect variance of
error
assumptions training samples

n Variance n See the following for explanations of bias-variance


p How many models estimated from different training sets p Z. Zhou, Machine Learning
differ from each other? p C.M. Bishop, Patter Recognition and Machine Learning
p http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf

55 56
Quiz Under-fitting v.s. Over-fitting
PM n Under-fitting
y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = j=0 w j xj
p Model is too “simple” to represent all the
n Given a hyperparameter $, if trained by different training sets. How will relevant class characteristics
the predicted curves vary? p High bias and low variance
p High training error and high testing error

n Over-fitting
p Model is too “complex” and fits irrelevant
characteristics (noise) in the data
p Low bias and high variance
p Low training error and high testing error
57 58

Regularization Regularization
n Consider the prior knowledge about the model and the task n Punish coefficients with large values
N
<latexit sha1_base64="FCVL4Cbteyi7N9L7h889A/JbdUA=">AAACa3icbVHPaxQxGM1Mq9ZV62oPlerhw1JYUZeZBalQCgtW8KQV3La4mR0ymUwbmskMScZ2ibn4J3rrqdde/B/M7Pawtn4QeN/7fuTlJasF1yaKLoJwafnO3Xsr9zsPHj5afdx98vRAV42ibEQrUamjjGgmuGQjw41gR7VipMwEO8xOP7T1wx9MaV7Jb2Zas6Qkx5IXnBLjqbT7CxsucmY/uh4uiTnJCnvmXsEu4EIRamNnBw7rpkyt3I3d5DNgC9PeeSrfwGL/WzCpBOwmA3gNeCbLKpa7+RYsvKCctLsA/1wY9NlkkHY3o340C7gN4muwOdxLvlzu5N/30+5vnFe0KZk0VBCtx3FUm8QSZTgVzHVwo1lN6Ck5ZmMPJSmZTuxMlIMtz+RQVMofaWDGLk5YUmo9LTPf2crUN2st+b/auDHF+8RyWTeGSTq/qGgEmApa4yHnilEjph4QqrjXCvSEeHuM/56ONyG++eTb4GDQj9/1o6/ejSGaxwp6jl6iHorRNhqiT2gfjRBFV8FqsB48C/6Ea+FG+GLeGgbXM2vonwi3/gJAob0Q</latexit>

1X
Ẽ(w) = {y(xn , w) tn }2 + kwk2
2 n=1 2

p Examining the values of % when overfitting?


p Small magnitude? Sparsity? The numerical stability of solving linear
systems?

59 60
Regularization Regularization
n Punish coefficients with large values n Punish coefficients with large values

61 62

<latexit sha1_base64="wyLxvIcjqYIERsTgasjH7UdL9BY=">AAACP3icbVA9SytBFJ1Vn8bo06ilzWAQYhN2BdEyIMKr1AcmCtkQZid34+Ds7DpzNxiW7fxHr7Gw8S9oZfsaC0W0tHPyIbynHhg4nHMud+4JEikMuu69MzE59WN6pjBbnJv/ubBYWlpumDjVHOo8lrE+CZgBKRTUUaCEk0QDiwIJx8HZ7sA/7oE2IlZH2E+gFbGuEqHgDK3ULjX22j7CBWZxinmlu0F9Cef0QxRqpCUR9SOGp5zJ7CCv+OZcY+aHmvGsM472eJ5n+7kvFd3PN9qlslt1h6BfiTcm5drmn+2ru9fLw3bp1u/EPI1AIZfMmKbnJtjKmEbBJeRFPzWQMH7GutC0VLEITCsb3p/Tdat0aBhr+xTSofrvRMYiY/pRYJODK8xnbyB+5zVTDHdatoUkRVB8tChMJcWYDsqkHaGBo+xbwrgW9q+UnzJbC9rKi7YE7/PJX0ljs+ptVd3fXrlWIyMUyCpZIxXikW1SI7/IIakTTq7JX/JInpwb58F5dl5G0QlnPLNC/oPz9g4wQ7VN</latexit>

q
dvc
Regularization Eout (g)  Ein (g) ± O(
Maximum Likelihood
N ln N )
Cross Validation
n Increase the number of training samples n If the training set is small, how to choose reasonable
hyperparameters?
N
<latexit sha1_base64="FCVL4Cbteyi7N9L7h889A/JbdUA=">AAACa3icbVHPaxQxGM1Mq9ZV62oPlerhw1JYUZeZBalQCgtW8KQV3La4mR0ymUwbmskMScZ2ibn4J3rrqdde/B/M7Pawtn4QeN/7fuTlJasF1yaKLoJwafnO3Xsr9zsPHj5afdx98vRAV42ibEQrUamjjGgmuGQjw41gR7VipMwEO8xOP7T1wx9MaV7Jb2Zas6Qkx5IXnBLjqbT7CxsucmY/uh4uiTnJCnvmXsEu4EIRamNnBw7rpkyt3I3d5DNgC9PeeSrfwGL/WzCpBOwmA3gNeCbLKpa7+RYsvKCctLsA/1wY9NlkkHY3o340C7gN4muwOdxLvlzu5N/30+5vnFe0KZk0VBCtx3FUm8QSZTgVzHVwo1lN6Ck5ZmMPJSmZTuxMlIMtz+RQVMofaWDGLk5YUmo9LTPf2crUN2st+b/auDHF+8RyWTeGSTq/qGgEmApa4yHnilEjph4QqrjXCvSEeHuM/56ONyG++eTb4GDQj9/1o6/ejSGaxwp6jl6iHorRNhqiT2gfjRBFV8FqsB48C/6Ea+FG+GLeGgbXM2vonwi3/gJAob0Q</latexit>

1X
Ẽ(w) = {y(xn , w) tn }2 + kwk2
2 n=1 2

n Split the dataset into training, testing & validation


sets
p Data augmentation
63 64
Cross Validation Cross Validation

n Split data into folds, try each fold as validation and


average the result

65 66

Questions?

67

You might also like