1 Intro
1 Intro
机器学习 盛 律
盛律/软件学院
2024 秋冬学期
1 2
助教介绍 课程说明
责任助教
n 新设立的专业学位核心课(48学时),10多个学院联合开设
n 本学期开课学院
p 计算机、仪器光电、软件、人工智能、网安
n 每个学院教学内容 和考核方式可能略有不同,不同学院的成
绩标准可能有差异
黄泽桓 文皓 樊红兴 王立芃 陈泽人
[email protected] [email protected] [email protected] [email protected] [email protected]
n 强烈建议来自计算机、仪器光电、人工智能、网安等学院的同
学联系本学院教务老师,重新选择本学院开设的机器学习课程
n 对课程有任何问题,都可以联系助教咨询
3 4
课程沟通和资源 课程目标
n 课程微信群 →_→
p 入群请将“群昵称”改为 n 掌握机器学习的基本理论与当前进展
“学号-姓名”
n 课程资源:智学北航
n 能够运用机器学习解决实际问题
n 为相关科学研究和工程实践夯实基础
5 6
参考书目 内容大纲
序号 教学内容 序号 教学内容
1 机器学习概述(3学时) 10 关联规则学习(2学时)
2 机器学习基础(3学时) 11 概率图模型(3学时)
3 线性模型(3学时) 12 采样方法(2学时)
4 正则化与稀疏学习(2学时) 13 决策树(2学时)
5 支持向量机与核方法(3学时) 14 集成学习(2学时)
6 神经网络(3学时) 15 半监督学习(2学时)
7 深度神经网络(6学时) 16 强化学习(3学时)
8 聚类(3学时) 17 应用案例分析(3学时)
Pattern Recognition 统计学习方法 Machine Learning 机器学习 9 降维(3学时)
and Machine Learning
Christopher M. Bishop 李航 Tom M. Mitchell 周志华 • 前置课程:工科数学分析、高等代数、概率统计等
7 8
考核方法
n 平时成绩( 65%)
p 4次作业(10%+15%+20%+20%)
• 理论计算/推导 + 代码实现
n 期末考试成绩( 35%) What is Machine Learning?
p 1个小论文,单人独立完成
• 论文题目: 题目自拟,需要结合机器学习算法和个人研究方向
• 格式要求: CVPR模板,6页双栏正文(不包括参考文献)
• 申优答辩: 5分钟PPT视频,带语音解说
p 期中提交题目和简要介绍,期末提交论文
n 没有补考!!!
9 10
Learning denotes changes in the system that are Machine Learning denotes automatic changes in the
adaptive in the sense that they enable the system to AI system that are adaptive in the sense that they
do the same task (or tasks drawn from a population of enable the system to do the same task (or tasks drawn
similar tasks) more effectively the next time. from a population of similar tasks) more effectively the
next time.
-- Why should Machine Learning?
Machine Learning: An Artificial Intelligence Approach
13 14
n Why?
p Optimal approach to decision making under uncertainty
p Probabilistic modeling is the language used by most other
areas of science and engineering, and thus provides a
unifying framework between these fields
15 16
ML for Daily Life
Optical Character Recognition (OCR)
Machine Learning
is Everywhere
Mail digit recognition, AT&T labs
http://www.research.att.com/~yann/
19 20
ML for Daily Life ML for Social Media & Industry & E-
Speech Recognition & Translation Commerce
https://www.microsoft.com/en-us/research/video/speech- n Predict stock prices, improve search, reduce spam, improve advertiser
recognition-breakthrough-for-the-spoken-translated-
word/?from=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fresearch.microsoft.com%2Fapps%
return on investment, ...
2Fvideo%2F%3Fid%3D175450
n Machine learning typically generates 10+% improvements
21 22
23 24
ML for Touching New Horizons New Breakthrough
Generative Foundation Models
n Identifying stars, supernovae, clusters, galaxies, quasars,
exoplanets, etc.
n AlphaGo/AlphaZero series:超越人类最佳棋手
n AlphaFold:已预测出98.5%的蛋白质结构
27 28
The State of ML & AI: We are Really, The State of ML & AI: We are Really,
Really Far Really Far
CS231n: Convolutional
Neural Networks for Visual
Recognition
http://karpathy.github.io/2012/10/22/state-of-computer-vision/ 29 http://karpathy.github.io/2012/10/22/state-of-computer-vision/ 30
General Pipeline
Testing
Machine Learning Input
System
Pipelines and Algorithms Samples
Learning
Algorithms
Training
31 32
Basic Concepts of Data Basic Concepts of Data Observations/Samples
i.i.d. 1
2
Labels
3 (supervised)
Training Set
4
5
6
Yann LeCun (杨立昆) 7
MNIST dataset 8
http://yann.lecun.com/exdb/mnist/ 9
Testing Set
0
33 34
Continuous Discrete
p Image pixels, audio waveforms, etc.
n Features, or called attributes, representations, are Classification Clustering
extracted to describe the samples conceptually
Wid
th
p
position of mouth, ... Regression
Reduction
f = [length, lightness, width, . . . ]
35 36
Supervised Learning Unsupervised Learning
n The learner is provided with a set of inputs together n Training examples as input
with the corresponding desired outputs patterns, with no associated
outputs
Has a teacher!
No teacher!
- Teaching kids to recognize different
animals - Similarity measure exists to detect
- Graded examinations with corrected groups/clustering
answers provided
37 38
Testing Phase
New
Features of Text,
Feature Predictive Expected
sample points Image,
vector Model label/value
Video,
Audio, …
39 40
Pipeline of Unsupervised Learning More Learning Algorithms
Feature n Supervised Learning n Semi-supervised Learning
Texts, Rewrite the general pipeline
Texts vectors p Partially labelled and unlabeled
Images,
Texts
Images
Videos,
samples
Images
Videos p Find decision boundaries for the
Audios, ...
Videos
Audios complete training samples
Audios Machine
Learning
Algorithm
43 44
Training and Testing
Goals
Tra Data are drawn from a
same distribution
Ev
a lu
a te
le a
rn e
Full dataset dm
od Testing set
e ls (UNSEEN in the training phase)
45 46
ERMS
0.5
n Smaller RMS about
the training samples No-free-lunch!!
n Need to make some
assumptions or biases
How to fairly evaluate
the model performance? 0
0 3 M 6 9
49 50
51 52
<latexit sha1_base64="wyLxvIcjqYIERsTgasjH7UdL9BY=">AAACP3icbVA9SytBFJ1Vn8bo06ilzWAQYhN2BdEyIMKr1AcmCtkQZid34+Ds7DpzNxiW7fxHr7Gw8S9oZfsaC0W0tHPyIbynHhg4nHMud+4JEikMuu69MzE59WN6pjBbnJv/ubBYWlpumDjVHOo8lrE+CZgBKRTUUaCEk0QDiwIJx8HZ7sA/7oE2IlZH2E+gFbGuEqHgDK3ULjX22j7CBWZxinmlu0F9Cef0QxRqpCUR9SOGp5zJ7CCv+OZcY+aHmvGsM472eJ5n+7kvFd3PN9qlslt1h6BfiTcm5drmn+2ru9fLw3bp1u/EPI1AIZfMmKbnJtjKmEbBJeRFPzWQMH7GutC0VLEITCsb3p/Tdat0aBhr+xTSofrvRMYiY/pRYJODK8xnbyB+5zVTDHdatoUkRVB8tChMJcWYDsqkHaGBo+xbwrgW9q+UnzJbC9rKi7YE7/PJX0ljs+ptVd3fXrlWIyMUyCpZIxXikW1SI7/IIakTTq7JX/JInpwb58F5dl5G0QlnPLNC/oPz9g4wQ7VN</latexit>
q
dvc
Eout (g) Ein (g) ± O( ln N )
Goals Goals N
Under-fitting Over-fitting
n Supervised
Eout
<latexit sha1_base64="ilTpUVRBqj0Fz/OyfcUgNGsu6vk=">AAAB83icbVDLSsNAFJ3UV62vapdugkVwISURRJcBEVxWsA9oQplMJ+3QySTM3BFLyMaPcONCEbf+jAvBP/AbXDl9LLT1wIXDOfdy7z1hypkCx/m0CkvLK6trxfXSxubW9k55d6+pEi0JbZCEJ7IdYkU5E7QBDDhtp5LiOOS0FQ4vxn7rlkrFEnEDo5QGMe4LFjGCwUj+ZdcHegdZoiHvlqtOzZnAXiTujFS9ysf3/Vd+XO+W3/1eQnRMBRCOleq4TgpBhiUwwmle8rWiKSZD3KcdQwWOqQqyyc25fWiUnh0l0pQAe6L+nshwrNQoDk1njGGg5r2x+J/X0RCdBxkTqQYqyHRRpLkNiT0OwO4xSQnwkSGYSGZutckAS0zAxFQyIbjzLy+S5knNPa05127V89AURbSPDtARctEZ8tAVqqMGIihFD+gJPVvaerRerNdpa8GazVTQH1hvPz8Qljw=</latexit>
p
<latexit sha1_base64="e/s25t6lpPR7f1yCcLOjsAr73Aw=">AAACLHicbVBNSxxBEO3R+LV+rQa8eGmyKHpZZgTRi7DgJbmIgqvCzjj09NasjT09Y3eNuDTzg7zkf4QcBMkhIl495g/kkt5dD/HjQcHjvSqq6iWFFAZ9/8EbG/80MTk1PVObnZtfWKwvLZ+YvNQc2jyXuT5LmAEpFLRRoISzQgPLEgmnyeX+wD+9Bm1Ero6xX0CUsZ4SqeAMnRTX90OEG7Sgda4rukfDVDNug8oeVDQ0ZRZbtRdU5wc0zBheJIn9VnX6saKhgiva27iJ1WYU1xt+0x+CvifBC2m01vmPPz//rhzG9fuwm/MyA4VcMmM6gV9gZJlGwSVUtbA0UDB+yXrQcVSxDExkh89WdM0pXZrm2pVCOlT/n7AsM6afJa5zcLJ56w3Ej7xOieluZIUqSgTFR4vSUlLM6SA52hUaOMq+I4xr4W6l/IK5uNDlW3MhBG9ffk9OtprBdtM/ChqtFhlhmqySL2SDBGSHtMhXckjahJNbckd+kwfvu/fLe/SeRq1j3svMZ/IK3vM/GvesgQ==</latexit>
PN
Error rate error = I[yn 6= g(xn )] 1
N n=1
<latexit sha1_base64="wyLxvIcjqYIERsTgasjH7UdL9BY=">AAACP3icbVA9SytBFJ1Vn8bo06ilzWAQYhN2BdEyIMKr1AcmCtkQZid34+Ds7DpzNxiW7fxHr7Gw8S9oZfsaC0W0tHPyIbynHhg4nHMud+4JEikMuu69MzE59WN6pjBbnJv/ubBYWlpumDjVHOo8lrE+CZgBKRTUUaCEk0QDiwIJx8HZ7sA/7oE2IlZH2E+gFbGuEqHgDK3ULjX22j7CBWZxinmlu0F9Cef0QxRqpCUR9SOGp5zJ7CCv+OZcY+aHmvGsM472eJ5n+7kvFd3PN9qlslt1h6BfiTcm5drmn+2ru9fLw3bp1u/EPI1AIZfMmKbnJtjKmEbBJeRFPzWQMH7GutC0VLEITCsb3p/Tdat0aBhr+xTSofrvRMYiY/pRYJODK8xnbyB+5zVTDHdatoUkRVB8tChMJcWYDsqkHaGBo+xbwrgW9q+UnzJbC9rKi7YE7/PJX0ljs+ptVd3fXrlWIyMUyCpZIxXikW1SI7/IIakTTq7JX/JInpwb58F5dl5G0QlnPLNC/oPz9g4wQ7VN</latexit>
q
Eout (g) Ein (g) ± O( dNvc ln N )
How about & increases?
Testing Training
Ein
p Minimum quantization error, Minimum distance, MAP, MLE
53 54
p How much the average model over all training sets differ
from the true model?
p Error due to inaccurate assumptions / simplifications made Error due to Error due to
Unavoidable
by the model incorrect variance of
error
assumptions training samples
55 56
Quiz Under-fitting v.s. Over-fitting
PM n Under-fitting
y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = j=0 w j xj
p Model is too “simple” to represent all the
n Given a hyperparameter $, if trained by different training sets. How will relevant class characteristics
the predicted curves vary? p High bias and low variance
p High training error and high testing error
n Over-fitting
p Model is too “complex” and fits irrelevant
characteristics (noise) in the data
p Low bias and high variance
p Low training error and high testing error
57 58
Regularization Regularization
n Consider the prior knowledge about the model and the task n Punish coefficients with large values
N
<latexit sha1_base64="FCVL4Cbteyi7N9L7h889A/JbdUA=">AAACa3icbVHPaxQxGM1Mq9ZV62oPlerhw1JYUZeZBalQCgtW8KQV3La4mR0ymUwbmskMScZ2ibn4J3rrqdde/B/M7Pawtn4QeN/7fuTlJasF1yaKLoJwafnO3Xsr9zsPHj5afdx98vRAV42ibEQrUamjjGgmuGQjw41gR7VipMwEO8xOP7T1wx9MaV7Jb2Zas6Qkx5IXnBLjqbT7CxsucmY/uh4uiTnJCnvmXsEu4EIRamNnBw7rpkyt3I3d5DNgC9PeeSrfwGL/WzCpBOwmA3gNeCbLKpa7+RYsvKCctLsA/1wY9NlkkHY3o340C7gN4muwOdxLvlzu5N/30+5vnFe0KZk0VBCtx3FUm8QSZTgVzHVwo1lN6Ck5ZmMPJSmZTuxMlIMtz+RQVMofaWDGLk5YUmo9LTPf2crUN2st+b/auDHF+8RyWTeGSTq/qGgEmApa4yHnilEjph4QqrjXCvSEeHuM/56ONyG++eTb4GDQj9/1o6/ejSGaxwp6jl6iHorRNhqiT2gfjRBFV8FqsB48C/6Ea+FG+GLeGgbXM2vonwi3/gJAob0Q</latexit>
1X
Ẽ(w) = {y(xn , w) tn }2 + kwk2
2 n=1 2
59 60
Regularization Regularization
n Punish coefficients with large values n Punish coefficients with large values
61 62
<latexit sha1_base64="wyLxvIcjqYIERsTgasjH7UdL9BY=">AAACP3icbVA9SytBFJ1Vn8bo06ilzWAQYhN2BdEyIMKr1AcmCtkQZid34+Ds7DpzNxiW7fxHr7Gw8S9oZfsaC0W0tHPyIbynHhg4nHMud+4JEikMuu69MzE59WN6pjBbnJv/ubBYWlpumDjVHOo8lrE+CZgBKRTUUaCEk0QDiwIJx8HZ7sA/7oE2IlZH2E+gFbGuEqHgDK3ULjX22j7CBWZxinmlu0F9Cef0QxRqpCUR9SOGp5zJ7CCv+OZcY+aHmvGsM472eJ5n+7kvFd3PN9qlslt1h6BfiTcm5drmn+2ru9fLw3bp1u/EPI1AIZfMmKbnJtjKmEbBJeRFPzWQMH7GutC0VLEITCsb3p/Tdat0aBhr+xTSofrvRMYiY/pRYJODK8xnbyB+5zVTDHdatoUkRVB8tChMJcWYDsqkHaGBo+xbwrgW9q+UnzJbC9rKi7YE7/PJX0ljs+ptVd3fXrlWIyMUyCpZIxXikW1SI7/IIakTTq7JX/JInpwb58F5dl5G0QlnPLNC/oPz9g4wQ7VN</latexit>
q
dvc
Regularization Eout (g) Ein (g) ± O(
Maximum Likelihood
N ln N )
Cross Validation
n Increase the number of training samples n If the training set is small, how to choose reasonable
hyperparameters?
N
<latexit sha1_base64="FCVL4Cbteyi7N9L7h889A/JbdUA=">AAACa3icbVHPaxQxGM1Mq9ZV62oPlerhw1JYUZeZBalQCgtW8KQV3La4mR0ymUwbmskMScZ2ibn4J3rrqdde/B/M7Pawtn4QeN/7fuTlJasF1yaKLoJwafnO3Xsr9zsPHj5afdx98vRAV42ibEQrUamjjGgmuGQjw41gR7VipMwEO8xOP7T1wx9MaV7Jb2Zas6Qkx5IXnBLjqbT7CxsucmY/uh4uiTnJCnvmXsEu4EIRamNnBw7rpkyt3I3d5DNgC9PeeSrfwGL/WzCpBOwmA3gNeCbLKpa7+RYsvKCctLsA/1wY9NlkkHY3o340C7gN4muwOdxLvlzu5N/30+5vnFe0KZk0VBCtx3FUm8QSZTgVzHVwo1lN6Ck5ZmMPJSmZTuxMlIMtz+RQVMofaWDGLk5YUmo9LTPf2crUN2st+b/auDHF+8RyWTeGSTq/qGgEmApa4yHnilEjph4QqrjXCvSEeHuM/56ONyG++eTb4GDQj9/1o6/ejSGaxwp6jl6iHorRNhqiT2gfjRBFV8FqsB48C/6Ea+FG+GLeGgbXM2vonwi3/gJAob0Q</latexit>
1X
Ẽ(w) = {y(xn , w) tn }2 + kwk2
2 n=1 2
65 66
Questions?
67