L2 数量课件
L2 数量课件
Level 2 -- 2019
Instructor: Feng
1
Brief Introduction
Topic weights:
Study Session 1-2 Ethics & Professional Standards 10 - 15%
Study Session 3 Quantitative Methods 5 - 10%
Study Session 4 Economics 5 - 10%
Study Session 5-6 Financial Reporting and Analysis 10 - 15%
Study Session 7-8 Corporate Finance 5 - 10%
Study Session 9-11 Equity Investment 10 - 15%
Study Session 12-13 Fixed Income 10 - 15%
Study Session 14 Derivatives 5 - 15%
Study Session 15 Alternative Investments 5 - 10%
Study Session 16-17 Portfolio Management 5 - 10%
Weights: 100%
2
Brief Introduction
Contents:
➢ Study session 3: Quantitative Methods for Valuation Reading 4 Introduction
to Linear Regression
✓ Reading 6: Fintech Investment Management (★)
✓ Reading 7: Correlation and Regression (★★) Reading 5 Multiple
Regression
✓ Reading 8: Multiple Regression and Issues in Regression
Reading 7
Analysis (★★★) Machine Learning
4
Brief Introduction
Level I vs. Level II:
➢ Level I 学习的主要是描述统计和推断统计中的估计与
判断部分。二级主要学习regression,是推断统计中的
预测部分。
➢ Level II 的学习中,会较多的用到 Level I 中学习的
Hypotheses Testing ,可以提前复习一下,再开始二级
的学习内容。
5
Brief Introduction
课程特征与学习建议:
➢ 理科内容,文科考法;
➢ 考纲主题部分基本没有发生过大的变动;
➢ 本门课程逻辑递进关系很强,要把每个知识点学懂了再继
续往前学;
➢ 听课与做题相结合,但并不建议“刷题”;
➢ 最重要的,认真、仔细的听课。
6
幸福就是,有人爱、有事做、
有所相信、有所期待!
7
Fintech in Investment Management
Basics of Fintech
Tasks:
➢ Describe “fintech”;
➢ Describe big data, artificial intelligence, and machine
learning.
8
Basics of Fintech
Definition of Fintech
➢ Fintech refers to technological innovation in the design and
delivery of financial services and products.
✓Drivers underlying fintech development include extremely
rapid growth in data and technological advances that
enable the capture and extraction of information from
them.
9
Basics of Fintech
Areas of Fintech development
➢ Analysis of large dataset.
➢ Analytical tools.
✓E.g., artificial intelligence (AI), machine learning (ML).
➢ Automated trading.
➢ Automated advice.
✓E.g., Robo-advisers.
➢ Financial record keeping.
✓E.g., Distributed ledger technology (DLT).
10
Basics of Fintech
Big data
➢ Big data typically refers to datasets having the following
characteristics:
✓Volume: many millions, or even billions, of data points.
✓Velocity: real-time or near-real-time have become the
norm in many areas.
✓Variety: include structured data (e.g. SQL tables or CSV
files), semi-structured data (e.g. HTML code), and
unstructured data (e.g. video messages).
11
Basics of Fintech
Big data (Cont.)
➢ Traditional data: stock exchanges, financial statements,
economic indicators.
➢ Non-traditional data: electronic devices, social media,
sensor network.
✓ Obtain data from smart phones, cameras, microphones,
radio-frequency identification (RFID) readers, wireless
sensors, and satellites.
✓ Also named alternative data.
12
Basics of Fintech
Artificial intelligence (AI)
➢ Artificial intelligence technology enable the development of
computer systems that exhibit cognitive and decision-
making ability comparable or superior to that of human
beings.
➢ By the late 1990s, AI had been deployed in logistics, data
mining, financial analysis, medical diagnosis, and other
areas.
13
Basics of Fintech
Machine learning (ML)
➢ ML algorithms are computer programs that are able to
“learn” how to complete tasks, improving their performance
over time with experience.
➢ ML involves splitting the dataset into a training dataset and
validation dataset.
✓Training dataset allows to identify relationships between
inputs and outputs based on historical data.
✓Validation dataset is used to test the relationship.
14
Basics of Fintech
Machine learning (Cont.)
➢ ML still requires human judgement in understanding data
and selecting the appropriate techniques for data analysis.
✓Before they can be used, the data must be clean and free
of biases and spurious data.
➢ Errors from overfitting may leads to prediction errors and
incorrect output forecasts.
✓Overfitting occurs when the ML model learns the input
and target dataset too precisely, and treats noise in the
data as true parameters.
15
Basics of Fintech
Types of machine learning
➢ Supervised learning: computers learn to model
relationships based on labeled training data.
✓Inputs and outputs are labeled, or identified, for the
有label就是supervised,没有就
algorithm. 是unsupervised (没有标签再
进⾏聚类等等)
16
Basics of Fintech
Types of machine learning (Cont.)
➢ Deep learning: computers use neural networks, often with
many hidden layers, to perform multistage, non-linear data
processing to identify patterns.
✓Deep learning may use supervised or unsupervised
machine learning approaches.
17
Summary
➢ Importance: ☆
➢ Content:
✓ Definition of Fintech;
✓ Definitions of big data, AI, ML;
✓ Types of ML.
➢ Exam tips:
✓ 不是重要考点。
18
Fintech in Investment Management
Fintech Application
Tasks:
➢ Describe fintech applications to investment
management;
➢ Describe financial applications of distributed ledger
technology.
19
Fintech Application
Fintech application
➢ Text analytics and natural language processing
➢ Robo-advisory services
➢ Risk analysis
➢ Algorithmic trading
➢ Distributed ledger technology (DLT)
20
Fintech Application
Text analytics and natural language processing (NLP)
➢ Text analytics involves the use of computer programs to
analyze and derive meaning typically from large,
unstructured text- or voice-based datasets.
✓May be used to help identify indicators of future
performance, such as consumer sentiment.
➢ Natural language processing focuses on developing 语音识别等等
22
Fintech Application
Robo-advisory services (Cont.)
➢ Most Robo-advisers follow a passive investment approach,
and two types of wealth management services dominate
the robo-advice sector:
✓Fully automated digital wealth managers
✓Adviser-assisted digital wealth manager
23
Fintech Application
Risk analysis
➢ Big data may provide insights into real-time and changing
market circumstances to help identify weakening market
conditions and adverse trends in advance.
➢ Machine learning can help validate data quality by identify
questionable data, potential errors, and data outliers before
integration with traditional data for use in risk models and
in risk management applications.
➢ Advanced AI-based techniques can be used for scenario
analysis and back-testing simulation which are often
computationally intense. 24
Fintech Application
Algorithmic trading 快
28
Fintech Application
Distributed ledger technology (Cont.)
➢ Permissionless networks: open to any user who wishes to
make a transaction, and all users within the network can see
all transactions that exist on the blockchain.
所有人都可以参与
✓Network participants can perform all network functions. 权限不同
31
Fintech Application
Applications of DLT to investment management (Cont.)
➢ Post-trading clearing and settlement
✓DLT provides near-real-time trade verification,
reconciliation, and settlement, thereby reduces the
complexity, time and, costs associated with post-trade
processes.
32
Fintech Application
Applications of DLT to investment management (Cont.)
➢ Compliance
✓Allow regulators and firms to maintain near-real-time
review over transactions and other compliance-related
processes.
✓Could help uncover fraudulent activity and reduce
compliance costs associated with known-your-customer
and anti-money-laundering regulations, which entail
很难作假
verifying the identity of clients and business partners.
33
Summary
➢ Importance: ☆
➢ Content:
✓ Fintech application: text analytics and natural language
processing, robo-advisory services, risk analysis,
algorithmic trading;
✓ DLT application: cryptocurrency, tokenization, post-
trading clearing and settlement, compliance.
➢ Exam tips:
✓ 不是重要考点。
34
Correlation and Regression
Correlation Analysis
Tasks:
➢ Calculate and interpret a sample covariance and a
sample correlation coefficient; 数学:
-相关关系
➢ Formulate a hypothesis test of population correlation -函数关系
coefficient;
➢ Describe limitations to correlation analysis.
35
Correlation Analysis
Scatter plots
➢ A graph that shows the relationship between the
observations for two data series in two dimensions.
South Korea
Australia
U.K.
U.S.
Switzerland 定量分析-协方差
Japan
36
Correlation Analysis
Sample covariance
➢ A statistical measure of the degree to which two
variables move together, and capture the linear
relationship between tow variables.
( X -X )( Y -Y )
n
i i
i=1
Cov(X,Y)=
n-1
➢ Ranges of Cov(X,Y): -∞ < Cov(X,Y) < +∞.
✓ Cov(X,Y) > 0: the two variables tend to move together;
✓ Cov(X,Y) < 0: the two variables tend to move in
opposite direction. 37
Correlation Analysis
Sample correlation coefficient 进行标准化
38
Correlation Analysis
Sample correlation coefficient (Cont.) 不是斜率!! 只是斜向上+1,斜向下-1
r = +1 r = -1
(perfect positive linear (perfect negative linear
correlation) correlation)
39
Correlation Analysis
Sample correlation coefficient (Cont.)
0<r<1 -1 < r < 0
(positive linear correlation) (negative linear correlation)
40
Correlation Analysis
Sample correlation coefficient (Cont.)
r=0
(no linear correlation) 两种可能,
-没有关系
-没有线性关系
41
r算出来多少才算线性关系强?
Correlation Analysis - 要做假设检验
43
Correlation Analysis
Answer:
➢ Step 1: H0: µ = 0.01 and Ha: µ ≠ 0.01.
➢ Step 2: with known population variance (standard
deviation), use two-tailed z-test.原假设是=,而不是>或是<,所以做双尾检验
➢ Step 3: The critical z-values for 5% significance level (95%
confidence interval) are +/- 1.96.
➢ Step 4: decision rule: if the z-statistic is outside the range
of critical values (-1.96 to +1.96), reject H0.
44
Correlation Analysis
Z statistic formula ?
Answer:
➢ Step 5: calculate the test statistic.
0.015-0.01 算检验统计量
z-statistic= =2.396 (样本值-假设值) / 标准误 标准误=标准差/根号n
0.014 45
➢ Step 6: reject H0 (mean return = 1%), because the z-
statistic (2.396) is outside the range of critical values
(-1.96 to +1.96).
95%
z-stat = 2.396
Reject H 0
2.396 45
1-SS3-S259
– 1.96 0 1.96
Correlation Analysis
Hypothesis testing of correlation
➢ Test the correlation coefficient between two variables is
equal to zero.
✓ H0: ρ=0, Ha: ρ≠0; 一般自由度是n-1
r n-2 下面就是相关
✓ t-test: t= 2
df=n-2; 自由度
1-r 系数的标准误
✓ Two-tailed test;
✓ Decision rule: reject H0 if t > + tcritical , or t < - tcritical
46
Correlation Analysis
Example:
A analyst want to test the correlation between variable X
and variable Y. The sample size is 20, and he find the
covariance between X and Y is 16. The standard deviation of
X is 4 and the standard deviation of Y is 8. With 5%
significance level, test the significance of the correlation
coefficient between X and Y.
47
Correlation Analysis
Answer:
➢ H0: ρ=0, Ha: ρ≠0;
➢ Sample correlation coefficient r = 16/(4×8) = 0.5;
20-2
➢ t-statistic: t=0.5x = 2.45 带公式
1-0.25
➢ The critical value of two-tailed t-test with df=18 and
significance level of 5% is 2.101; 查t表,自由度18, 关键值
linear relationship.
49
Correlation Analysis
Limitation to correlation analysis (Cont.)
➢ Spurious correlation: statistically significant correlation
伪相关
exists when in fact there is no relation (no economic
explanation).
50
Correlation Analysis
Limitation to correlation analysis (Cont.)
➢ Nonlinear relationships: two variables can have a strong
nonlinear relation and still have a very low correlation.
可能有一个很强的非线性相关
51
Summary
➢ Importance: ☆☆
➢ Content:
✓ Covariance and correlation coefficient; 会算协方差,会算假设检验统计量
判断是否有线性关系
✓ Hypothesis testing of correlation coefficient;
✓ Limitation of correlation analysis.
➢ Exam tips:
✓ 这一部分是后面学习的基础,出题点比较多,出题形式也
比较灵活。
52
Correlation and Regression
55
Simple Linear Regression
Simple linear regression model
➢ Yi =b 0 +b1X i +εi i=1, .... ,n
where:
Yi = ith observation of the dependent variable, Y;
Xi = ith observation of the independent variable, X;
b0 = intercept;
b1 = slope coefficient;
残差项
εi = error term for the ith observation (also referred to as
residual of disturbance term).
56
Simple Linear Regression
Assumptions of simple linear regression model
➢ The relationship between the dependent variable (Y) and
the independent variable (X) is linear;
➢ The independent variable (X) is not random;
➢ The expected value of the error term is 0: E(ε)=0; 均值
Cov XY
✓ Calculation: bˆ 1 = 2
X
60
Simple Linear Regression
Predicted value of dependent variable (Cont.)
➢ The confidence interval for a predicted value of
dependent variable is:
ˆ ( t c sf )
Y ˆ - ( t c sf ) Y Y
or Y ˆ + ( t c sf )
加减关键值*标准误
where:
t c : two-tailed critical t-value with df=n-2;
61
Simple Linear Regression
回归系数的假设性检验
Significance test for a regression coefficient
➢ H0: b1= hypothesized value; Ha: b1≠ hypothesized value;
✓ Typically, H0: b1= 0; Ha: b1≠ 0, which means to test
whether an independent variable explains the variation
检验自变量是否可以解释因变量
in the dependent variable.
ˆ 1 -b1
b
➢ Test statistic: t = df=n-2;
sbˆ
1
63
Simple Linear Regression
Confidence interval for a regression coefficient 点估计,区间估计
where:
t c : two-tailed critical t-value with df=n-2;
64
Simple Linear Regression
Confidence interval for a regression coefficient (Cont.) 看假设值是否在置信区间
65
Practice
There is a number of assumptions for linear regression.
Which of the following is NOT an assumption?
A. The independent variables are not correlated with the
error term.
B. There is at least some correlation between the error
terms from one observation to the next. 残差和残差不能有序列相关
Answer: B
66
Summary
➢ Importance: ☆☆☆
➢ Content:
✓ Underlying consumptions of linear regression;
✓ Prediction of dependent variable; x带进去算出来
✓ Interpretation of hypothesis testing for regression
coefficient.
➢ Exam tips:
✓ 常考点1:underlying consumption;
✓ 常考点2:回归系数的假设检验。 67
Correlation and Regression
Tasks:
➢ Describe and interpret ANOVA;
➢ Calculate and interpret SEE, R2, and F-statistics;
➢ Describe limitations of regression analysis.
68
ANOVA Analysis (1)
Analysis of variance (ANOVA)
➢ A statistical procedure for dividing the total variability of
a variable into components that can be attributed to
different sources.
回归项解释的
✓ Total variation = explained variation + unexplained
variation
• Total sum of squares(SST) = Regression sum of
squares (RSS) + Sum of squared errors (SSE)
69
ANOVA Analysis (1)
Analysis of variance (Cont.)
➢ A graphic explanation of the components of total
variation:
均值
70
ANOVA Analysis (1)
Analysis of variance (Cont.)
➢ Total sum of squares(SST): measures the total variation
in the dependent variable.
n
SST= (Yi -Y)2
i-1
➢ Regression sum of squares (RSS): measures the variation
in the dependent variable that is explained by the
independent variable.
n
RSS= ˆ 2
(Y-Y)
i-1 线上的那个点
71
ANOVA Analysis (1)
Analysis of variance (Cont.)
➢ Sum of squared errors (SSE): measures the unexplained
variation in the dependent variable.
n
SSE= ˆ 2
(Yi -Y)
i-1
72
ANOVA Analysis (1)
Analysis of variance (Cont.) MS= SS / df
T就是Total
➢ ANOVA table 自由度 平方和 均方和 R就是Regression
Sum of Squares Mean Sum of E就是Error
df
(SS) Squares (MS)
Regression
1 RSS MSR=SSR/1
(explained)
Error
n-2 SSE MSE=SSE/(n-2)
(unexplained)
Total n-1 SST -
75
ANOVA Analysis (1)
F-statistic F统计量 (多元回归)
77
ANOVA Analysis (1)
Limitations of regression analysis
➢ Regression relations can change over time (parameter
b可能就是变化的, 过去两年是这样的, 前五年的可能就不一样了
instability).
➢ To investment contexts, public knowledge of regression
relationships may negate their future usefulness. 大家都知道,就没帮助了
79
Practice
A. The first regression equation has more explaining
power than the second regression equation. R^2
Answer: C
80
Summary
➢ Importance: ☆☆☆
➢ Content:
✓ ANOVA;
✓ SEE, R2, and F-statistic.
➢ Exam tips:
✓ 常考点1:给出ANOVA表,计算某空白格;
✓ 常考点2:R2的calculation and interpretation,计算题和概念
题都可能考。
81
Multiple Regression and Issues in Regression Analysis
Multiple Regression
Tasks:
➢ Formulate a multiple regression and explain the
assumptions of a multiple regression model;
➢ Interpret estimated regression coefficients, formulate
hypothesis tests for them and interpret the results.
➢ Calculate and interpret the predicted value for the
dependent variable.
82
Multiple Regression
Multiple regression
➢ Regression analysis with more than one independent
variable.
✓ Multiple linear regression model
Yi =b0 +b1X1i +b2 X2i +...+bk Xki +ε i
where:
Yi = the ith observation of the dependent variable Y
Xji = the ith observation of the jth independent variable Xj
bj = slope coefficient of independent variables 83
Multiple Regression
Assumptions of multiple linear regression
➢ The relationship between the dependent variable and
the independent variables is linear;
➢ The independent variables are not random. Also, no 自变量之间没有线性关系
85
Multiple Regression
Intercept term (b0)
➢ The value of the dependent variable when the
independent variables are all equal to zero.
➢ Hypothesis: ˆ =b
H0 : b ˆ
Ha : b bj
j j j
ˆ
H0 : b bj ˆ
Ha : b bj
j j
ˆ
H0 : b bj ˆ
Ha : b bj
j j
ˆ -b
b
➢ Test statistic: t = j j
(估计值-假设值) / 标准误
s bˆ
j
s bˆ
j
where:
t c : two-tailed critical t-value with df=n-k-1;
90
Multiple Regression
Confidence interval for a regression coefficient (Cont.)
➢ Confidence interval can be applied to significance test for
a regression coefficient.
✓ If the confidence interval does not include zero, the null
hypothesis (H0: bj=0) is rejected, and the coefficient is
said to be statistically significantly different from zero.
91
Multiple Regression
Predicting the dependent variable
➢ The regression equation can be used to predict the value
of the dependent variable based on assumed values of
the independent variables.
ˆ =b
Y ˆ +bˆ X
ˆ ˆ ˆ ˆ ˆ
i 0 1 1i + b 2 X 2i +...+ bk Xki
where:
Ŷi = predicted value of the dependent variable
b̂ j = estimated slope coefficient for jth independent
variable
92
Summary
➢ Importance: ☆☆
➢ Content:
✓ Assumptions of multiple linear regression;
✓ Interpretation and hypothesis testing of regression
coefficients;
✓ Prediction of dependent variable.
➢ Exam tips:
✓ 常考点:regression coefficients的假设检验;出题点比较灵
活,包括检验统计量的计算,检验结果的判断和解读。
93
Multiple Regression and Issues in Regression Analysis
94
ANOVA Analysis (2)
ANOVA table of multiple regression
均方和
df SS MSS
Regression k RSS MSR=SSR/k
Error n-k-1 SSE MSE=SSE/(n-k-1)
Total n-1 SST -
➢ R2 = RSS/SST
➢ F = MSR/MSE with df of k and n-k-1
➢ SEE= MSE
95
ANOVA Analysis (2)
F-statistics
➢ Test whether all regression coefficients are
simultaneously equal to zero; or test whether the
independent variables, as a group, help explain the
整个回归系数当整体,
dependent variable; or assess the effectiveness of the 是否可以解释因变量
97
ANOVA Analysis (2)
R² (Coefficient of determination)
Explained variation RSS SST-SSE
➢ R2 = = =
Total variation SST SST
➢ Test the overall effectiveness (goodness of fit) of the
entire set of independent variables (regression model) in
explaining the dependent variable.
✓ For example, an R² of 0.7 indicates that the model, as a
whole, explains 70% of the variation in the dependent
variable. 98
ANOVA Analysis (2)
R² (Cont.)
➢ For multiple regression, however, R2 will increase simply
by adding independent variables that explain even a
slight amount of the previously unexplained variation.
✓ Even if the added independent variable is not
只要加自变量,R^2都会增加
statistically significant, R2 will increase.
我们不需要这样的
99
ANOVA Analysis (2)
Adjusted R²
n -1
➢ adjusted R = 1 -
2
n - k -1
(1 - R )
2
101
Summary
➢ Importance: ☆☆☆
➢ Content:
✓ ANOVA table;
✓ Calculation and interpretation of F-statistics;
✓ R2 and adjusted R2.
➢ Exam tips:
✓ 常考点1:F-statistics的解读,概念题;
✓ 常考点2:R2和adjusted R2的比较,概念题。
102
Multiple Regression and Issues in Regression Analysis
Violations of Assumptions
Tasks:
异方差
➢ Explain the types of heteroskedasticity and how
heteroskedasticity and serial correlation affect
statistical inference;
➢ Describe multicollinearity and explain its causes and
effects in regression analysis.
103
Heteroskedasticity (异方差性)
Definition of heteroskedasticity
➢ The variance of the errors differs across observations 不同观察值残差
的方差是否相同
(i.e., the error terms are not homoskedastic).
✓ Unconditional heteroskedasticity: heteroskedasticity
of the error variance is not correlated with the
与自变量是否有关
independent variables.
• Creates no major problems for statistical inference.
无条件异方差没有关系
104
Heteroskedasticity
Definition of heteroskedasticity (Cont.)
✓ Conditional heteroskedasticity: heteroskedasticity of
the error variance is correlated with (conditional on)
the values of the independent variables.
• Does create significant problems for statistical
inference.
105
Heteroskedasticity
Effects of heteroskedasticity
➢ The coefficient estimates ( b̂ j ) aren't affected.
➢ The standard errors of coefficient ( s bˆ j )are usually
unreliable.
✓ With financial data, the standard errors are most likely
ˆ -b
b j j
underestimate, and the t-statistics ( t = ) will be
s bˆ
j
Residuals
左边残差方差小,右边大
107
Heteroskedasticity
Testing for heteroskedasticity (Cont.)
➢ Breusch-Pagen χ² test.
✓ H0: no heteroskedasticity; 原假设:没有异方差性
考试考以上
2
✓ BP χ² = n Rresid with df = k (the number of
independent variables) and one-tailed test;
• n = the number of observation;
• Rresid = R2 of a second regression of the squared
2
109
Serial Correlation (序列相关、自相关) 残差之间是否相关
Definition of serial correlation (autocorrelation)
➢ The residuals (error terms) are correlated with one
another, and typically arises in time-series regressions.
✓ Positive serial correlation: a positive/negative error for
如果一个残差为
one observation increases the chance of a 正,会增加另一个
为正的概率
positive/negative error for another observation.
✓ Negative serial correlation: a positive/negative error
for one observation increases the chance of a
negative/positive error for another observation.
110
Serial Correlation
Effects of serial correlation
➢ The coefficient estimates aren't affected.
➢ The standard errors of coefficient are usually unreliable.
✓ Positive serial correlation: standard errors
underestimated and t-statistics inflated, suggesting
significance when there is none (type I error);
✓ Negative serial correlation: vice versa (type II error).
➢ The F-test is also unreliable.
111
Serial Correlation
Testing for serial correlation
➢ Residual scatter plots
Residuals Residuals
正序列相关 负序列相关
一个为正都为正
T T
114
Multicollinearity ( 多重共线性)
Definition of multicollinearity 自变量之间有相关
115
Multicollinearity
Effects of multicollinearity
➢ Estimates of regression coefficients become extremely
回归系数就不靠谱
imprecise and unreliable;
➢ Standard errors of regression coefficients will be inflated,
then t-test on the coefficients will have little power
(more type II error).
✓ Greater probability we will incorrectly conclude that a
variable is not statistically significant. 第二类错误是:以假为真
116
y=5x1
Multicollinearity x1=2x2
可以写y=4x1+2x2
Testing for multicollinearity 不断往下拆,每个系数都可以变得很小
做假设检验就会认为=0
➢ The t-tests indicate that none of the regression
coefficients is significant, while R² is high and F-test
indicates overall significance; 单个自变量不显著,但是模型是有意义的
117
Multicollinearity
Correcting for multicollinearity
➢ Excluding one or more of the correlated independent
variables.
118
Summary of Assumption Violations
Violation Effects Testing
• Residual scatter plots
Conditional
Type I error • Breusch-Pagen χ²-test
Heteroskedasticity
BP = n×R²
Positive serial
Type I error • Residual scatter plots
correlation
• Durbin-Watson test
Negative serial
Type II error DW≈2×(1−r)
correlation
• t-tests indicate no significance
Multicollinearity Type II error when F-test indicates overall
significance and R² is high
119
Practice
Feng, CFA, runs a regression of portfolio returns on three
independent variables. Feng discovers that the p-values
for each independent variable are relatively high, but the
F-test has a very small p-value. Feng is puzzled and tries p值高,不能拒绝原假设
bj=0,说明每一个自变量都不
to figure out the reasons. What violation of regression 具有解释力
➢ Content:
✓ Definition, effects, testing, and correcting of
heteroskedasticity, serial correlation, and
multicollinearity.
➢ Exam tips:
✓ 常考点:effects and testing of heteroskedasticity and serial
correlation, 概念题。
121
Multiple Regression and Issues in Regression Analysis
123
Dummy Variable
Example
➢ Yi = b0 + b1X1i + b2X2i + b3X3i + ɛi
where: Yi = quarterly value of EPS of a stock
Y X1 X2 X3
Q1 EPS 1 0 0
Q2 EPS 0 1 0
Q3 EPS 0 0 1
Q4 EPS
0 0 0
(omitted category)
缺省类
124
Dummy Variable
Interpretation of coefficient
➢ Intercept coefficient (b0): the average value of
缺省类的平均值
dependent variable for the omitted category.
➢ Regression coefficient (bj): the difference in dependent
variable (on average) between the category represented
by the dummy variable and the omitted category. 哑变量和缺省之间的不同
125
Model Misspecification (模型设定偏误)
Definition of model misspecification
➢ The set of variables included in the regression and the
regression equation’s functional form.
126
Model Misspecification
Categories of model misspecification
➢ Misspecified functional form 模型方程形式有问题
128
Model Misspecification
Avoiding model misspecification
➢ The model should be grounded in cogent economic
reasoning;
➢ The functional form chosen for the variables should be
appropriate given the nature of the variables;
➢ The model should be parsimonious;
➢ The model should be examined for violations of
regression assumptions before being accepted;
➢ The model should be tested and be found useful out of
sample before being accepted. 7:3 分数据, 7检测,3验证 129
Qualitative Dependent Variable (定性的因变量)
Qualitative dependent variable
➢ Dummy variables used as dependent variables instead
of as independent variables.
✓ Probit and logit model
✓ Discriminant models 打分
130
Practice
Consider the following model of earnings (EPS) regressed
against dummy variables for the quarters:
EPSt = α + β1Q1t + β2Q2t + β3Q3t
where:
EPSt: quarterly observation of EPS;
Q1t : 1 for the second quarter, 0 otherwise;
Q2t : 1 for the third quarter, 0 otherwise;
Q3t : 1 for the fourth quarter, 0 otherwise.
Which of the following statements regarding this model
is most accurate? The: 131
Practice
A. coefficient on each dummy tells us about the
difference in earnings per share between the respective
quarter and the one left out (first quarter in this case)
B. EPS for the first quarter is represented by the residual.
C. significance of the coefficients cannot be interpreted
in the case of dummy variables.
Answer: A
132
Summary
➢ Importance: ☆
➢ Content:
✓ Dummy variable;
✓ Model misspecification;
✓ Qualitative dependent variable.
➢ Exam tips:
✓ 不是考试重点。
133
Multiple Regression and Issues in Regression Analysis
Machine Learning
Tasks:
➢ Distinguish between supervised and unsupervised
machine learning;
➢ Describe machine learning algorithms used in
prediction, classification, clustering, and dimension
reduction;
➢ Describe the steps in model training. 134
Machine Learning
Machine learning (ML)
➢ Machine learning comprises diverse approaches by
which computers are programmed to improve
电脑去学习
performance in specified tasks with experience.
✓Types of machine learning
✓Machine learning algorithms
✓Steps in model training
135
Types of Machine Learning
监督学习和非监督学习
Supervised learning vs. unsupervised learning
➢ Supervised learning: makes use of labeled training
data.
✓More formally, supervised learning is the process of
training an algorithm to take a set of inputs X and
find a model that best relates them to the output Y.
✓E.g., ML program labels “fraudulent” or “non-
fraudulent” and uses them to train a model in
predicting fraud more accurately in new credit card
transactions. 136
Types of Machine Learning
Supervised learning vs. unsupervised learning
➢ Unsupervised learning does not make use of labeled
training data.
✓More formally, in unsupervised learning, we have
inputs X that are used for analysis without any
targets Y being supplied, the ML program has to
discover structure within the data themselves.
✓E.g., based on financial statement data, ML program
clusters firms into groups based on their attributes.
137
Machine Learning Algorithms
Machine learning algorithms
Penalized regression
CART
Supervised learning
Random forests
Neural networks
138
Machine Learning Algorithms
Supervised learning algorithms
➢ Supervised learning can be divided into two categories:
回归 分类
regression and classification, the distinction is
determined by the nature of the target variable (Y).
✓If the target variable is continuous, then the task is
one of regression. 连续就是回归
139
Machine Learning Algorithms
Supervised learning algorithms (Cont.)
✓If the target variable is categorical or ordinal (e.g., a
离散就是分类
firm’s rating), then it is a classification problem.
• Classification include classification and regression
trees (CART), random forests, and neural networks
for brief descriptions.
140
Machine Learning Algorithms
Supervised learning algorithms (Cont.)
➢ Penalized regression: choose the regression coefficients
to minimize the sum of squared residuals plus a penalty
term. 加上一个处罚项
144
Machine Learning Algorithms
Supervised learning algorithms (Cont.)
有权重 神经元
145
Machine Learning Algorithms
Unsupervised learning algorithms
聚类
➢ Clustering algorithms could cluster groups data solely
on the basis of information found in the data.
✓In classification, data are assigned to classes 之前是有目标的分类
determined by the researcher (fraudulent or non-
fraudulent). In clustering, the groups are determined
by the data themselves.
146
Machine Learning Algorithms
Unsupervised learning algorithms (Cont.)
降维
➢ Dimension reduction focuses on reducing the number
of independent variables while retaining variation
across observations (to preserve the information
contained in that variation). 通过合并,找出重要的元素
147
Steps in Model Training
Supervised machine learning: training
➢ Process of training ML models involves 5 steps:
✓Specify the ML technique/algorithm. 确定算法
149
Time-Series Analysis
Tasks:
➢ Calculate and evaluate the predicted trend value for a
time series;
➢ Describe factors to determine trend model selection;
➢ Evaluate limitations of trend models.
150
Trend Models
Linear trend models 线性趋势模型
t 151
Trend Models
Log-linear trend models 对数线性趋势模型
152
Trend Models
Linear trend model vs. Log-linear trend model
➢ If data plots with a linear shape (constant change
amount), a linear trend model may be appropriate.
➢ If data plots with a non-linear (curved) shape (constant
growth rate), a log-linear model may be more suitable.
153
Trend Models
Limitations of trend models
➢ The trend model is not appropriate for time series when
data exhibit serial correlation. 序列相关
154
Summary
➢ Importance: ☆
➢ Content:
✓ Linear trend model & log-linear trend model;
✓ Limitation of trend models.
➢ Exam tips:
✓ 不是考试重点。
155
Time-Series Analysis
Tasks: 没有自变量,创造自变量,用前值
X_t+1=f(X_t)
➢ Describe the structure of an AR model, explain the
testing of autocorrelations of the residuals;
➢ Calculate one- and two-period-ahead forecasts given
the estimated coefficients of an AR model;
➢ Explain mean reversion and calculate a mean-
reverting level.
156
Autoregressive Model (AR)
Covariance stationary 协方差平稳
159
Autoregressive Model
Detecting autocorrelation 检测自相关
DW检验只能用于有真正
➢ Step 1: Estimate the AR(1) model using linear regression; 自变量的
✓ xt = b0 + b1xt-1 + ɛt
➢ Step 2: Compute the autocorrelations ( ρ ε t ,ε t-k ) of the
residual; 算残差的自相关系数
166
Time-Series Analysis
Tasks: 资产的价格是符合随机游走的
169
Random Walk
Random walk vs. Covariance stationary 随机游走和协方差平稳的关系
a random walk.
170
Random Walk
Unit root
单位根
➢ The time series is said to have a unit root if the lag 滞后变量系数
172
Random Walk
Dickey-Fuller test for unit root
➢ Step 1: start with an AR(1) model: xt=b0+b1xt-1+εt ;
➢ Step 2: subtract xt-1 from both sides:
xt-xt-1 = b0 + (b1 –1)xt-1 + εt ;
✓ Or: xt-xt-1 =b0+g1xt-1+εt where: g1=b1-1
➢ Step 3: test if g1=0.
✓ H0: g1=0; Ha: g1<0;
比通常查出来的要大一些
✓ Calculate t-statistic and use revised critical values;
✓ If fail to reject H0, there is a unit root and the time
series is non-stationary. 有单位根就是不平稳的 173
Random Walk
First differencing 一阶差分
➢ A random walk (i.e., has a unit root) can be transformed
to a covariance stationary time series by first 随机游走的模型可以转化为协方差平稳的
differencing.
✓ Subtract xt-1 from both sides of random walk model:
xt-xt-1=xt-1-xt-1+εt=εt 做一阶差分
1,2,3,4,5,。。。
差分
✓ Define yt=xt-xt-1, so yt=εt ; 1,1,1,1,1,。。。平稳
Or yt=b0+b1yt-1+εt ; where: b0=b1=0
还可以二阶差分
✓ Then, yt is covariance stationary variable with a finite 1,2,4,7,11
1,2,3,4
mean-reverting level of 0/(1-0)=0. 1,1,1,
174
Practice
One choice a researcher can use to test for
不平稳检验,修正t检验
nonstationarity is to use a:
A. Dickey-Fuller test, which uses a modified t-statistic
B. Breusch-Pagan test, which uses a modified t-statistic.
检验异方差的
C. Dickey-Fuller test, which uses a modified χ2 statistic.
Answer: A
175
Summary
➢ Importance: ☆☆☆
➢ Content:
✓ Random walk;
✓ Testing of unit roots;
✓ First differencing.
➢ Exam tips:
✓ 常考点:unit roots的检验方法,检验结果的解读,random
walk变形为stationary的方法(first differencing)。
176
Time-Series Analysis
Tasks:
➢ Contrast in-sample and out-of-sample forecasts; 样本内和样本外的预测
➢ Explain ARCH model;
➢ Determine and justify an appropriate time-series
model.
177
Model Evaluation
Comparing forecasting model performance
➢ In-sample forecasts errors: the residuals within sample
period to estimate the model;
➢ Out-of-sample forecasts errors: the residuals outside
一万数据,7:3分,3k数据回测
sample period to estimate the model.
➢ Root mean squared error (RMSE) criterion: the model
就是SEE
with the smallest RMSE for the out-of-sample data is
越小越好
typically judged most accurate.
178
Model Evaluation
Instability of regression coefficients
➢ Financial and economic relationships are inherently
dynamic, so the estimates of regression coefficients of
the time-series model can change substantially across
different sample periods.
➢ The is a tradeoff between reliability and stability.
✓ Models estimated with shorter time series are usually 数据少,更稳定,不够可靠
182
Model Evaluation
Steps in time series forecasting
Does series have a trend? (plotting)
Yes
✓ ARCH model;
✓ Regression with two time series.
➢ Exam tips:
✓ 不是考试重点。
187
Excerpt from “Probabilistic approaches: scenario
analysis, decision trees, and simulations”
Simulation
Tasks:
➢ Describe steps of simulation and treatment of
correlation;
➢ Describe advantage, constraints, and issues of
simulation;
➢ Compare scenario analysis, decision trees, and
simulations.
188
Simulation
Steps in running a simulation
➢ Determine “probabilistic” variables;
➢ Define probability distributions for these variables;
➢ Check for correlation across variables;
➢ Run the simulation.
189
Simulation
Define probability distributions for variables
➢ Historical data
➢ Cross sectional data 截面数据
190
Simulation
Treatment of correlation across variables
➢ When there is strong correlation, positive or negative,
across inputs, we have two choices:
✓ Pick only one that has the bigger impact on value;
✓ Building the correlation explicitly into the simulation.
191
Simulation
Advantages of using simulations
➢ Better input estimation;
➢ It yield a distribution for expected value rather than a
不是点是一个分布
point estimate.
192
Simulation
Constraints on simulations
➢ Book value constraints;
➢ Earnings and cash flow constraints;
➢ Market value constraints.
193
Simulation
Issues in using simulations in risk assessment
➢ Garbage in, garbage out;
➢ Real data may not fit distributions;
➢ Non-stationary distributions;
➢ Changing correlation across inputs.
194
Simulation
Comparing probabilistic approaches
➢ How to choose among probabilistic approaches: scenario
analysis, decision trees, and simulation:
✓ Selective vs. full risk analysis; 选择风险评估还是全面风险评估
✓ Type of risk;
离散还是连续
• Discrete vs. continuous.
✓ Correlations across risks; 风险因子相关性
✓ Quality of information.
195
Summary
➢ Importance: ☆
➢ Content:
✓ Steps of simulation and ways to define probability
distribution;
✓ Advantages, constraints, and issues of simulation;
✓ Comparison of scenario analysis, decision tree and
simulation.
➢ Exam tips:
✓ 不是考试重点。
196