CFA Level II: Quantitative Methods
CFA Level II: Quantitative Methods
http://www.zejicert.cn
Quantitative Methods
1
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”
Review
The simple linear regression model
The assumptions of the linear regression
http://www.zejicert.cn
2
Review
http://www.zejicert.cn
Four steps of hypothesis testing
3
Four steps of hypothesis testing
Test-Statistic
Tests concerning a single mean
• H0: μ=μ0; Ha: μ≠μ0
http://www.zejicert.cn
X−μ0 X−μ0
• z= ; t =
σ⁄ n ; n−1 s⁄ n
Test-Statistic
Tests concerning correlation
• r = Cov(X,Y)
s s
http://www.zejicert.cn
X Y
• H0: ρ=0; Ha: ρ≠0
r n−2
• t= , df=n-2
1−r2
4
Four steps of hypothesis testing
Test-Statistic
Tests concerning correlation
Y Y
http://www.zejicert.cn
X X
Y Y
Y
X X X
Student’s t-distribution
http://www.zejicert.cn
5
Four steps of hypothesis testing
http://www.zejicert.cn
Fail to reject H0 if |test statistic|<critical value
• μ is not significantly different from μ0.
95% 95%
2.5% 2.5% 5%
http://www.zejicert.cn
6
P-value
P-value
If P–value<alpha, we reject null hypothesis.
http://www.zejicert.cn
α/2=2.5% α/2=2.5%
95%
P/2=1.07% P/2=1.07%
True Condition
Decision
H0 is true H0 is false
http://www.zejicert.cn
7
Type I and type II errors
http://www.zejicert.cn
P(I) + P(II) ≠ 1
n不变时, P(I) ↑, P(II) ↓
n↑, P(I)↓, P(II) ↓
8
http://www.zejicert.cn
The simple linear regression model
The simple linear regression model
Yi=b0+b1Xi+εi, i=1,…,n
• Yi=ith observation of the dependent
variable, Y;
http://www.zejicert.cn
9
http://www.zejicert.cn
The simple linear regression model
b1= n = , b0=Y-b1X
∑i=1 (Xi−X)2 Var(X)
The estimated intercept coefficient (b0 ): the
point (X, Y) is on the regression line.
10
The simple linear regression model
http://www.zejicert.cn
The estimated slope coefficient (b1 ): the
sensitivity of Y to a change in X.
(i.e., E(εi)=0);
The variance of the error term is constant (i.e.,
he error terms are homoscedastic);
The error term is uncorrelated across
observations (i.e., E(εiεj)=0 for all i≠j);
The error term is normally distributed.
11
Example
建立身高(X)和体重(Y)之间的回归关系:
身高,当做X01到X10输入: 170 、165、 168、
155、 173 、180、 185、171、 167、 176。
体重,当做Y01到Y10输入: 65 、52、 60、 48、
http://www.zejicert.cn
59、75、85、70、68、72。
Example
计算器操作步骤及结果如下:
按键 解释 显示
[ 2ND ] [ 7 ] [ 2ND ] 清除DATA功能
X01=0.0000
[ CLR WORK ]
http://www.zejicert.cn
中的存储记忆
170 [ ENTER ] 输入X01 X01=170.0000
[ ↓ ] 65 [ ENTER ] 输入Y01 Y01=65.0000
[ ↓ ] 165 [ ENTER ] 输入X02 X02=165.0000
[ ↓ ] 52 [ ENTER ] 输入Y02 Y02=52.0000
依次输入X03、Y03……X10、Y10
12
Example
按键 解释 显示
[ 2ND ] [ 8 ]
LIN 进入STAT功能
→ [ STAT ]
[↓] n=10.0000 共输入10组数据
http://www.zejicert.cn
[↓] X=171.0000 X的均值是171.0000
如果是一个样本,X的样本标
[↓] SX=8.3267
准差是8.3267
如果是一个总体,X的总体标
[↓] X=7.8994
准差是7.8994
Example
按键 解释 显示
[ ↓ ] y=65.40000 Y的均值是65.4000
如果是一个样本,Y的样本标准差
[↓] SY=11.0574
是11.0574
http://www.zejicert.cn
如果是一个总体,Y的总体标准差
[↓] Y=10.4900
是10.4900
[↓] a=-139.0326 回归方程截距项为-139.0326
[↓] b=1.1955 回归方程的斜率为1.1955
[↓] r=0.9003 身高与体重的相关系数为0.9003
13
Analysis of Variance (ANOVA)
Y
Yi =b0 +b1 Xi (Yi −Yi )
(Yi −Y)
http://www.zejicert.cn
(Yi −Y)
Y
b0
14
Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA) Table
df SS MSS
Regression k=1 RSS RSS/k
http://www.zejicert.cn
Residual n-2 SSE SSE/(n-2)
Total n-1 TSS -
SEE
Standard error of estimate(SEE, sometimes
called the standard error of the regression)
http://www.zejicert.cn
15
Standard error of estimate(SEE)
SEE
The SEE gauges the "fit" of the regression
line. The smaller the standard error, the
http://www.zejicert.cn
better the fit.
SEE will be low(relative to total variability)
if the relationship is very strong, or will be
high if the relationship is weak.
16
The coefficient of determination(R2)
http://www.zejicert.cn
• R2 is called multiple R.
SSR ESS
R2 = =1-
TSS TSS
17
The coefficient of determination(R2)
The coefficient of determination can apply to
an equation with several independent
variables, and it implies a explanatory power,
while the correlation coefficient only applies to
http://www.zejicert.cn
two variables and does not imply explanatory
between the variables.
18
Hypothesis testing for the regression coefficient
http://www.zejicert.cn
the less confidence to estimate a coefficient.
Hypothesis testing
We testing the population value of the
intercept or slope coefficient of a
http://www.zejicert.cn
regression model.
• H0: b1=0, Ha: b1≠0
• t-statistic: t=(b1-b1)/Sb1; df = n-2.
• Reject H0 if t > +t critical or t <- t critical.
19
Determining a prediction interval
http://www.zejicert.cn
of the independent variable.
Point estimate Y=b0+b1X;
1 (X−X)2
• sf =SEE 1+ +
n (n−1)s2X
20
Limitations of regression analysis
Regression relations can change over time.
(parameter instability).
Public knowledge of relationships may result its
future usefulness.
http://www.zejicert.cn
Regression assumptions will not be violated.
Heteroskedastic(non-constant variance of
the error terms)
Autocorrelation(error terms are not
independent)
Example
An analyst ran a regression and got the
following result:
Coefficient t-statistic p-value
Intercept -0.5 -0.91 0.18
http://www.zejicert.cn
21
Example
1. Fill in the blanks of the ANOVA Table.
2. What is the standard error of estimate?
3. What is the 95% confidence interval result of
the slope coefficient significance test?
http://www.zejicert.cn
4. What is the result of the sample correlation?
5. What is the 95% confidence interval of the
slope coefficient?
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”
22
Multiple Regression and Machine Learning
http://www.zejicert.cn
Models with qualitative dependent variables
23
Multiple linear regression
http://www.zejicert.cn
Multiple linear regression
24
Multiple linear regression
Assumptions of the multiple linear regression
model:
The relationship between the dependent
variable, Y, and the independent variables,
http://www.zejicert.cn
X1, X2, …, Xk, is linear.
The independent variables (X1, X2, …, Xk)
are not random. Also, no exact linear
relation exists between two or more of the
independent variables.
25
Multiple linear regression
http://www.zejicert.cn
Residual n-k-1 SSE SSE/(n-k-1)
Total n-1 TSS -
SSE
SEE= , k is the number of slope
n−k−1
coefficients.
http://www.zejicert.cn
R2 and adjusted R2
R2 almost always increases as variables are
added to the model, even if the marginal
contribution of the new variables is not
statistically significant.
26
Multiple linear regression
2
The adjusted R is a modified version of the R2
that does not necessarily increase with a new
independent variable is added.
2 2 n−1
http://www.zejicert.cn
Adjusted R is given by: R =1− 1−R2 .
n−k−1
2 2
R ≤ R2 : Adjusted R may be less than zero.
27
Multiple linear regression
http://www.zejicert.cn
• The F-test is used to test whether at
least one slope coefficient is significantly
different from zero.
H0: b1=b2=b3=…=bk=0;
Ha: at least one bj≠0 (j = 1 to k).
28
F-table at 5 percent (Upper tail)
http://www.zejicert.cn
Example
In an attempt to estimate a regression equation
that can be used to forecast Feild’s future sales,
22 years of Feild’s annual sales were regressed
against two independent variables:
http://www.zejicert.cn
29
Example
Based on the regression results (see next page),
the regression equation can be stated as: Sales
= 6.000 + 0.004(GDP) − 20.500(∆I)
Fill in the missing data and interpret the results
http://www.zejicert.cn
of the regression at a 5% level of significance
with respect to:
• The significance of the individual independent
variables.
• The utility of the model as a whole.
Example
Level of gross
domestic product 0.004 0.003 ? 0.20
(GDP)
Changes in 30-year <
-20.500 3.560 ?
mortgage rates (∆I) 0.001
30
Example
ANOVA df SS MS F Significance F
Regression ? 236.30 ? ? p < 0.005
Error ? 116.11 ?
Total ? ?
http://www.zejicert.cn
R2 ?
2
Ra ?
Answer
Level of gross
domestic product 0.004 0.003 1.333 0.20
(GDP)
Changes in 30-year <
-20.500 3.560 -5.758
mortgage rates (∆I) 0.001
31
Answer
ANOVA df SS MS F Significance F
Regression 2 236.30 118.15 19.34 p < 0.005
Error 19 116.11 6.11
Total 21 352.41
http://www.zejicert.cn
R2 67.05%
2
Ra 63.58%
32
Multiple linear regression
Example of dummy variables
GDPt=b0+b1Q1t+b2Q2t+b3Q3t+εt
• GDPt=第t期的GDP观测值;
• Q1t=如果期间t是第一季度就为1,否则为0;
http://www.zejicert.cn
• Q2t=如果期间t是第二季度就为1,否则为0;
• Q3t=如果期间t是第三季度就为1,否则为0。
33
Multiple Regression and Machine Learning
http://www.zejicert.cn
Models with qualitative dependent variables
Heteroskedasticity
Serial correlation (autocorrelation)
Multicollinearity
http://www.zejicert.cn
34
Heteroskedasticity
Homoskedasticity and heteroskedasticity
The error term εi is homoskedasticity if the
variance of the conditional distribution of εi
given Xi is constant for i = 1,…,n and in
http://www.zejicert.cn
particular does not depend on Xi.
Otherwise the error term is heteroskedastic.
Heteroskedasticity
Unconditional heteroskedasticity
The heteroskedasticity is not related to the
level of the independent variables.
• Which means that it doesn’t
http://www.zejicert.cn
35
Heteroskedasticity
Conditional heteroskedasticity
Heteroskedasticity is related to the level of
(i.e., conditional on ) the independent variable.
http://www.zejicert.cn
The residual variance with be larger if the
values of the independent variable X is larger.
Conditional heteroskedasticity does create
significant problems for statistical inference.
Heteroskedasticity
Y=b0 +b1 X
http://www.zejicert.cn
0
X
36
Heteroskedasticity
http://www.zejicert.cn
Heteroskedasticity
37
Heteroskedasticity
Detecting heteroskedasticity
Method one: Residual scatter plot.
http://www.zejicert.cn
Residual
Independent Variable
Heteroskedasticity
Method two:The Breusch-Pagen 2 test(df = k).
• H0: Squared error term is uncorrelated with
the independent variables.
• Ha: Squared error term is correlated with the
http://www.zejicert.cn
independent variables.
2
• BP=n×RResidual , df=k, one-tailed test
38
Heteroskedasticity
2
• 注意: RResidual 是以残差项的方差(squared
residuals)和X做回归得出的决定系数,并非
原回归方程的决定系数。
We should concern only for large values of the
http://www.zejicert.cn
test statistic.
Decision rule: BP test statistic should be small
( 2分布表).
Chi-squared ( 2) table
http://www.zejicert.cn
39
Heteroskedasticity
Correcting for heteroskedasticity
Method one:Robust standard error
• Corrects the standard errors (White-
corrected) of the linear regression
http://www.zejicert.cn
model’s estimated coefficients to account
for the conditional heteroscedasticity.
Heteroskedasticity
Method two:Generalized least squares
• Modifies the original equation in an attempt
to eliminate the heteroscedasticity.
http://www.zejicert.cn
40
Violations of regression assumptions
Heteroskedasticity
Serial correlation (autocorrelation)
Multicollinearity
http://www.zejicert.cn
Serial correlation (autocorrelation)
Serial correlation (autocorrelation) refers to
the situation that error terms are correlated
with one another.
http://www.zejicert.cn
41
Serial correlation (autocorrelation)
Serial correlation is often found in time series data.
Positive serial correlation exists when a
positive regression error in one time period
increases the probability of observing
http://www.zejicert.cn
regression error for the next time period.
Negative serial correlation occurs when a
positive error in one period increases the
probability of observing a negative error in the
next period.
42
Serial correlation (autocorrelation)
Positive serial correlation → Type Ⅰ error & F-
test unreliable
• Not affect the consistency of estimated
regression coefficients.
http://www.zejicert.cn
Serial correlation (autocorrelation)
• Because of the tendency of the data to cluster
together from observation to observation,
positive serial correlation typically results in
coefficient standard errors that are too small,
http://www.zejicert.cn
43
Serial correlation (autocorrelation)
Negative serial correlation → Type Ⅱ error
• Because of the tendency of the data to
diverge from observation to observation,
negative serial correlation typically results in
http://www.zejicert.cn
coefficient standard error that are too large,
and that leads to t-statistics to be small.
44
Serial correlation (autocorrelation)
Durbin-Watson test
H0: No positive serial correlation
DW≈2×(1− )
Decision rule:
http://www.zejicert.cn
Reject H0, Reject H0,
Conclude Conclude
positive serial Do not negative serial
correlation Inconclusive reject H0 Inconclusive correlation
0 dl du 4-du 4-dl 4
45
Serial correlation (autocorrelation)
Methods to correct serial correlation
Method one : Adjusting the coefficient
standard errors( e.g., Hansen method)
• Hansen method also corrects for
http://www.zejicert.cn
conditional heteroskedasticity.
• The white-corrected standard errors are
preferred if only heteroskedasticity is a
problem.
Method two: Modify the regression equation
itself to eliminate the serial correlation.
Heteroskedasticity
Serial correlation (autocorrelation)
Multicollinearity
http://www.zejicert.cn
46
Multicollinearity
Multicollinearity refers to the condition when
two or more of the independent variables, or
linear combinations of the independent
variables are highly correlated with each other.
http://www.zejicert.cn
In practice, multicollinearity is often a matter of
degree rather than of absence or presence.
High pairwise correlations among the
independent variables are not a necessary
condition for multicollinearity.
Multicollinearity
Effect of multicollinearity
Does not affect the consistency of slope
coefficients;
Coefficients themselves tend to be unreliable;
http://www.zejicert.cn
47
Multicollinearity
Two methods to detect multicollinearity:
Method one: t-tests indicate that none of the
individual coefficients is significantly
different than zero (所有的t检验都不拒绝H0,
http://www.zejicert.cn
所有的b都等于0), while the F-test indicates
overall significance (F检验拒绝H0,至少有一
些b不等于0) and the R² is high.
Multicollinearity
Method two: the absolute value of the
sample correlation between any two
independent variables is greater than 0.7
(︱r︱>0.7).
http://www.zejicert.cn
48
Summary of assumption violations
Problem Effect Detection Solution
Use robust
Incorrect 1. Residual scatter plots standard errors
Heteroskedasticity standard 2. Breuch-Pagen χ2-test (corrected for
errors (BP=n×R2 conditional
heteroscedasticity)
http://www.zejicert.cn
Use robust
Incorrect 1. Residual scatter plots
standard
Serial correlation standard 2. Durbin-Watson
errors(corrected for
errors test(DW≈2×(1−r))
serial correlation)
1. t-tests: fail to reject H0;
High R2 F-test: reject H0;R2 is Remove one or
Multicollinearity and low t- high more independent
statistics 2. High correlation among variables
independent variables
49
Model specification and errors in specification
Principles of model specification
The model should be grounded in cogent
economic reasoning;
The functional form chosen for the variables
http://www.zejicert.cn
in the regression should be appropriate
given the nature of the variables;
50
Model specification and errors in specification
Misspecified functional form
One or more important variables could be
omitted from regression.
One or more of the regression variables
http://www.zejicert.cn
may need to be transformed before
estimating the regression.
The regression model pools data from
different samples that should not be pooled.
51
Model specification and errors in specification
Other Types of Time-Series Misspecification
Relations among time series with trends;
Relations among time series that may be
http://www.zejicert.cn
random walks.
52
Models with qualitative dependent variables
Qualitative dependent variables are dummy
variables used as dependent variables
instead of as independent variables.
Probit and logit model
http://www.zejicert.cn
Application of these models results in
estimates of the probability that the event
occurs(e.g., probability of default).
53
Models with qualitative dependent variables
Discriminant models
Discriminant analysis yields a linear
function, similar to a regression equation,
http://www.zejicert.cn
which can then be used to create an
overall score. Based on the score, an
observation can be classified into the
bankrupt or not bankrupt category.(Z-score)
Example
54
Example
http://www.zejicert.cn
stock is that of a Fortune 500 firm, FORT = 0 if not
a Fortune 500 stock)
Example
Market
0.0460 0.0150 3.090 0.021
Capitalization
Industry
0.7102 0.2725 2.610 0.040
Ranking
Fortune 500 0.9000 0.5281 1.700 0.139
55
Example
http://www.zejicert.cn
Total 9 24.0000
Test Test-Statistic
Breusch-Pagan 17.7
Durbin-Watson 1.8
Example
56
Example
http://www.zejicert.cn
with a rank of 3 is closest to:
A. 2.88%.
B. 3.98%.
C. 1.42%.
Example
57
Example
http://www.zejicert.cn
respectively. Based on this data and the
information in the tables, there is evidence of:
A. only serial correlation.
B. serial correlation and heteroskedasticity.
C. only heteroskedasticity.
Example
1. Answer = C
2. Answer = A
3. Answer = C
http://www.zejicert.cn
4. Answer = C
58
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
59
Trend models
http://www.zejicert.cn
independent variable is time t (t=1,2,3,……).
Yt
Trend models
60
Trend models
http://www.zejicert.cn
Use the Durbin Watson statistic to detect
autocorrelation.
Trend models
Dots represent
Yt Raw data
http://www.zejicert.cn
61
Trend models
4500
4000
3500
3000
Sales/million
2500
http://www.zejicert.cn
2000
1500
1000
500
0
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
2015
‐500
Year
Sales Trend 指数 (Sales) 线性 (Sales)
Trend models
Factors that determine which model is best
A linear trend model may be appropriate if the
data points appear to be equally distributed
above and below the regression line.
http://www.zejicert.cn
62
Trend models
http://www.zejicert.cn
not appropriate for the time series, causing
inconsistent b0 and b1.
The mean and variance of the time series
changes over time.
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
63
Autoregressive(AR) Time-series models
An autoregressive model uses past values of
dependent variable as independent variables.
AR(p) model:
• Xt=b0+b1xt-1+b2xt-2+…+bpxt-p+εt
http://www.zejicert.cn
(εt N(0,σ2), satisfied white noise
AR(p):AR model of order p (p是自回归模型
包含的滞后项的个数)。
Xt+1=b0+b1xt
• A two-period-ahead forecast for an
AR(1) model:
Xt+2=b0+b1xt+1
64
Autoregressive(AR) Time-series models
http://www.zejicert.cn
Covariance-stationary series
No conditional heteroskedasticity
65
Autoregressive(AR) Time-series models
Covariance-stationary series
Three conditions for covariance stationary:
• Constant and finite expected value of
the time series;
http://www.zejicert.cn
• Constant and finite variance of the time
series;
• Constant and finite covariance with
leading or lagged values.
Stationary in the past does not guarantee
stationary in the future.
66
Autoregressive(AR) Time-series models
b
If xt> 0 the model predicts that xt+1 will be
1−b1
lower than xt;
b
If xt< 0 the model predicts that xt+1 will be
http://www.zejicert.cn
1−b1
higher than xt;
All covariance-stationary time series have a
finite mean reverting level.
67
Autoregressive(AR) Time-series models
Compare forecasting power with RMSE
Typically, we compare the out-of-sample
forecasting performance of forecasting models by
comparing their root mean squared error (RMSE).
http://www.zejicert.cn
• RMSE is the square root of the average
squared error.
n
∑ (Predictedi −Actuali )2
RMSE= i−1
n
• The model with the smallest RMSE is the
most accurate.
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
68
Random walks and units roots
Random walk
Simple random walk:
• xt=xt-1+εt, E(εt)=0, E(εt2)=σ2, Cov(εt,εs) =
E(εt,εs) = 0 if t≠s.
http://www.zejicert.cn
• A special AR(1) model with b0=0 and b1=1;
• The best forecast of xt is xt-1.
69
Random walks and units roots
http://www.zejicert.cn
covariance stationary.
• We cannot use standard regression analysis
on a time series that is a random walk. We
should convert the data to a covariance-
stationary time series.
70
Random walks and units roots
The unit root test of nonstaionarity
A common t-test of the hypothesis that b1=1
is invalid to test the unit root.
Dicky-Fuller test(DF test) to test the unit root
http://www.zejicert.cn
• Start with an AR(1) model xt=b0+b1xt-1+εt;
• xt-xt-1=b0+(b1-1)xt-1+εt or xt-xt-1=b0+gxt-1+εt,
E(εt)=0.
71
Random walks and units roots
If a time series appears to have a unit root,
how should we model it?
One method that is often successful is to
first-difference the time series and try to
http://www.zejicert.cn
model the first-differenced series as an
autoregressive time series.
b0=b1=0;
the mean-reverting level is b0/(1-b1)=0, the
variance of yt is var(εt)= σ2.
Therefor, the first-differenced variable yt is
covariance stationary.
72
Example
http://www.zejicert.cn
B. A random-walk-with-drift time series.
C. A time series with unit root
Answer = A.
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
73
Autoregressive conditional
heteroskedasticity (ARCH)
Heteroskedasticity refers to the situation that the
variance of the error term is not constant.
Test whether a time series is ARCH(1) (多元回归
中用BP test)
http://www.zejicert.cn
ARCH(1): εt ~N(0, a0+a1εt-12)
• The distribution of εt is conditional on its
value in the previous period, εt-1.
2 2
εt =a0+a1 εt−1 +μt;
Autoregressive conditional
heteroskedasticity (ARCH)
If the coefficient a1 is significantly different
from 0, the time series is ARCH(1).
If ARCH exists, the standard errors for the
http://www.zejicert.cn
74
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
Autoregressive conditional heteroskedasticity
(ARCH)
Seasonality in time-series models
Regression with more than one time series
Steps in time-series forecasting
75
Example
Using quarterly data from the first quarter of 1995 to
the last quarter of 2012, we estimate the following
AR(1) model using ordinary least squares:
(ln Salest-ln Salest-1)=b0+b1(ln Salest-1-ln Salest-2)+εt.
http://www.zejicert.cn
The table shows the results of the regression.
Example
Regression Statistics
R-squared 0.1548
Standard error 0.0762
http://www.zejicert.cn
Observations 74
Durbin-Watson 1.9165
76
Example
http://www.zejicert.cn
3 0.0294 0.1162 0.2530
4 0.7667 0.1162 6.5981
Example
Suppose we decide to use an autoregressive
model with a seasonal lag because of the
seasonal autocorrelation. We are modeling
quarterly data, so we estimate Equation:
http://www.zejicert.cn
77
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
Autoregressive conditional heteroskedasticity
(ARCH)
Seasonality in time-series models
Regression with more than one time series
Steps in time-series forecasting
78
Regression with more than one time series
http://www.zejicert.cn
other does not: Error term in the linear regression
would not be covariance stationary, we cannot
use linear regression.
Both time series have a unit root: We need to
establish whether the time series are
cointegrated.
79
Regression with more than one time series
Yt = b0 + b1xt + εt
H0: No cointegration; Ha: Cointegration;
If we cannot reject the null, the error term is
http://www.zejicert.cn
not covariance stationary. we cannot use
linear regression;
If we can reject the null, the error term is
covariance stationary. we can use linear
regression to estimate the long-term relation
between the two series.
Time-series analysis
Trend models
Autoregressive(AR) Time-series models
Random walk and unit root
http://www.zejicert.cn
80
Steps in time-series forecasting
No 判断是否有季节性因素 Yes
Seasonality?
画出散点图,判断序列是否有趋势
使用
No
Does series have a trend?
指数趋势 Yes
An exponential trend 使用自回归模型
DW
http://www.zejicert.cn
Use an AR model
检验判断残差是否自相关
Serial correlation?
Yes 线性趋势
No
A linear trend 使用趋势模型
Use a trend model
Yes
模型错误
回归级数
量和级数
模型构建完成,测试模型
序列协方差是否固定?
( )
1模型
http://www.zejicert.cn
预测能力
检测是否存在
检测残差项是
季节性因素
组建序列
型检测是否
存在异质性
否自相关
用 ARCH
模
No No No
No
81
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”
Machine Learning
82
What is machine learning
http://www.zejicert.cn
Deep learning and reinforcement learning
satisfied in reality.
The goal of machine learning algorithms is to
generate structure or predictions from data
without any human help and priori restrictive.
83
Supervised learning
Supervised learning
Supervised ML algorithms use a labeled data
set (training data set) of observed inputs and
http://www.zejicert.cn
the associated output to infer patterns, and
then use this inferred patterns to predict
outputs based on new given inputs set (test
data set).
Supervised learning
In Regression In ML
Xi Inputs Independent variables Features
http://www.zejicert.cn
84
Supervised learning
Xi, Y
Xj Y
http://www.zejicert.cn
Supervised ML Algorithm
Test dataset
Prediction Rule
Xi, I = 1,…N
Evaluation of Fit
Supervised learning
85
Supervised learning
http://www.zejicert.cn
into distinct categories.
Unsupervised learning
86
Deep learning and reinforcement learning
http://www.zejicert.cn
In reinforcement learning, a computer learns
from interacting with itself (or data generated by
the same algorithm).
Deep learning and reinforcement learning are
based on neural networks (NNs).
Example
Answer = A
87
Machine Learning
http://www.zejicert.cn
Unsupervised machine learning algorithms
Neural network, deep learning nets, and
reinforcement learning
88
Generalization and overfitting
http://www.zejicert.cn
Validation sample: validating and tuning the model;
Test sample: used to test the model’s ability to
predict well on new data.
Training sample is referred to as “in-sample”.
Validation and test samples are referred to as
“out-of-sample”.
89
Generalization and overfitting
Y Y Y
http://www.zejicert.cn
X X X
Underfit Overfit Good Fit
90
Errors and overfitting
Low or no in-sample error but large out-of-
sample error are indicative of poor generalization.
Total out-of-sample error from three sources:
Bias error: A model produce high bias with
http://www.zejicert.cn
poor approximation, causing underfitting and
high in-sample error.
Variance error: Unstable models pick up
noise and produce high variance, causing
overfitting and high out-of-sample error.
Base error due to randomness in the data.
Accuracy Rate%
http://www.zejicert.cn
0 0
Number of Training Samples Number of Training Samples
91
Errors and overfitting
http://www.zejicert.cn
0
Number of Training Samples
92
Errors and overfitting
http://www.zejicert.cn
The trade-off between overfitting and generalization
is a trade-off between cost and complexity.
Optimal
http://www.zejicert.cn
Complexity
Total Error
Model Complexity
93
Overview of evaluating ML algorithm performance
http://www.zejicert.cn
Preventing overfitting in supervised
machine learning
Two methods to reduce overfitting
Preventing the algorithm from getting too
complex during selection and training;
http://www.zejicert.cn
94
Preventing overfitting in supervised
machine learning
K-fold cross-validation Training samples
Validation sample
http://www.zejicert.cn
Sample Sample Sample Sample Sample
2 2 2 2 2
Sample Sample Sample Sample Sample
3 3 3 3 3
Sample Sample Sample Sample Sample
4 4 4 4 4
Sample Sample Sample Sample Sample
5 5 5 5 5
Example
95
Example
By implementing which one of the following actions
can Anderson address the problem?
A. Estimate and incorporate into the model a
penalty that decreases in size with the number
http://www.zejicert.cn
of included features.
B. Use the k-fold cross-validation technique to
estimate the model’s out of-sample error, and
then adjust the model accordingly.
C. Use an unsupervised learning model.
Answer = B
Machine Learning
96
Supervised machine learning algorithms
http://www.zejicert.cn
Classification and regression tree (CART)
Ensemble learning and random forest
Penalized regression
A special case of generalized linear model (GLM).
LASSO is a regularization technique to remove
http://www.zejicert.cn
97
Penalized regression and LASSO
http://www.zejicert.cn
• Penalized regression ensures that a feature is
included only if the sum of squared residuals
declines by more than the penalty term increases.
2 K
Min ∑ni=1 Yi −Yi + λ ∑k=1 bk , λ > 0
The greater the number of included features,
the larger the penalty term.
Lambda (λ) is a parameter whose value must
be set by the researcher before learning
begins, is called hyperparameter.
98
Support vector machine (SVM)
http://www.zejicert.cn
Support vectors
X X
99
Support vector machine (SVM)
http://www.zejicert.cn
furthest away from all the observations
(maximum margin).
The margin is determined by the observations
closest to the boundary in each set, and these
observations are called support vectors.
Y Y
http://www.zejicert.cn
X X
100
K-nearest neighbor (KNN)
K-nearest neighbor (KNN) is to classify a new
observation by choosing the classification with the
largest number of nearest (similar) neighbors out
of the k being considered.
http://www.zejicert.cn
KNN is no assumptions about the distribution of
the data.
K is a hyperparameter.
The distance metric is used to model similarity.
KNN can be used directly for multi-class
classification.
regression tree.
101
Classification and regression tree (CART)
http://www.zejicert.cn
of the algorithm at each terminal node will be
the category with the majority of data points.
If the goal is regression, then the prediction at
each terminal node is the mean of the labeled
values.
f: feature
c: cutoff value
Root Node
(f1, c1)
http://www.zejicert.cn
102
Classification and regression tree (CART)
http://www.zejicert.cn
group error than before.
To avoid overfitting, regularization parameters,
such as the maximum depth of the tree, the
minimum population at a node, or the maximum
number of decision nodes can be added.
103
Ensemble learning and random forest
http://www.zejicert.cn
algorithms is known as the ensemble method.
104
Ensemble learning and random forest
http://www.zejicert.cn
have trained, the higher the accuracy of the
aggregated prediction up to a point. (the law
of large numbers)
While, there is an optimal number of models
beyond which performance would be
expected to deteriorate from overfitting.
105
Ensemble learning and random forest
http://www.zejicert.cn
regression.
106
Ensemble learning and random forest
http://www.zejicert.cn
It tends to protect against overfitting on the
training data.
It also reduces the ratio of noise to signal.
An important drawback of random forest is
that it lacks the ease of interpretability of
individual trees.(black box-type algorithm)
Machine Learning
107
Unsupervised machine learning algorithms
http://www.zejicert.cn
Hierarchical Clustering
Dendrograms
108
Principal Components Analysis
http://www.zejicert.cn
An eigenvalue gives the proportion of total
variance in the initial data that is explained by
each eigenvector.
PCA based on the proportion of variation in
the data set to select the principal component.
PC2 X
Z
109
Principal Components Analysis
Scree plots
Percent of Variance Explained
0.4
http://www.zejicert.cn
0.3
≈ 80% of total variance
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
Number of Principal Components
Clustering
110
Clustering
http://www.zejicert.cn
distance between two points.
K-Means Clustering
111
K-Means Clustering
http://www.zejicert.cn
C3 C3 C3
C2 C2 C2
C1
C1 C3 C1 C3
C3
C2 C2
C2
4
5 6
http://www.zejicert.cn
9
8
3
7
2
1 10
112
Agglomerative clustering (or bottom-up)
hierarchical clustering
Divisive clustering (or top-down) hierarchical clustering
4
5 6
http://www.zejicert.cn
9 8
3
7
2
1
10
Hierarchical Clustering
113
Dendrogram
Arch
Dendrit 9
e
http://www.zejicert.cn
2 Clusters
7 6 Clusters
8
11 Clusters
1 5 6
3
2 4
A B C D E F G H I J K
Machine Learning
114
Neural Networks
http://www.zejicert.cn
complex interactions among features.
Neural networks can be supervised or
unsupervised.
Neural Networks
X1
http://www.zejicert.cn
X2 Y
X3
Y=aX1+bX2+cX3
115
Neural Networks
X1
http://www.zejicert.cn
A1
X2 Y
A2
X3
Y=a*max(0,X1+X2+X3)+b*max(0,X2+X3)
=a*A1+b*A2
A=f(x)=max(0,x)
http://www.zejicert.cn
116
Neural Networks
http://www.zejicert.cn
units of the data.
Neural Networks
117
Neural Networks
http://www.zejicert.cn
Input #1
Input #2 Output
Input #3
Neural Networks
118
Neural Networks
http://www.zejicert.cn
a weight and sums the weighted values to form
the total net input.
Neural Networks
119
Neural Networks
Activation function
1 1
Sigmoid Sigmoid
http://www.zejicert.cn
Function Function
0 0
- Sigmoid Operator + - Sigmoid Operator +
Total Net Input Total Net Input
Dimmer Dimmer
Switch Switch
Neural Networks
120
Neural Networks
http://www.zejicert.cn
(Partial derivative of the total error with
respect to the old weight)
121
Reinforcement Learning
http://www.zejicert.cn
rewards over time, taking into consideration
the constraints of its environment.
Example 1
122
Example 1
http://www.zejicert.cn
Answer = B
Example 2
123
Example 2
A. K-Means Clustering
B. Principal Components Analysis (PCA)
C. Classification and Regression Trees (CART)
http://www.zejicert.cn
Answer = A
Summary of ML algorithms
Variables Supervised
Regression
• Linear; Penalized Regression/LASSO
Continuous • Logistic
http://www.zejicert.cn
• CART
• Random Forest
Classification
• Logit
Categorical • SVM
• KNN
• CART
Continuous or
Neural Networks; Deep Learning; Reinforcement Learning
categorical
124
How to choose among ML algorithms
Summary of ML algorithms
Variables Unsupervised
Dimensionality Reduction
Continuous • PCA
http://www.zejicert.cn
Clustering
• K-Means
Categorical • Hierarchical
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”
125
Big Data Projects
http://www.zejicert.cn
Data preparation and wrangling
Data exploration objectives and methods
Model Training
126
Characteristics of big data
http://www.zejicert.cn
different data sources.
These Vs have numerous implications for
financial technology fintech pertaining to
investment management.
127
Steps in executing a data analysis project:
Financial forecasting with big data
The traditional (with structured data) ML
model building steps:
Conceptualization of the modeling task.
http://www.zejicert.cn
• Determining the output of the model;
how this model will be used and by
whom, and how it will be embedded in
business processes.
128
Steps in executing a data analysis project:
Financial forecasting with big data
The text ML Model Building Steps:
Text problem formulation.
• How to formulate the text classification
http://www.zejicert.cn
problem, identifying the inputs and
outputs, how the text ML model’s
classification output will be utilized.
Data (text) curation.
• Gathering external text data via web
services or web spidering programs.
Text exploration.
• Text visualization, text feature selection
and engineering.
Model training.
129
Big Data Projects
http://www.zejicert.cn
Data preparation and wrangling
Data exploration objectives and methods
Model Training
130
Data preparation and wrangling
http://www.zejicert.cn
by dealing with outliers, extracting useful
variables from existing data points, and
scaling the data.
131
Data preparation and wrangling: Structured Data
http://www.zejicert.cn
mitigating all data errors.
• Incompleteness error: the data are not
present.
• Invalidity error: the data are outside of a
meaningful range.
132
Possible errors in a raw dataset
Date of Marital
1 ID Name Gender Salary Province
Birth status
2 1 Mr. Zhao M 12/5/1980 ¥200,200 Shanghai Y
15 Jan,
3 2 Ms. Qian M ¥260,500 Guangdong Yes
http://www.zejicert.cn
1975
4 3 Sun 1/13/1989 ¥265,000 Beijing No
Don’t
5 4 Ms. Li F 1/1/1900 NA Fujian
Know
6 5 Ms. Qian F 15/1/1976 ¥160,500 Y
7 6 Mr. Zhou M 9/10/1971 — Sichuan N
8 7 Mr. Wu M 2/27/1966 ¥300,000 Jiangsu Y
9 8 Ms. Zheng F 4/4/1984 ¥255,000 SX N
133
Data preparation and wrangling: Structured Data
http://www.zejicert.cn
intuitively not needed for the project can be
removed.
• Conversion: The variables in the dataset must
be converted into appropriate types.
Scaling
Scaling is a process of adjusting the range
of a feature by shifting and changing the
http://www.zejicert.cn
scale of data.
It is important to remove outliers before
scaling is performed.
134
Data preparation and wrangling: Structured Data
http://www.zejicert.cn
trimming (also called truncation).
When extreme values and outliers are
replaced with the maximum (for large value
outliers) and minimum (for small value outliers)
values of data points that are not outliers, the
process is known as winsorization.
Xi(normalized) =
Xmax −Xmin
Standardization: both centering and scaling the
variables. (variable will have an arithmetic mean
of 0 and standard deviation of 1)
X −μ
Xi(standardized) = i
σ
135
Data preparation and wrangling: Structured Data
http://www.zejicert.cn
outliers. However, the data must be normally
distributed to use standardization.
136
Data preparation and wrangling:
Unstructured (Text) Data
Text processing is essentially cleansing and
transforming the unstructured text data into a
structured format.
http://www.zejicert.cn
Text Preparation (Cleansing)
Text cleansing involves use regular
expressions (regex) to clean the text by
removing unnecessary elements from the
raw text.
Remove Numbers
Remove white spaces
137
Data preparation and wrangling:
Unstructured (Text) Data
<html>
<body>
<p><b>New U.S. claims for jobless benefits fell for
http://www.zejicert.cn
the third week in a row, hitting their lowest level in
nearly 49 years for the third straight week.</b></p>
<p>For the week ended September 12, new claims
for unemployment insurance fell to 201,000, down
3,000 from the prior week. Economists had instead
been expecting a result of 209,000.</p>
</body>
</html>
138
Data preparation and wrangling:
Unstructured (Text) Data
Text Wrangling (Preprocessing)
A token is equivalent to a word, and
tokenization is the process of splitting a given
http://www.zejicert.cn
text into separate tokens.
Cleaned Texts Tokens
Text 1
Text 1
Text 1
Text 1
lower cases.
Stop words are such commonly used words as
“the,” “is,” and “a, will be kept or removed.
139
Data preparation and wrangling:
Unstructured (Text) Data
Stemming is the process of converting
inflected forms of a word into its base word
(known as stem).
Lemmatization is the process of converting
http://www.zejicert.cn
inflected forms of a word into its morphological
root (known as lemma).
Stemming or lemmatization decrease data
sparseness.
After the cleansed text is normalized, a bag-of-
words (BOW) is created.
140
Data preparation and wrangling:
Unstructured (Text) Data
Document term matrix (DTM)
Each row of the matrix belongs to a document
(or text file), and each column represents a
http://www.zejicert.cn
token (or term).
141
Example 1
http://www.zejicert.cn
B. confusion matrix.
C. document term matrix
Answer = C
Example 2
A. stemming.
B. scaling.
C. data cleansing.
Answer = A
142
Example 3
http://www.zejicert.cn
B. most commonly performed at the character
level.
C. the process of splitting a given text into
separate tokens.
Answer = C
143
Data exploration objectives and methods
Data exploration
Investigate and comprehend data
distributions and relationships, involves:
exploratory data analysis, feature selection,
http://www.zejicert.cn
and feature engineering.
Exploratory data analysis (EDA)
Exploratory graphs, charts, and other
visualizations, such as heat maps and
word clouds, are designed to summarize
and observe data.
Feature selection
Only pertinent features from the dataset
are selected for ML model training.
http://www.zejicert.cn
Feature engineering
Creating new features by changing or
transforming existing features.
144
Data exploration objectives and methods
http://www.zejicert.cn
Data exploration objectives and methods:
Structured Data
Exploratory data analysis
For structured data, EDA can be performed
on a single feature (one-dimension) or on
multiple features (multi-dimension).
http://www.zejicert.cn
145
Data exploration objectives and methods:
Structured Data
http://www.zejicert.cn
Data exploration objectives and methods:
Structured Data
Feature Selection
The objective of the feature selection process
is to assist in identifying significant features.
Statistical measures can be used to assign a
http://www.zejicert.cn
146
Data exploration objectives and methods:
Structured Data
Feature Engineering
The feature engineering process attempts to
further optimize and improve the features.
http://www.zejicert.cn
• This action involves engineering an
existing feature into a new feature or
decomposing it into multiple features.
• For continuous data, a new feature may
be created by taking the logarithm of the
product of two or more features.
147
Data exploration objectives and methods
http://www.zejicert.cn
Data exploration objectives and methods:
Unstructured (Text) Data
Exploratory Data Analysis
Term frequency (TF): the ratio of the number
of times a given token occurs in all the texts
http://www.zejicert.cn
148
Data exploration objectives and methods:
Unstructured (Text) Data
The most common applications:
• Text classification uses supervised ML
approaches to classify texts into different classes.
http://www.zejicert.cn
• Topic modeling uses unsupervised ML
approaches to group the texts in the dataset into
topic clusters.
• Sentiment analysis predicts sentiment (negative,
neutral, or positive) of the texts in a dataset using
both supervised and unsupervised approaches.
149
Data exploration objectives and methods:
Unstructured (Text) Data
The general feature selection methods in text data
are as follows:
Frequency measures can be used for
http://www.zejicert.cn
vocabulary pruning to remove noise features by:
• Filtering the tokens with very high and low
TF values across all the texts.
150
Data exploration objectives and methods:
Unstructured (Text) Data
Chi-square test: The chi-square test is applied to
test the independence of two events: occurrence of
the token and occurrence of the class.
http://www.zejicert.cn
Tokens with the highest chi-square test statistic
values occur more frequently in texts associated
with a particular class and therefore can be
selected for use as features for ML model
training due to higher discriminatory potential.
151
Data exploration objectives and methods:
Unstructured (Text) Data
Feature Engineering
The goal of feature engineering is to
maintain the semantic essence of the text
http://www.zejicert.cn
while simplifying and converting it into
structured data for ML.
152
Data exploration objectives and methods:
Unstructured (Text) Data
• Name entity recognition (NER): The name entity
recognition algorithm analyzes the individual
tokens and their surrounding semantics while
http://www.zejicert.cn
referring to its dictionary to tag an object class
to the token.
• Parts of speech (POS): Similar to NER, parts of
speech uses language structure and
dictionaries to tag every token in the text with a
corresponding part of speech.
153
Model Training
http://www.zejicert.cn
of the model.
Number of Features
• A dataset with a small number of
features can lead to underfitting
• A dataset with a large number of
features can lead to overfitting.
Model Training
154
Model Training
Method Selection
Method selection is governed by the
following factors:
http://www.zejicert.cn
• Supervised or unsupervised learning.
• Type of data.
• Size of data.
Model Training
To deal with mixed data, the results from more
than one method can be combined. Sometimes,
the predictions from one method can be used as
predictors (features) by another.
http://www.zejicert.cn
155
Model Training
• Undersampling majority and Oversampling
minority class
http://www.zejicert.cn
Class 1 Class 0 Class 0 Class 1
Model Training
• Undersampling majority and Oversampling
minority class
156
Model Training
Performance Evaluation
Error analysis.
• For classification problems, error
http://www.zejicert.cn
analysis involves computing four basic
evaluation metrics: true positive (TP),
false positive (FP), true negative (TN),
and false negative (FN) metrics. FP is
also called a Type I error, and FN is
also called a Type II error.
Model Training
157
Model Training
http://www.zejicert.cn
Trading off precision and recall is subject
to business decisions and model
application.
Model Training
158
Model Training
http://www.zejicert.cn
performance measure if the number of classes
is equal in the dataset, .
High scores on both of these metrics suggest
good model performance.
Model Training
159
Model Training
http://www.zejicert.cn
measures the area under the ROC curve.
An AUC close to 1.0 indicates near perfect
prediction, while an AUC of 0.5 signifies
random guessing.
Model Training
Model 2
AUC>0.7
http://www.zejicert.cn
Model 3
AUC=0.5
0
False Positive Rate(FPR) 1
160
Model Training
http://www.zejicert.cn
regression methods.
n
∑ (Predictedi −Actuali )2
• RMSE= i−1
n
Model Training
Tuning
Model fitting has two types of error: bias and
variance. Bias error is associated with
http://www.zejicert.cn
161
Model Training
http://www.zejicert.cn
model parameters and are manually set and
tuned, not dependent on the training data.
Model Training
162
Model Training
http://www.zejicert.cn
needs to be tuned to improve the overall
accuracy of the larger model.
Ceiling analysis is a systematic process of
evaluating different components in the
pipeline of model building.
Model Training
Errortrain
Errorcv>>Errortrain
Error
Overfitting
Underfitting
Small
Error
Slight Regularization Large Regularization
163
Example
http://www.zejicert.cn
results
Class ‘0’ FN = 7 TN = 110
Example
B. 90%.
C. 98%.
Answer = A
164
Example
http://www.zejicert.cn
B. 81%.
C. 93%.
Answer = C
Example
B. 90%.
C. 83%.
Answer = B
165
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”
assess risk;
Simulations are also used to assess risk.
166
Scenario analysis, decision trees,
and simulations
Scenario analysis, which applies probabilities
to a small number of possible outcomes;
Decision trees, which use tree diagrams of
possible outcomes, are techniques used to
http://www.zejicert.cn
assess risk;
Simulations are also used to assess risk.
• Historical data;
• Cross sectional data;
• Statistical distribution and parameters.
Check for correlation across variables;
Run the simulation.
167
Scenario analysis, decision trees,
and simulations
Issues in simulation:
Garbage in, garbage out
Real data may not fit distributions
Non-stationary distribution
http://www.zejicert.cn
Changing correlation across inputs
168
http://www.zejicert.cn
扫扫加入CFA金融题库 扫扫获取更多考试资讯
THANKS
169