0% found this document useful (0 votes)
397 views

CFA Level II: Quantitative Methods

The document provides details on the topic area weights for the CFA Level II exam, including: - Quantitative Methods makes up 5-10% of the exam topics. - The Quantitative Methods section will cover topics such as linear regression, multiple regression, time-series analysis, machine learning, and probabilistic approaches. - Examples of questions will relate linear regression to analyzing the relationship between variables such as height and weight.

Uploaded by

Crayon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
397 views

CFA Level II: Quantitative Methods

The document provides details on the topic area weights for the CFA Level II exam, including: - Quantitative Methods makes up 5-10% of the exam topics. - The Quantitative Methods section will cover topics such as linear regression, multiple regression, time-series analysis, machine learning, and probabilistic approaches. - Examples of questions will relate linear regression to analyzing the relationship between variables such as height and weight.

Uploaded by

Crayon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

CFA Level II

http://www.zejicert.cn
Quantitative Methods

Topic Area Weights in CFA Level Ⅱ


Topic Area Weights (%)
Ethics and Professional Standards 10%-15%
Quantitative Methods 5%-10%
Economics 5%-10%
http://www.zejicert.cn

Financial Reporting and analysis 10%-15%


Corporate Finance 5%-10%
Portfolio Management 5%-15%
Equity Investments 10%-15%
Fixed Income 10%-15%
Derivatives 5%-10%
Alternative Investments 5%-10%
Total: 100%

1
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis

http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”

Introduction to Linear Regression

 Review
 The simple linear regression model
 The assumptions of the linear regression
http://www.zejicert.cn

 Analysis of Variance (ANOVA)


 Standard error of estimate(SEE)
 The coefficient of determination(R2)
 Hypothesis testing for the regression coefficient
 Determining a prediction interval
 Limitations of regression analysis

2
Review

 Four steps of hypothesis testing


 P-value
 Type I and type II errors

http://www.zejicert.cn
Four steps of hypothesis testing

 Stating the hypothesis.


 Null hypothesis (H0) & Alternative hypothesis (Ha).
• Two-tailed test
http://www.zejicert.cn

 H0: μ=0 Ha: μ≠0


• One-tailed test
 H0: μ≥0 Ha: μ<0
 H0: μ≤0 Ha: μ>0

3
Four steps of hypothesis testing

 Test-Statistic
 Tests concerning a single mean
• H0: μ=μ0; Ha: μ≠μ0

http://www.zejicert.cn
X−μ0 X−μ0
• z= ; t =
σ⁄ n ; n−1 s⁄ n

Four steps of hypothesis testing

 Test-Statistic
 Tests concerning correlation

• r = Cov(X,Y)
s s
http://www.zejicert.cn

X Y
• H0: ρ=0; Ha: ρ≠0
r n−2
• t= , df=n-2
1−r2

4
Four steps of hypothesis testing

 Test-Statistic
 Tests concerning correlation
Y Y

http://www.zejicert.cn
X X
Y Y
Y

X X X

Student’s t-distribution
http://www.zejicert.cn

5
Four steps of hypothesis testing

 Stating the decision rule


 Reject H0 if |test statistic|>critical value
• μ is significantly different from μ0.

http://www.zejicert.cn
 Fail to reject H0 if |test statistic|<critical value
• μ is not significantly different from μ0.

Four steps of hypothesis testing

 Stating the decision rule

95% 95%
2.5% 2.5% 5%
http://www.zejicert.cn

-1.96 +1.96 +1.645

Reject Fail to Reject Reject Fail to Reject Reject


H0 H0 H0 H0 H0

 Making the statistical decision

6
P-value

 P-value
 If P–value<alpha, we reject null hypothesis.

http://www.zejicert.cn
α/2=2.5% α/2=2.5%
95%
P/2=1.07% P/2=1.07%

-|Test statistic| Critical Value for |Test statistic|


5% significance
level

Type I and type II errors

 Type I and type II errors

True Condition
Decision
H0 is true H0 is false
http://www.zejicert.cn

Do not Incorrect decision


Correct decision
reject H0 Type Ⅱ error
Incorrect decision
Correct decision
Type Ⅰ error
Reject H0 Power of the test
Significance level, α,
=1-P(Type Ⅱ error)
=P(Type Ⅰ error)

7
Type I and type II errors

 Type I and type II errors


 P(I) = α
 P(II) = 1 – Power of the test

http://www.zejicert.cn
 P(I) + P(II) ≠ 1
 n不变时, P(I) ↑, P(II) ↓
 n↑, P(I)↓, P(II) ↓

The simple linear regression model


 Regression analysis is concerned with the study
of the relationship between one variable (the
dependent or explained variable) and other
variables (independent or explanatory variables).
http://www.zejicert.cn

 Regression does not necessarily imply causation.


 The objectives of Regression analysis: to predict
or forecast dependent.

8
http://www.zejicert.cn
The simple linear regression model
 The simple linear regression model
 Yi=b0+b1Xi+εi, i=1,…,n
• Yi=ith observation of the dependent
variable, Y;
http://www.zejicert.cn

• Xi=ith observation of the independent


variable, X;
• b0=regression intercept term;
• b1=regression slope coefficient;
• εi=the residual for the ith observation (the
disturbance term or error term).

9
http://www.zejicert.cn
The simple linear regression model

 The Ordinary Least Square(OLS) estimation


sample coefficients:
2
 Minimize ∑ ei =∑ [ Yi -(b0+b1Xi) ]2
n
∑i=1 (Xi−X)(Yi−Y) COV(X,Y)
http://www.zejicert.cn

 b1= n = , b0=Y-b1X
∑i=1 (Xi−X)2 Var(X)
 The estimated intercept coefficient (b0 ): the
point (X, Y) is on the regression line.

10
The simple linear regression model

 Interpretation of regression coefficients


 The estimated intercept coefficient (b0): the
value of Y when X is equal to zero.

http://www.zejicert.cn
 The estimated slope coefficient (b1 ): the
sensitivity of Y to a change in X.

The assumptions of the linear regression


 A linear relationship exists between X and Y;
 Independent variable, X, is not random (X is
uncorrelated with the error term);
 The expected value of the error term is zero
http://www.zejicert.cn

(i.e., E(εi)=0);
 The variance of the error term is constant (i.e.,
he error terms are homoscedastic);
 The error term is uncorrelated across
observations (i.e., E(εiεj)=0 for all i≠j);
 The error term is normally distributed.

11
Example

建立身高(X)和体重(Y)之间的回归关系:
身高,当做X01到X10输入: 170 、165、 168、
155、 173 、180、 185、171、 167、 176。
体重,当做Y01到Y10输入: 65 、52、 60、 48、

http://www.zejicert.cn
59、75、85、70、68、72。

Example

计算器操作步骤及结果如下:

按键 解释 显示
[ 2ND ] [ 7 ] [ 2ND ] 清除DATA功能
X01=0.0000
[ CLR WORK ]
http://www.zejicert.cn

中的存储记忆
170 [ ENTER ] 输入X01 X01=170.0000
[ ↓ ] 65 [ ENTER ] 输入Y01 Y01=65.0000
[ ↓ ] 165 [ ENTER ] 输入X02 X02=165.0000
[ ↓ ] 52 [ ENTER ] 输入Y02 Y02=52.0000
依次输入X03、Y03……X10、Y10

12
Example
按键 解释 显示
[ 2ND ] [ 8 ]
LIN 进入STAT功能
→ [ STAT ]
[↓] n=10.0000 共输入10组数据

http://www.zejicert.cn
[↓] X=171.0000 X的均值是171.0000
如果是一个样本,X的样本标
[↓] SX=8.3267
准差是8.3267
如果是一个总体,X的总体标
[↓] X=7.8994
准差是7.8994

Example
按键 解释 显示
[ ↓ ] y=65.40000 Y的均值是65.4000
如果是一个样本,Y的样本标准差
[↓] SY=11.0574
是11.0574
http://www.zejicert.cn

如果是一个总体,Y的总体标准差
[↓] Y=10.4900
是10.4900
[↓] a=-139.0326 回归方程截距项为-139.0326
[↓] b=1.1955 回归方程的斜率为1.1955
[↓] r=0.9003 身高与体重的相关系数为0.9003

13
Analysis of Variance (ANOVA)

Y
Yi =b0 +b1 Xi (Yi −Yi )
(Yi −Y)

http://www.zejicert.cn
(Yi −Y)
Y

b0

Analysis of Variance (ANOVA)

 Total variation: Total sum of squares (TSS) =


∑ ( Yi -Y)2
 Explained variation: Regression sum of
http://www.zejicert.cn

squares (RSS) = ∑ (Y-Y)2


 Unexplained variation: sum of squared errors
(SSE) = ∑ (Yi -Y)2
 ∑ ( Yi -Y)2=∑ (Y-Y)2+∑ (Yi -Y)2  TSS = RSS +
SSE

14
Analysis of Variance (ANOVA)
 Analysis of Variance (ANOVA) Table
df SS MSS
Regression k=1 RSS RSS/k

http://www.zejicert.cn
Residual n-2 SSE SSE/(n-2)
Total n-1 TSS -

Standard error of estimate(SEE)

 SEE
 Standard error of estimate(SEE, sometimes
called the standard error of the regression)
http://www.zejicert.cn

is the standard deviation of the error term.


2
∑ (Yi −Y)2 ∑ ei SSE
• SEE= = =
n−2 n−2 n−2

15
Standard error of estimate(SEE)

 SEE
 The SEE gauges the "fit" of the regression
line. The smaller the standard error, the

http://www.zejicert.cn
better the fit.
 SEE will be low(relative to total variability)
if the relationship is very strong, or will be
high if the relationship is weak.

The coefficient of determination(R2)

 R2(The coefficient of determination)


 R2 (coefficient of determination or
goodness of fit) : the percentage of the
http://www.zejicert.cn

total variation in the dependent variable


explained by the independent
variable.(0≤R2≤1)

16
The coefficient of determination(R2)

 R2(The coefficient of determination)


 R2 = r2
• Cannot be used in multiple regression.

http://www.zejicert.cn
• R2 is called multiple R.
SSR ESS
 R2 = =1-
TSS TSS

The coefficient of determination(R2)


 The different between the R2 and correlation
coefficient
 The correlation coefficient indicates the
sign of the relationship between two
http://www.zejicert.cn

variables, whereas the coefficient of


determination does not.

17
The coefficient of determination(R2)
 The coefficient of determination can apply to
an equation with several independent
variables, and it implies a explanatory power,
while the correlation coefficient only applies to

http://www.zejicert.cn
two variables and does not imply explanatory
between the variables.

Hypothesis testing for the regression coefficient


 The confidence interval for the regression
coefficient, b1:
b t s
http://www.zejicert.cn

• tc: The critical two-tailed t-value for the


selected confidence level with the
appropriate number of degrees of
freedom(n-2).
• Sb1: The standard error of the regression
coefficient.

18
Hypothesis testing for the regression coefficient

s is a function of the SEE: as SEE rises, s also


increases, and the confidence interval widens.
 SEE measures the variability of the data about the
regression line, and the more variable the data,

http://www.zejicert.cn
the less confidence to estimate a coefficient.

Hypothesis testing for the regression coefficient

 Hypothesis testing
 We testing the population value of the
intercept or slope coefficient of a
http://www.zejicert.cn

regression model.
• H0: b1=0, Ha: b1≠0
• t-statistic: t=(b1-b1)/Sb1; df = n-2.
• Reject H0 if t > +t critical or t <- t critical.

19
Determining a prediction interval

 Predicted values are values of the dependent


variable based on the estimated regression
coefficients and a prediction about the value

http://www.zejicert.cn
of the independent variable.
 Point estimate Y=b0+b1X;

Determining a prediction interval


 Confidence interval estimate for the prediction
 Y±(tc×sf).
• tc= the critical t-value with df=n−2;
• sf= the standard error of the forecast.
http://www.zejicert.cn

1 (X−X)2
• sf =SEE 1+ +
n (n−1)s2X

20
Limitations of regression analysis
 Regression relations can change over time.
(parameter instability).
 Public knowledge of relationships may result its
future usefulness.

http://www.zejicert.cn
 Regression assumptions will not be violated.
 Heteroskedastic(non-constant variance of
the error terms)
 Autocorrelation(error terms are not
independent)

Example
 An analyst ran a regression and got the
following result:
Coefficient t-statistic p-value
Intercept -0.5 -0.91 0.18
http://www.zejicert.cn

Slope 2 12.00 <0.001

ANOVA Table df SS MSS


Regression 1 8000 ?
Error ? 2000 ?
Total 51 ? -

21
Example
1. Fill in the blanks of the ANOVA Table.
2. What is the standard error of estimate?
3. What is the 95% confidence interval result of
the slope coefficient significance test?

http://www.zejicert.cn
4. What is the result of the sample correlation?
5. What is the 95% confidence interval of the
slope coefficient?

CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn

Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”

22
Multiple Regression and Machine Learning

 Multiple linear regression


 Violations of regression assumptions
 Model specification and errors in specification

http://www.zejicert.cn
 Models with qualitative dependent variables

Multiple linear regression


 The multiple linear regression model
 Yi=b0+b1X1i+ b2X2i + …+ bkXki+ εi, i=1,…,n
• Yi=ith observation of the dependent
variable, Y;
http://www.zejicert.cn

• Xji=jth observation of the ith independent


variable, X;
• b0=the regression intercept term;
• b1,…,bk = partial regression coefficient,
the regression slope coefficient for each
of the independent variables;

23
Multiple linear regression

• εi=the residual for the ith observation( also


referred to as the disturbance term or error term).

http://www.zejicert.cn
Multiple linear regression

 Interpretation of regression coefficients


 The estimated intercept coefficient (b0): the
value of Y when X is equal to zero.
http://www.zejicert.cn

 The partial regression coefficients or partial


slope coefficients (b1, b2,…,bk) : the sensitivity
of Y to a change in X.
• bk : Expected increase in Y for a 1-unit
increase Xk holding the other independent
variable constant.

24
Multiple linear regression
 Assumptions of the multiple linear regression
model:
 The relationship between the dependent
variable, Y, and the independent variables,

http://www.zejicert.cn
X1, X2, …, Xk, is linear.
 The independent variables (X1, X2, …, Xk)
are not random. Also, no exact linear
relation exists between two or more of the
independent variables.

Multiple linear regression


 The expected value of the error term is 0.
 The variance of the error term is the same for
all observations: E(εi2)=σε2.
 The error term is uncorrelated across
http://www.zejicert.cn

observations: E(εiεj)=0, j≠i.


 The error term is normally distributed.

25
Multiple linear regression

 Analysis of variance(ANOVA) table


df SS MSS
Regression k RSS RSS/k

http://www.zejicert.cn
Residual n-k-1 SSE SSE/(n-k-1)
Total n-1 TSS -

Multiple linear regression


 Standard error of regression(SEE)

SSE
 SEE= , k is the number of slope
n−k−1
coefficients.
http://www.zejicert.cn

 R2 and adjusted R2
 R2 almost always increases as variables are
added to the model, even if the marginal
contribution of the new variables is not
statistically significant.

26
Multiple linear regression
2
 The adjusted R is a modified version of the R2
that does not necessarily increase with a new
independent variable is added.
2 2 n−1

http://www.zejicert.cn
 Adjusted R is given by: R =1− 1−R2 .
n−k−1
2 2
 R ≤ R2 : Adjusted R may be less than zero.

Multiple linear regression

 Hypothesis test for a partial slope coefficient


 H0: bj=0 (j=1,2,…,k)
b
 t= j , df=n-k-1
sbj
http://www.zejicert.cn

 Regression coefficient confidence interval:


bj (t0 sb )
j

27
Multiple linear regression

 Testing whether all population regression


coefficients equal zero
 Joint hypothesis testing:

http://www.zejicert.cn
• The F-test is used to test whether at
least one slope coefficient is significantly
different from zero.
 H0: b1=b2=b3=…=bk=0;
 Ha: at least one bj≠0 (j = 1 to k).

Multiple linear regression


 The F-Statistic
RSS⁄k
 F=
SSE (n−k−1)
• The F-test here is always a one-tailed test.
http://www.zejicert.cn

• Decision rule: If F(test-statistic)>F(critical


value), reject H0.

28
F-table at 5 percent (Upper tail)

http://www.zejicert.cn
Example
In an attempt to estimate a regression equation
that can be used to forecast Feild’s future sales,
22 years of Feild’s annual sales were regressed
against two independent variables:
http://www.zejicert.cn

• GDP = the level of gross domestic product


• ∆I = changes in 30-year mortgage interest
rates (expressed in percentage terms)

29
Example
Based on the regression results (see next page),
the regression equation can be stated as: Sales
= 6.000 + 0.004(GDP) − 20.500(∆I)
Fill in the missing data and interpret the results

http://www.zejicert.cn
of the regression at a 5% level of significance
with respect to:
• The significance of the individual independent
variables.
• The utility of the model as a whole.

Example

 Regression Results for Feild Sales Data


Standard t- p-
Coefficient
Error Statistic Value
Intercept 6.000 4.520 1.327 0.20
http://www.zejicert.cn

Level of gross
domestic product 0.004 0.003 ? 0.20
(GDP)
Changes in 30-year <
-20.500 3.560 ?
mortgage rates (∆I) 0.001

30
Example

ANOVA df SS MS F Significance F
Regression ? 236.30 ? ? p < 0.005
Error ? 116.11 ?
Total ? ?

http://www.zejicert.cn
R2 ?
2
Ra ?

Answer

 Regression Results for Feild Sales Data


Standard t- p-
Coefficient
Error Statistic Value
Intercept 6.000 4.520 1.327 0.20
http://www.zejicert.cn

Level of gross
domestic product 0.004 0.003 1.333 0.20
(GDP)
Changes in 30-year <
-20.500 3.560 -5.758
mortgage rates (∆I) 0.001

31
Answer

ANOVA df SS MS F Significance F
Regression 2 236.30 118.15 19.34 p < 0.005
Error 19 116.11 6.11
Total 21 352.41

http://www.zejicert.cn
R2 67.05%
2
Ra 63.58%

Multiple linear regression


 Dummy variables
 Dummy variables in a regression model can
help analysts determine whether a particular
qualitative independent variable explains the
http://www.zejicert.cn

model’s dependent variable. 。


 Dummy variables are assigned a value of “0"
or “1”.
 If we need to distinguish among n categories,
should include n − 1 dummy variables.

32
Multiple linear regression
 Example of dummy variables
 GDPt=b0+b1Q1t+b2Q2t+b3Q3t+εt
• GDPt=第t期的GDP观测值;
• Q1t=如果期间t是第一季度就为1,否则为0;

http://www.zejicert.cn
• Q2t=如果期间t是第二季度就为1,否则为0;
• Q3t=如果期间t是第三季度就为1,否则为0。

Multiple linear regression


• The intercept term represents the average
value of GDP for the fourth quarter.
• The slope coefficient on each dummy variable
estimates the difference in GDP (on average)
http://www.zejicert.cn

between the respective quarter and the


omitted quarter.

33
Multiple Regression and Machine Learning

 Multiple linear regression


 Violations of regression assumptions
 Model specification and errors in specification

http://www.zejicert.cn
 Models with qualitative dependent variables

Violations of regression assumptions

 Heteroskedasticity
 Serial correlation (autocorrelation)
 Multicollinearity
http://www.zejicert.cn

34
Heteroskedasticity
 Homoskedasticity and heteroskedasticity
 The error term εi is homoskedasticity if the
variance of the conditional distribution of εi
given Xi is constant for i = 1,…,n and in

http://www.zejicert.cn
particular does not depend on Xi.
 Otherwise the error term is heteroskedastic.

Heteroskedasticity
 Unconditional heteroskedasticity
 The heteroskedasticity is not related to the
level of the independent variables.
• Which means that it doesn’t
http://www.zejicert.cn

systematically increase or decrease


with changes in the value of the
independent variable(s).
• While this is a violation of the equal
variance assumption, it usually causes
no major problems with the regression.

35
Heteroskedasticity

 Conditional heteroskedasticity
 Heteroskedasticity is related to the level of
(i.e., conditional on ) the independent variable.

http://www.zejicert.cn
 The residual variance with be larger if the
values of the independent variable X is larger.
 Conditional heteroskedasticity does create
significant problems for statistical inference.

Heteroskedasticity

Y High residual variance


Low residual variance

Y=b0 +b1 X
http://www.zejicert.cn

0
X

36
Heteroskedasticity

 Effect of heteroskedaticity on regression analysis


 The coefficient estimates (the bi) aren’t affected;
 The standard errors are usually unreliable
estimates;

http://www.zejicert.cn
Heteroskedasticity

 t test & F test are all affected.


• If the standard errors are too small, but
the coefficient estimates themselves are
http://www.zejicert.cn

not affected, the t-statistics will be too


large and the null hypothesis of no
statistical significance is rejected too
often. (Type I error)
• If the standard errors are too large.
(Type II error)

37
Heteroskedasticity

 Detecting heteroskedasticity
 Method one: Residual scatter plot.

http://www.zejicert.cn
Residual

Independent Variable

Heteroskedasticity
 Method two:The Breusch-Pagen 2 test(df = k).
• H0: Squared error term is uncorrelated with
the independent variables.
• Ha: Squared error term is correlated with the
http://www.zejicert.cn

independent variables.
2
• BP=n×RResidual , df=k, one-tailed test

38
Heteroskedasticity
2
• 注意: RResidual 是以残差项的方差(squared
residuals)和X做回归得出的决定系数,并非
原回归方程的决定系数。
 We should concern only for large values of the

http://www.zejicert.cn
test statistic.
 Decision rule: BP test statistic should be small
( 2分布表).

Chi-squared ( 2) table
http://www.zejicert.cn

39
Heteroskedasticity
 Correcting for heteroskedasticity
 Method one:Robust standard error
• Corrects the standard errors (White-
corrected) of the linear regression

http://www.zejicert.cn
model’s estimated coefficients to account
for the conditional heteroscedasticity.

Heteroskedasticity
 Method two:Generalized least squares
• Modifies the original equation in an attempt
to eliminate the heteroscedasticity.
http://www.zejicert.cn

40
Violations of regression assumptions

 Heteroskedasticity
 Serial correlation (autocorrelation)
 Multicollinearity

http://www.zejicert.cn
Serial correlation (autocorrelation)
 Serial correlation (autocorrelation) refers to
the situation that error terms are correlated
with one another.
http://www.zejicert.cn

41
Serial correlation (autocorrelation)
 Serial correlation is often found in time series data.
 Positive serial correlation exists when a
positive regression error in one time period
increases the probability of observing

http://www.zejicert.cn
regression error for the next time period.
 Negative serial correlation occurs when a
positive error in one period increases the
probability of observing a negative error in the
next period.

Serial correlation (autocorrelation)


 Effect of serial correlation on regression analysis.
 If one of the independent variables is a
lagged value of the dependent variable, then
serial correlation in the error term will cause
http://www.zejicert.cn

all the parameter estimates from linear


regression to be inconsistent. (此处不做讨论)
 As long as none of the independent variables
is a lagged value of the dependent variable,
then the estimated parameters themselves
will be consistent.

42
Serial correlation (autocorrelation)
 Positive serial correlation → Type Ⅰ error & F-
test unreliable
• Not affect the consistency of estimated
regression coefficients.

http://www.zejicert.cn
Serial correlation (autocorrelation)
• Because of the tendency of the data to cluster
together from observation to observation,
positive serial correlation typically results in
coefficient standard errors that are too small,
http://www.zejicert.cn

and that leads to t-statistics to be larger.


• Positive serial correlation is much more
common in economic and financial data, so
we focus our attention on its effects.

43
Serial correlation (autocorrelation)
 Negative serial correlation → Type Ⅱ error
• Because of the tendency of the data to
diverge from observation to observation,
negative serial correlation typically results in

http://www.zejicert.cn
coefficient standard error that are too large,
and that leads to t-statistics to be small.

Serial correlation (autocorrelation)


 Detecting serial correlation
 Two methods to detect serial correlation
• Method one:Residual scatter plots.
• Method two:The Durbin-Watson test.
http://www.zejicert.cn

 H0: No serial correlation


 If the sample is very large, the
Durbin-Watson statistic will be
approximately equal to 2× 1−r .

44
Serial correlation (autocorrelation)
 Durbin-Watson test
 H0: No positive serial correlation
 DW≈2×(1− )
 Decision rule:

http://www.zejicert.cn
Reject H0, Reject H0,
Conclude Conclude
positive serial Do not negative serial
correlation Inconclusive reject H0 Inconclusive correlation
0 dl du 4-du 4-dl 4

Critical values for The durbin-watson


statistic (α=0.05)
http://www.zejicert.cn

45
Serial correlation (autocorrelation)
 Methods to correct serial correlation
 Method one : Adjusting the coefficient
standard errors( e.g., Hansen method)
• Hansen method also corrects for

http://www.zejicert.cn
conditional heteroskedasticity.
• The white-corrected standard errors are
preferred if only heteroskedasticity is a
problem.
 Method two: Modify the regression equation
itself to eliminate the serial correlation.

Violations of regression assumptions

 Heteroskedasticity
 Serial correlation (autocorrelation)
 Multicollinearity
http://www.zejicert.cn

46
Multicollinearity
 Multicollinearity refers to the condition when
two or more of the independent variables, or
linear combinations of the independent
variables are highly correlated with each other.

http://www.zejicert.cn
 In practice, multicollinearity is often a matter of
degree rather than of absence or presence.
 High pairwise correlations among the
independent variables are not a necessary
condition for multicollinearity.

Multicollinearity
 Effect of multicollinearity
 Does not affect the consistency of slope
coefficients;
 Coefficients themselves tend to be unreliable;
http://www.zejicert.cn

 The standard errors of the slope coefficients


are artificially inflated. (Type II error)

47
Multicollinearity
 Two methods to detect multicollinearity:
 Method one: t-tests indicate that none of the
individual coefficients is significantly
different than zero (所有的t检验都不拒绝H0,

http://www.zejicert.cn
所有的b都等于0), while the F-test indicates
overall significance (F检验拒绝H0,至少有一
些b不等于0) and the R² is high.

Multicollinearity
 Method two: the absolute value of the
sample correlation between any two
independent variables is greater than 0.7
(︱r︱>0.7).
http://www.zejicert.cn

 Methods to correct multicollinearity


 Omit one or more of the correlated
independent variables.

48
Summary of assumption violations
Problem Effect Detection Solution
Use robust
Incorrect 1. Residual scatter plots standard errors
Heteroskedasticity standard 2. Breuch-Pagen χ2-test (corrected for
errors (BP=n×R2 conditional
heteroscedasticity)

http://www.zejicert.cn
Use robust
Incorrect 1. Residual scatter plots
standard
Serial correlation standard 2. Durbin-Watson
errors(corrected for
errors test(DW≈2×(1−r))
serial correlation)
1. t-tests: fail to reject H0;
High R2 F-test: reject H0;R2 is Remove one or
Multicollinearity and low t- high more independent
statistics 2. High correlation among variables
independent variables

Multiple Regression and Machine Learning

 Multiple linear regression


 Violations of regression assumptions
 Model specification and errors in specification
http://www.zejicert.cn

 Models with qualitative dependent variables

49
Model specification and errors in specification
 Principles of model specification
 The model should be grounded in cogent
economic reasoning;
 The functional form chosen for the variables

http://www.zejicert.cn
in the regression should be appropriate
given the nature of the variables;

Model specification and errors in specification


 The model should be parsimonious. We
should expect each variable included in a
regression to play an essential role;
 The model should be tested and be found
http://www.zejicert.cn

useful out of sample before being accepted.

50
Model specification and errors in specification
 Misspecified functional form
 One or more important variables could be
omitted from regression.
 One or more of the regression variables

http://www.zejicert.cn
may need to be transformed before
estimating the regression.
 The regression model pools data from
different samples that should not be pooled.

Model specification and errors in specification


 Time-series misspecification (independent
variables correlated with errors)
 Including lagged dependent variables as
independent variables in regressions with
http://www.zejicert.cn

serially correlated errors;


 Including a function of a dependent variable
as an independent variable, sometimes as a
result of the incorrect dating of variables;
 Independent variables that are measured
with error.

51
Model specification and errors in specification
 Other Types of Time-Series Misspecification
 Relations among time series with trends;
 Relations among time series that may be

http://www.zejicert.cn
random walks.

Multiple Regression and Machine Learning

 Multiple linear regression


 Violations of regression assumptions
 Model specification and errors in specification
http://www.zejicert.cn

 Models with qualitative dependent variables

52
Models with qualitative dependent variables
 Qualitative dependent variables are dummy
variables used as dependent variables
instead of as independent variables.
 Probit and logit model

http://www.zejicert.cn
 Application of these models results in
estimates of the probability that the event
occurs(e.g., probability of default).

Models with qualitative dependent variables


• A probit model based on the normal distribution,
while a logit model is based on the logistic
distribution.
• Both models must be estimated using maximum
http://www.zejicert.cn

likelihood methods( 最大似然估计方法).


• These coefficients relate the independent
variables to likelihood of an event occurring,
such as a merger, bankruptcy, or default.

53
Models with qualitative dependent variables

 Discriminant models
 Discriminant analysis yields a linear
function, similar to a regression equation,

http://www.zejicert.cn
which can then be used to create an
overall score. Based on the score, an
observation can be classified into the
bankrupt or not bankrupt category.(Z-score)

Example

Use the following information to answer Questions


1 through 4.
Multiple regression was used to explain stock
http://www.zejicert.cn

returns using the following variables:


Dependent variable: RET = annual stock returns
(%)
Independent variables:
MKT = Market capitalization = Market capitalization
/ $1.0 million

54
Example

IND = Industry quartile ranking (IND = 4 is the


highest ranking)
FORT = Fortune 500 firm, where (FORT = 1 if the

http://www.zejicert.cn
stock is that of a Fortune 500 firm, FORT = 0 if not
a Fortune 500 stock)

Example

The regression results are presented in the tables below.


Standard p-
Coefficient t-Statistic
Error Value
Intercept 0.5220 1.2100 0.430 0.681
http://www.zejicert.cn

Market
0.0460 0.0150 3.090 0.021
Capitalization
Industry
0.7102 0.2725 2.610 0.040
Ranking
Fortune 500 0.9000 0.5281 1.700 0.139

55
Example

ANOVA df SS MSS F Significance F


Regression 3 20.5969 6.8656 12.100 0.006
Error 6 3.4031 0.5672

http://www.zejicert.cn
Total 9 24.0000

Test Test-Statistic
Breusch-Pagan 17.7
Durbin-Watson 1.8

Example

1. Based on the results in the table, which of the


following most accurately represents the
regression equation?
http://www.zejicert.cn

A. 0.43 + 3.09(MKT) + 2.61(IND) + 1.70(FORT).


B. 0.681 + 0.021(MKT) + 0.04(IND) +
0.139(FORT).
C. 0.522 + 0.0460(MKT) + 0.7102(IND) +
0.9(FORT).

56
Example

2. The expected return on the stock of a firm that


is not in the Fortune 500, has a market
capitalization of $5 million, and is in an industry

http://www.zejicert.cn
with a rank of 3 is closest to:
A. 2.88%.
B. 3.98%.
C. 1.42%.

Example

3. Does being a Fortune 500 stock contribute


significantly to stock returns?
A. Yes, at a 10% level of significance.
http://www.zejicert.cn

B. Yes, at a 5% level of significance.


C. No, not at a reasonable level of significance.

57
Example

4. The p-value of the Breusch-Pagan test is


0.0005. The lower and upper limits for the
Durbin-Watson test are 0.40 and 1.90,

http://www.zejicert.cn
respectively. Based on this data and the
information in the tables, there is evidence of:
A. only serial correlation.
B. serial correlation and heteroskedasticity.
C. only heteroskedasticity.

Example

1. Answer = C
2. Answer = A
3. Answer = C
http://www.zejicert.cn

4. Answer = C

58
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis

http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”

Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root
http://www.zejicert.cn

 Autoregressive conditional heteroskedasticity


(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

59
Trend models

 Linear trend model


 yt=b0+b1t+εt;
 Same as linear regression, except for that the

http://www.zejicert.cn
independent variable is time t (t=1,2,3,……).
Yt

Trend models

 Log-linear trend model


 If the residuals from a linear trend model are
persistent, we then need an alternative model
http://www.zejicert.cn

satisfying the conditions of linear regression.


 Log-linear trends work well in fitting time series
that have exponential growth (constant growth at
a particular rate).
• Yt =e(b0+b1t) ;
 The constant growth rate is eb -1.

60
Trend models

• Model the natural log of the series using a


linear trend.
 ln yt =b0 +b1 t +εt

http://www.zejicert.cn
 Use the Durbin Watson statistic to detect
autocorrelation.

Trend models

Dots represent
Yt Raw data
http://www.zejicert.cn

Linear Trend Model


yt =b0 +b1 t +εt
Dots represent
time transformed data

Log-Linear Trend Model


ln yt =b0 +b1 t +εt
time

61
Trend models
4500
4000
3500
3000
Sales/million

2500

http://www.zejicert.cn
2000
1500
1000
500
0
1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

2015
‐500

Year
Sales Trend 指数 (Sales) 线性 (Sales)

Trend models
 Factors that determine which model is best
 A linear trend model may be appropriate if the
data points appear to be equally distributed
above and below the regression line.
http://www.zejicert.cn

 A log-linear model may be more appropriate if


the data plots with non-linear(curved)shape,
then the residuals from a linear trend model
will be persistently positive or negative for a
period of time.

62
Trend models

 Limitations of trend model


 Usually the time series data exhibit serial
correlation, which means that the model is

http://www.zejicert.cn
not appropriate for the time series, causing
inconsistent b0 and b1.
 The mean and variance of the time series
changes over time.

Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root
http://www.zejicert.cn

 Autoregressive conditional heteroskedasticity


(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

63
Autoregressive(AR) Time-series models
 An autoregressive model uses past values of
dependent variable as independent variables.
 AR(p) model:
• Xt=b0+b1xt-1+b2xt-2+…+bpxt-p+εt

http://www.zejicert.cn
 (εt N(0,σ2), satisfied white noise
 AR(p):AR model of order p (p是自回归模型
包含的滞后项的个数)。

Autoregressive(AR) Time-series models


 Forecasting With an Autoregressive Model
 Chain rule of forecasting
• A one-period-ahead forecast for an
AR(1) model:
http://www.zejicert.cn

 Xt+1=b0+b1xt
• A two-period-ahead forecast for an
AR(1) model:
 Xt+2=b0+b1xt+1

64
Autoregressive(AR) Time-series models

 Forecasting With an Autoregressive Model,


we should prove:
 No autocorrelation

http://www.zejicert.cn
 Covariance-stationary series
 No conditional heteroskedasticity

Autoregressive(AR) Time-series models


 Detecting autocorrelation in an AR model
 Compute the autocorrelations of the residual:
rε ,ε
• t−statistics= t⁄ t−k
1 n
http://www.zejicert.cn

 n is the number of observations in the


time series.
 If the residual autocorrelations differ
significantly from 0, the model is not correctly
specified, so we may need to modify it.
 Correction: add lagged values.

65
Autoregressive(AR) Time-series models
 Covariance-stationary series
 Three conditions for covariance stationary:
• Constant and finite expected value of
the time series;

http://www.zejicert.cn
• Constant and finite variance of the time
series;
• Constant and finite covariance with
leading or lagged values.
 Stationary in the past does not guarantee
stationary in the future.

Autoregressive(AR) Time-series models


 Mean reversion
 A time series exhibits mean reversion if it
has a tendency to move towards its mean;
 For an AR(1) model (xt+1=b0+b1xt), xt+1=xt
http://www.zejicert.cn

implies the level xt=b0+b1xt , or that the


b
mean-reverting level, xt= 0 ;
1−b1

66
Autoregressive(AR) Time-series models

b
 If xt> 0 the model predicts that xt+1 will be
1−b1
lower than xt;
b
 If xt< 0 the model predicts that xt+1 will be

http://www.zejicert.cn
1−b1
higher than xt;
 All covariance-stationary time series have a
finite mean reverting level.

Autoregressive(AR) Time-series models


 Compare forecasting model performance
 One way to compare the forecast
performance of two models is to compare
the variance of the forecast errors that the
http://www.zejicert.cn

two models make.


 Distinguish between in-sample forecast
errors and out-of-sample forecast errors.

67
Autoregressive(AR) Time-series models
 Compare forecasting power with RMSE
 Typically, we compare the out-of-sample
forecasting performance of forecasting models by
comparing their root mean squared error (RMSE).

http://www.zejicert.cn
• RMSE is the square root of the average
squared error.
n
∑ (Predictedi −Actuali )2
 RMSE= i−1
n
• The model with the smallest RMSE is the
most accurate.

Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root
http://www.zejicert.cn

 Autoregressive conditional heteroskedasticity


(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

68
Random walks and units roots
 Random walk
 Simple random walk:
• xt=xt-1+εt, E(εt)=0, E(εt2)=σ2, Cov(εt,εs) =
E(εt,εs) = 0 if t≠s.

http://www.zejicert.cn
• A special AR(1) model with b0=0 and b1=1;
• The best forecast of xt is xt-1.

Random walks and units roots

 Random walk with drift


 xt=b0+b1xt-1+εt, (b0≠0, b1=1);
 The time series is expected to increase or
http://www.zejicert.cn

decrease by a constant amount.


 If the lag coefficient is equal to 1, the time
series has a unit root.

69
Random walks and units roots

 A random walk is a time series which has an


undefined mean-reverting level, and the variance
of xt grows without an upper bound, is not

http://www.zejicert.cn
covariance stationary.
• We cannot use standard regression analysis
on a time series that is a random walk. We
should convert the data to a covariance-
stationary time series.

Random walks and units roots

• We could not rely on the statistical results of an


AR(1) model if the absolute value of the lag
coefficient were greater than or equal to 1.0,
http://www.zejicert.cn

because the time series would not be covariance


stationary.

70
Random walks and units roots
 The unit root test of nonstaionarity
 A common t-test of the hypothesis that b1=1
is invalid to test the unit root.
 Dicky-Fuller test(DF test) to test the unit root

http://www.zejicert.cn
• Start with an AR(1) model xt=b0+b1xt-1+εt;
• xt-xt-1=b0+(b1-1)xt-1+εt or xt-xt-1=b0+gxt-1+εt,
E(εt)=0.

Random walks and units roots


 H0: g=0 (有单位根,非平稳); Ha: g<0( 没有单位
根,平稳).
• Calculate conventional t-statistic and use
revised t-table;
http://www.zejicert.cn

• If we can reject H0, the time series does not


have a unit root and is stationary.

71
Random walks and units roots
 If a time series appears to have a unit root,
how should we model it?
 One method that is often successful is to
first-difference the time series and try to

http://www.zejicert.cn
model the first-differenced series as an
autoregressive time series.

Random walks and units roots


 First differencing
 Define yt as yt=xt-xt-1=εt;
• E(εt)=0, E(εt2)=σ2, Cov(εt,εs)=E(εt,εs)=0 if t≠s.
 This is an AR(1) model yt=b0+b1yt-1+εt, where
http://www.zejicert.cn

b0=b1=0;
 the mean-reverting level is b0/(1-b1)=0, the
variance of yt is var(εt)= σ2.
 Therefor, the first-differenced variable yt is
covariance stationary.

72
Example

Which of the following will always have a finite


mean-reverting level?
A. A covariance-stationary time series.

http://www.zejicert.cn
B. A random-walk-with-drift time series.
C. A time series with unit root

Answer = A.

Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root
http://www.zejicert.cn

 Autoregressive conditional heteroskedasticity


(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

73
Autoregressive conditional
heteroskedasticity (ARCH)
 Heteroskedasticity refers to the situation that the
variance of the error term is not constant.
 Test whether a time series is ARCH(1) (多元回归
中用BP test)

http://www.zejicert.cn
 ARCH(1): εt ~N(0, a0+a1εt-12)
• The distribution of εt is conditional on its
value in the previous period, εt-1.
2 2
 εt =a0+a1 εt−1 +μt;

Autoregressive conditional
heteroskedasticity (ARCH)
 If the coefficient a1 is significantly different
from 0, the time series is ARCH(1).
 If ARCH exists, the standard errors for the
http://www.zejicert.cn

regression parameters will not be correct.


 If ARCH exists, the variance of the errors in
period t+1 can be predicted in period t using:
2 2
σt+1 =a0 + a1 εt
 If ARCH exists, Generalized least squares(GLS)
must be used to develop a predictive model.

74
Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root

http://www.zejicert.cn
 Autoregressive conditional heteroskedasticity
(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

Seasonality in time-series models


 Seasonality
 Time series shows regular patterns of
movement within the year.
 The seasonal autocorrelation of the residual
http://www.zejicert.cn

with differ significantly from 0.


 We should uses a seasonal lag in an AR model.
 For example: xt=b0+b1xt-1+b2xt-4+εt.

75
Example
Using quarterly data from the first quarter of 1995 to
the last quarter of 2012, we estimate the following
AR(1) model using ordinary least squares:
(ln Salest-ln Salest-1)=b0+b1(ln Salest-1-ln Salest-2)+εt.

http://www.zejicert.cn
The table shows the results of the regression.

Example

 Regression Statistics
R-squared 0.1548
Standard error 0.0762
http://www.zejicert.cn

Observations 74
Durbin-Watson 1.9165

Coefficient Standard Error t-Statistic


Intercept 0.0669 0.0101 6.6238
In Salest-1 -0.3813 0.1050 -3.6314
In Salest-2

76
Example

Autocorrelations of the Residual


Lag Autocorrelation Standard Error T-Statistic
1 -0.0141 0.1162 -0.1213
2 -0.0390 0.1162 -0.3356

http://www.zejicert.cn
3 0.0294 0.1162 0.2530
4 0.7667 0.1162 6.5981

Example
Suppose we decide to use an autoregressive
model with a seasonal lag because of the
seasonal autocorrelation. We are modeling
quarterly data, so we estimate Equation:
http://www.zejicert.cn

(ln Salest-ln Salest-1)


=b0+b1(ln Salest-1-ln Salest-2)+b2(ln Salest-4-ln
Salest-5)+εt

77
Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root

http://www.zejicert.cn
 Autoregressive conditional heteroskedasticity
(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

Regression with more than one time series

 In linear regression, if any time series contains a


unit root, OLS may be invalid.
 For two time series, there are several possible
http://www.zejicert.cn

scenarios related to the outcome of the Dickey-


Fuller test:

78
Regression with more than one time series

 Neither of the time series has a unit root: We can


use linear regression.
 One of the time series has a unit root, but the

http://www.zejicert.cn
other does not: Error term in the linear regression
would not be covariance stationary, we cannot
use linear regression.
 Both time series have a unit root: We need to
establish whether the time series are
cointegrated.

Regression with more than one time series

 Cointegrated: A long-term financial or


economic relationship exists between them;
they do not diverge from each other without
http://www.zejicert.cn

bound in the long run.


 If both time series have a unit root, using the
(Engle-Granger) Dickey-Fuller test to test the
cointegration.

79
Regression with more than one time series

 Yt = b0 + b1xt + εt
 H0: No cointegration; Ha: Cointegration;
 If we cannot reject the null, the error term is

http://www.zejicert.cn
not covariance stationary. we cannot use
linear regression;
 If we can reject the null, the error term is
covariance stationary. we can use linear
regression to estimate the long-term relation
between the two series.

Time-series analysis

 Trend models
 Autoregressive(AR) Time-series models
 Random walk and unit root
http://www.zejicert.cn

 Autoregressive conditional heteroskedasticity


(ARCH)
 Seasonality in time-series models
 Regression with more than one time series
 Steps in time-series forecasting

80
Steps in time-series forecasting
No 判断是否有季节性因素 Yes
Seasonality?
画出散点图,判断序列是否有趋势

使用
No
Does series have a trend?

指数趋势 Yes
An exponential trend 使用自回归模型

DW

http://www.zejicert.cn
Use an AR model

检验判断残差是否自相关
Serial correlation?

Yes 线性趋势
No
A linear trend 使用趋势模型
Use a trend model

Steps in time-series forecasting


二乘法来调整
通过广义最小
增加相应的自
增加自回归数
开始模型的估计
以 AR

Yes
模型错误
回归级数
量和级数

模型构建完成,测试模型
序列协方差是否固定?

( )
1模型

http://www.zejicert.cn
预测能力

Yes Yes Yes


以差额法重新

检测是否存在
检测残差项是

季节性因素
组建序列

型检测是否
存在异质性
否自相关

用 ARCH

No No No
No

81
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis

http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”

Machine Learning

 What is machine learning


 Overview of evaluating ML algorithm performance
 Supervised machine learning algorithms
http://www.zejicert.cn

 Unsupervised machine learning algorithms


 Neural network, deep learning nets, and
reinforcement learning

82
What is machine learning

 Defining machine learning


 Supervised learning
 Unsupervised learning

http://www.zejicert.cn
 Deep learning and reinforcement learning

Defining machine learning

 Statistical approaches rely on a priori restrictive


foundational assumptions and explicit models of
structure. These assumptions are often not
http://www.zejicert.cn

satisfied in reality.
 The goal of machine learning algorithms is to
generate structure or predictions from data
without any human help and priori restrictive.

83
Supervised learning

 Supervised learning
 Supervised ML algorithms use a labeled data
set (training data set) of observed inputs and

http://www.zejicert.cn
the associated output to infer patterns, and
then use this inferred patterns to predict
outputs based on new given inputs set (test
data set).

Supervised learning

 Terminologies difference between regression and ML

In Regression In ML
Xi Inputs Independent variables Features
http://www.zejicert.cn

Y Output Dependent variable Target

84
Supervised learning

 Supervised learning training process


Training dataset: Xi, Y, i = 1,…N1 Test dataset: Xj, Y, j = 1,…N2

Xi, Y
Xj Y

http://www.zejicert.cn
Supervised ML Algorithm

Test dataset
Prediction Rule
Xi, I = 1,…N

Prediction Value: YPredict Actual Value YActual

Evaluation of Fit

Supervised learning

 Supervised learning can be divided into two


categories: regression problems and classification
problems.
http://www.zejicert.cn

 If the target (Y) variable is continuous, then the


task is one of regression problem.
• Regression focuses on making predictions
of continuous target variables.

85
Supervised learning

 If the target (Y) variable is categorical or ordinal,


then it is a classification problem.
• Classification focuses on sorting observations

http://www.zejicert.cn
into distinct categories.

Unsupervised learning

 Unsupervised learning is machine learning that


does not make use of labeled data.
 We only have features (X’s) that are used for
http://www.zejicert.cn

analysis without any target (Y) being supplied.


 The unsupervised learning algorithm seeks to
discover structure within the data themselves.
 Two important types of problems that are well
suited to unsupervised machine learning are
dimension reduction and clustering.

86
Deep learning and reinforcement learning

 Deep learning and reinforcement learning


 In deep learning, sophisticated algorithms
address highly complex tasks.

http://www.zejicert.cn
 In reinforcement learning, a computer learns
from interacting with itself (or data generated by
the same algorithm).
 Deep learning and reinforcement learning are
based on neural networks (NNs).

Example

Which of the following statements is most accurate?


When attempting to discover groupings of data
without any target (Y) variable:
http://www.zejicert.cn

A. an unsupervised ML algorithm is used.


B. an ML algorithm that is given labeled training
data is used.
C. a supervised ML algorithm is used.

Answer = A

87
Machine Learning

 What is machine learning


 Overview of evaluating ML algorithm performance
 Supervised machine learning algorithms

http://www.zejicert.cn
 Unsupervised machine learning algorithms
 Neural network, deep learning nets, and
reinforcement learning

Overview of evaluating ML algorithm performance

 Generalization and overfitting


 Errors and overfitting
 Preventing overfitting in supervised machine learning
http://www.zejicert.cn

88
Generalization and overfitting

 The data set is typically divided into three non-


overlapping samples:
 Training sample: used to train the model;

http://www.zejicert.cn
 Validation sample: validating and tuning the model;
 Test sample: used to test the model’s ability to
predict well on new data.
 Training sample is referred to as “in-sample”.
 Validation and test samples are referred to as
“out-of-sample”.

Generalization and overfitting


 A model that generalizes well is a model that retains
its explanatory power when tested out-of-sample.
 Underfitting means the model does not capture the
relationships in the data.
http://www.zejicert.cn

 Overfitting means training a model to so precise that


it does not generalize well to new data.
 The main reasons for overfitting are thus high noise
levels (random fluctuations) in the data and too much
complexity in the model.
 Models complexity increase, overfitting risk increases.

89
Generalization and overfitting

Y Y Y

http://www.zejicert.cn
X X X
Underfit Overfit Good Fit

Overview of evaluating ML algorithm performance

 Generalization and overfitting


 Errors and overfitting
 Preventing overfitting in supervised machine learning
http://www.zejicert.cn

90
Errors and overfitting
 Low or no in-sample error but large out-of-
sample error are indicative of poor generalization.
 Total out-of-sample error from three sources:
 Bias error: A model produce high bias with

http://www.zejicert.cn
poor approximation, causing underfitting and
high in-sample error.
 Variance error: Unstable models pick up
noise and produce high variance, causing
overfitting and high out-of-sample error.
 Base error due to randomness in the data.

Errors and overfitting

 Learning curves: Error rate of the data


High Bias Error High Variance Error
100 100
Accuracy Rate%

Accuracy Rate%

http://www.zejicert.cn

0 0
Number of Training Samples Number of Training Samples

91
Errors and overfitting

 Learning curves: Error rate of the data


Good Tradeoff
100
Accuracy Rate%

http://www.zejicert.cn
0
Number of Training Samples

Errors and overfitting

 Typically, linear functions are more susceptible to


bias error and underfitting, while non-linear functions
are more prone to variance error and overfitting.
http://www.zejicert.cn

 An optimal point of model complexity exists where


the bias and variance error curves intersect and in-
and out-of-sample error rates are minimized.

92
Errors and overfitting

 Managing overfitting risk (finding the optimal point )


is a core part of the machine learning process and
the key to successful generalization.

http://www.zejicert.cn
 The trade-off between overfitting and generalization
is a trade-off between cost and complexity.

Errors and overfitting

 Fitting curve: Error rate of the model complexity


Model Error
(In- and out-of-sample)

Optimal
http://www.zejicert.cn

Complexity
Total Error

Variance Error Bias Error

Model Complexity

93
Overview of evaluating ML algorithm performance

 Generalization and overfitting


 Errors and overfitting
 Preventing overfitting in supervised machine learning

http://www.zejicert.cn
Preventing overfitting in supervised
machine learning
 Two methods to reduce overfitting
 Preventing the algorithm from getting too
complex during selection and training;
http://www.zejicert.cn

• ‘Occam’s razor’ principle.


 Proper data sampling achieved by using
cross-validation.
• K-fold cross-validation.

94
Preventing overfitting in supervised
machine learning
 K-fold cross-validation Training samples
Validation sample

Sample Sample Sample Sample Sample


1 1 1 1 1

http://www.zejicert.cn
Sample Sample Sample Sample Sample
2 2 2 2 2
Sample Sample Sample Sample Sample
3 3 3 3 3
Sample Sample Sample Sample Sample
4 4 4 4 4
Sample Sample Sample Sample Sample
5 5 5 5 5

Example

After training a model, Anderson discovers that while


it is good at correctly classifying using the training
sample, it does not perform well using new data.
http://www.zejicert.cn

Anderson’s model is most likely being impaired by


which of the following:
A. Underfitting and bias error
B. Overfitting and variance error.
C. Overfitting and bias error.
Answer = B

95
Example
By implementing which one of the following actions
can Anderson address the problem?
A. Estimate and incorporate into the model a
penalty that decreases in size with the number

http://www.zejicert.cn
of included features.
B. Use the k-fold cross-validation technique to
estimate the model’s out of-sample error, and
then adjust the model accordingly.
C. Use an unsupervised learning model.
Answer = B

Machine Learning

 What is machine learning


 Overview of evaluating ML algorithm performance
 Supervised machine learning algorithms
http://www.zejicert.cn

 Unsupervised machine learning algorithms


 Neural network, deep learning nets, and
reinforcement learning

96
Supervised machine learning algorithms

 Penalized regression and LASSO


 Support vector machine (SVM)
 k-nearest neighbor (KNN)

http://www.zejicert.cn
 Classification and regression tree (CART)
 Ensemble learning and random forest

Penalized regression and LASSO

 Penalized regression
 A special case of generalized linear model (GLM).
 LASSO is a regularization technique to remove
http://www.zejicert.cn

less pertinent features and build parsimonious


models. (avoid overfitting)

97
Penalized regression and LASSO

 The regression coefficients are chosen to minimize


the sum of squared residuals plus a penalty term.
2
• Min ∑ni=1 Yi −Yi + Penalty term

http://www.zejicert.cn
• Penalized regression ensures that a feature is
included only if the sum of squared residuals
declines by more than the penalty term increases.

Penalized regression and LASSO

 The penalty term of Least absolute shrinkage and


selection operator (LASSO)
K
 Penalty term=λ ∑k=1 bk , λ > 0
http://www.zejicert.cn

2 K
 Min ∑ni=1 Yi −Yi + λ ∑k=1 bk , λ > 0
 The greater the number of included features,
the larger the penalty term.
 Lambda (λ) is a parameter whose value must
be set by the researcher before learning
begins, is called hyperparameter.

98
Support vector machine (SVM)

Margin Linear space boundary


Y Y (Discriminant boundary)

http://www.zejicert.cn
Support vectors

X X

Support vector machine (SVM)

 Terminology: n-dimension space; Linear classifier;


Linear space boundary; n-dimensional
hyperplane.
http://www.zejicert.cn

 Support vector machine (SVM) is a linear


classifier that determines the hyperplane that
optimally separates the observations into two
sets of data points.

99
Support vector machine (SVM)

 The intuitive idea behind the SVM algorithm is


maximizing the probability of making a correct
prediction by determining the boundary that is the

http://www.zejicert.cn
furthest away from all the observations
(maximum margin).
 The margin is determined by the observations
closest to the boundary in each set, and these
observations are called support vectors.

K-nearest neighbor (KNN)

Y Y
http://www.zejicert.cn

X X

100
K-nearest neighbor (KNN)
 K-nearest neighbor (KNN) is to classify a new
observation by choosing the classification with the
largest number of nearest (similar) neighbors out
of the k being considered.

http://www.zejicert.cn
 KNN is no assumptions about the distribution of
the data.
 K is a hyperparameter.
 The distance metric is used to model similarity.
 KNN can be used directly for multi-class
classification.

Classification and regression tree (CART)

 It can be applied to predict either a categorical


target variable, producing a classification tree,
or a continuous target variable, producing a
http://www.zejicert.cn

regression tree.

101
Classification and regression tree (CART)

 Most commonly, CART is applied to binary


classification or regression.
 If the goal is classification, then the prediction

http://www.zejicert.cn
of the algorithm at each terminal node will be
the category with the majority of data points.
 If the goal is regression, then the prediction at
each terminal node is the mean of the labeled
values.

Classification and regression tree (CART)

f: feature
c: cutoff value
Root Node
(f1, c1)
http://www.zejicert.cn

Decision Node Decision Node


(f2, c2) (f2, c3)

Terminal Node Decision Node Terminal Node Terminal Node


(f3, c4) (f4, c5) (f5, c6)

Terminal Node Terminal Node

102
Classification and regression tree (CART)

 After each decision node, the partition of the


feature space becomes smaller and smaller, so
observations in each group have lower within-

http://www.zejicert.cn
group error than before.
 To avoid overfitting, regularization parameters,
such as the maximum depth of the tree, the
minimum population at a node, or the maximum
number of decision nodes can be added.

Classification and regression tree (CART)

 Regularization can occur via a pruning


technique that can be used afterward to reduce
the size of the tree. Sections of the tree that
http://www.zejicert.cn

provide little classifying power are pruned.

103
Ensemble learning and random forest

 Ensemble learning: combining the predictions


from a collection of models.
 The combination of multiple learning

http://www.zejicert.cn
algorithms is known as the ensemble method.

Ensemble learning and random forest

 Ensemble learning can be divided into two main


categories:
• Different types of algorithms combined
http://www.zejicert.cn

together (an aggregation of heterogeneous


learners) with a voting classifier;
• A combination of the same algorithm (an
aggregation of homogenous learners), using
different training data that are based on a
bootstrap aggregating technique).

104
Ensemble learning and random forest

 A majority-vote classifier will assign to a new


data point the predicted label with the most votes.
 The more individual independent models you

http://www.zejicert.cn
have trained, the higher the accuracy of the
aggregated prediction up to a point. (the law
of large numbers)
 While, there is an optimal number of models
beyond which performance would be
expected to deteriorate from overfitting.

Ensemble learning and random forest

 Bootstrap aggregating (or bagging) is a technique


whereby the original training data set is used to
generate n new training data sets or bags of data.
http://www.zejicert.cn

 Each new bag of data is generated by random


re-sampling from the initial training set.
 The algorithm can be trained on n
independent data sets that will generate n
new models.

105
Ensemble learning and random forest

 Then, for each new observation, we can


aggregate the n predictions using a majority-vote
classifier for a classification or an average for a

http://www.zejicert.cn
regression.

Ensemble learning and random forest

 A random forest classifier is a collection of a


large number of decision trees trained via a
bagging method.
http://www.zejicert.cn

 To derive even more individual predictions,


added diversity can be generated in the
trees by randomly reducing the number of
features available during training.

106
Ensemble learning and random forest

 For any new observation, we let all the


classifier trees undertake classification by
majority vote.

http://www.zejicert.cn
 It tends to protect against overfitting on the
training data.
 It also reduces the ratio of noise to signal.
 An important drawback of random forest is
that it lacks the ease of interpretability of
individual trees.(black box-type algorithm)

Machine Learning

 What is machine learning


 Overview of evaluating ML algorithm performance
 Supervised machine learning algorithms
http://www.zejicert.cn

 Unsupervised machine learning algorithms


 Neural network, deep learning nets, and
reinforcement learning

107
Unsupervised machine learning algorithms

 Principal Components Analysis


 Clustering
 K-Means Clustering

http://www.zejicert.cn
 Hierarchical Clustering
 Dendrograms

Principal Components Analysis

 Dimension reduction aims to represent a data


set with many, typically correlated, features by
a smaller set of features that still do well in
http://www.zejicert.cn

describing the data.


 Principal components analysis (PCA) is used
to summarize or reduce highly correlated
features of data into a few main, uncorrelated
composite variables.

108
Principal Components Analysis

 The eigenvectors define new, mutually


uncorrelated composite variables that are
linear combinations of the original features.

http://www.zejicert.cn
 An eigenvalue gives the proportion of total
variance in the initial data that is explained by
each eigenvector.
 PCA based on the proportion of variation in
the data set to select the principal component.

Principal Components Analysis


 Principal Components of a Hypothetical 3-Dimensional
Data Set
Y PC1
Spread
Projection Error
http://www.zejicert.cn

PC2 X
Z

109
Principal Components Analysis

 Scree plots
Percent of Variance Explained

0.4

http://www.zejicert.cn
0.3
≈ 80% of total variance

0.2

0.1

0
0 2 4 6 8 10 12 14 16 18 20
Number of Principal Components

Clustering

 Clustering is used to organize data points into


similar groups.
 The observations inside each cluster are
http://www.zejicert.cn

similar or close to each other (cohesion) and


the observations in two different clusters are
as far away from one another or are as
dissimilar as possible (separation).

110
Clustering

 We use ‘distance’ to define ‘similar’.


• A commonly used definition of distance is
the Euclidian distance, the straight-line

http://www.zejicert.cn
distance between two points.

K-Means Clustering

 K-means clustering repeatedly partitions


observations into k non-overlapping clusters.
 Each observation is assigned by the
http://www.zejicert.cn

algorithm to the cluster with the centroid (i.e.,


center) to which that observation is closest.
 The k-means algorithm is fast and works well
on very large data sets with hundreds of
millions of observations.

111
K-Means Clustering

 K-means clustering algorithm


C1 C1 C1

http://www.zejicert.cn
C3 C3 C3

C2 C2 C2

C1
C1 C3 C1 C3

C3
C2 C2
C2

Agglomerative clustering (or bottom-up)


hierarchical clustering
 Agglomerative clustering (or bottom-up) hierarchical clustering

4
5 6
http://www.zejicert.cn

9
8
3
7
2
1 10

112
Agglomerative clustering (or bottom-up)
hierarchical clustering
 Divisive clustering (or top-down) hierarchical clustering

4
5 6

http://www.zejicert.cn
9 8
3
7
2
1
10

Hierarchical Clustering

 The agglomerative clustering algorithm based on


local patterns without initially accounting for the
global structure of the data. As such, the
http://www.zejicert.cn

agglomerative method is well suited for


identifying small clusters.
 The divisive clustering algorithm starts with a
holistic representation of the data, is designed to
account for the global structure of the data and
thus is better suited for identifying large clusters.

113
Dendrogram

 For visualizing a hierarchical cluster analysis.

Arch
Dendrit 9
e

http://www.zejicert.cn
2 Clusters
7 6 Clusters
8

11 Clusters
1 5 6
3
2 4
A B C D E F G H I J K

Machine Learning

 What is machine learning


 Overview of evaluating ML algorithm performance
 Supervised machine learning algorithms
http://www.zejicert.cn

 Unsupervised machine learning algorithms


 Neural network, deep learning nets, and
reinforcement learning

114
Neural Networks

 Neural networks (artificial neural networks, or ANNs)


 Has been successfully applied to a variety of
tasks characterized by non-linearities and

http://www.zejicert.cn
complex interactions among features.
 Neural networks can be supervised or
unsupervised.

Neural Networks

 Rectified linear unit (or ReLU) function

Input Layer Output Layer

X1
http://www.zejicert.cn

X2 Y

X3

Y=aX1+bX2+cX3

115
Neural Networks

 Rectified linear unit (or ReLU) function


Input Layer Hidden Layer Output Layer

X1

http://www.zejicert.cn
A1
X2 Y
A2
X3

Y=a*max(0,X1+X2+X3)+b*max(0,X2+X3)
=a*A1+b*A2
A=f(x)=max(0,x)

http://www.zejicert.cn

116
Neural Networks

 Neural networks (Artificial neural networks, or ANNs)


 The feature inputs would be scaled (i.e.,
standardized) to account for differences in the

http://www.zejicert.cn
units of the data.

Neural Networks

 Neural networks have three types of layers


• Input layer (here with a node for each of the
three features);
http://www.zejicert.cn

• Hidden layers (here consisting of six hidden


nodes), where learning occurs in training and
inputs are processed on trained nets;
• Output layer (here consisting of a single node for
the target variable y): passes information to
outside the network;

117
Neural Networks

 Forward propagation Neural Network


Input Layer Hidden Layer Output Layer

http://www.zejicert.cn
Input #1

Input #2 Output

Input #3

Neural Networks

 For the neural network 3, 6, and 1 are called


hyperparameters.
 Each node are called “neurons”.
http://www.zejicert.cn

118
Neural Networks

 Each node in hidden layers has two functional parts:


 Once the node receives the three input values,
the summation operator multiplies each value by

http://www.zejicert.cn
a weight and sums the weighted values to form
the total net input.

Neural Networks

 The total net input is then passed to the activation


function, which transforms this input into the final
output of the node.
http://www.zejicert.cn

 The activation function is characteristically non-


linear, such as an S-shaped (sigmoidal) function
(with output range of 0 to 1) or the rectified linear
unit function, can decreases or increases the
strength of the input.

119
Neural Networks

 Activation function
1 1

Sigmoid Sigmoid

http://www.zejicert.cn
Function Function

0 0
- Sigmoid Operator + - Sigmoid Operator +
Total Net Input Total Net Input
Dimmer Dimmer
Switch Switch

Neural Networks

 Predictions are compared to actual values of


labeled data and evaluated by a specified
performance measure (e.g., mean squared
http://www.zejicert.cn

error). Then, network weights are adjusted


(Learning takes place) to reduce total error of
the network.

120
Neural Networks

 If the process of adjustment works backward


through the layers of the network, this process
is called error backward propagation (BP).
 New weight = (Old weight) – (Learning rate) ×

http://www.zejicert.cn
(Partial derivative of the total error with
respect to the old weight)

Deep Learning Nets

 Deep Learning Nets (DLNs)


 Neural networks with many hidden layers—at
least 3 but often more than 20 hidden layers.
http://www.zejicert.cn

 The DLN assigns the category based on the


category with the highest probability.
 Advances in DLNs have driven developments
in many complex activities, such as image,
pattern, and speech recognition.

121
Reinforcement Learning

 Reinforcement learning (RL)


 The RL algorithm involves an agent that
should perform actions that will maximize its

http://www.zejicert.cn
rewards over time, taking into consideration
the constraints of its environment.

Example 1

When analysts apply ML techniques to a model


including fundamental and technical variables to
predict next quarter’s return for each of the 100
http://www.zejicert.cn

stocks currently in a portfolio. Then, the 20 stocks


with the lowest estimated return are identified for
replacement. Assuming regularization is utilized in
the machine learning technique used. Which of the
following ML models would be least appropriate:

122
Example 1

A. Regression tree with pruning.


B. LASSO with lambda (λ) equal to 0.
C. LASSO with lambda (λ) between 0.5 and 1

http://www.zejicert.cn
Answer = B

Example 2

When analysts utilize ML techniques to divide an


investable universe of about 10,000 stocks into 20
different groups, based on a wide variety of the
http://www.zejicert.cn

most relevant financial and non-financial


characteristics. The idea is to prevent unintended
portfolio concentration by selecting stocks from
each of these distinct groups. Which of the
following machine learning techniques is most
appropriate for this step:

123
Example 2

A. K-Means Clustering
B. Principal Components Analysis (PCA)
C. Classification and Regression Trees (CART)

http://www.zejicert.cn
Answer = A

How to choose among ML algorithms

 Summary of ML algorithms
Variables Supervised
Regression
• Linear; Penalized Regression/LASSO
Continuous • Logistic
http://www.zejicert.cn

• CART
• Random Forest
Classification
• Logit
Categorical • SVM
• KNN
• CART
Continuous or
Neural Networks; Deep Learning; Reinforcement Learning
categorical

124
How to choose among ML algorithms

 Summary of ML algorithms
Variables Unsupervised

Dimensionality Reduction
Continuous • PCA

http://www.zejicert.cn
Clustering
• K-Means
Categorical • Hierarchical

Continuous Neural Networks


or Deep Learning
categorical Reinforcement Learning

CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis
http://www.zejicert.cn

Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”

125
Big Data Projects

 Characteristics of big data


 Steps in executing a data analysis project:
Financial forecasting with big data

http://www.zejicert.cn
 Data preparation and wrangling
 Data exploration objectives and methods
 Model Training

Characteristics of big data

 Big data presence of a set of characteristics


commonly referred to as the 3Vs:
 Volume: the quantity of data.
http://www.zejicert.cn

 Variety: the array of available data sources.


 Velocity: is the speed at which data are created.

126
Characteristics of big data

 When using big data for inference or


prediction, there is a “fourth V”:
 Veracity: the credibility and reliability of

http://www.zejicert.cn
different data sources.
 These Vs have numerous implications for
financial technology fintech pertaining to
investment management.

Big Data Projects

 Characteristics of big data


 Steps in executing a data analysis project:
Financial forecasting with big data
http://www.zejicert.cn

 Data preparation and wrangling


 Data exploration objectives and methods
 Model Training

127
Steps in executing a data analysis project:
Financial forecasting with big data
 The traditional (with structured data) ML
model building steps:
 Conceptualization of the modeling task.

http://www.zejicert.cn
• Determining the output of the model;
how this model will be used and by
whom, and how it will be embedded in
business processes.

Steps in executing a data analysis project:


Financial forecasting with big data
 Data collection.
• Numeric data Including internal and
external sources.
 Data preparation and wrangling.
http://www.zejicert.cn

• Involve cleansing and organizing raw data


into a consolidated format.
 Data exploration.
• Exploratory data analysis, feature selection,
and feature engineering.
 Model training

128
Steps in executing a data analysis project:
Financial forecasting with big data
 The text ML Model Building Steps:
 Text problem formulation.
• How to formulate the text classification

http://www.zejicert.cn
problem, identifying the inputs and
outputs, how the text ML model’s
classification output will be utilized.
 Data (text) curation.
• Gathering external text data via web
services or web spidering programs.

Steps in executing a data analysis project:


Financial forecasting with big data
 Text preparation and wrangling.
• Converting streams of unstructured data to
structured inputs.
http://www.zejicert.cn

 Text exploration.
• Text visualization, text feature selection
and engineering.
 Model training.

129
Big Data Projects

 Characteristics of big data


 Steps in executing a data analysis project:
Financial forecasting with big data

http://www.zejicert.cn
 Data preparation and wrangling
 Data exploration objectives and methods
 Model Training

Data preparation and wrangling

 Data preparation and wrangling


 Domain knowledge is beneficial and often
necessary to successfully execute this step.
http://www.zejicert.cn

 Data Preparation(Cleansing) is the process


of examining, identifying, and mitigating
errors in raw data.

130
Data preparation and wrangling

 Data Wrangling (Preprocessing): Raw data


need to be presented in the appropriate
format for model consumption, and processed

http://www.zejicert.cn
by dealing with outliers, extracting useful
variables from existing data points, and
scaling the data.

Data preparation and wrangling

 Data preparation and wrangling


 Structured Data
 Unstructured (Text) Data
http://www.zejicert.cn

131
Data preparation and wrangling: Structured Data

 Data Preparation (Cleansing)


 In structured data, The data cleansing
process mainly deals with identifying and

http://www.zejicert.cn
mitigating all data errors.
• Incompleteness error: the data are not
present.
• Invalidity error: the data are outside of a
meaningful range.

Data preparation and wrangling: Structured Data

• Inaccuracy error: the data are not a measure


of true value.
• Inconsistency error: the data conflict with the
http://www.zejicert.cn

corresponding data points or reality: the data


are not present in an identical format.
• Non-uniformity error.
• Duplication error: duplicate observations are
present.

132
Possible errors in a raw dataset

Date of Marital
1 ID Name Gender Salary Province
Birth status
2 1 Mr. Zhao M 12/5/1980 ¥200,200 Shanghai Y
15 Jan,
3 2 Ms. Qian M ¥260,500 Guangdong Yes

http://www.zejicert.cn
1975
4 3 Sun 1/13/1989 ¥265,000 Beijing No
Don’t
5 4 Ms. Li F 1/1/1900 NA Fujian
Know
6 5 Ms. Qian F 15/1/1976 ¥160,500 Y
7 6 Mr. Zhou M 9/10/1971 — Sichuan N
8 7 Mr. Wu M 2/27/1966 ¥300,000 Jiangsu Y
9 8 Ms. Zheng F 4/4/1984 ¥255,000 SX N

Data preparation and wrangling: Structured Data

 Data Wrangling (Preprocessing)


 Data preprocessing primarily includes
transformations and scaling of the data.
http://www.zejicert.cn

• Extraction: A new variable can be


extracted from the current variable.
• Aggregation: Two or more variables can
be aggregated into one variable to
consolidate similar variables.

133
Data preparation and wrangling: Structured Data

• Filtration: The data rows that are not needed


for the project must be identified and filtered.
• Selection: The data columns that are

http://www.zejicert.cn
intuitively not needed for the project can be
removed.
• Conversion: The variables in the dataset must
be converted into appropriate types.

Data preparation and wrangling: Structured Data

 Scaling
 Scaling is a process of adjusting the range
of a feature by shifting and changing the
http://www.zejicert.cn

scale of data.
 It is important to remove outliers before
scaling is performed.

134
Data preparation and wrangling: Structured Data

 Several practical methods for handling outliers.


 When extreme values and outliers are simply
removed from the dataset, it is known as

http://www.zejicert.cn
trimming (also called truncation).
 When extreme values and outliers are
replaced with the maximum (for large value
outliers) and minimum (for small value outliers)
values of data points that are not outliers, the
process is known as winsorization.

Data preparation and wrangling: Structured Data

 Two of the most common ways of scaling


 Normalization: rescaling numeric variables in
the range of [0, 1].
Xi −Xmin
http://www.zejicert.cn

Xi(normalized) =
Xmax −Xmin
 Standardization: both centering and scaling the
variables. (variable will have an arithmetic mean
of 0 and standard deviation of 1)
X −μ
Xi(standardized) = i
σ

135
Data preparation and wrangling: Structured Data

 Normalization is sensitive to outliers, can be used


when the distribution of the data is not known.
 Standardization is relatively less sensitive to

http://www.zejicert.cn
outliers. However, the data must be normally
distributed to use standardization.

Data preparation and wrangling

 Data preparation and wrangling


 Structured Data
 Unstructured (Text) Data
http://www.zejicert.cn

136
Data preparation and wrangling:
Unstructured (Text) Data
 Text processing is essentially cleansing and
transforming the unstructured text data into a
structured format.

http://www.zejicert.cn
 Text Preparation (Cleansing)
 Text cleansing involves use regular
expressions (regex) to clean the text by
removing unnecessary elements from the
raw text.

Data preparation and wrangling:


Unstructured (Text) Data
 The text cleansing process steps.
 Remove html tags
 Remove Punctuations
http://www.zejicert.cn

 Remove Numbers
 Remove white spaces

137
Data preparation and wrangling:
Unstructured (Text) Data
<html>
<body>
<p><b>New U.S. claims for jobless benefits fell for

http://www.zejicert.cn
the third week in a row, hitting their lowest level in
nearly 49 years for the third straight week.</b></p>
<p>For the week ended September 12, new claims
for unemployment insurance fell to 201,000, down
3,000 from the prior week. Economists had instead
been expecting a result of 209,000.</p>

Data preparation and wrangling:


Unstructured (Text) Data
<p>The result was the lowest level since November
of 1969, whereas the prior week's level had been
the lowest since December 1969.</p>
http://www.zejicert.cn

</body>
</html>

138
Data preparation and wrangling:
Unstructured (Text) Data
 Text Wrangling (Preprocessing)
 A token is equivalent to a word, and
tokenization is the process of splitting a given

http://www.zejicert.cn
text into separate tokens.
Cleaned Texts Tokens
Text 1

Text 1

Text 1
Text 1

Data preparation and wrangling:


Unstructured (Text) Data
 The normalization process in text processing
involves the following:
 Lowercasing the alphabet removes distinctions
among the same words due to upper and
http://www.zejicert.cn

lower cases.
 Stop words are such commonly used words as
“the,” “is,” and “a, will be kept or removed.

139
Data preparation and wrangling:
Unstructured (Text) Data
 Stemming is the process of converting
inflected forms of a word into its base word
(known as stem).
 Lemmatization is the process of converting

http://www.zejicert.cn
inflected forms of a word into its morphological
root (known as lemma).
 Stemming or lemmatization decrease data
sparseness.
 After the cleansed text is normalized, a bag-of-
words (BOW) is created.

Data preparation and wrangling:


Unstructured (Text) Data
 BOW http://www.zejicert.cn

140
Data preparation and wrangling:
Unstructured (Text) Data
 Document term matrix (DTM)
 Each row of the matrix belongs to a document
(or text file), and each column represents a

http://www.zejicert.cn
token (or term).

Data preparation and wrangling:


Unstructured (Text) Data
 N-grams: is a representation of word sequences.
 When one word is used, it is a unigram; a two-
word sequence is a bigram; and a 3-word
http://www.zejicert.cn

sequence is a trigram; and so on.

141
Example 1

The output produced by preparing and wrangling


textual data is best described as a:
A. data table.

http://www.zejicert.cn
B. confusion matrix.
C. document term matrix

Answer = C

Example 2

When some words appear very infrequently in a


textual dataset, techniques that may address the
risk of training highly complex models include:
http://www.zejicert.cn

A. stemming.
B. scaling.
C. data cleansing.

Answer = A

142
Example 3

Which of the following statements concerning


tokenization is most accurate? Tokenization is
A. part of the text cleansing process.

http://www.zejicert.cn
B. most commonly performed at the character
level.
C. the process of splitting a given text into
separate tokens.

Answer = C

Big Data Projects

 Characteristics of big data


 Steps in executing a data analysis project:
Financial forecasting with big data
http://www.zejicert.cn

 Data preparation and wrangling


 Data exploration objectives and methods
 Model Training

143
Data exploration objectives and methods
 Data exploration
 Investigate and comprehend data
distributions and relationships, involves:
exploratory data analysis, feature selection,

http://www.zejicert.cn
and feature engineering.
 Exploratory data analysis (EDA)
 Exploratory graphs, charts, and other
visualizations, such as heat maps and
word clouds, are designed to summarize
and observe data.

Data exploration objectives and methods

 Feature selection
 Only pertinent features from the dataset
are selected for ML model training.
http://www.zejicert.cn

 Feature engineering
 Creating new features by changing or
transforming existing features.

144
Data exploration objectives and methods

 Data exploration objectives and methods


 Structured Data
 Unstructured (Text) Data

http://www.zejicert.cn
Data exploration objectives and methods:
Structured Data
 Exploratory data analysis
 For structured data, EDA can be performed
on a single feature (one-dimension) or on
multiple features (multi-dimension).
http://www.zejicert.cn

 The basic one-dimension exploratory


visualizations are as follows:
• Histograms
• Bar charts
• Box plots
• Density plots

145
Data exploration objectives and methods:
Structured Data

http://www.zejicert.cn
Data exploration objectives and methods:
Structured Data
 Feature Selection
 The objective of the feature selection process
is to assist in identifying significant features.
 Statistical measures can be used to assign a
http://www.zejicert.cn

score gauging the importance of each feature.


 The dimensionality reduction method creates
new combinations of features that are
uncorrelated, whereas feature selection
includes and excludes features present in the
data without altering them.

146
Data exploration objectives and methods:
Structured Data
 Feature Engineering
 The feature engineering process attempts to
further optimize and improve the features.

http://www.zejicert.cn
• This action involves engineering an
existing feature into a new feature or
decomposing it into multiple features.
• For continuous data, a new feature may
be created by taking the logarithm of the
product of two or more features.

Data exploration objectives and methods:


Structured Data
• For categorical data, a new feature can be a
combination of two features or a decomposition
of one feature into many.
http://www.zejicert.cn

147
Data exploration objectives and methods

 Data exploration objectives and methods


 Structured Data
 Unstructured (Text) Data

http://www.zejicert.cn
Data exploration objectives and methods:
Unstructured (Text) Data
 Exploratory Data Analysis
 Term frequency (TF): the ratio of the number
of times a given token occurs in all the texts
http://www.zejicert.cn

in the dataset to the total number of tokens in


the dataset.
 The words with high TF values are eliminated.

148
Data exploration objectives and methods:
Unstructured (Text) Data
 The most common applications:
• Text classification uses supervised ML
approaches to classify texts into different classes.

http://www.zejicert.cn
• Topic modeling uses unsupervised ML
approaches to group the texts in the dataset into
topic clusters.
• Sentiment analysis predicts sentiment (negative,
neutral, or positive) of the texts in a dataset using
both supervised and unsupervised approaches.

Data exploration objectives and methods:


Unstructured (Text) Data
 Feature Selection
 For text data, feature selection involves
selecting a subset of the terms or tokens
http://www.zejicert.cn

occurring in the dataset.


 Feature selection in text data aims to
eliminate noisy features [most frequent and
most sparse (or rare) tokens] from the dataset.

149
Data exploration objectives and methods:
Unstructured (Text) Data
 The general feature selection methods in text data
are as follows:
 Frequency measures can be used for

http://www.zejicert.cn
vocabulary pruning to remove noise features by:
• Filtering the tokens with very high and low
TF values across all the texts.

Data exploration objectives and methods:


Unstructured (Text) Data
• Discard the noise features that carry no specific
information about the text class and are present
across all texts by Document frequency (DF).
http://www.zejicert.cn

 DF of a token: the number of documents (texts)


that contain the respective token divided by the
total number of documents.

150
Data exploration objectives and methods:
Unstructured (Text) Data
 Chi-square test: The chi-square test is applied to
test the independence of two events: occurrence of
the token and occurrence of the class.

http://www.zejicert.cn
 Tokens with the highest chi-square test statistic
values occur more frequently in texts associated
with a particular class and therefore can be
selected for use as features for ML model
training due to higher discriminatory potential.

Data exploration objectives and methods:


Unstructured (Text) Data
 Mutual information (MI) measures how much
information is contributed by a token to a class
of texts.
http://www.zejicert.cn

 The mutual information value will be equal to


0 if the token’s distribution in all text classes
is the same. The MI value approaches 1 as
the token in any one class tends to occur
more often in only that particular class of text.

151
Data exploration objectives and methods:
Unstructured (Text) Data
 Feature Engineering
 The goal of feature engineering is to
maintain the semantic essence of the text

http://www.zejicert.cn
while simplifying and converting it into
structured data for ML.

Data exploration objectives and methods:


Unstructured (Text) Data
 Techniques for feature engineering
• Numbers can be converted into a token with
different lengths of digits representing
http://www.zejicert.cn

different kinds of numbers, such as “/


number4/.”
• N-grams: Multi-word patterns that are
particularly discriminative can be identified
and their connection kept intact.

152
Data exploration objectives and methods:
Unstructured (Text) Data
• Name entity recognition (NER): The name entity
recognition algorithm analyzes the individual
tokens and their surrounding semantics while

http://www.zejicert.cn
referring to its dictionary to tag an object class
to the token.
• Parts of speech (POS): Similar to NER, parts of
speech uses language structure and
dictionaries to tag every token in the text with a
corresponding part of speech.

Big Data Projects

 Characteristics of big data


 Steps in executing a data analysis project:
Financial forecasting with big data
http://www.zejicert.cn

 Data preparation and wrangling


 Data exploration objectives and methods
 Model Training

153
Model Training

 Model fitting errors are caused by:


 Dataset Size
• Small datasets can lead to underfitting

http://www.zejicert.cn
of the model.
 Number of Features
• A dataset with a small number of
features can lead to underfitting
• A dataset with a large number of
features can lead to overfitting.

Model Training

 The three tasks of ML model training are


method selection, performance evaluation,
and tuning.
http://www.zejicert.cn

154
Model Training

 Method Selection
 Method selection is governed by the
following factors:

http://www.zejicert.cn
• Supervised or unsupervised learning.
• Type of data.
• Size of data.

Model Training
 To deal with mixed data, the results from more
than one method can be combined. Sometimes,
the predictions from one method can be used as
predictors (features) by another.
http://www.zejicert.cn

 Class imbalance, where the number of


instances for a particular class is significantly
larger than for other classes, may be a problem
for data used in supervised learning. Balancing
the training data can help alleviate such
problems.

155
Model Training
• Undersampling majority and Oversampling
minority class

Original Dataset Final Dataset

http://www.zejicert.cn
Class 1 Class 0 Class 0 Class 1

Undersampling Majority Class

Model Training
• Undersampling majority and Oversampling
minority class

Original Dataset Final Dataset


http://www.zejicert.cn

Class 0 Class 1 Class 1 Class 0

Oversampling Majority Class

156
Model Training

 Performance Evaluation
 Error analysis.
• For classification problems, error

http://www.zejicert.cn
analysis involves computing four basic
evaluation metrics: true positive (TP),
false positive (FP), true negative (TN),
and false negative (FN) metrics. FP is
also called a Type I error, and FN is
also called a Type II error.

Model Training

• Confusion Matrix for Error Analysis


Actual Training Labels
Class 1 Class 0
http://www.zejicert.cn
Predicted Results

Class 1 True Positives(TP) False Positives(FP)


Type Ⅰ Error

Class 0 False Negatives(FN)


True Negatives(TN)
Type Ⅱ Error

157
Model Training

• Precision and Recall


 Precision (P) = TP/(TP + FP).
 Recall (R) = TP/(TP + FN).

http://www.zejicert.cn
 Trading off precision and recall is subject
to business decisions and model
application.

Model Training

• The two overall performance metrics: Accuracy


and F1 score.
 Accuracy = (TP + TN)/(TP + FP + TN + FN).
http://www.zejicert.cn

 F1 score = (2 * P * R)/(P + R).

158
Model Training

 F1 score is more appropriate when unequal


class distribution is in the dataset.
 Accuracy is considered an appropriate

http://www.zejicert.cn
performance measure if the number of classes
is equal in the dataset, .
 High scores on both of these metrics suggest
good model performance.

Model Training

 Receiver Operating Characteristic (ROC)


• Plot a curve showing the trade-off between
the false positive rate (x-axis) and true
http://www.zejicert.cn

positive rate (y-axis) for various cutoff points.


 False positive rate (FPR) = FP/(TN + FP)
 True positive rate (TPR) = TP/(TP + FN).

159
Model Training

• A more convex curve indicates better model


performance.
• Area under the curve (AUC) is the metric that

http://www.zejicert.cn
measures the area under the ROC curve.
 An AUC close to 1.0 indicates near perfect
prediction, while an AUC of 0.5 signifies
random guessing.

Model Training

 ROC Curves and AUCs


1 Model 1 AUC Curves
AUC>0.9
True Positive Rate(TPR)

Model 2
AUC>0.7
http://www.zejicert.cn

Model 3
AUC=0.5

0
False Positive Rate(FPR) 1

160
Model Training

 Root Mean Squared Error (RMSE).


• This measure is appropriate for continuous
data prediction and is mostly used for

http://www.zejicert.cn
regression methods.
n
∑ (Predictedi −Actuali )2
• RMSE= i−1
n

Model Training

 Tuning
 Model fitting has two types of error: bias and
variance. Bias error is associated with
http://www.zejicert.cn

underfitting, and variance error is associated


with overfitting.
 The bias–variance trade-off is critical to
finding an optimum balance where a model
neither underfits nor overfits.

161
Model Training

 Parameters are critical for a model and are


dependent on the training data.
 Hyperparameters are used for estimating

http://www.zejicert.cn
model parameters and are manually set and
tuned, not dependent on the training data.

Model Training

 Tuning heuristics techniques such as grid


search are used to obtain the optimum values
of hyperparameters.
http://www.zejicert.cn

 Grid search by using various combinations


of hyperparameter values, cross validating
each model, and determining which
combination of hyperparameter values
ensures the best model performance.

162
Model Training

 In the case of a complex model, where a large


model is comprised of sub-model(s), ceiling
analysis can help determine which sub-model

http://www.zejicert.cn
needs to be tuned to improve the overall
accuracy of the larger model.
 Ceiling analysis is a systematic process of
evaluating different components in the
pipeline of model building.

Model Training

 Fitting Curve for Regularization Hyperparameter(λ)


Large
High Variance High Bias Errorcv
Error
http://www.zejicert.cn

Errortrain
Errorcv>>Errortrain
Error
Overfitting
Underfitting

Small
Error
Slight Regularization Large Regularization

163
Example

The confusion matrix is presented in the tables below.


Actual Training results
Class ‘1’ Class ‘0’
Predicted
Class ‘1’ TP = 284 FP = 35

http://www.zejicert.cn
results
Class ‘0’ FN = 7 TN = 110

Example

1. Based on confusion matrix , the model’s precision


metric is closest to:
A. 89%.
http://www.zejicert.cn

B. 90%.
C. 98%.

Answer = A

164
Example

2. Based on confusion matrix , the model’s F1 score


is closest to:
A. 77%.

http://www.zejicert.cn
B. 81%.
C. 93%.

Answer = C

Example

3. Based on confusion matrix , the model’s accuracy


metric is closest to:
A. 86%.
http://www.zejicert.cn

B. 90%.
C. 83%.

Answer = B

165
CONTENTS
目录
Introduction to Linear Regression
Multiple Regression
Time-series Analysis

http://www.zejicert.cn
Machine Learning
Big Data Projects
Excerpt from “Probabilistic Approaches:
Scenario Analysis, Decision Trees, and
Simulations”

Scenario analysis, decision trees,


and simulations
 Scenario analysis, which applies probabilities
to a small number of possible outcomes;
 Decision trees, which use tree diagrams of
possible outcomes, are techniques used to
http://www.zejicert.cn

assess risk;
 Simulations are also used to assess risk.

166
Scenario analysis, decision trees,
and simulations
 Scenario analysis, which applies probabilities
to a small number of possible outcomes;
 Decision trees, which use tree diagrams of
possible outcomes, are techniques used to

http://www.zejicert.cn
assess risk;
 Simulations are also used to assess risk.

Scenario analysis, decision trees,


and simulations
 Steps in Simulation
 Determine “probabilistic” variables;
 Define probability distributions for these variables:
http://www.zejicert.cn

• Historical data;
• Cross sectional data;
• Statistical distribution and parameters.
 Check for correlation across variables;
 Run the simulation.

167
Scenario analysis, decision trees,
and simulations
 Issues in simulation:
 Garbage in, garbage out
 Real data may not fit distributions
 Non-stationary distribution

http://www.zejicert.cn
 Changing correlation across inputs

Scenario analysis, decision trees,


and simulations
 Risk Type and Probabilistic Approaches
Discrete/ Correlated/ Sequential/
Risk Approach
Continuous Independent Concurrent
Discrete Independent Sequential Decision tree
http://www.zejicert.cn

Discrete Correlated Concurrent Scenario analysis


Continuous Either Either Simulations

168
http://www.zejicert.cn
扫扫加入CFA金融题库 扫扫获取更多考试资讯

THANKS

169

You might also like