Econometrics 2
Econometrics 2
“Explain y in term of x”
Where:
1. x & y
2. Beta1: slope paremeter: the relationship between x and y, holding the other
factors in u is fixed
3. Beta0: Intercept parameter: hệ số chắn
The variable u, called the error term or disturbance in the relationship,
represents factors other than x that affect y. A simple regression analysis
effectively treats all factors affecting y other than x as being unobserved. You
can usefully think of u as standing for “unobserved.”
Nếu: hàm hồi qui tổng thể có dạng E(Y/X) = 1 + 2X
PRF cho biết quan hệ giữa biến phụ thuộc và biến giải thích về mặt trung
bình trong tổng thể.
SECSION 1 (6/5/2021)
1. DEFINITION
1. Question of interest
2. Economic model
3. Econometric model
4. Data collection
5. Estimation of econometric model
6. Dianosing the model problem (example:Multicollinearity;heteroskedasticity;
normality)
7. Hypotheses postulated
8. Result analysis and policy implications
Step 1. Question of interest based on economic theories
In which
• C: Consumption of the households
•β β : Parameters/ coefficient
0; 1
• For example, beside variable “income”, there are other variables that can
affect the consumption of households: numbers of family member, ages of the
family head...
C = β + β I + u (2)
i 0 1i i
The researchers cannot know all the factors that affect dependent variable Y.
If they know all the factors, it is impossible to get data for all factors.
• Primary vs secondarydata
• The Structure of Economic Data
– Cross-sectional Data
– Time-series Data
– Pooled Data
• Pooled cross-sectional data
• Panel Data
• These data are obtained by random sampling from the underlying population.
D, Panel data
• A panel data set consists of a time series for each cross-sectional member in
the data set.
• The same cross-sectional units are followed over a given time period.
Pane
ID YEAR FDI ODA POPU IZ MOUTAIN
l
2170,
1 An Giang 2004 145 40,61 0 0
1
1 An Giang 2005 139 41,51 2194 0 0
2210,
1 An Giang 2006 140 30,60 0 0
4
Ba Ria
2 2004 64776 1220,01 897,6 7 0
Vung Tau
Ba Ria
2 2005 71441 157,99 913,1 7 0
Vung Tau
Ba Ria
2 2006 106618 11,55 926,3 7 0
Vung Tau
.... .... .... .... .... .... .... ....
1154,
63 Vinh Phuc 2004 7340 5,24 2 0
8
63 Vinh Phuc 2005 9340 7,36 1169 2 0
1180,
63 Vinh Phuc 2006 12776 27,73 3 0
4
64 Yen Bai 2004 96 3,04 723,5 0 1
64 Yen Bai 2005 103 6,13 731,8 0 1
64 Yen Bai 2006 113 9,80 740,7 0 1
• Data sources: (Must be mentioned the source of data when doing analysis)
– Experimental data
– Available data
Data -> Stata, Eviews, SPSS -> estimate parameters of the model (2)
β^ = -184,08 and β^ = 0,7064
0 1
The “hat” above the variable C show that this is an estimator of this
variable. ("Mũ" phía trên biến C cho thấy đây là một công cụ ước tính của biến
này)
Slope parameter = 0,70: if income increases by 1 billion USD, consumption
will increase 706 million USD.
- Normality of u
To see if the estimated results are consistent with/ supportive of the theories.
If the model is appropriate and the estimated results are consistent with the
theories Provide policy implication
Step 9: Forecasting
Lecture 2: The Linear Regression Model 1
• Defined by Galton (1886) when he studied the relationship between the height
of sons and the height of fathers
• Given the height of fathers, the height of sons will distribute around a medium
value
• On average, when the height of fathers increase, the height of sons also
increase
• This line is called regression line, showing the relationship between the height
of sons and the height of father on average
For example: Galton studied the relationship between the height of fathers and
the height of sons in one city. He collected the data of all fathers having adult
sons. So he can build PRF. 5
So, E(Y|X ) is a function of independent variable X :
i i
–If PRF has 2 or more independent variables -> multiple regression function
E(Y|X )= β + β X [2]
i 0 1 i
necessary the same with E(Y|X ), but they are around E(Y/X ).
i i
• Note u is the difference between Y and E(Y/X ) (Khoảng cách từ giá trị
i i i
thực của quan sát thứ i đến kì vọng toán = giá trị trung bình), we have:
u = Y - E(Y|X ) [3]
i i i
Or :
Y = E(Y|X )+ u [4]
i i i
• In reality, we can not carry out surveys of population -> we can not build PRF
• Then we only canestimate the expected value of Y,or in other words, estimate
PRF based on sample(s) taken from population
9
Graph 2.03. Scatter graph and regression line of the 2 samples SRF1 và
SRF2
10
• From the population, we can get many samples. With each sample, we can
have a SRF
• To have the “best” SRF, meaning that that SRF is the closest estimate of PRF,
we have to base on some criteria (tiêu chuẩn) even when we do not have PRF
to compare.
U^: phần dư
14
• Assume that:
SRF: 𝑌𝑖 = β0 + 1 𝑋𝑖 + 𝑢𝑖 = 𝑌𝑖 + 𝑢𝑖 [3.02]
Where Y is the predicted/ fitted value of Y
i
Vấn đề trái ngược dấu , bị triệt tiêu giá trị trong khi khoảng cách k phải như
vậy
6. Formula of calculating :
SESSION 3 (13/5/2021)
Example
c, Write SRF
Example
X 5 4 2 8 8
Y 1 2 3 4 5
28
• Beta0=69/68; beta1=25/68
• SST = 10, SSE = 125/34, SSR = 215/34, R-square = 0.3676
• R-square=0.3676
It means that: Income can explain 36.76% of the sample variation in
consumption of people. So 63.24% of the the sample variation in consumption
of people is explained by other independent variables that are not included in the
model.
Example
X 6 5 2 4 4
Y 5 2 2 3 1
E(X)=21/5 , E(Y)=13/5
Var(X)=1.76 , Var(Y)=1.84
Beta0 = 1/44 ; beta1 = 27/44
SST=9.2,SSE=3.3,SSR=5.9,R-square=35.86%
The OLS estimators are expressed solely in terms of the observable (i.e.,
sample) quantities.
They are point estimators; that is, given the sample, each estimator will
provide only a single (point) value of the relevant population parameter.
Once the OLS estimates are obtained from the sample data, the sample
regression line can be easily obtained.
2. The mean value of the estimated Y is equal to the mean value of the actual Y
i
Y =Y
i
3. The mean value of the residuals is zero.
34
2
R is the fraction (percentage) of the sample variation in Y that is explained
Example:
Note: If the model violate (vi phạm) assumption 1-4 => We cannot run the
model
• Assumption 5: The error term has an expected value of zero given any value
of the explanatory variable. In other words, E(u|X)=0.
This assumption simply says that the factors not explicitly included in the
model, therefore subsumed in 𝑢𝑖, do not systematically affect the mean value of
Y; the positive 𝑢𝑖 values cancel out the negative 𝑢𝑖 values so that their average
Theorem 1: Unbiasedness of OLS: Tính k chệnh của ước lượng trong OLS
Given assumptions, we have:
E(0 ) 0 ,and E(1 ) 1
for any values of 0 and 1. In other words, o is unbiased
for o, and 1 is unbiased for 1
Note: R2
Công thức tính hệ số R bình phương.
Công thức tính hệ số R bình phương xuất phát từ ý tưởng: toàn bộ sự biến thiên của
biến phụ thuộc được chia làm hai phần: phần biến thiên do hồi quy và phần biến
thiên không do hồi quy( còn gọi là phần dư).
RSS
R2 = 1−
TSS
Regression Sum of Squares(RSS): tổng các độ lệch bình phương giải thích từ môi
hình hồi quy
Residual Sum of Squares(ESS): tổng các độ lệch bình phương phần dư
Total Sum of Squares(TSS): tổng các độ lệch bình phương toàn bộ
7. Giá trị R bình phương dao động từ 0 đến 1. R bình phương càng gần 1 thì mô hình
đã xây dựng càng phù hợp với bộ dữ liệu dùng chạy hồi quy. R bình phương càng gần
0 thì mô hình đã xây dựng càng kém phù hợp với bộ dữ liệu dùng chạy hồi quy.
Trường hợp đặt biệt, phương trình hồi quy đơn biến ( chỉ có 1 biến độc lập) thì R2
chính là bình phương của hệ số tương quan r giữa hai biến đó.
( X
2
i 1
ij
X j
) 2 (1 R j )
=> When roe increases by 1%, salary per year of CEO is expected to increase by 18.501
thousand usd
Case 1
salarydol = 1000*salary
salarydol 96319118501roe
=> If the dependent variable is multiplies or divided by the constant c, then the OLS
intercept and slope estimates are also multiplies or divided by c.
Case 2
salary 963.1911850.1roedec
• Coefficient of roedec is 100 times greater than the coefficient of roe in [1]
=> If the independent variable is divided or multiplied by some non zero constant c, then
the OLS slope coefficient is multiplied or divided by c, respectively. The intercept is
unchanged.
LECTURE 3: HYPOTHESIS TEST
20/5/2021. LECTURE 4
3 dạng file:
Dữ liệu (Data): wage.dta
Log: Lưu trữ các thông tin chạy phần mềm (.smcl, .log)
Do file: Chứa các câu lệnh
Log file: Store all the result and commands
Log using “…”
Example: Log using “D:\Practice_econometrics”
*Command
1. Des : provides the meaning and the measurement of the variables
Obs: oservation
Vars: Variables
2 KINDS OF VARIABLE
- Quantiative (định lượng) and quanlitative (Định tính)
a) Quantitative variable: is a rando, variable that has value in number and
the value has meaning in terms of algebra
(Biến định lượng: Là các biến số có giá trị bằng số và các giá trị này có ý nghĩa
về mặt đại số)
Ex: educ: trình độ học vấn
Obs Educ
1 16
2 12
3 15
4 9
Example:
SOE: State-Owned Enterprise
FDI: Foreign Direct Investment
*Wage
Has 526 observations
Mean = 5,896
Mean: Giá trị trung bình: Trong thống kê, nó là thước đo xu hướng tập
trung của dữ liệu. Nó cũng được coi là một giá trị mong đợi.
Standard deviation = 3,693
SD: Độ lệch chuẩn: độ lệch so với giá trị trung bình của biến. Giá trị này
càng nhỏ cho thấy, các con số không chênh lệch nhau nhiều so với giá trị
trung bình. Ngược lại nếu giá trị này cao, thể hiển rằng đối tượng khảo sát
có nhận định rất khác biệt nhau đối với biến đó, nên mức điểm cho chênh
lệch nhau khá nhiều.
(Usually do not use dummy variables to analysis because the number is no
meaning)
The educ has the strongest effect on wage (chênh lệch giữa 2 giá trị trung
bình), next is the gender, the married vs unmarried people. Nonwhite and
white people have light effect on wage
20/5/2021 SESSION 5
tab wage
Wage is continuos variable, has a lot of value
→ should not use command “tab” for continous variable
4 .gen: to generate/create a new variable in the case we do not have this
variable in the data table
gen newvar =
After creating the variable, we should add the meaning of the new variable by
command:
NOTE: Phải sử dụng dấu ngoặc thẳng ", không được sử dụng ngoặc cong
Ex:
Create one dummy variable showing the education level of 2 group: Graduated
from university vs. Not yet graduated from university
Gen … if …
Replace
Answer:
gen graduated = 1 if educ >= 16
replace graduated = 0 if educ < 16
We have new variable : graduated
Create one dummy variable showing the experience of 2 group: less than 20
years and more than or equal 20 years
5.List in/if
Sort …
List … in STT
Exercise: list 10 people that have the lowest wage and highest wage
Sort wage
List wage in 1/10 ( STT của1-10)
List wage in 517/526 ( STT …)
Calculate the average wage of 10 people that have th lowest wage and highest
wage
Sort wage
Sum wage in 1/10
Sum wage in 517/526
Drop in 1/20
7. Rename
ttest var
Sau đó, mình sẽ dùng lệnh rvfplot để có thể vẽ được đồ thị giữa sai số và giá trị
ước lượng của biến phụ thuộc trong mô hình. Mình thêm một cái option trong câu
lệnh là yline(0) để đồ thị hiện ra đường thẳng tại mức sai số = 0. Giá trị 0 là giá trị
trung bình của sai số.
Step 2: Set up mathematics model ( skip)
Data source
Number or observation
Years of survey
Check the correlation of Y and X: trước khi chạy hồi quy phải chạy bảng ma
trận tương quan (provide in research)
corr Y X -> correlation matrix (bảng ma trận tương quan)
corr wage educ exper nonwhite female married south
Topic 3: Analyze the relationship between the rice output and the rainfall of
VietNam
Y: Rice output
X: rain fall
reg Y X
So, 67,75% of the sample variation of wage is explain by other variable that are
not included in the model. By theory, they are included in u (error term or
residual)
P-VALUE EXPLAIN
If we calc t = 1, 76, alpha = 5%, n>300 -> critical value = 1,96 ( giá trị tới hạn)
Khi kiểm định 1 biến, k thể bác bỏ at alpha = 55
p-value
Trong thống kê có 2 loại mắc sai lầm: Mistake type I and Mistake type II
27/5/2021 SESSION 7:
Cause:
Consequense
- Biased estimation
Theorem 4.1, 4.2 (lecture 3) are not satisfied (read again at textbook: 3.3:
the Expected Value of the oLS Estimators)
T statistics has no t-distribution
Inexact hypothesis test
If the test shows that the model has mispecification of function foem, we need to
change the function form
Ho: model has no omitted variables ( the model has no mispecification of function
form)
P-value < alpha = 0,05 -> reject Ho -> The model has mispecification of
function form -> We have to change the function form
Generate new variable
Gen educsq = educ^2
reg lwage educ educsq exper nonwhite female married south
-> p-value > alpha -> Accepted Ho -> the model has no omitted variables tuye
Command: corr lwage educ educsq exper nonwhite female married south
+ If the model has heteroskedasticity -> the estimators are still linear and unbiased,
but not the best
+ The variance of the coefficient will be larger (Biased) -> standard error (SE) is
bias -> hypothesis test is inexact
Khi có hiện tượng PSSS thay đổi, nếu vẫn dùng OLS để ước lượng mô hình,
các ước lượng OLS thu được vẫn là ước lượng tuyến tính, không chệch
nhưng có phương sai bị chệch.
- Phương sai của ước lượng không còn chính xác.
- Các khoảng tin cậy, các kết luận kiểm định các giả thuyết thống kê về
hệ số hồi quy không còn giá trị.
- Kết quả dự báo không còn đáng tin cậy.
- Hàm đơn biến
As the distribution of the residuals does not converge into any certain direction ->
predict that the model has heteroskedasticity
b, White test
VD: Câu 3: Nếu mô hình hồi quy gốc có 4 biến độc lập, khi dùng kiểm định White
sử dụng các phần dư từ mô hình hồi quy ước lượng, mô hình hồi quy phụ có bao
nhiêu biến độc lập?
14 variables
STATA:
Command: imtest, white
If (Pro>Chi2) > alpha = 5% -> not reject H0 -> The model has no
heteroskedascity
Example:
c, Breusch-Pagan test
ui
Command: hettest
If (Pro>Chi2) < alpha = 5% -> reject H0 at alpha = 5%
The model has heteroskedasticity
Note: If the methods give the different result -> Should follow the test that
give us the heteroskedasticity conclusion.
Solution:
-Robust Standard error (Phương pháp ước lượng sai số chuẩn mạnh)
Reg X Y, robust
reg lwage educ educsq exper nonwhite female married south, robust
This method just adjust, not correct the model -> People often use this
model because it is simple
- 3 method to hypothesis
Critical vaule (T TEST)
p-value
confidence interval
NOTE: làm kiểm định giả thuyết thống kê dựa vào kết quả mô hình cuối cùngk,
sau khi đã sửa chữa các vấn đề
KIểm định từng hệ số hồi quy và kiểm định sự phù hợp của mô hình
Note: We have to include the variables female and married from the model
METHOD 2:
We can generate an interaction variable of 2 dummy
variables:
STATA:
Singmale is base group
(3)
-The intercept for women is below that for men, but the slope on education is
larger for women.
- This means that women earn less than men at low levels of education, but the gap
narrow as education increases.
- At some point, a woman earns more than a man.
Stata: gen femaleeduc = female*edu
reg wage female educ femaleeduc
7/6/2021 SESSION 10
CHƯƠNG 10: MULTI REGRESSION WITH A BINARY DEPENDENT
VARIABLE
For example:
inlf = 0,5 + 0,038educ – 0,02female +u^i
+ Holding other factors fixed, another year of education will increase the
probability of attending labor force by 3,8%
+ Holding other factors fixed, female has lower probability of attending labor
force, by 2%, compared with male
1. Write LPM.
2. Interpret coefficients
3. Estimate the probability of getting job when GS and EN equal: (70,80);
( 60,60)?
2. Interpretation: (Giải thích)
- P-value <0.05 -> both GS and EN have statistically significance on Y.
- Holding other factors fixed, when GS increases by 1 unit, the probability of
getting a job increases by 2%.
- Holding other factors fixed, when EN increases by 1 unit, the probability of
getting a job increases by 2.9%.
3. When
- GS = 70, EN = 80
plpm (Y=1|X) = -3.01567 + 0.020158*70 + 0.02922*80 = 0.733
^
- GS = 60, EN = 60
plpm (Y=1|X) = -3.01567 + 0.020158*60 + 0.02922*60 = -0.053
^
+ If we want G(z) to have value in the interval (0,1) => Logit and Probit model
are among the choices.
1. Logit Model
ez
G(z) e /[1+ e )] =
z z
1+ e z
+ In the logit model, G(z) is the logistic function which is between 0 and 1 for all
real numbers z.
+ This is the cdf for a standard logistic random variable.
+ P(y=1|x) = G(z) = pi and P(y=0|x) = 1- G(z) = 1- pi
At each value of Xi, the probability for the event A happening is pi.
When X changes by 1 unit, the probability changes: pi( 1-pi)j.
b. Phương pháp ước lượng hợp lý tối đa MLE (maximum likelihood
estimation)
- Do hàm E(y|x) là không tuyến tính nên phương pháp OLS không còn hợp lý.
- Phương pháp ước lượng hợp lý tối đa (maximum likelihood estimation) hợp lý
hơn do dựa trên sự phân phối có điều kiện của y.
Logit y x
Probit y x
Example : logit coursechoice read math
Note:
Chỉ dùng scalar cho Probit, k cho Logit
Mfx dùng cho cả Logit and probit
Command MFX
Mô tả khả năng dự báo của mô hình
3.
14/6/2021 SESSION 12
Panel DATA
1. Definition
- Panel data: the same groups of observation (N) (household, enterprise,
individuals, countries...) are observed over time (T)
- The panel data can have:
+ Variable has different values for each obs, but does not change overtime
(location, gender..)
+ Variable has different values for each obs, but change overtime (exchange rate,
FDI, consumption, income..)
Problem:
• The data (units) are observed overtime, so the sample is larger and we can track
all the changes of units overtime.
data