0% found this document useful (0 votes)
222 views

Econometrics 2

The document discusses simple linear regression models. It defines: 1) Population regression function (PRF) as the linear relationship between a dependent variable (y) and independent variable (x) based on the entire population. 2) PRF shows the average relationship between the dependent and explanatory variables in the population. 3) Sample regression function (SRF) is estimated based on a sample of data rather than the entire population. SRF is used to make inferences about the PRF.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views

Econometrics 2

The document discusses simple linear regression models. It defines: 1) Population regression function (PRF) as the linear relationship between a dependent variable (y) and independent variable (x) based on the entire population. 2) PRF shows the average relationship between the dependent and explanatory variables in the population. 3) Sample regression function (SRF) is estimated based on a sample of data rather than the entire population. SRF is used to make inferences about the PRF.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 110

The notion of ceteris paribus—which means “other (relevant) factors being

equal”— plays an important role in causal analysis

CHAPTER 2: Simple regression model

2.1. Definition of simple regression model

“Explain y in term of x”

Where:
1. x & y

2. Beta1: slope paremeter: the relationship between x and y, holding the other
factors in u is fixed
3. Beta0: Intercept parameter: hệ số chắn
The variable u, called the error term or disturbance in the relationship,
represents factors other than x that affect y. A simple regression analysis
effectively treats all factors affecting y other than x as being unobserved. You
can usefully think of u as standing for “unobserved.”

Population regression function (PRF)

4. Phân tích hồi quy dựa trên toàn bộ tổng thể

5. E(yx), is a linear function of x. The linearity means that a one-unit increase


in x changes the expected value of y by the amount beta1

Nếu: hàm hồi qui tổng thể có dạng E(Y/X) = 1 + 2X

Thì 1 = E(Y/X = 0): hệ số chặn (INPT : intercept term)

2 = E(Y/X) / X: hệ số góc (slope coefficient)

 PRF cho biết quan hệ giữa biến phụ thuộc và biến giải thích về mặt trung
bình trong tổng thể.
SECSION 1 (6/5/2021)

1, Write sample regression model

Y^ = 3,108 + 0,305X1 + (-0,003)X2 + (-0,127)X3 + 0,107.X4 + (-0,092)X5


SESSION 2 (10/5/2021) : Introduction

1. DEFINITION

Econometrics is based on the development of statistical methods for


estimating economics relationships, testing economic theories, evaluating
government and business policies

• What is econometrics for?


– Quantifying relationships among economic variables

– Empirically testing economic theories: law of demand, money supply and


inflation

– Evaluating the impact of a change in one variable on another variable:


measuring the Return to Education, effect of the Minimum Wage on
Unemployment

– Forecasting (demand for goods, stock/gold prices,...)


2. ANALYSIS STEPS

• Steps in empirical economic analysis

1. Question of interest
2. Economic model
3. Econometric model
4. Data collection
5. Estimation of econometric model
6. Dianosing the model problem (example:Multicollinearity;heteroskedasticity;
normality)
7. Hypotheses postulated
8. Result analysis and policy implications
Step 1. Question of interest based on economic theories

• The relationship among variables in theory

• Example: Keynes’s theory states that the consumption of households has a


positive relationship with their income

Step 2. Set up mathematical model

Keynes’ theory in Step 1 can be modeled as following:

C=β +β I;β >0 (1)


0 1 1

In which
• C: Consumption of the households

• I: Income of the households

•β β : Parameters/ coefficient
0; 1

– β0: Constant / intercept parameter

– β1: Slope parameter

Step 3. Set up econometric model

• The mathematic model in Step 2 reflects the exact relationship between


variable consumption and income of households.
• But the relationship among economic variables in general is not perfectly
exact.

• For example, beside variable “income”, there are other variables that can
affect the consumption of households: numbers of family member, ages of the
family head...

• To measure inexact relationships between variables econometric model:

C = β + β I + u (2)
i 0 1i i

• u : error term ( or disturbance (nhiễu), represents factors that are not


i
income, but can affect consumption of the households

• The choice of variables to include in the models bases on economic theories


and available data.

*The reasons for the existence of error term:

 The researchers cannot know all the factors that affect dependent variable Y.

 If they know all the factors, it is impossible to get data for all factors.

 It becomes very complicated if we include all the variables in the models

Step 4. Data collection

• Primary vs secondarydata
• The Structure of Economic Data
– Cross-sectional Data

– Time-series Data

– Pooled Data
• Pooled cross-sectional data

• Panel Data

A, Cross-sectional data: Dữ liệu chéo

• A cross-sectional data set consists of a sample of individuals, households,


firms, cities... taken at a given point of time.

• These data are obtained by random sampling from the underlying population.

• Sometimes, different variables can correspond to different time periods in


cross-sectional data sets.

B, Time-series data: Dữ liệu chuỗi thời gian

A time-series data set consists of observations on a variable or several variables


over time (stock price, consumer price index, GDP...).

– The chronological ordering of observations conveys important information.

– Economic observations can rarely, if ever, be assumed to be independent


across time.

A time series data on consumption and income of a person

C, Pooled cross sectional data: Dữ liệu chéo gộp


• Some data sets have both cross-sectional data and time series features.

Purpose: increase the number of observations

 The estimation is more exact

D, Panel data
• A panel data set consists of a time series for each cross-sectional member in
the data set.

• The same cross-sectional units are followed over a given time period.

A panel data set on provinces’ characteristics

Pane
ID YEAR FDI ODA POPU IZ MOUTAIN
l
2170,
1 An Giang 2004 145 40,61 0 0
1
1 An Giang 2005 139 41,51 2194 0 0
2210,
1 An Giang 2006 140 30,60 0 0
4
Ba Ria
2 2004 64776 1220,01 897,6 7 0
Vung Tau
Ba Ria
2 2005 71441 157,99 913,1 7 0
Vung Tau
Ba Ria
2 2006 106618 11,55 926,3 7 0
Vung Tau
.... .... .... .... .... .... .... ....
1154,
63 Vinh Phuc 2004 7340 5,24 2 0
8
63 Vinh Phuc 2005 9340 7,36 1169 2 0
1180,
63 Vinh Phuc 2006 12776 27,73 3 0
4
64 Yen Bai 2004 96 3,04 723,5 0 1
64 Yen Bai 2005 103 6,13 731,8 0 1
64 Yen Bai 2006 113 9,80 740,7 0 1

• The quality of data depends on:

– Errors in data collection


– Sampling methods

• Data sources: (Must be mentioned the source of data when doing analysis)
– Experimental data

– Available data

Step 5. Estimate parameters of the model

 Data -> Stata, Eviews, SPSS -> estimate parameters of the model (2)
 β^ = -184,08 and β^ = 0,7064
0 1

Ĉ= -184,08 + 0,7064I (3)


i

 The “hat” above the variable C show that this is an estimator of this
variable. ("Mũ" phía trên biến C cho thấy đây là một công cụ ước tính của biến
này)
 Slope parameter = 0,70: if income increases by 1 billion USD, consumption
will increase 706 million USD.

Step 6: Test mistakes of the model


• To test if the assumptions of the models are violated

- Multicollinearity : Đa đối chiếu


- Heteroskedasticity: Dị hợp tử

- Normality of u

Step 7: Test hypotheses

• To test the appropriation of the model and estimated parameters

• Tests: Fisher, Durbin- Watson, Lagrange, Hausman....

Step 8: Analyze the estimated results and Forecasting/ policy implication

 To see if the estimated results are consistent with/ supportive of the theories.
 If the model is appropriate and the estimated results are consistent with the
theories Provide policy implication

Step 9: Forecasting
Lecture 2: The Linear Regression Model 1

1. Introduction to regression model

• The term « regression » means «regression to mediocrity» ( Hồi quy là quy về


giá trị trung bình)

 Regression line is the line that connect the medium point

• Defined by Galton (1886) when he studied the relationship between the height
of sons and the height of fathers

Distribution of the height of sons respects to the height of the fathers


3

The study shows that:

• Given the height of fathers, the height of sons will distribute around a medium
value

• On average, when the height of fathers increase, the height of sons also
increase

 Possitive effect is the most important thing we have to know

• If we conect all the medium points, we will have a linear line

• This line is called regression line, showing the relationship between the height
of sons and the height of father on average

2. Population Regression Function (PRF) and Sample Regression Function


(SRF)

2.1. Definition of PRF

PRF is a regression function that is constructed based on the survey of the


population

For example: Galton studied the relationship between the height of fathers and
the height of sons in one city. He collected the data of all fathers having adult
sons. So he can build PRF. 5
 So, E(Y|X ) is a function of independent variable X :
i i

E(Y/X )= f(X ) = β + β X [1]


i i 0 1 i

 Conditional expected value of Y = medium value (?) : Đường hồi quy là


đường đi qua giá trị trung bình nên biến phụ thuộc chính là kỳ vọng toán vì kỳ
vọng toán bằng giá trị trung bình

• The equation [1] is called Population regression function (PRF).

– PRF shows how the expected value of Y changes at different values of X

– If PRF has 1 independent variable -> simple regression function

–If PRF has 2 or more independent variables -> multiple regression function

6 • Suppose that PRF E(Y|X ) is a linear function, then:


i

E(Y|X )= β + β X [2]
i 0 1 i

- β , β : regression coefficients/ parameters • β0: constant coefficient


0 1
• β1: slope coefficient

• The equation [2] is a simple regression function 7

2.2. Error/ disturbance term


• Because E(Y|X ) is expected value of Y given X , single values of Y is not
i i i

necessary the same with E(Y|X ), but they are around E(Y/X ).
i i

• Note u is the difference between Y and E(Y/X ) (Khoảng cách từ giá trị
i i i
thực của quan sát thứ i đến kì vọng toán = giá trị trung bình), we have:

u = Y - E(Y|X ) [3]
i i i

Or :

Y = E(Y|X )+ u [4]
i i i

-> u is a random variable/ component or disturbance


i
8

2.3. Sample regression function (SRF)

• In reality, we can not carry out surveys of population -> we can not build PRF

• Then we only canestimate the expected value of Y,or in other words, estimate
PRF based on sample(s) taken from population

• Obviously the estimated SRF can not be absolutely exact

 The regression function that is constructed based on a sample is called


Sample Regression Function (SRF).

9
Graph 2.03. Scatter graph and regression line of the 2 samples SRF1 và
SRF2

10

• From the population, we can get many samples. With each sample, we can
have a SRF

• To have the “best” SRF, meaning that that SRF is the closest estimate of PRF,
we have to base on some criteria (tiêu chuẩn) even when we do not have PRF
to compare.

 Study somes technicque


2.3.Sample Regression Function (SRF) – Simple ( 1 X )

 Y is an etimate of E(Y/Xi) and is a fitted value/ predicted value of Y


 0, 1 are estimates of β , β
0 1

 𝑢𝑖 is an estimate of u and is called as residuals (Dư lượng)


i

2.3.Sample Regression Function (SRF) – Multiple ( 2 or more X )

Where: k is number of independent variable X


i is the number of observation

Y: giá trị thực tế của quan sát thứ i

U^: phần dư

3. The Ordinary Least Square (OLS)

• The method OLS is invented by the German mathematician - Carl Friedrich


Gauss.

• It is used to estimate parameters given some assumptions.

• The estimates have some properties (linearity, unbiasedness, and efficiency).

• This method is used the most popularly now.

14

3.1. The Ordinary Least Square (OLS)

• Assume that:

PRF has the form: Y = β + β X + u [3.01]


i 0 1 i i

• Because we can not have PRF, we have to estimate it through SRF

SRF: 𝑌𝑖 = β0 + 1 𝑋𝑖 + 𝑢𝑖 = 𝑌𝑖 + 𝑢𝑖 [3.02]
Where Y is the predicted/ fitted value of Y
i

From [3.02], we have [3.03]: 𝑢𝑖 = Y – Y


i i

 𝑢𝑖 is the difference between the actual value and predicted value of Y


i

 If 𝑢𝑖 is smaller and smaller, the difference between is Y and Y smaller.


i i

Then, the estimated value Y is closer to Y


i i

• Suppose that we have n observations of Y and X, we try to find SRF so that Y


is closest to Y.

• It means that we have to choose SRF so that the sum of residuals:

has the minimum value

(tìm hệ số hồi quy sao cho tổng 𝑢𝑖 nhỏ nhất)


• However, this is not the best choice because of some following reasons:

 Vấn đề trái ngược dấu , bị triệt tiêu giá trị trong khi khoảng cách k phải như
vậy

6. Formula of calculating :
SESSION 3 (13/5/2021)

Example

A random sample as followings:


X: Personal income/ day in thousand vnd
Y: Personal consumption/ day in thousand vnd
a. Calculate the main properties of X and Y (Expected value, variance, median,
mod)

b, Estimate the parameters of the SRF

c, Write SRF

Example

X: Personal income/ day in thousand vnd


Y: Personal consumption/ day in thousand vnd

X 5 4 2 8 8
Y 1 2 3 4 5

1. Calculate the main properties of X and Y (Expected value, variance, median,


mod)
2. Estimate the parameters of the SRF
3. Write SRF
4. Calculate SST, SSE, SSR, R-square
5. Explain the meaning of R-square

28

• Beta0=69/68; beta1=25/68
• SST = 10, SSE = 125/34, SSR = 215/34, R-square = 0.3676

• R-square=0.3676
It means that: Income can explain 36.76% of the sample variation in
consumption of people. So 63.24% of the the sample variation in consumption
of people is explained by other independent variables that are not included in the
model.

Example

X: Personal income/ day in thousand vnd


Y: Personal consumption/ day in thousand vnd

X 6 5 2 4 4
Y 5 2 2 3 1

1. Calculate the main properties of X and Y (Expected value, variance, median,


mod)
2. Estimate the parameters of the SRF
3. Write SRF
4. Calculate SST, SSE, SSR, R-square
5. Explain the meaning of R-square

 E(X)=21/5 , E(Y)=13/5
 Var(X)=1.76 , Var(Y)=1.84
 Beta0 = 1/44 ; beta1 = 27/44
 SST=9.2,SSE=3.3,SSR=5.9,R-square=35.86%

It means that: Income can explain 35.86% of the sample variation in


consumption of people. So 64.14% of the the sample variation in consumption
of people is explained by other independent variables that are not included in the
model.
3.2. The statistical properties of OLS estimators

 The OLS estimators are expressed solely in terms of the observable (i.e.,
sample) quantities.
 They are point estimators; that is, given the sample, each estimator will
provide only a single (point) value of the relevant population parameter.
 Once the OLS estimates are obtained from the sample data, the sample
regression line can be easily obtained.

The regression line thus obtained has the following properties:

1. It passes through the sample means of Y and X ( Y and X )

2. The mean value of the estimated Y is equal to the mean value of the actual Y
i

Y =Y
i
3. The mean value of the residuals is zero.
34

3.3. The sum of squares


3.4. Determination Coefficient (R-squared)

2
R is the fraction (percentage) of the sample variation in Y that is explained

3.6. Assumptions of the OLS

Assumption 1- Linear in parameters: In the PRF, the dependent variable, y, is


related to the independent variable, x, and the error term, u, as
And “linear’ in parameters (tham số), not in variables

Assumption 2 – Random sampling: We have a random sample of size n

Assumption 3 – Sample variation in the explanatory variable: The sample


outcomes on x, namely {xi , i = 1,..., n}, are not all the same value.

• Assumption 4 – No perfect collinearity (tính đối chiếu) for multiplied: In


the sample, there are no exact linear relationships among the independent
variables.

Example:
Note: If the model violate (vi phạm) assumption 1-4 => We cannot run the
model

• Assumption 5: The error term has an expected value of zero given any value
of the explanatory variable. In other words, E(u|X)=0.

 This assumption simply says that the factors not explicitly included in the
model, therefore subsumed in 𝑢𝑖, do not systematically affect the mean value of

Y; the positive 𝑢𝑖 values cancel out the negative 𝑢𝑖 values so that their average

or mean effect on Y is zero.


 Geometrically, this assumption can be pictured as in Figure 3.3, which
shows a few values of the variable X and Y populations associated with each of
the them. As shown, each Y corresponding to a given X is distributed around its
mean value.
Figure 3.3. Conditional distribution of the disturbances u
i
43

Assumption 6 - Homoskedasticity: ( Phương sai đồng nhất) : The error term


ui has the same variance given any value of the independent variable. In other
words,

var (ui/Xi)= E[ui- E(ui/Xi)]2= E(ui2/Xi)= σ2

 Var(u) reflects the distribution of Y surrounding its E(Y|X).


 This assumption means that Y corresponding to various X values have the
same variance. The variance surrouding the regression line is the same across
the X values, it neither increases nor decreases as X varies.

Figure 3.4. The simple regression model under homoskedasticity

 Consider figure 3.5, where the conditional variance of population Y varies


with X.
 Let Y represents for weekly consumption expenditure and X represents for
weekly income.
 Figure 3.4 and 3.5 show that as income increases, the average consumption
expenditure also increases.
 In figure 3.4, the variance of consumption expenditure remains the same at all
levels of income.
 Infigure3.5,itincreaseswiththeincreasesinincome.
 Richer families on average consume more than poor families. There is also
more variability in the consumption expenditure of the former.

Figure 3.5. The simple regression model under heteroskedasticity

3.7. Properties of the OLS estimators - Gauss-Markov


Theorem
 The OLS estimators are the best, linear, unbiased estimators (BLUE).
 The Gauss- Markov Theorem: Under the OLS assumptions, the
estimators are BLUE ( best, linear, unbiased estimators).
–Linear: OLS Estimators are linear functions of a random variable
–Unbiased: tính không chệch/tính tuyến tính -> Kỳ vọng toán của betaj mũ =
betaj ở tổng thể
– The best: Smallest variance (reflect the exact/efficient of the estimation)
among layers of unbiased estimators

Theorem 1: Unbiasedness of OLS: Tính k chệnh của ước lượng trong OLS
Given assumptions, we have:
E(0 )  0 ,and E(1 )  1
for any values of 0 and 1. In other words, o is unbiased
for o, and 1 is unbiased for 1

Mỗi 1 biến -> 1 beta2 mũ.

Kỳ vọng toán của beta2 mũ = beta2 tổng thể

3.9: The components of the OLS variances


Ví dụ: Có mô hình với 2 biến độc lập như sau:
Wage = βo + β1educ + β2iq + u
R2=0.3
Để xem 2 biến độc lập educ và iq có tương quan mạnh với nhau hay không, có 2 cách:
 Chạy mô hình hồi quy phụ được kết quả
Educ = a0+ a1iq + u  Nếu hệ số xác định của mô hình này, ký hiệu là Rj2, lớn hơn 0.8
 2 biến độc lập educ và iq có tương quan mạnh với nhau  mô hình có đa cộng
tuyến.
Chú ý: R2 khác với Rj2. R2 thể hiện mối quan hệ tương quan giữa biến phụ thuộc
và các biến độc lập. Rj2 thể hiện mối quan hệ tương quan giữa các biến độc lập
với nhau

Note: R2
Công thức tính hệ số R bình phương.
Công thức tính hệ số R bình phương xuất phát từ ý tưởng: toàn bộ sự biến thiên của
biến phụ thuộc được chia làm hai phần: phần biến thiên do hồi quy và phần biến
thiên không do hồi quy( còn gọi là phần dư).
RSS
R2 = 1−
TSS
Regression Sum of Squares(RSS): tổng các độ lệch bình phương giải thích từ môi
hình hồi quy
Residual Sum of Squares(ESS): tổng các độ lệch bình phương phần dư
Total Sum of Squares(TSS): tổng các độ lệch bình phương toàn bộ
7. Giá trị R bình phương dao động từ 0 đến 1. R bình phương càng gần 1 thì mô hình
đã xây dựng càng phù hợp với bộ dữ liệu dùng chạy hồi quy. R bình phương càng gần
0 thì mô hình đã xây dựng càng kém phù hợp với bộ dữ liệu dùng chạy hồi quy.
Trường hợp đặt biệt, phương trình hồi quy đơn biến ( chỉ có 1 biến độc lập) thì R2
chính là bình phương của hệ số tương quan r giữa hai biến đó.

Ý nghĩa R bình phương


Ý nghĩa cụ thể:giả sử R bình phương là 0.60, thì mô hình hồi quy tuyến tính này phù
hợp với tập dữ liệu ở mức 60%. Nói cách khác, 60% biến thiên của biến phụ thuộc
được giải thích bởi các biến độc lập.( còn 40% còn lại ở đâu, dĩ nhiên là do sai số đo
lường, do cách thu thập dữ liệu, do có thể có biến độc lập khác giải thích cho biến phụ
thuộc mà chưa được được vào mô hình nghiên cứu…vv). Thông thường, ngưỡng của
R2 phải trên 50%, vì như thế mô hình mới phù hợp. Tuy nhiên tùy vào dạng nghiên
cứu, như các mô hình về tài chính, không phải tất cả các hệ số R2 đều bắc buộc phải
thỏa mãn lớn hơn 50%.( do rất khó dể dự đoán giá vàng, giá cổ phiếu mà chỉ đơn
thuần dựa vào các biến độc lập ví dụ GDP, ROA,ROE….)
Hạn chế của hệ số R bình phương
Càng đưa thêm nhiều biến vào mô hình, mặc dù chưa xác định biến đưa vào có ý nghĩa
hay không thì giá trị R2 sẽ tăng. Lý do là khi càng đưa thêm biến giải thích vào mô
hình thì sẽ càng khiến phần dư giảm xuống (vì bản chất những gì không giải thích
được đều nằm ở phần dư), do vậy tăng thêm biến sẽ khiến tổng bình phương phần
dư(Residual Sum of Squares) giảm, trong khi Total Sum of Squares không đổi, dẫn tới
R2 luôn luôn tăng.
Giá trị R2 tăng khả năng giải thích của mô hình, nhưng bản chất thì lại không làm rõ
được tầm quan trọng của biến đưa vào, do đó nếu dựa vào giá trị R2 để đánh giá tính
hiệu quả của mô hình sẽ dẫn đến tình huống không chính xác vì sẽ đưa quá nhiều biến
không cần thiết, làm phức tạp mô hình.
Để ngăn chặn tình trạng như đã nêu trên, một phép đo khác về mức độ thích hợp được
sử dụng thường xuyên hơn. Phép đo này gọi là R2 hiệu chỉnh hoặc R2 hiệu chỉnh theo
bậc tự do.

Theorem 2: Sampling variances of the OLS estimators


Under assumptions 1 through 6,
2
Var( ˆ j )  n

( X
2

i 1
ij
 X j
) 2 (1  R j )

3.10. Units of measurement


• Example: data set “CEO Salary and Return on Equity” Salary: salary per year in thousands
dollar of CEO

Roe: average return on equity in percentage salary  963.19118.501roe

=> When roe increases by 1%, salary per year of CEO is expected to increase by 18.501
thousand usd

Case 1

When salary is measured in usd

 salarydol = 1000*salary

• The unit of roe is unchanged

salarydol  96319118501roe

=> If the dependent variable is multiplies or divided by the constant c, then the OLS
intercept and slope estimates are also multiplies or divided by c.

Case 2

• When the unit of salary unchanged


• The unit of roe changed: roedec = roe/100

salary  963.1911850.1roedec
• Coefficient of roedec is 100 times greater than the coefficient of roe in [1]

=> If the independent variable is divided or multiplied by some non zero constant c, then
the OLS slope coefficient is multiplied or divided by c, respectively. The intercept is
unchanged.
LECTURE 3: HYPOTHESIS TEST

20/5/2021. LECTURE 4

3 dạng file:
Dữ liệu (Data): wage.dta
Log: Lưu trữ các thông tin chạy phần mềm (.smcl, .log)
Do file: Chứa các câu lệnh
Log file: Store all the result and commands
Log using “…”
Example: Log using “D:\Practice_econometrics”
*Command
1. Des : provides the meaning and the measurement of the variables
Obs: oservation
Vars: Variables
2 KINDS OF VARIABLE
- Quantiative (định lượng) and quanlitative (Định tính)
a) Quantitative variable: is a rando, variable that has value in number and
the value has meaning in terms of algebra
(Biến định lượng: Là các biến số có giá trị bằng số và các giá trị này có ý nghĩa
về mặt đại số)
Ex: educ: trình độ học vấn
Obs Educ
1 16
2 12
3 15
4 9

b, Quanlitative variable: is a random variable that has value in number


BUT the value has NO meaning in term of algebra
(Biến định tính là các biến số có gía trị bằng số nhưng không có ý nghĩa về mặt
đại số)
Sometime Quanlitative variables were coded in number
We have to transfer the quanlitative variable into a dummy variable

DUMMY VARIABLE (ZERO-ONE; BINARY): (Biến giả) : is a variable that


has value of 0 or 1
NOTE
If a quanlitative variable has n categories (loại) , then we can create n dummy
variables
BUT we only include (n-1) dummy variables in the model. The variable
excluded in the model is considered as base group or benchmark variable (biến
=0) to compare (Biến bị loại trừ trong mô hình được coi là biến cơ sở hoặc
biến chuẩn để so sánh)
Example: Gender has 2 categories: Male and Female -> We create 2 dummy
variables: Male and Female
Variable Male = 1 if the obs is a Male, = 0 otherwise
Variable Female = 1 if the obs is a Female, =0 otherwise
Obs Edu Gender Male Female
1 11 M 1 0
2 16 F 0 1
3 4 F 0 1
4 6 M 1 0
5 18 F 0 1

Example:
SOE: State-Owned Enterprise
FDI: Foreign Direct Investment

Enterprise Ownership Private FDI SOE


1 Private 1 0 0
2 Fdi 0 1 0
3 Soe 0 0 1
4 Private 1 0 0
5 Fdi 0 1 0
6 Fdi 0 1 0
7 Private 1 0 0
8 Soe 0 0 1
9 Fdi 0 1 0
10 Soe 0 0 1

The variable Private = 1 if the ownership of the firm is private, =0 otherwise


The variable FDI =1 if if the ownership of the firm is FDI, = 0 otherwise
The variable SOE=1 if if the ownership of the firm is SOE, = 0 otherwise
ANALYSIS

Step 1: Question of interest


Topic: Analysing factors affecting income of individuals in the USA
Income is dependent variable
Chose the X and Y
Y: wage
X: educ exper nonwhite female married south

Statistic Description (Quanlitative analysis) Phân tích mô tả thống kê


The purpose of statistic description is to provide the understanding of the data
structure
sum
sum provides the statistic indicators of the variables (mean, standard deviation,
min, max)

Example: sum wage educ exper nonwhite female married south

*Wage
Has 526 observations
Mean = 5,896
Mean: Giá trị trung bình: Trong thống kê, nó là thước đo xu hướng tập
trung của dữ liệu. Nó cũng được coi là một giá trị mong đợi.
Standard deviation = 3,693
SD: Độ lệch chuẩn: độ lệch so với giá trị trung bình của biến. Giá trị này
càng nhỏ cho thấy, các con số không chênh lệch nhau nhiều so với giá trị
trung bình. Ngược lại nếu giá trị này cao, thể hiển rằng đối tượng khảo sát
có nhận định rất khác biệt nhau đối với biến đó, nên mức điểm cho chênh
lệch nhau khá nhiều.
(Usually do not use dummy variables to analysis because the number is no
meaning)

sum can go with if or by


+-*/
>=, <=
| means “or”
& means “and”
!= means “unequal”
Sau if, we use ==

- Homework: Calculate the average mean wage of these groups:


Female vs. Male
Married vs. Unmarred
Nonwhite vs. White
Graduated from university vs. Not yet graduated from university
Answer:

a, Female vs. Male

Female: sum wage if female == 1


Male: sum wage if female == 0

How that give us information?


Mean -> Average wage of female is much smaller than male

b, Married vs. Unmarred


Married: sum wage if married == 1

Unmarried: sum wage if married == 0

c, Nonwhite vs. White


Nonwhite: sum wage if nonwhite == 1
White: sum wage if nonwhite == 0

The gap on average wage is quite small

d, Graduated from university vs. Not yet graduated from university

Graduated: sum wage if educ >= 16


Not yet: sum wage if educ < 16
Average wage of people who graduated is twice as much as people has not
graduated yet

The educ has the strongest effect on wage (chênh lệch giữa 2 giá trị trung
bình), next is the gender, the married vs unmarried people. Nonwhite and
white people have light effect on wage
20/5/2021 SESSION 5

Other way of dummy variable

Bysort female: sum wage

Bysort nonwhite: sum wage


Bysort married: sum wage

Average wage of female/male and married


Sum wage if female == 1 & married == 1
Sum wage if female == 0 & married == 1

3.Tab provides distribution of value of variables so that we can understand the


structure of the dataset (thấy rõ hơn data)
We have: variable educ:
Mean = 12,562
SD 2,762
→ Many values is concentrated around the mean value because the Sd is quite
small
In sample, we have 526 obs totally
There are 2 people did not go to school , educ = 0 , accounting for 0,38
Cum. Percent: (phần trăm tích lũy) = 0,38
Most of people in the sample have educ = 12 (198) => avg educ around 12
( cum. Percent at educ = 12 is 59,70, represent for the percent of people who
have year of educ equal or less than 12)

tab wage
Wage is continuos variable, has a lot of value
→ should not use command “tab” for continous variable
4 .gen: to generate/create a new variable in the case we do not have this
variable in the data table

gen newvar =

gen educsq = educ^2


gen lneduc = ln(educ)

drop "variable" = delete "variable"

After creating the variable, we should add the meaning of the new variable by
command:

label variable variablename "…" (dán nhãn)

NOTE: Phải sử dụng dấu ngoặc thẳng ", không được sử dụng ngoặc cong

Eg. label variable educsq "the squared value of educ"

Ex:
Create one dummy variable showing the education level of 2 group: Graduated
from university vs. Not yet graduated from university

Gen … if …
Replace
Answer:
gen graduated = 1 if educ >= 16
replace graduated = 0 if educ < 16
We have new variable : graduated

Create one dummy variable showing the experience of 2 group: less than 20
years and more than or equal 20 years

Experience = 1 if the experience >= 20, = 0 otherwise

gen experience = 1 if exper >= 20


replace experience = 0 if exper < 20

label variable experience "=1 if experience >=20"

5.List in/if

Sort …
List … in STT

Exercise: list 10 people that have the lowest wage and highest wage

Sort wage
List wage in 1/10 ( STT của1-10)
List wage in 517/526 ( STT …)
Calculate the average wage of 10 people that have th lowest wage and highest
wage

Sort wage
Sum wage in 1/10
Sum wage in 517/526

6. Drop/ Keep in/ if


Drop/ Keep variable
Drop variable

Drop in 1/20

7. Rename

rename var newname

rename educ hocvan


-Đồ thị phân phối của science ở 2 nhóm có dạng gần phân phối chuẩn. Bây giờ, giả
sử chúng ta muốn biết giá trị trung bình ở 2 nhóm này có bằng nhau ở mức ý nghĩa
thống kê 5% hay không, sử dụng lệnh ttest như sau:

ttest var

Sau đó, mình sẽ dùng lệnh rvfplot để có thể vẽ được đồ thị giữa sai số và giá trị
ước lượng của biến phụ thuộc trong mô hình. Mình thêm một cái option trong câu
lệnh là yline(0) để đồ thị hiện ra đường thẳng tại mức sai số = 0. Giá trị 0 là giá trị
trung bình của sai số. 
Step 2: Set up mathematics model ( skip)

Step 3: Set up ecomometrics model

Step 4: Collect data

Data source
Number or observation
Years of survey

Step 5: Estimate the model

Econometrics model: linear regression model (data set is cross-sectional and


dependent variable is continous)
Method to estimate coefficients: OLS

Check the correlation of Y and X: trước khi chạy hồi quy phải chạy bảng ma
trận tương quan (provide in research)
corr Y X -> correlation matrix (bảng ma trận tương quan)
corr wage educ exper nonwhite female married south

(correlation and statistical significant effect are 2 different effects): tương


quan nhưng k có nghĩa là sẽ ảnh hưởng đến Y

Correlation matrix presents:


Correlation between Y and X -> r(Y,X)
If r(Y,X) # 0 -> X has the correlation with Y -> we can include X in the model
Ex: r(wage, educ) = 0,4059 # 0 -> educ has correlation with wage

Correlation between Xj and Xk (to check multicollinearity problem)


Note: Choosing Y and X
Topic 1: Analyze the impact of foreign direct investment on GDP growth of
VietNam

Y: GDP growth of VietNam


X1: FDI (main independent variable)
X2, X3,… Xk: Control variable: Biến kiểm soát

Topic 2: Analysis the relationship between FDI and GDP growth of VN

(1): GDP growth = f(FDI)


(2): FDI = f(GDP growth)

Topic 3: Analyze the relationship between the rice output and the rainfall of
VietNam
Y: Rice output
X: rain fall

If we have 2 variables A and B and we want to see if A or B can be Y or X ->


we have to check the nature of the correlation of A and B
+ Cal the correlation of A and B -> r(A,B)
+ If r(A,B) # 0 -> we can include A, B in the model
+ If A correlates with B -> check if the correlation is causation (nhân quả) or not
If causation -> Y: Result/ Consequence (rice output)
X: Cause (rainfall)

Regression (Chạy mô hình)

reg Y X

Eg. reg wage educ exper nonwhite female married south

- What is the difference between Rsq and adjusted Rsq?


R Square is a basic matrix which tells you about that how much variance is been
explained by the model.
+Rsq: What happens in a multivariate linear regression is that if you keep on adding new
variables, the R square value will always increase irrespective of the variable significance
+ Adjusted Rsq increase only when the independent variable in the model has statiscally
significant effect on dependent variable
In this case, why is Rsq greater than adjusted Rsq?
Nonwhite variable has no statistically significant effect on wage ( p-value is big)
Mistake type I
Null hypothesis
In analysis, why do we always mention/analyse Rsq instead of adjusted
Rsq?
Because in the model, some variable have no significant effect but it do not
mean that the variable is useless in term of providing us information.
Cannot exclude the variable which has no effect on the dependent variable
24/5/2021. Session 6

The sum of square


- df: Bậc tự do
- df = 6 = k
- k is the number of independent variable in the model (k=6)
- df = 519 = n-k-1 = 526 – 7
- n: number of observations
- MS = SS/df : mean of sum of square: gía trị trung bình của tổng bình phương
- Root MSE: căn bậc 2 của giá trị trung bình của tổng bình phương sai số
- Cons_: constant coefficient Beta0 head
- Coef. = coefficient: hệ số hồi quy
- Std. Err: Standard error: sai số chuẩn của hồi quy

SRF: wage = -1,41 + 0,569educ +0,55exper +0,072nonwhite – 2,092female +


0,715married -0,646south +u^

R2 = SSE/SST = 2309,32588/7160,41429 = 0,322


It means that the independent variables in the model (educ, exper, nonwhite,
female, married, south) can explain 32,25% of the sampke varition of wage

So, 67,75% of the sample variation of wage is explain by other variable that are
not included in the model. By theory, they are included in u (error term or
residual)
P-VALUE EXPLAIN

If we calc t = 1, 76, alpha = 5%, n>300 -> critical value = 1,96 ( giá trị tới hạn)
Khi kiểm định 1 biến, k thể bác bỏ at alpha = 55
p-value

Trong thống kê có 2 loại mắc sai lầm: Mistake type I and Mistake type II
27/5/2021 SESSION 7:

Step 6: Dianosing the problems of the model

(Chuẩn đoán vấn đề của mô hình)

(1): E(u|X) = 0 (Assumption 5)


(2): Multi-collinearity
(3): Homoskedaticity vs. Heteroskedasticity (Assumption 6)
(4): Normal distribution of u (Assumption 7)
(5): Auto-Correlation (only for timeseries data)

(1): E(u|X) = 0 (Assumption 5)


This assumption is satisfied when:
- E(u) = 0
- Cov(Xj,u) =0
Giả thiết này nhằm để đảm bảo u là một nhiễu trắng (whitr noise) để viêc ước
lượng các hệ số hồi quy không bị ảnh hưởng bởi các yếu tố thuộc nhiễu u
Nếu (1) or (2) không thỏa mãn -> Violate assumption

Cause:

- Not include important variables in the model


- Mispecification of funtional form (read at 6.2): Phân loại sai

Consequense
- Biased estimation
 Theorem 4.1, 4.2 (lecture 3) are not satisfied (read again at textbook: 3.3:
the Expected Value of the oLS Estimators)
 T statistics has no t-distribution
 Inexact hypothesis test

How to find out and collect the problem

- Not include important variable: if we have a doubt that variable Z has


statistically significant effect on Y, we should include Z in the model and apply
the t-test to see if it has statistically significant effect on Y

- (1) Check for mispecification of function form: Ramsey test

If the test shows that the model has mispecification of function foem, we need to
change the function form

Câu lệnh trong stata: ovtest

reg wage educ exper nonwhite female married south


ovtest

Ho: model has no omitted variables ( the model has no mispecification of function
form)
 P-value < alpha = 0,05 -> reject Ho -> The model has mispecification of
function form -> We have to change the function form
 Generate new variable
Gen educsq = educ^2
reg lwage educ educsq exper nonwhite female married south

-> p-value > alpha -> Accepted Ho -> the model has no omitted variables tuye

(2) Multicollinearity (Đa cộng tuyến)


+ Problem: Multicollinearity happens when independent variables has strong
(but not exact) correlation
This problem does not violate Assumption 4 (perfect collinearity)
Assumption 4: In the sample, there are no exact linear relationshíp among the
independent
-> the model still valid but this problem cause some consequences.
Example:
Wage = Bo + B1educ + B2iq + u
Educ = ao + a1iq + u
If 0,8 < Rj2 <1 -> multicollinearity
- Consequence
NO
TE when the model has multicollinearity, the estimator’re still BLUE
+ How to find out :
 Correlation matrix

Command: corr lwage educ educsq exper nonwhite female married south

- vif (Variance inflation factor): Nhân tử phóng đại phương sai


vif = 1/(1-Rj2)
Nếu vif and mean vif >10 -> multicollinearity
+ Solution
- Exclude the variables that have high correlation out of the model (In this case, if
we exclude one of them, we could have (1) mistake)
- Increase the sample size

 If we have a large sample, we do not need to worry about the


multicollinearity
So in this case, the sample has 526 observation -> largr enough -> We do not
have to care about the multicollinearity problem
( How is consider as large enough? When the sample have about n =
384.16 -> large enough)
31/5/2021 SESSION 8

(3): Homoskedaticity vs. Heteroskedasticity (Assumption 6)


+Problem
Assumption 6: var(ui|Xi) = 2 với i -> Homoskedaticity (Phương sai sai số
không đổi)
Heteroskedasticity (Phương sai sai số thay đổi)
Khi giả thiết này bị vi phạm. Điều đó có nghĩa là: varu X  i2
(Với i = 1, 2,..., n).
→ Heteroskedasticity( Hiện tượng phương sai sai số (PSSS) thay đổi).
 Violate assumption 6
+Consequence
- (Hàm đa biến)
Homoskedasticity var(ui) = 2 ⏞ 2
Var( B j ) = ❑
Heterskedasticity varui  i2 ⏞ 2
Var( B j ) = ❑

+ If the model has heteroskedasticity -> the estimators are still linear and unbiased,
but not the best
+ The variance of the coefficient will be larger (Biased) -> standard error (SE) is
bias -> hypothesis test is inexact

 Khi có hiện tượng PSSS thay đổi, nếu vẫn dùng OLS để ước lượng mô hình,
các ước lượng OLS thu được vẫn là ước lượng tuyến tính, không chệch
nhưng có phương sai bị chệch.
 - Phương sai của ước lượng không còn chính xác.
- Các khoảng tin cậy, các kết luận kiểm định các giả thuyết thống kê về
hệ số hồi quy không còn giá trị.
- Kết quả dự báo không còn đáng tin cậy.
- Hàm đơn biến

Xét ví dụ sau đây:


Hồi quy tiêu dùng Y theo mức thu nhập X của các hộ gia đình ta có mô hình hồi
quy sau:
- Trường hợp 1: PSSS không đổi Homoskedaticity
var ui   2
→ Độ biến thiên trong chi tiêu của các gia đình có thu nhập khác nhau là như
nhau.
- Trường hợp 2: PSSS thay đổi Heteroskedasticity
varui i2
 Các gia định có mức thu nhập cao hơn có độ biến thiên trong chi tiêu cao
hơn. Đây là điều thường thấy trong thực tế.
 Vi phạm giả định của mô hình 6 hồi quy tuyến tính cổ điển. -> Strongest
assumption because most of reality data violate this assumption

+ How we can find out?


A, Phương pháp đồ thị
STATA
-Graph stata

Reg (dependent variable) (independent variables)


Rvfplot, yline(0) -> Graph of u

As the distribution of the residuals does not converge into any certain direction ->
predict that the model has heteroskedasticity

b, White test
VD: Câu 3: Nếu mô hình hồi quy gốc có 4 biến độc lập, khi dùng kiểm định White
sử dụng các phần dư từ mô hình hồi quy ước lượng, mô hình hồi quy phụ có bao
nhiêu biến độc lập?

We have X1, X2, X3, X4,


(X1)2, (X2)2, (X3)2, (X4)2
X1X2, X1X3, X1X4, X2X3, X2X4, X3X4

 14 variables

STATA:
Command: imtest, white
If (Pro>Chi2) > alpha = 5% -> not reject H0 -> The model has no
heteroskedascity

Example:
c, Breusch-Pagan test

ui

Command: hettest
If (Pro>Chi2) < alpha = 5% -> reject H0 at alpha = 5%
 The model has heteroskedasticity

Note: If the methods give the different result -> Should follow the test that
give us the heteroskedasticity conclusion.

Solution:
-Robust Standard error (Phương pháp ước lượng sai số chuẩn mạnh)
Reg X Y, robust
reg lwage educ educsq exper nonwhite female married south, robust

Weakness of this method: (People prefer use it because it is simple)

C, How we can correct the problem?


Simple method

 This method just adjust, not correct the model -> People often use this
model because it is simple

(4): Normal distribution of u (Assumption 7)


+ Problem: u does not have nornaml distribution -> Violate assumption 7
+ Cause: The sample is not large enough
+ Consequence: T statistics may have no t-distribution -> The hypothesis test is
inexact
+How to find out?
Ho: u has normal distribution
- Graph
predict u, residuals
histogram u, normal
- Jacque – Bera test
Ho: u has normality distribution
predict u, residuals
sktest u
+How to correct: increase the sample size
(5) Tự tương quan (Auto correlation)
3/6/2021 SESSION 9
Step 7: Test Hypothesis and explain the effect of independent variables on
dependent variable

- 3 method to hypothesis
 Critical vaule (T TEST)
 p-value
 confidence interval
NOTE: làm kiểm định giả thuyết thống kê dựa vào kết quả mô hình cuối cùngk,
sau khi đã sửa chữa các vấn đề

KIểm định từng hệ số hồi quy và kiểm định sự phù hợp của mô hình

Step 8: Estimated result analysis and policy implication

For educ variable:


As: p-value of educ >0,05 but p-value of educsq <0,05 ; the coefficient of educsq =
0.0044 >0
+ Education has unlinear effect on wage and has an increaseing marginal effect on
wage
Example:
- policy implication : (of government on reduce gap, …)

 knownledge for researching cross data set

CHƯƠNG 10: MULTI REGRESSION WITH A BINARY DEPENDENT


VARIABLE

1. A single dummy independent variable

wage0  female1educu 001

0 E(wage| female1,educ) E(wage| female0, educ)

Female = 1 corresponds to females, female = 0 corresponds to male

0  E(wage| female,educ)  E(wage|male, educ)


The level of education is the same in both expectations, the difference, 0, is due to
gender only.
- At any level of education, men have higher wage than women.
- The difference in wage comes from the difference in gender only, not from the
education (educ have linear effect)z because the slopes are the same for 2 cases
- Higher education, higher wage for both men and women.

Note: If one qualitative variable has n characteristics

 include only n-1 dummy variables in the regression.

The dummy variable is not included in the model

-> base group or benchmark group.


E.g.: Gender has 2 characteristics: male and female  use only 1 dummy variable
male or female

-If female is the base group, we have the model:

wage maleeducu 001

- Using 2 dummy variables would introduce perfect collinearity because female +


male = 1, which means that male is perfect linear function of female.

2. Using multiple dummy variables in the model


- We can include more than 1 dummy variable in the model

- Weakness of the model: we only know the difference in wage between:


+ Female vs.male group
+ Married vs. Unmarried group
 Do not know the difference in wage between 4 group: married man,
single man, married woman, single woman.
- Solution
METHOD 1:
We can overcome this disadvantage by generating 4 groups: married man,
married woman, single man, single woman
-If the base group is single men, the model will be:

Note: We have to include the variables female and married from the model
METHOD 2:
We can generate an interaction variable of 2 dummy
variables:

 The estimated results of 2 models are the same.

STATA:
Singmale is base group

reg wage marrmale marrfemale singfemale educ (3)

gen femalemarried = female*married

reg wage female married femalemarried educ (4)

(3)

-> wage^ = -1.024421 + 2.641066marrmale - .5567235marrfemale - .


368964singfemale + .493559educ
(4)

-> wage^ = -1.024421 + 2.641066marrmale - .5567235marrfemale - .


368964singfemale + .493559educ

Basegroup singmale = 1 if female ==0 & married ==0


(3) Singmale Wage = 1.024421+ 493559educ

Marrmale Wage -1.024421 + 2.641066marrmale + .493559educ

(4) Singmale Wage = 1.024421+ 493559educ

Marrmale Wage -1.024421 + 2.641066marrmale + .493559educ

3.Interaction between dummy and quantitative variables


Case 2

-The intercept for women is below that for men, but the slope on education is
larger for women.
- This means that women earn less than men at low levels of education, but the gap
narrow as education increases.
- At some point, a woman earns more than a man.
Stata: gen femaleeduc = female*edu
reg wage female educ femaleeduc
7/6/2021 SESSION 10
CHƯƠNG 10: MULTI REGRESSION WITH A BINARY DEPENDENT
VARIABLE

Regression with a binary dependent variable


1. Linear Probability Model (LPM)
2. Logit Model
3. Probit Model
4. Comparing LPM, Logit, Probit

I. . Linear Probability Model (LPM)

P = E(Yi/Xi) = P(Y=1/Xi) = 0 + 1X1


We have a model: Yi = 0 + 1iX1i + u^i

Let pi be probability for event A happening


given Xi, pi = P(A/Xi),
* The differences when explaining the coefficient:
-  Linear regression model: when X changes by 1 unit, the average value of Y (E
(Y|X)) changes by 1 unit.
-  LPM: when X changes by 1 unit, the probability for the event A happening
changes by 1 unit.

For example:
inlf = 0,5 + 0,038educ – 0,02female +u^i

+ Holding other factors fixed, another year of education will increase the
probability of attending labor force by 3,8%

+ Holding other factors fixed, female has lower probability of attending labor
force, by 2%, compared with male

*The weakness of LPM


1. Probability can be smaller than 0 or greater than 1.
2. Probability is linear with independent variable. But in fact, it cannot be.
Exercise 1
A survey of 40 students after graduation for 6 months, with variables GS -
Graduate Score, EN – English grade. The scale of grade equals 100.
Y = 1, if student can have a job, Y =0, if student has not got a job yet. (inlf)
Let “probability of getting job” is the probability for a student to get a job after 6
month from graduation, =5%

1. Write LPM.
2. Interpret coefficients
3. Estimate the probability of getting job when GS and EN equal: (70,80);
( 60,60)?
2. Interpretation: (Giải thích)
-  P-value <0.05 -> both GS and EN have statistically significance on Y.
-  Holding other factors fixed, when GS increases by 1 unit, the probability of
getting a job increases by 2%.
-  Holding other factors fixed, when EN increases by 1 unit, the probability of
getting a job increases by 2.9%.

3. When
- GS = 70, EN = 80
plpm (Y=1|X) = -3.01567 + 0.020158*70 + 0.02922*80 = 0.733
^

- GS = 60, EN = 60
plpm (Y=1|X) = -3.01567 + 0.020158*60 + 0.02922*60 = -0.053
^

 The negative value (-0.053) is inappropriate. So we can not have a


conclusion in this case
II. Logit and Probit model
- As LPM has 2 weaknesses:
+ P(y=1|x) can be smaller than 0 or greater than 1
+ the effect of X on Y is constant
- To overcome these weaknesses, we rewrite the model:

P(y1|x)G(0 +1X1+...+k Xk)G(z)

+ If we want G(z) to have value in the interval (0,1) => Logit and Probit model
are among the choices.

1. Logit Model
ez
G(z)  e /[1+ e )] =
z z
1+ e z
+ In the logit model, G(z) is the logistic function which is between 0 and 1 for all
real numbers z.
+ This is the cdf for a standard logistic random variable.
+ P(y=1|x) = G(z) = pi and P(y=0|x) = 1- G(z) = 1- pi

 At each value of Xi, the probability for the event A happening is pi.
When X changes by 1 unit, the probability changes: pi( 1-pi)j.

(Can not interpretation as normal)


2. Ước lượng mô hình Logit

With Logit Probit model, we do not use OLS

b. Phương pháp ước lượng hợp lý tối đa MLE (maximum likelihood
estimation)

- Do hàm E(y|x) là không tuyến tính nên phương pháp OLS không còn hợp lý.

- Phương pháp ước lượng hợp lý tối đa (maximum likelihood estimation) hợp lý
hơn do dựa trên sự phân phối có điều kiện của y.

- Hàm mật độ có điều kiện của y:


f(y=1|x; ) = P(y=1|x) = G(x )
f(y=0|x; ) = P(y=0|x) = 1-G(x )
Exercise
A survey of 40 students after graduation for 6 months, with variables GS -
Graduate Score, EN – English grade. The scale of grade equals 100.
Y = 1, if student can have a job, Y =0, if student has not got a job yet.
Let “probability of getting job” is the probability for a student to get a job after 6
month from graduation,  =5%

b, When GS increases by 1 unit, EN fixed, the probability to have a job


increase by 1,78%
ANSWER:
Because the coeficiemts is positive
STATA
Command:

Logit y x
Probit y x
Example : logit coursechoice read math
Note:
Chỉ dùng scalar cho Probit, k cho Logit
Mfx dùng cho cả Logit and probit

Command MFX
Mô tả khả năng dự báo của mô hình
3.
14/6/2021 SESSION 12
Panel DATA
1. Definition
- Panel data: the same groups of observation (N) (household, enterprise,
individuals, countries...) are observed over time (T)
- The panel data can have:
+ Variable has different values for each obs, but does not change overtime
(location, gender..)

+ Variable has different values for each obs, but change overtime (exchange rate,
FDI, consumption, income..)

2. Advantages of Panel Data

For example: ( Cross-section dataset) Analyze the relationship between quantity of


fertilize and productivity

 Problem:

– There are missing unobserved variables (u) and u correlate with X


– These variables are different for each unit => Biased OLS estimation

 Advantages of Panel Data

• Overcome the problem of missing unobserved variables

• The data (units) are observed overtime, so the sample is larger and we can track
all the changes of units overtime.

The unobserved variables can be:


– Change/ unchange by unit (i)
– Change/ unchange by time (t)
– Change by both unit (i) and time (t)
3. Econometric Model for Panel Data

ai unobserved -> include in error term

• Let: vit = ai+uit


• Depends on the characteristics of ai, we have 3 models:
– Pooled OLS (Mô hình hồi qui gộp - POLS)
– Fixed effect (Mô hình tác động cố định - FE)
– Random effect (Mô hình tác động ngẫu nhiên-RE)
Xtset id year : command to declare that you want to treat this data as panel

data

You might also like