Notes113
Notes113
Course Description
This course provides a comprehensive introduction to basic econometric concepts and techniques. It
covers estimation and diagnostic testing of simple and multiple regression models. The course also covers
the consequences of and tests for misspecification of regression models.
Course Outline
2. Statistical Concepts
Normal distribution;
chi-sq, t- and F-distributions;
estimation of parameters;
properties of estimators;
testing of hypotheses.
It is the social science in which the tools of economic theory, mathematics, and statistical inference are
applied to the analysis of economic phenomena. It is concerned with the empirical determination of
economic laws.
The method of econometric research aims, essentially, at a conjunction of economic theory and actual
measurements, using the theory and technique of statistical inference as a bridge.
Econometrics consists of application of mathematical statistics to economic data to land empirical support
to the models constructed by mathematical economics and to obtain numerical results.
2. The main concern of mathematical Economics is to Express economic theory in terms of some
mathematical equation.
C=A+BY.
3. Economics statistics is mainly concerned with collecting, presenting and processing the economic
data in forms of charts and tables.
Eg.) primary, secondary, qualitative, quantitative, seasonal, time series data.
Q3: How is it different from mathematical Econometrics? What is the need to be studied as a separate
Discipline? (2 BOOKS)
Econometrics is different from mathematical economics in its focus on empirical verification of economic
theories. Mathematical economics is concerned with the construction and analysis of mathematical models
of economic phenomena, without regard to the measurability or empirical verification of the models.
Econometrics, on the other hand, is concerned with using statistical and mathematical methods to test and
interpret economic theories using real-world data.
Econometrics needs to be studied as a separate discipline because it has its own unique set of methods and
techniques. These methods and techniques are necessary to deal with the unique challenges of testing
economic theories using real-world data, such as the non-experimental nature of economic data and the
noisy nature of economic data.
Prepared by Madhav Gupta (2K21/CO/262)
1. Statement of Theory or Hypothesis: Begin by stating the economic theory or hypothesis you want to
investigate. In this example, it's Keynesian theory of consumption.
2. Specification of the mathematical model of the theory: Create a mathematical model representing the
theory. In this case, a linear consumption function relating consumption (Y) to income (X) is proposed.
3. Specification of the statistical, or econometric, model: Modify the mathematical model to account for
inexact relationships between economic variables. Introduce a disturbance term (u) to capture
unaccounted factors affecting consumption.
4. Obtaining the Data: Collect relevant data that will be used for estimation and analysis. In this case, data
on personal consumption expenditure (PCE) and gross domestic product (GDP) for the period 1960-2005 is
gathered.
5. Estimation of the parameters of the econometric model.: Use statistical techniques, like regression
analysis, to estimate the model's parameters (β1 and β2) from the data. In this example, estimates of -
299.5913 and 0.7218 are obtained for β1 and β2.
6. Hypothesis Testing: Evaluate whether the estimated parameters are statistically significant and
consistent with the theory. In this case, you would test if the MPC estimate of 0.72 is significantly less than
1 to support Keynesian theory.
7. Forecasting or Prediction: Use the estimated model to make predictions about future economic variables
based on expected values of the independent variable. For example, predict future consumption
expenditure based on forecasted GDP.
8. Using the model for control or policy purposes: Apply the estimated model for policy analysis and
control. Determine how changes in policy variables (e.g., tax policy) will impact economic outcomes (e.g.,
income and consumption).
Throughout these steps, it's crucial to consider the adequacy of the chosen model in explaining the data
and to compare it to alternative models or hypotheses when applicable. This ensures robust and reliable
econometric analysis.
Prepared by Madhav Gupta (2K21/CO/262)
Functional (Deterministic)
Aspect Statistical (Stochastic) Dependence
Dependence
Nature of Involves random or stochastic variables Involves variables that are not
Variables with probability distributions. random or stochastic.
Cannot predict the dependent variable
Allows for precise predictions as
Predictability exactly due to measurement errors and
relationships are exact.
unidentifiable factors.
- Newton's law of gravity. - Ohm's law
- Crop yield based on temperature, in physics. - Boyle's gas law,
Examples
rainfall, etc. - Social sciences data. Kirchhoff's law, Newton's law of
motion.
Transition to
Occurs when there are errors in May transition to statistical when
Statistical
measurement or disturbances in deterministic relationships are
from
otherwise deterministic relationships. affected by errors or uncertainties.
Functional
Q4: What do you mean by primary and secondary data (with examples)? (SP GUPTA)
Primary Data
Primary data are measurements observed and recorded as part of an original study. When the data
required for a particular study can be found neither in the internal records of the enterprise, nor in
published sources, it may become necessary to collect original data, i.e., to conduct first hand investigation.
The work of collecting original data is usually limited by time, money and manpower available for the study.
When the data to be collected are very large in volume, it is possible to draw reasonably accurate
conclusions from the study of a small portion of the group called a sample. The actual procedures used in
collecting data are essentially the same whether all the items are to be included or only some items are
considered.
Secondary Data
When an investigator uses the data which has already been collected by others, such data are called
Secondary data. Secondary data can be obtained from journals, reports, government publications,
publications of research organisations, trade and professional bodies, etc. However, secondary data must
be used with utmost care. The user should be extra cautious in using secondary data and he should not
accept it at its face value. The reason is that such data may be full of errors because of bias, inadequate size
of the sample, substitution, errors of definition, arithmetical errors, etc. Even if there is no error, secondary
data may not be suitable and adequate for the purpose of the inquiry
Prepared by Madhav Gupta (2K21/CO/262)
Qualitative Data
Qualitative Data, also referred to as Categorical data, is
data characterized by approximation and description. It
lacks numerical values and is instead observed and
recorded. For instance, determining whether a person is
male or female.
Nominal Data: Nominal Data is a form of qualitative data that encompasses two or more categories without
any inherent ranking or preference order. For example, a real estate agent might categorize properties as
flats, bungalows, penthouses, and so on. However, these categories do not imply any particular order of
preference.
Ordinal Data: Ordinal Data, similar to nominal data, includes two or more categories, but the key distinction
is that this data can be ranked.
Eg. rate the behaviour of bank staff: Friendly, Rude, or Indifferent
Eg. rating a product: Very Useful, Useful, Neutral, Not Useful, or Not at All Useful
Qualitative Data
It can be represented numerically
Discrete Data: It is the data that can only contain certain vales. Also known as attribute data, it is
information that can be categorized into a classification with no fractional or continuous values. It can only
take a finite no of value and such values are obtained by counting and cannot be subdivided meaningful.
Eg. No of students in a class:
- You can count
- You can be 20 or 21, but not 21.2 or 20.4 etc.
Continuous Data: It is information that can be measured on a continuum or scale. It can have an infinite
number of different values depending upon the precision of the measuring instrument. Obtained by
measurement, it can take on any numeric value.
Eg. Weight of a person can be 65, 66, 65.1, 65.8, 65.231 etc.
Prepared by Madhav Gupta (2K21/CO/262)
Eg. Different between 30° and 40°, 80° and 90° on Fahrenheit scale of temperature represents the
same temperature different
- Properties:
o Equal Intervals
o Can count, rank and take difference
o No true zero point: it means the absolute of the property being measures
Eg. Temp of the house is 30° and outside temp is 2°. It is meaningful to say that the difference
in temperature is 28° but not meaningful to say that the temperature in house is 15 times
hotter than outside.
Ratio Scale:
- It is an interval scale with the additional property of ‘Zero Point’
- Properties:
o Can Count
o Can Rank
o Take differences and differences are meaningful
o There is a zero point (ratios are meaningful)
Eg. The age of the father is 50 years old and of son is 25 years. It is meaningful to say that
the difference in age is 25 Years and also meaningful to say that age of father is 2 times the
age of the son.
1. Purposive or Subjective or Judgment Sampling: In this method, a desired number of sample units are
deliberately selected based on the objective of the inquiry. The goal is to include only important items that
represent the true characteristics of the population. However, this method is highly subjective, as it relies
on the personal convenience, beliefs, biases, and prejudices of the investigator.
2. Probability Sampling: Probability sampling is a scientific technique for selecting samples from a
population according to specific laws of chance. In this method, each unit in the population has a predefined
probability of being selected in the sample. There are different types of probability sampling, including:
3. Mixed Sampling: Mixed sampling involves a combination of probability-based sampling methods (as
mentioned in section 2) and fixed sampling rules (no use of chance). It is a hybrid approach to sampling.
The CV for distribution B is indeed higher than that of distribution A, indicating that distribution B has more
variation.
Both distribution A and distribution B have the same skewness value, which is 3.
When is it used?
12. Moments of Odd Order: All moments of odd order about the mean are zero (μ1 = μ3 = μ5 = ... = 0).
13. Moments of Even Order: The moments of even order are given by specific formulas.
14. Asymptote: The x-axis is an asymptote to the curve as X moves numerically far from the mean.
15. Additivity Property: A linear combination of independent Normal random variables is also a Normal
random variable.
16. Mean Deviation: The mean deviation about the mean, median, or mode is approximately 0.7979 times
the standard deviation (σ).
17. Quartiles: Quartiles are given in terms of μ and σ.
18. Quartile Deviation: The quartile deviation is approximately 0.6745 times the standard deviation (σ).
19. Relationship Between Q.D., M.D., and S.D.: Quartile deviation, mean deviation, and standard deviation
have a specific relationship.
20. Relationship Between S.D., M.D., and Q.D.: The standard deviation is related to mean deviation and
quartile deviation.
21. Points of Inflexion: Points of inflexion of the curve are at X = μ ± σ.
22. Area Property: The area under the curve between specific ordinates represents the percentage of data
within those ranges.
The Normal Distribution, also known as the Gaussian Distribution, is the most important probability
distribution in statistics and probability theory. It is a continuous probability distribution that is
symmetrical about the mean, showing that data near the mean are more frequent in occurrence than
data far from the mean. In graphical form, the normal distribution appears as a "bell curve".
The Normal Distribution holds the most honorable position in probability theory for several reasons:
The following table shows the conditions under which the Binomial Distribution and the Poisson
Distribution can be approximated by a Normal Distribution:
Conditions for approximation by a Normal
Distribution
Distribution
Binomial
n is large and p is not too close to 0 or 1
Distribution
Here are some examples of when the Binomial Distribution and the Poisson Distribution can be
approximated by a Normal Distribution:
• The probability of getting at least 10 heads in 20 flips of a fair coin can be approximated by a Normal
Distribution, since the number of trials (20) is large and the probability of success (0.5) is not too
close to 0 or 1.
• The probability of getting at least 10 cars passing through an intersection in a given hour can be
approximated by a Normal Distribution, since the rate of events (cars passing through the
intersection) is large.
Prepared by Madhav Gupta (2K21/CO/262)
When is it used?
Prepared by Madhav Gupta (2K21/CO/262)
When is it used?
Prepared by Madhav Gupta (2K21/CO/262)
When is it used?
Prepared by Madhav Gupta (2K21/CO/262)
The theory of hypothesis testing was introduced by Jerzey Neywman and Egon Pearson in later half of 19
century. Generalizations have to be drawn about the population parameters based upon the evidence
obtained from the study of the sample.
Hypothesis testing is a statistical tool to test some hypothesis about parent population from which sample
is drawn that helps us in decision making in such a scenario.
We use null and alternative hypotheses. The null hypothesis (Ho) suggests there's no significant difference
between a sample and a population, while the alternative hypothesis (Ha) specifies the desired outcome.
The procedure of testing hypotheses follows five sequential steps, which are as:
4. Doing calculations:
- Calculate test statistics and standard error from the sample.
Prepared by Madhav Gupta (2K21/CO/262)
5. Making decisions:
- a. Test Statistic Approach:
- If Calculated Statistic > Table Value, reject Ho.
- If Calculated Statistic ≤ Table Value, accept Ho.
- b. P-value Approach:
- If P-value > Critical Value (e.g., 0.05), accept Ho.
- If P-value < Critical Value (e.g., 0.05), reject Ho.
Type-I Error: In a statistical hypothesis testing experiment, a Type-I error is committed by rejecting the null
hypothesis when it is true. Type-I error is denoted by ' α'.
Type-II Error: The Type-II error is committed by not rejecting the null hypothesis when it is false. The
probability of committing a Type-II is denoted by ' β'.
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)
Hypothesis Testing:
Since the calculated t-statistic (-0.62) is greater in absolute value than the critical t-value (-4.032), we fail
to reject the null hypothesis.
At a 1% significance level, we do not have enough evidence to reject the null hypothesis.
Prepared by Madhav Gupta (2K21/CO/262)
(BARD)
The seller can use the arithmetic mean and standard deviation of the demand for the product in August to
make decisions about the following: Production planning: The seller can use the mean demand to forecast
future demand and plan production accordingly. This will help to avoid overproduction and
underproduction.
• Inventory management: The seller can use the standard deviation of demand to determine
how much inventory to hold. A higher standard deviation indicates that demand is more volatile, so
the seller will need to hold more inventory to avoid stockouts.
• Pricing: The seller can use the mean and standard deviation of demand to set prices. If the
seller knows that demand is relatively stable, they can set a higher price. If demand is more volatile,
the seller may need to set a lower price to attract customers.
• Marketing: The seller can use the mean and standard deviation of demand to develop
targeted marketing campaigns. For example, if the seller knows that demand is higher in certain
regions or during certain times of the year, they can focus their marketing efforts on those areas
and times.
Example
Suppose a seller knows that the mean demand for a product in August was 100 units and the standard
deviation was 20 units. This means that the seller can expect to sell an average of 100 units per month in
August. However, there is a 68% chance that demand will be between 80 and 120 units.
The seller can use this information to make decisions about production, inventory, pricing, and marketing.
For example, the seller may decide to produce 110 units per month in August to ensure that they do not
run out of stock. The seller may also decide to offer a discount on the product in August to attract customers
during a month when demand is typically lower.
Conclusion
The arithmetic mean and standard deviation of demand are valuable tools that sellers can use to make
informed decisions about their businesses. By understanding how demand is likely to behave in the future,
sellers can better plan for production, inventory, pricing, and marketing.
Prepared by Madhav Gupta (2K21/CO/262)
UNIT 3
Q1: Explain OLS (no derivation) (Gujarati).
Ordinary Least Squares (OLS) is a statistical method for estimating
the coefficients of a linear regression model. It is the most
common method used in regression analysis, and is also the
foundation for many other statistical methods.
The OLS estimators are unbiased and efficient under the classical linear regression model, which assumes
that the errors are normally distributed and independent of each other. The OLS estimators are also
consistent, meaning that they converge to the true population parameters as the sample size increases.
Prepared by Madhav Gupta (2K21/CO/262)
where:
• y is the dependent variable
• x is the independent variable
• β0 is the intercept parameter
• β1 is the slope parameter
• u is the error term
Q3: List the assumptions of classical regression. what are the properties of CLRM? (10 properties in
Gujarati)
Prepared by Madhav Gupta (2K21/CO/262)
The SRF is estimated using the ordinary least squares (OLS) method. The OLS method minimizes the sum of
the squared residuals, which are the differences between the actual values of the dependent variable and
the predicted values of the dependent variable based on the SRF.
Q5: What do you mean by the Error/Stochastic/Noise term, and what is its importance in the classical
regression model? (7 Points in Gujarati)
The variable u, called the error term or disturbance in the relationship, represents factors other than x that
affect y. A simple regression analysis effectively treats all factors affecting y other than x as being
unobserved. You can usefully think of u as standing for “unobserved.”
It captures the random and unpredictable variations in the dependent variable that cannot be explained by
the independent variables.
The error term is important because it plays a crucial role in determining the accuracy and reliability of the
regression model. By assuming certain properties of the error term, such as its distribution and
independence, we can make statistical inferences about the regression coefficients and test hypotheses
about the relationship between the dependent and independent variables.
Prepared by Madhav Gupta (2K21/CO/262)
1. Unobserved Factors: Error terms (ui) account for unmeasured factors affecting the dependent variable.
2. Randomness: Error terms are assumed to follow a normal distribution, enabling statistical analysis.
3. Zero Mean: Errors have an average of zero, suggesting a well-specified model.
4. Homoscedasticity: Errors have constant variance across independent variables.
5. Independence: Errors for one observation are unrelated to others, ensuring unique information.
6. Central Limit Theorem: Normality assumption aids hypothesis testing.
7. Estimation and Inference: Errors are vital for OLS parameter estimation, hypothesis testing, and model
evaluation.
Q13: Explain Time Series Data, Cross Sectional Data, Panel Data, Pool Data, and Seasonal Data (NOTES
(lost) + Gujarati).
Cross-Sectional Data:
Cross-sectional data are collected at a single point in time and provide information on one or more
variables for different entities or individuals. These entities can be individuals, households, firms, or
any unit of interest. For example, a cross-sectional dataset might include data on income, education,
and age for a group of people surveyed in a specific year. Analyzing cross-sectional data often
involves addressing heterogeneity, as different entities may exhibit varying characteristics and
behaviours.
Eg. Scooter data from 2005
Pooled Data:
Pooled data refer to a combination of both time series and cross-sectional data. In this type of
dataset, observations are collected from multiple entities at different points in time. An example is
the table provided in the text, which contains data on egg production and prices for 50 states over
two years. Pooled data can provide insights into both cross-sectional variations among entities and
how those entities change over time.
Eg. Data collected from multiple sources over time
Panel Data:
Panel data, also known as longitudinal data or micro panel data, combine aspects of both time series
and cross-sectional data. In panel data, the same entities (e.g., households, firms) are observed or
surveyed over multiple time periods. This type of data is valuable for studying changes within
entities over time. For instance, a panel dataset might include annual income data for a group of
individuals tracked over several years, allowing researchers to examine how individual incomes
evolve over time and the factors influencing those changes.
Eg. Data collected from the same subjects at different points in time (2005 and 2021).
Seasonal Data:
Seasonal data are a specific subset of time series data that exhibit regular patterns or fluctuations
within a year, often associated with seasons or specific time periods. These patterns can include
repeating cycles, such as quarterly variations in retail sales or temperature fluctuations between
summer and winter. Seasonal data analysis typically involves identifying and accounting for these
recurrent patterns to understand underlying trends and make predictions or forecasts.
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)
6. Specification Analysis
Omission of a relevant variable; inclusion of irrelevant variable; tests of specification errors.
UNIT 4
iv. Goodness of Fit (1 para each from Gujrati)
Goodness of fit is a statistical measure that assesses how well a sample of data fits a particular distribution
or model. It is a crucial concept in statistics, as it allows us to evaluate the validity and reliability of our
statistical models. A good goodness of fit indicates that the model accurately represents the underlying data,
while a poor goodness of fit suggests that the model may not be appropriate for the data.
There are many different ways to measure goodness of fit, but some of the most common include the
coefficient of determination (r-squared), the mean squared error (MSE), and the adjusted coefficient of
determination (R-squared). The coefficient of determination (r-squared) is a measure of how much of the
variation in the dependent variable (Y) is explained by the independent variable(s) (X). It is calculated as
the square of the correlation coefficient between Y and the fitted values of Y. R-squared ranges from 0 to 1,
with a higher value indicating a better fit.
The mean squared error (MSE) is a measure of the average squared difference between the fitted values
of Y and the actual values of Y. It is calculated as the sum of the squared residuals, divided by the number
of observations. MSE is measured in the same units as Y, and a lower value indicates a better fit.
The Method of Maximum Likelihood (ML) is a method of statistical inference that estimates the parameters
of a probability distribution by maximizing the likelihood of the observed data. In the context of multiple linear
regression, the ML method can be used to estimate the regression coefficients and the variance of the error
term.
The ML approach is based on the idea that the best estimates of the parameters are those that make the
observed data as likely as possible. To find these estimates, the ML method involves maximizing the
likelihood function, which is a function of the parameters and the data. The likelihood function is proportional
to the probability of the observed data given the parameters.
In the case of multiple linear regression, the likelihood function is a product of normal density functions, one
for each observation. The mean of each normal density function is equal to the predicted value of the
dependent variable for that observation, and the variance of each normal density function is equal to the
variance of the error term. The ML estimates of the regression coefficients and the variance of the error
term are found by maximizing the likelihood function with respect to the parameters. This can be done using
a variety of methods, such as numerical optimization or gradient descent.
The ML method has several advantages over other methods of estimation, such as least squares. One
advantage is that the ML method is more general and can be applied to a wider range of models. Another
advantage is that the ML method is consistent, meaning that the ML estimates converge to the true
parameter values as the sample size increases.
However, the ML method also has some disadvantages. One disadvantage is that the ML method can be
computationally intensive, especially for models with a large number of parameters. Another disadvantage
is that the ML method can be sensitive to outliers, which can distort the estimates.
vi. Significance of error term and what are its consequences (Mid, Gujrati)
=> 7/8 points given
The objective of seasonal analysis is to deseasonalize or seasonally adjust time series data, allowing
analysts to focus on other components such as trends. Seasonal adjustment is particularly important
Prepared by Madhav Gupta (2K21/CO/262)
for economic indicators like the unemployment rate, consumer price index (CPI), producer's price index
(PPI), and industrial production index, which are commonly published in seasonally adjusted form.
One method of deseasonalization involves the use of dummy variables. Dummy variables are binary
variables that take the value of 1 in a specific category and 0 otherwise. In the context of seasonal
analysis, dummy variables are assigned to each quarter of the year to capture the seasonal effects.
However, to avoid the dummy variable trap, an intercept term is omitted from the model.
The dummy variable coefficients are assessed for statistical significance, and a statistically
significant coefficient for a specific quarter indicates a significant seasonal effect during that
period. By including dummy variables for each quarter, the model allows for different intercepts
in each season, effectively capturing the mean sales of refrigerators in each quarter.
UNIT 5
v. Heteroscedasticity: (Gujrati)
Nature:
Several factors contribute to the presence of heteroscedasticity, including changes over time (as in
error-learning models where errors decrease with learning), increased discretionary income leading
to more varied choices, improvements in data collection techniques reducing errors, the presence of
outliers, specification errors in the regression model, and skewness in the distribution of regressors.
Additionally, heteroscedasticity may arise from incorrect data transformations or functional form
specifications.
It is noted that heteroscedasticity is more likely in cross-sectional data, where members of a population
differ in size or characteristics, compared to time series data where variables tend to be of similar
orders of magnitude over a period. Addressing heteroscedasticity is crucial for maintaining the validity
of classical linear regression model assumptions and obtaining accurate regression results.
Thus, In statistics, heteroskedasticity (or heteroscedasticity) happens when the standard deviations of a predicted
variable, monitored over different values of an independent variable or as related to prior time periods, are non-constant.
Consequence:
1. OLS Estimation Allowing for Heteroscedasticity
When using Ordinary Least Squares (OLS) estimation while acknowledging the presence of
heteroscedasticity, several consequences arise:
Confidence Intervals and Hypothesis Testing
• Unnecessarily Larger Confidence Intervals: Confidence intervals based on OLS may be
larger than necessary because the variance of the OLS estimator, denoted as βˆ2, is larger
than the variance of the efficient estimator, denoted as βˆ∗2.
• Inaccurate Testing: The conventional t and F tests may provide inaccurate results. The larger
variance of βˆ2 can lead to the misinterpretation of statistical significance, potentially
classifying a coefficient as insignificant when it might be significant with the correct confidence
intervals.
2. OLS Estimation Disregarding Heteroscedasticity
In scenarios where OLS estimation is performed without accounting for heteroscedasticity, more
serious issues arise:
Biased Estimation
Prepared by Madhav Gupta (2K21/CO/262)
• Bias in Variance Estimation: The standard variance formula for OLS (homoscedastic)
estimation is biased when heteroscedasticity is present. The bias arises because the
conventional estimator of the error variance, σˆ2, is no longer unbiased when
heteroscedasticity exists.
• Misleading Inferences: Continuing to use standard testing procedures without addressing
heteroscedasticity can lead to highly misleading inferences. The bias in variance estimation
affects the reliability of confidence intervals, t-tests, and F-tests, making conclusions drawn
from these methods unreliable.
Empirical Evidence from a Monte Carlo Study
• Overestimation by OLS: Empirical evidence from a Monte Carlo study shows that OLS
consistently overestimates standard errors, especially for larger values of α (the power
parameter determining the relationship between error variance and the explanatory variable
X).
• Superiority of GLS: The results emphasize the superiority of Generalized Least Squares
(GLS) over OLS, particularly in the presence of heteroscedasticity. GLS provides more
accurate standard errors compared to OLS.
In summary, ignoring heteroscedasticity in OLS estimation can lead to biased results, inflated standard
errors, and inaccurate statistical inferences. Addressing heteroscedasticity through methods like GLS
becomes crucial for obtaining reliable estimates and valid hypothesis tests.
Detection:
Formal Methods for Detecting Heteroscedasticity:
1. Park Test:
• Procedure: Conduct a regression of squared residuals on predicted values and their
squares.
• Interpretation: A statistically significant coefficient for the squared predicted values
implies the presence of heteroscedasticity, indicating that the variance of residuals
varies systematically with the predicted values.
2. Goldfeld–Quandt Test:
• Procedure: Divide the dataset based on a chosen variable and compare the variances
of residuals in the subsets.
• Interpretation: A significant difference in variances between subsets suggests the
existence of heteroscedasticity.
3. Breusch–Pagan–Godfrey Test:
• Procedure: Examine whether the error variance relates to specific explanatory
variables.
• Interpretation: Statistically significant coefficients of these variables indicate
heteroscedasticity, implying that certain factors contribute to the varying error
variances.
4. White’s General Heteroscedasticity Test:
• Procedure: Conduct a regression of squared residuals on the original regressors, their
squares, and cross-products.
• Interpretation: A significant model suggests that these variables explain
heteroscedasticity. Additionally, the test can identify specification errors in the model.
5. Koenker–Bassett (KB) Test:
• Procedure: Regress squared residuals on the squared estimated values of the
dependent variable.
• Interpretation: A statistically significant coefficient indicates the presence of
heteroscedasticity. Notably, this test remains applicable even without assuming
normality in the error terms.
Prepared by Madhav Gupta (2K21/CO/262)
Remedies(test ) :
Two approaches for remedying heteroscedasticity: one when the error variance ((σi)^2) is
known and the other when it is not known.
1. When (σi)^2 Is Known: The Method of Weighted Least Squares (WLS)
• If (σi)^2 is known, the most straightforward method is the Weighted Least Squares
(WLS). This involves dividing the regression equation by the standard deviations ((σi)^2
) to transform the model.
• The WLS estimators obtained this way are Best Linear Unbiased Estimators (BLUE),
providing efficient estimates even in the presence of heteroscedasticity.
2. When (σi)^2 Is Not Known: White's Heteroscedasticity-Consistent Variances and
Standard Errors
• Since true (σi)^2 is rarely known, White's heteroscedasticity-consistent variances and
standard errors can be used to obtain asymptotically valid statistical inferences about
the true parameter values.
• White's procedure is available in regression packages and is known for providing robust
standard errors.
Additional Methods for Heteroscedasticity Remediation:
• Weighted Least Squares (WLS) with Unknown : Use the WLS approach even when
(σi)^2 is not known by estimating it from the data.
• Ad Hoc Transformations:
• Assumption 1: Transform the model if it is believed that the error variance is
proportional to (Xi)^2.
• Assumption 2: Transform the model if the variance is believed to be
proportional to Xi itself.
• Assumption 3: Log transformation often helps reduce heteroscedasticity.
• Assumption 4: Log transformation is particularly advantageous when the error
variance is proportional to the square of the mean value of Y(σi)^2.
ix. Multicollinearity (Gujrati)
i. Multicollinearity means the presence of high correlation between two or more explanatory
variables in a multiple regression model.
ii. ii. The classical linear regression model assumes that there is no perfect multicolinearity
Prepared by Madhav Gupta (2K21/CO/262)
Nature Of Multicollinearity:
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)
Causes:
There are several sources of multicollinearity. As Montgomery and Peck, multicollinearity may
be due to the following factors:
1. The data collection method employed, for example, sampling over a limited range of the
values taken by the regressors in the population.
2. Constraints on the model or in the population being sampled. For example, in the regression
of electricity consumption on income (X2) and house size (X3) there is a physical constraint in
the population in that families with higher incomes generally have larger homes than families
with lower incomes.
3. Model specification, for example, adding polynomial terms to a regression model, especially
when the range of the X variable is small.
4. An overdetermined model. This happens when the model has more explanatory variables
than the number of observations. This could happen in medical research where there may be
a small number of patients about whom information is collected on a large number of variables.
An additional reason for multicollinearity, especially in time series data, may be that the
regressors included in the model share a common trend, that is, they all increase or decrease
over time. Thus, in the regression of consumption expenditure on income, wealth, and
population, the regressors income, wealth, and population may all be growing over time at more
or less the same rate, leading to collinearity among these variables.
Consequences:
Detection
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)
Remedies(Test )
Prepared by Madhav Gupta (2K21/CO/262)
Serial correlation, or autocorrelation, is the relationship between a variable and its lagged
version over various time intervals. It assesses how a variable's current value is influenced by
its past values.
Correcting Autocorrelation
Several methods exist for correcting serial correlation:
• Hansen Method:
• Estimates the degree of autocorrelation, adjusting standard errors accordingly.
• Newey-West Estimator:
• Creates a weighting matrix to adjust for autocorrelation, widely used in
econometrics and finance.
• Modifying Regression Equation:
• Adds lag terms to account for correlations between the dependent variable and
error terms.
• Other Correction Methods:
• Use of instrumental variables and panel data methods.
Prepared by Madhav Gupta (2K21/CO/262)
UNIT 6
x. How to select independent variables for any dependent variable in the research, (Seema
mam notes)
xii criteria for selection of independent variable for any dependent variable
Prepared by Madhav Gupta (2K21/CO/262)
###############NUMERICAL####################
Chapters from SP Gupta -> 4, 5, 6, (14), 15, 16
i. AM, GM, HM, Median, Mode, Variance, and Standard Deviation. => SP Gupta