0% found this document useful (0 votes)
8 views48 pages

Notes113

The document provides compiled notes for a Basic Econometrics course, detailing its structure, including contact hours, examination duration, and credits. It covers key concepts such as econometric methodology, types of data, sampling techniques, and statistical distributions, emphasizing the importance of empirical analysis in economics. The notes also differentiate between primary and secondary data, qualitative and quantitative data, and various scales of measurement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views48 pages

Notes113

The document provides compiled notes for a Basic Econometrics course, detailing its structure, including contact hours, examination duration, and credits. It covers key concepts such as econometric methodology, types of data, sampling techniques, and statistical distributions, emphasizing the importance of empirical analysis in economics. The notes also differentiate between primary and secondary data, qualitative and quantitative data, and various scales of measurement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Prepared by Madhav Gupta (2K21/CO/262)

HU307: Basic Econometrics (MidSem Compiled Notes)


Contact Hours- 04 (per week)
Examination Duration- 03
Credits- 04 (4-0-0)

Course Description
This course provides a comprehensive introduction to basic econometric concepts and techniques. It
covers estimation and diagnostic testing of simple and multiple regression models. The course also covers
the consequences of and tests for misspecification of regression models.

Course Outline

1. Nature and Scope of Econometrics


Meaning, Scope, Importance and Application

2. Statistical Concepts
Normal distribution;
chi-sq, t- and F-distributions;
estimation of parameters;
properties of estimators;
testing of hypotheses.

3. Simple Linear Regression Model: Two Variable Case


Estimation of model by method of ordinary least squares;
properties of estimators; goodness of fit; tests of hypotheses;
scaling and units of measurement;
confidence intervals;
Gauss-Markov theorem;
forecasting.
Prepared by Madhav Gupta (2K21/CO/262)

UNIT 1 (Gujarati and SP Gupta)


Q1: What do you mean by Econometrics?
Econometrics may be defined as the quantitative analysis of actual economic phenomena based on the
concurrent development of theory and observation, related by appropriate methods of inference.

It is the social science in which the tools of economic theory, mathematics, and statistical inference are
applied to the analysis of economic phenomena. It is concerned with the empirical determination of
economic laws.

The method of econometric research aims, essentially, at a conjunction of economic theory and actual
measurements, using the theory and technique of statistical inference as a bridge.

Q2: what the its scope of Econometrics?

Econometrics consists of application of mathematical statistics to economic data to land empirical support
to the models constructed by mathematical economics and to obtain numerical results.

Econometrics is amalgamation of economic theory, mathematical economics, economic statistics and


mathematical statistics.

1. Economic theory makes a statement or hypothesis that is mostly qualitative in nature.


Eg.) as per psychological law of consumption letters make a statement that B tech students
of DTU spend 75% of their income on consumption.

2. The main concern of mathematical Economics is to Express economic theory in terms of some
mathematical equation.
C=A+BY.

3. Economics statistics is mainly concerned with collecting, presenting and processing the economic
data in forms of charts and tables.
Eg.) primary, secondary, qualitative, quantitative, seasonal, time series data.

Q3: How is it different from mathematical Econometrics? What is the need to be studied as a separate
Discipline? (2 BOOKS)

Econometrics is different from mathematical economics in its focus on empirical verification of economic
theories. Mathematical economics is concerned with the construction and analysis of mathematical models
of economic phenomena, without regard to the measurability or empirical verification of the models.
Econometrics, on the other hand, is concerned with using statistical and mathematical methods to test and
interpret economic theories using real-world data.

Econometrics needs to be studied as a separate discipline because it has its own unique set of methods and
techniques. These methods and techniques are necessary to deal with the unique challenges of testing
economic theories using real-world data, such as the non-experimental nature of economic data and the
noisy nature of economic data.
Prepared by Madhav Gupta (2K21/CO/262)

Q3: Explain the methodology of Econometrics. (GUJARATI, Woolridge).

Traditional econometric methodology proceeds along the following lines:

1. Statement of Theory or Hypothesis: Begin by stating the economic theory or hypothesis you want to
investigate. In this example, it's Keynesian theory of consumption.

2. Specification of the mathematical model of the theory: Create a mathematical model representing the
theory. In this case, a linear consumption function relating consumption (Y) to income (X) is proposed.

3. Specification of the statistical, or econometric, model: Modify the mathematical model to account for
inexact relationships between economic variables. Introduce a disturbance term (u) to capture
unaccounted factors affecting consumption.

4. Obtaining the Data: Collect relevant data that will be used for estimation and analysis. In this case, data
on personal consumption expenditure (PCE) and gross domestic product (GDP) for the period 1960-2005 is
gathered.

5. Estimation of the parameters of the econometric model.: Use statistical techniques, like regression
analysis, to estimate the model's parameters (β1 and β2) from the data. In this example, estimates of -
299.5913 and 0.7218 are obtained for β1 and β2.

6. Hypothesis Testing: Evaluate whether the estimated parameters are statistically significant and
consistent with the theory. In this case, you would test if the MPC estimate of 0.72 is significantly less than
1 to support Keynesian theory.

7. Forecasting or Prediction: Use the estimated model to make predictions about future economic variables
based on expected values of the independent variable. For example, predict future consumption
expenditure based on forecasted GDP.

8. Using the model for control or policy purposes: Apply the estimated model for policy analysis and
control. Determine how changes in policy variables (e.g., tax policy) will impact economic outcomes (e.g.,
income and consumption).

Throughout these steps, it's crucial to consider the adequacy of the chosen model in explaining the data
and to compare it to alternative models or hypotheses when applicable. This ensures robust and reliable
econometric analysis.
Prepared by Madhav Gupta (2K21/CO/262)

Functional (Deterministic)
Aspect Statistical (Stochastic) Dependence
Dependence
Nature of Involves random or stochastic variables Involves variables that are not
Variables with probability distributions. random or stochastic.
Cannot predict the dependent variable
Allows for precise predictions as
Predictability exactly due to measurement errors and
relationships are exact.
unidentifiable factors.
- Newton's law of gravity. - Ohm's law
- Crop yield based on temperature, in physics. - Boyle's gas law,
Examples
rainfall, etc. - Social sciences data. Kirchhoff's law, Newton's law of
motion.

Intrinsic Contains inherent or "intrinsic" random Typically lacks intrinsic random


Variability variability that cannot be fully explained. variability.

Transition to
Occurs when there are errors in May transition to statistical when
Statistical
measurement or disturbances in deterministic relationships are
from
otherwise deterministic relationships. affected by errors or uncertainties.
Functional

Q4: What do you mean by primary and secondary data (with examples)? (SP GUPTA)

Primary Data
Primary data are measurements observed and recorded as part of an original study. When the data
required for a particular study can be found neither in the internal records of the enterprise, nor in
published sources, it may become necessary to collect original data, i.e., to conduct first hand investigation.
The work of collecting original data is usually limited by time, money and manpower available for the study.
When the data to be collected are very large in volume, it is possible to draw reasonably accurate
conclusions from the study of a small portion of the group called a sample. The actual procedures used in
collecting data are essentially the same whether all the items are to be included or only some items are
considered.

Examples of primary data include:


• Data collected through surveys, questionnaires, and interviews
• Data collected through observation
• Data collected through experiments

Secondary Data
When an investigator uses the data which has already been collected by others, such data are called
Secondary data. Secondary data can be obtained from journals, reports, government publications,
publications of research organisations, trade and professional bodies, etc. However, secondary data must
be used with utmost care. The user should be extra cautious in using secondary data and he should not
accept it at its face value. The reason is that such data may be full of errors because of bias, inadequate size
of the sample, substitution, errors of definition, arithmetical errors, etc. Even if there is no error, secondary
data may not be suitable and adequate for the purpose of the inquiry
Prepared by Madhav Gupta (2K21/CO/262)

Examples of secondary data include:


• Data collected by government agencies, such as the Census Bureau or the Bureau of Labor Statistics
• Data collected by private organizations, such as research firms or market research firms
• Data published in books, journals, and other academic publications
• Data found on websites and other online resources

Q5: What do you mean by qualitative and quantitative data?

Qualitative Data
Qualitative Data, also referred to as Categorical data, is
data characterized by approximation and description. It
lacks numerical values and is instead observed and
recorded. For instance, determining whether a person is
male or female.

Nominal Data: Nominal Data is a form of qualitative data that encompasses two or more categories without
any inherent ranking or preference order. For example, a real estate agent might categorize properties as
flats, bungalows, penthouses, and so on. However, these categories do not imply any particular order of
preference.

Ordinal Data: Ordinal Data, similar to nominal data, includes two or more categories, but the key distinction
is that this data can be ranked.
Eg. rate the behaviour of bank staff: Friendly, Rude, or Indifferent
Eg. rating a product: Very Useful, Useful, Neutral, Not Useful, or Not at All Useful

Binary Data: When a variable can take only two categories.


Eg. Yes or No
Eg. Pass or Fail

Qualitative Data
It can be represented numerically

Discrete Data: It is the data that can only contain certain vales. Also known as attribute data, it is
information that can be categorized into a classification with no fractional or continuous values. It can only
take a finite no of value and such values are obtained by counting and cannot be subdivided meaningful.
Eg. No of students in a class:
- You can count
- You can be 20 or 21, but not 21.2 or 20.4 etc.

Continuous Data: It is information that can be measured on a continuum or scale. It can have an infinite
number of different values depending upon the precision of the measuring instrument. Obtained by
measurement, it can take on any numeric value.
Eg. Weight of a person can be 65, 66, 65.1, 65.8, 65.231 etc.
Prepared by Madhav Gupta (2K21/CO/262)

Q6: What are the different scales of measurement? or


Explain the concept of Unit of Measurement or Scale (notes).
Nominal Scale:
- Arranging numbers to different categories
- The numbers have no real meaning other than differentiating the categories
Eg.
• Flats – 1, Kothis – 2, Penthouse -3
• Cricket Jersey numbers – provides no insight into player’s position
Ordinal Scale:
- Just like nominal scale, assigning no to different categories but here they are meaningful
- The numbers indicate the rank of preference
Eg.
• Friendly - 1, indifferent – 2, rude - 3
• Very Useful - 1, Useful - 2, Neutral - 3, Not Useful - 4, or Not at All Useful -5
Interval Scale:
- There are the numeric scales in which intervals have the same interpretation throughout.

Eg. Different between 30° and 40°, 80° and 90° on Fahrenheit scale of temperature represents the
same temperature different
- Properties:
o Equal Intervals
o Can count, rank and take difference
o No true zero point: it means the absolute of the property being measures
Eg. Temp of the house is 30° and outside temp is 2°. It is meaningful to say that the difference
in temperature is 28° but not meaningful to say that the temperature in house is 15 times
hotter than outside.
Ratio Scale:
- It is an interval scale with the additional property of ‘Zero Point’
- Properties:
o Can Count
o Can Rank
o Take differences and differences are meaningful
o There is a zero point (ratios are meaningful)
Eg. The age of the father is 50 years old and of son is 25 years. It is meaningful to say that
the difference in age is 25 Years and also meaningful to say that age of father is 2 times the
age of the son.

Q7: What do you mean by Sampling? (S P Gupta)


Sampling refers to the process of selecting a finite subset of a population with the goal of studying and
investigating its properties. This selected subset is called a "sample," and the number of units included in
this sample is referred to as the "sample size." Sampling allows researchers to draw conclusions about the
characteristics of the entire population by studying only the objects or items within the sample. The primary
objectives of sampling theory are to maximize the information obtained about the population while
considering limitations in terms of time, money, and manpower, and to obtain the best possible estimates
of population parameters.
In summary, sampling involves selecting a representative subset of a population to study and make
inferences about the larger population based on the properties of the sample.
Prepared by Madhav Gupta (2K21/CO/262)

Q8: What are the different types/ techniques of sampling? (S P Gupta)


Sampling techniques can be broadly classified as follows:

1. Purposive or Subjective or Judgment Sampling: In this method, a desired number of sample units are
deliberately selected based on the objective of the inquiry. The goal is to include only important items that
represent the true characteristics of the population. However, this method is highly subjective, as it relies
on the personal convenience, beliefs, biases, and prejudices of the investigator.

2. Probability Sampling: Probability sampling is a scientific technique for selecting samples from a
population according to specific laws of chance. In this method, each unit in the population has a predefined
probability of being selected in the sample. There are different types of probability sampling, including:

- Each sample unit having an equal chance of selection.


- Sampling units having varying probabilities of selection.
- Probability of selection being proportional to the sample size.

3. Mixed Sampling: Mixed sampling involves a combination of probability-based sampling methods (as
mentioned in section 2) and fixed sampling rules (no use of chance). It is a hybrid approach to sampling.

Some important types of sampling schemes covered by sections 2 and 3 are:

- Simple Random Sampling


- Stratified Random Sampling
- Systematic Sampling
- Multistage Sampling
- Quasi Random Sampling
- Area Sampling
- Simple Cluster Sampling
- Multistage Cluster Sampling
- Quota Sampling
Prepared by Madhav Gupta (2K21/CO/262)

(a) Coefficient of Variation (CV) for distribution A:


CV(A) = (100 * SD(A)) / Mean(A)
CV(A) = (100 * 10) / 100
CV(A) = 10%

Coefficient of Variation (CV) for distribution B:


CV(B) = (100 * SD(B)) / Mean(B)
CV(B) = (100 * 10) / 90
CV(B) = 11.11% (rounded to two decimal places)

The CV for distribution B is indeed higher than that of distribution A, indicating that distribution B has more
variation.

So, Statement (a) is incorrect.

(b) Skewness (Sk) for distribution A:


Sk(A) = [3 * (Mean(A) - Median(A))] / SD(A)
Sk(A) = [3 * (100 - 90)] / 10
Sk(A) = 3

Skewness (Sk) for distribution B:


Sk(B) = [3 * (Mean(B) - Median(B))] / SD(B)
Sk(B) = [3 * (90 - 80)] / 10
Sk(B) = 3

Both distribution A and distribution B have the same skewness value, which is 3.

So, Statement (b) is correct.


Prepared by Madhav Gupta (2K21/CO/262)

UNIT 2 (SP GUPTA)


Distributions (2 Marks Each)
Q1: What is normal distribution? (Gujarati, SP Gupta)
Normal probability distribution or commonly called the normal distribution is one of the most important
continuous theoretical distributions in Statistics. Most of the data relating to economic and business
statistics or even in social and physical sciences conform to this distribution.
If X is a continuous random variable following normal probability distribution with mean μ and standard
deviation σ, then its probability density function (p.d.f.) is given by

When is it used?

What are its properties?


The properties of this distribution are as follows:

1. Bell-Shaped Curve: The graph of the Normal Distribution


is a bell-shaped curve with the peak at the mean (μ).
2. Symmetry: The curve is symmetrical about the mean (μ),
meaning it has the same shape on either side of the mean.
3. Mean, Median, and Mode: In a Normal Distribution, the
mean (average), median, and mode all coincide at the
same value (μ).
4. Equal Areas: The area under the curve is equal on both sides of the mean. Half of the total area lies to
the right of the mean, and half lies to the left.
5. Quartiles: The quartiles are equidistant from the median (μ). This relationship holds: Q1 + Q3 = 2μ.
6. Skewness: The Normal Distribution is symmetric, so the moment coefficient of skewness (β1) is zero.
7. Kurtosis: The coefficient of kurtosis (β2) is 3, indicating that it has no excess kurtosis.
8. Non-Negative Values: The curve never dips below the x-axis; the probability (p(x)) is always non-
negative.
9. Theoretical Range: The theoretical range extends from negative infinity to positive infinity, but
practically, the range is limited to 6 times the standard deviation (Range = 6σ).
10. Peak Probability: The highest point of the curve occurs at the mean (μ), and the maximum probability is
given by a formula inversely proportional to the standard deviation.
11. Unimodal: The Normal Distribution is unimodal, meaning it has only one mode, which is at the mean (μ).
Prepared by Madhav Gupta (2K21/CO/262)

12. Moments of Odd Order: All moments of odd order about the mean are zero (μ1 = μ3 = μ5 = ... = 0).
13. Moments of Even Order: The moments of even order are given by specific formulas.
14. Asymptote: The x-axis is an asymptote to the curve as X moves numerically far from the mean.
15. Additivity Property: A linear combination of independent Normal random variables is also a Normal
random variable.
16. Mean Deviation: The mean deviation about the mean, median, or mode is approximately 0.7979 times
the standard deviation (σ).
17. Quartiles: Quartiles are given in terms of μ and σ.
18. Quartile Deviation: The quartile deviation is approximately 0.6745 times the standard deviation (σ).
19. Relationship Between Q.D., M.D., and S.D.: Quartile deviation, mean deviation, and standard deviation
have a specific relationship.
20. Relationship Between S.D., M.D., and Q.D.: The standard deviation is related to mean deviation and
quartile deviation.
21. Points of Inflexion: Points of inflexion of the curve are at X = μ ± σ.
22. Area Property: The area under the curve between specific ordinates represents the percentage of data
within those ranges.

The Normal Distribution, also known as the Gaussian Distribution, is the most important probability
distribution in statistics and probability theory. It is a continuous probability distribution that is
symmetrical about the mean, showing that data near the mean are more frequent in occurrence than
data far from the mean. In graphical form, the normal distribution appears as a "bell curve".

The Normal Distribution holds the most honorable position in probability theory for several reasons:

• It is the most common probability distribution observed in nature. Many naturally-occurring


phenomena tend to approximate the normal distribution, such as the heights of people, the weights
of apples, and the scores on intelligence tests.
• It is the distribution that is obtained by the Central Limit Theorem. The Central Limit Theorem states
that under certain conditions, the average of a large number of independent and identically
distributed random variables will be approximately normally distributed, regardless of the
distribution of the individual random variables. This means that the Normal Distribution can be used
to approximate the distribution of many different types of data, even if the underlying distribution
is unknown.
• It is mathematically convenient. The Normal Distribution has a number of desirable mathematical
properties, such as being symmetric about the mean and having a bell-shaped curve. This makes it
easy to perform calculations with the Normal Distribution and to develop statistical tests based on
it.
Prepared by Madhav Gupta (2K21/CO/262)

The following table shows the conditions under which the Binomial Distribution and the Poisson
Distribution can be approximated by a Normal Distribution:
Conditions for approximation by a Normal
Distribution
Distribution

Binomial
n is large and p is not too close to 0 or 1
Distribution

Poisson Distribution λ is large

Here are some examples of when the Binomial Distribution and the Poisson Distribution can be
approximated by a Normal Distribution:
• The probability of getting at least 10 heads in 20 flips of a fair coin can be approximated by a Normal
Distribution, since the number of trials (20) is large and the probability of success (0.5) is not too
close to 0 or 1.
• The probability of getting at least 10 cars passing through an intersection in a given hour can be
approximated by a Normal Distribution, since the rate of events (cars passing through the
intersection) is large.
Prepared by Madhav Gupta (2K21/CO/262)

Q2: What is chi-square distribution? (2 Marks, Numerical)

What are its properties?

When is it used?
Prepared by Madhav Gupta (2K21/CO/262)

Q3: What is t-distribution?

What are its properties?

When is it used?
Prepared by Madhav Gupta (2K21/CO/262)

Q4: What is f-distribution?

What are its properties?

When is it used?
Prepared by Madhav Gupta (2K21/CO/262)

Q7: Discuss Testing of Hypothesis (notes + Gujarati).

The theory of hypothesis testing was introduced by Jerzey Neywman and Egon Pearson in later half of 19
century. Generalizations have to be drawn about the population parameters based upon the evidence
obtained from the study of the sample.
Hypothesis testing is a statistical tool to test some hypothesis about parent population from which sample
is drawn that helps us in decision making in such a scenario.
We use null and alternative hypotheses. The null hypothesis (Ho) suggests there's no significant difference
between a sample and a population, while the alternative hypothesis (Ha) specifies the desired outcome.
The procedure of testing hypotheses follows five sequential steps, which are as:

1. Set up the Hypothesis:


- Ho: No significant difference.
- Ha: Desired outcome.

2. Set up the significance level:


- Use a significance level (e.g., 5% or 1%) to
determine the confidence for accepting/rejecting
Ho.

3. Setting a test criterion:


- Choose a suitable statistical test (e.g., t-test, F-
test, Chi-square).

4. Doing calculations:
- Calculate test statistics and standard error from the sample.
Prepared by Madhav Gupta (2K21/CO/262)

5. Making decisions:
- a. Test Statistic Approach:
- If Calculated Statistic > Table Value, reject Ho.
- If Calculated Statistic ≤ Table Value, accept Ho.
- b. P-value Approach:
- If P-value > Critical Value (e.g., 0.05), accept Ho.
- If P-value < Critical Value (e.g., 0.05), reject Ho.

Null Hypothesis (H0) Alternative Hypothesis (H1)


Asserts that there is no real difference between
sample and population. Assumes differences Specifies values that the researcher believes to
are accidental and unimportant. hold true and hopes the sample data will support.
Challenges the idea of a difference and seeks to
be refuted by the experiment's data. Aims to be supported by the experiment's data.
For testing if extra coaching benefits students:
"extra coaching has not benefited the For testing if a drug is effective in curing malaria:
students." "the drug is not effective in curing malaria."
Acceptance of H0 implies that differences are Rejection of H0 indicates statistically significant
due to chance. differences.

Represented as Ho in statistical notation. Represented as H1 or Ha in statistical notation.

Type-I Error: In a statistical hypothesis testing experiment, a Type-I error is committed by rejecting the null
hypothesis when it is true. Type-I error is denoted by ' α'.

Type-II Error: The Type-II error is committed by not rejecting the null hypothesis when it is false. The
probability of committing a Type-II is denoted by ' β'.
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)

Fisher's F-test is employed to compare two independent estimates of population variance.


It is typically utilized when two samples exhibit equal variances, denoted as S 12 and S22.
This statistical test is suitable for small sample sizes and is based on the ratio of variances, S 12 divided by
S 22.
The degrees of freedom differ for the larger and smaller population variances, with v₁ and v₂
respectively. The null hypothesis in this context asserts that the two population variances are equal,
represented as Ho = S12 = S22. Specifically, v₁, representing the smaller population variance, is calculated as
n-1, while v₂, corresponding to the larger population variance, is determined as n-2.
Prepared by Madhav Gupta (2K21/CO/262)

It is used in the following cases:


1. F-test is used to compare the ratio of the two variances, S21 and S22.
2. The samples must be independent.
3. F-test is never negative because the upper value is greater than the lower one (S 21/S22).
4. F-test is used for testing the overall significance of regression.

Hypothesis Testing:

- Null Hypothesis (Ho): μ = 25


- Alternative Hypothesis (Ha): μ ≠ 25
- Significance Level (α) = 0.01
- Test Statistic = T Test
- Degrees of Freedom (DF) = N - 1 = 4
- Critical t Value = ±4.032

Calculate the mean of the sample:


Mean (x̄) = 23.6

Calculate the Standard Error (SE):


SE = Standard Deviation (SD) / √N
SE = 5 / √5
SE ≈ 2.23

Calculate the t statistic:


t = (x̄ - μ) / SE
t = (23.6 - 25) / 2.23
t ≈ -0.62

Since the calculated t-statistic (-0.62) is greater in absolute value than the critical t-value (-4.032), we fail
to reject the null hypothesis.

At a 1% significance level, we do not have enough evidence to reject the null hypothesis.
Prepared by Madhav Gupta (2K21/CO/262)

(BARD)
The seller can use the arithmetic mean and standard deviation of the demand for the product in August to
make decisions about the following: Production planning: The seller can use the mean demand to forecast
future demand and plan production accordingly. This will help to avoid overproduction and
underproduction.
• Inventory management: The seller can use the standard deviation of demand to determine
how much inventory to hold. A higher standard deviation indicates that demand is more volatile, so
the seller will need to hold more inventory to avoid stockouts.
• Pricing: The seller can use the mean and standard deviation of demand to set prices. If the
seller knows that demand is relatively stable, they can set a higher price. If demand is more volatile,
the seller may need to set a lower price to attract customers.
• Marketing: The seller can use the mean and standard deviation of demand to develop
targeted marketing campaigns. For example, if the seller knows that demand is higher in certain
regions or during certain times of the year, they can focus their marketing efforts on those areas
and times.
Example
Suppose a seller knows that the mean demand for a product in August was 100 units and the standard
deviation was 20 units. This means that the seller can expect to sell an average of 100 units per month in
August. However, there is a 68% chance that demand will be between 80 and 120 units.
The seller can use this information to make decisions about production, inventory, pricing, and marketing.
For example, the seller may decide to produce 110 units per month in August to ensure that they do not
run out of stock. The seller may also decide to offer a discount on the product in August to attract customers
during a month when demand is typically lower.
Conclusion
The arithmetic mean and standard deviation of demand are valuable tools that sellers can use to make
informed decisions about their businesses. By understanding how demand is likely to behave in the future,
sellers can better plan for production, inventory, pricing, and marketing.
Prepared by Madhav Gupta (2K21/CO/262)

UNIT 3
Q1: Explain OLS (no derivation) (Gujarati).
Ordinary Least Squares (OLS) is a statistical method for estimating
the coefficients of a linear regression model. It is the most
common method used in regression analysis, and is also the
foundation for many other statistical methods.

OLS works by minimizing the sum of the squared residuals, which


are the differences between the actual and predicted values of the
dependent variable. This is done by finding the values of the
coefficients that produce the smallest possible sum of squared
residuals.

The OLS estimators are unbiased and efficient under the classical linear regression model, which assumes
that the errors are normally distributed and independent of each other. The OLS estimators are also
consistent, meaning that they converge to the true population parameters as the sample size increases.
Prepared by Madhav Gupta (2K21/CO/262)

Q2: What do you mean by the Classical Linear Regression Model,


The Classical Linear Regression Model is a statistical model that describes the relationship between a
dependent variable and one or more independent variables. It is a simple and powerful tool that can be
used to explain and predict a wide range of phenomena.

where:
• y is the dependent variable
• x is the independent variable
• β0 is the intercept parameter
• β1 is the slope parameter
• u is the error term

Q3: List the assumptions of classical regression. what are the properties of CLRM? (10 properties in
Gujarati)
Prepared by Madhav Gupta (2K21/CO/262)

Q4: Define the Sample Regression Function.


The Sample Regression Function (SRF) is a statistical concept used to estimate the relationship between
two variables based on a sample of data, when we have only one Y value corresponding to each fixed X
value in the sample. In other words, it is a way to estimate the population regression function (PRF) when
we have limited data.

The SRF is estimated using the ordinary least squares (OLS) method. The OLS method minimizes the sum of
the squared residuals, which are the differences between the actual values of the dependent variable and
the predicted values of the dependent variable based on the SRF.

Q5: What do you mean by the Error/Stochastic/Noise term, and what is its importance in the classical
regression model? (7 Points in Gujarati)
The variable u, called the error term or disturbance in the relationship, represents factors other than x that
affect y. A simple regression analysis effectively treats all factors affecting y other than x as being
unobserved. You can usefully think of u as standing for “unobserved.”
It captures the random and unpredictable variations in the dependent variable that cannot be explained by
the independent variables.
The error term is important because it plays a crucial role in determining the accuracy and reliability of the
regression model. By assuming certain properties of the error term, such as its distribution and
independence, we can make statistical inferences about the regression coefficients and test hypotheses
about the relationship between the dependent and independent variables.
Prepared by Madhav Gupta (2K21/CO/262)

1. Unobserved Factors: Error terms (ui) account for unmeasured factors affecting the dependent variable.
2. Randomness: Error terms are assumed to follow a normal distribution, enabling statistical analysis.
3. Zero Mean: Errors have an average of zero, suggesting a well-specified model.
4. Homoscedasticity: Errors have constant variance across independent variables.
5. Independence: Errors for one observation are unrelated to others, ensuring unique information.
6. Central Limit Theorem: Normality assumption aids hypothesis testing.
7. Estimation and Inference: Errors are vital for OLS parameter estimation, hypothesis testing, and model
evaluation.

Q9: Describe Goodness of Fit (paragraph from Gujarati).


“goodness of fit” tells us how well the estimated regression line fits the actual Y values. This measure has
been developed and is known as the coefficient of determination, denoted by the symbol r-square (r^2)

Q10: Explain Confidence Interval (paragraph) from Gujarati.


In the context of hypothesis testing in a two-variable regression model, a confidence interval provides a
range of values within which the true value of a coefficient or parameter is likely to fall. It is calculated
based on the sample data and is used to make inferences about the population parameter.
Specifically, in the case of a two-variable regression model, a confidence interval can be constructed for the
slope coefficient (B2) of the independent variable. The confidence interval represents a range of values for
B2 within which we can be reasonably confident that the true value of B2 lies.
For example, a 95% confidence interval for B2 means that if we were to repeat the sampling process
multiple times, 95 out of 100 of these intervals would include the true value of B2. If the null hypothesis is
that B2 equals a specific value (e.g., B*2), we can use the confidence interval to determine whether or not
to reject the null hypothesis. If the hypothesized value falls within the confidence interval, we do not reject
the null hypothesis. However, if the hypothesized value lies outside the confidence interval, we reject the
null hypothesis.
In summary, a confidence interval in the context of hypothesis testing in a two-variable regression model
provides a range of values within which the true value of a coefficient is likely to fall, and it is used to make
inferences about the population parameter.
Prepared by Madhav Gupta (2K21/CO/262)

Q6: Compare Regression vs Correlation

Q11: Discuss Gauss-Markov Theorem (Gujarati) or


Properties of OLS Estimator or
Why OLS are called BLUE Estimators (3 Points).
The method of OLS is used popularly not only because it is easy to use but also because it has some strong
theoretical properties, which are summarized in the well-known Gauss-Markov theorem.

Q13: Explain Time Series Data, Cross Sectional Data, Panel Data, Pool Data, and Seasonal Data (NOTES
(lost) + Gujarati).

Time Series Data:


Time series data consist of observations collected over a sequence of time intervals. These
intervals can be regular (e.g., daily, weekly, monthly) or irregular, and they track changes in a
specific variable or set of variables over time. Examples include stock prices, weather data,
economic indicators like GDP, and even real-time stock quotes. Analyzing time series data often
requires addressing issues related to stationarity, where the mean and variance of the data should
not exhibit systematic changes over time.
Eg. Sales data from 1991-2021.
Prepared by Madhav Gupta (2K21/CO/262)

Cross-Sectional Data:
Cross-sectional data are collected at a single point in time and provide information on one or more
variables for different entities or individuals. These entities can be individuals, households, firms, or
any unit of interest. For example, a cross-sectional dataset might include data on income, education,
and age for a group of people surveyed in a specific year. Analyzing cross-sectional data often
involves addressing heterogeneity, as different entities may exhibit varying characteristics and
behaviours.
Eg. Scooter data from 2005

Pooled Data:
Pooled data refer to a combination of both time series and cross-sectional data. In this type of
dataset, observations are collected from multiple entities at different points in time. An example is
the table provided in the text, which contains data on egg production and prices for 50 states over
two years. Pooled data can provide insights into both cross-sectional variations among entities and
how those entities change over time.
Eg. Data collected from multiple sources over time

Panel Data:
Panel data, also known as longitudinal data or micro panel data, combine aspects of both time series
and cross-sectional data. In panel data, the same entities (e.g., households, firms) are observed or
surveyed over multiple time periods. This type of data is valuable for studying changes within
entities over time. For instance, a panel dataset might include annual income data for a group of
individuals tracked over several years, allowing researchers to examine how individual incomes
evolve over time and the factors influencing those changes.
Eg. Data collected from the same subjects at different points in time (2005 and 2021).

Seasonal Data:
Seasonal data are a specific subset of time series data that exhibit regular patterns or fluctuations
within a year, often associated with seasons or specific time periods. These patterns can include
repeating cycles, such as quarterly variations in retail sales or temperature fluctuations between
summer and winter. Seasonal data analysis typically involves identifying and accounting for these
recurrent patterns to understand underlying trends and make predictions or forecasts.
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)

i. What is Econometrics and its significance for engineers (mid)


li. Type of data, (mid)
iii Classical Linear Regression Model (CLRM) along with its properties. (mid, Gujrati)
Prepared by Madhav Gupta (2K21/CO/262)

EndSem Compiled Notes


4. Multiple Linear Regression Model
Estimation of parameters; properties of OLS estimators; goodness of fit - R2 and adjusted R2; partial
regression coefficients; testing hypotheses – individual and joint; functional forms of regression models;
qualitative (dummy) independent variables.

5. Violations of Classical Assumptions: Consequences, Detection and Remedies


Multicollinearity; heteroscedasticity; serial correlation.

6. Specification Analysis
Omission of a relevant variable; inclusion of irrelevant variable; tests of specification errors.

UNIT 4
iv. Goodness of Fit (1 para each from Gujrati)
Goodness of fit is a statistical measure that assesses how well a sample of data fits a particular distribution
or model. It is a crucial concept in statistics, as it allows us to evaluate the validity and reliability of our
statistical models. A good goodness of fit indicates that the model accurately represents the underlying data,
while a poor goodness of fit suggests that the model may not be appropriate for the data.

There are many different ways to measure goodness of fit, but some of the most common include the
coefficient of determination (r-squared), the mean squared error (MSE), and the adjusted coefficient of
determination (R-squared). The coefficient of determination (r-squared) is a measure of how much of the
variation in the dependent variable (Y) is explained by the independent variable(s) (X). It is calculated as
the square of the correlation coefficient between Y and the fitted values of Y. R-squared ranges from 0 to 1,
with a higher value indicating a better fit.

The mean squared error (MSE) is a measure of the average squared difference between the fitted values
of Y and the actual values of Y. It is calculated as the sum of the squared residuals, divided by the number
of observations. MSE is measured in the same units as Y, and a lower value indicates a better fit.

The adjusted coefficient of determination (R-squared)


is a modification of r-squared that takes into account
the number of independent variables in the model. It is
calculated as 1 - (SSE/TSS), where SSE is the sum of
squared errors and TSS is the total sum of squares. R-
squared is also a measure of how much of the variation
in Y is explained by the X variables, but it adjusts for
the fact that adding more X variables will always
increase r-squared, even if the new variables do not
provide any additional explanatory power.

There are various methods for measuring goodness of


fit, each with its own strengths and weaknesses. Some
common methods include the chi-square test, the
Kolmogorov-Smirnov test, and the coefficient of
determination (R-squared). The choice of method depends on the specific distribution or model being
evaluated and the nature of the data.
Prepared by Madhav Gupta (2K21/CO/262)

iv. Method of Maximum Likelihood (1 para each from Gujrati)

The Method of Maximum Likelihood (ML) is a method of statistical inference that estimates the parameters
of a probability distribution by maximizing the likelihood of the observed data. In the context of multiple linear
regression, the ML method can be used to estimate the regression coefficients and the variance of the error
term.

The ML approach is based on the idea that the best estimates of the parameters are those that make the
observed data as likely as possible. To find these estimates, the ML method involves maximizing the
likelihood function, which is a function of the parameters and the data. The likelihood function is proportional
to the probability of the observed data given the parameters.

In the case of multiple linear regression, the likelihood function is a product of normal density functions, one
for each observation. The mean of each normal density function is equal to the predicted value of the
dependent variable for that observation, and the variance of each normal density function is equal to the
variance of the error term. The ML estimates of the regression coefficients and the variance of the error
term are found by maximizing the likelihood function with respect to the parameters. This can be done using
a variety of methods, such as numerical optimization or gradient descent.

The ML method has several advantages over other methods of estimation, such as least squares. One
advantage is that the ML method is more general and can be applied to a wider range of models. Another
advantage is that the ML method is consistent, meaning that the ML estimates converge to the true
parameter values as the sample size increases.

However, the ML method also has some disadvantages. One disadvantage is that the ML method can be
computationally intensive, especially for models with a large number of parameters. Another disadvantage
is that the ML method can be sensitive to outliers, which can distort the estimates.

Aspect OLS Regression ML Regression


- Errors are independent and identically
Assumptions - Errors are normally distributed. distributed (i.i.d.).
- Errors have constant variance - No specific distribution assumption for
(homoscedasticity). errors.
- Errors are independent of each
other.
Estimation - Minimizes sum of squared
Technique residuals. - Maximizes the likelihood function.
- Asymptotically efficient estimates
- Provides unbiased estimates, but (approaching minimum variance as sample
Efficiency not always the most efficient. size increases).
Hypothesis - Uses t-tests, F-tests, etc., based on - Uses likelihood ratio tests, Wald tests,
Testing error distribution assumptions. etc., based on the likelihood function.
Prepared by Madhav Gupta (2K21/CO/262)

vi. Coefficient of determination and Coefficient of correlation

Characteristic Coefficient of Correlation (r) Coefficient of Determination (r2)


It measures the strength and direction of
a linear relationship between two
variables (x and y) with values between It provides the percentage of the variation in
Definition -1 and 1. y explained by all the x variables together.
Range of -1 (perfect negative correlation) to 1
Values (perfect positive correlation) 0 to 1
Positive correlation (closer to 1): Higher values (closer to 1): Indicates a
Interpretation Indicates that two variables rise and fall stronger Linear Regression model with less
of Values together. scattered data points.
Negative correlation (closer to -1):
Indicates that two variables are perfect Lower values (closer to 0): Indicates a
opposites, with one going up as the other weaker Linear Regression model with more
goes down. scattered data points.
No correlation (close to 0): Indicates
either no linear correlation or a weak
linear correlation.
Calculation N/A r2 = r x r
Alternatively, R2 = 1 - (RSS/TSS), where
RSS is Residual Sum of Squares and TSS
is Total Sum of Squares.

vi. Significance of error term and what are its consequences (Mid, Gujrati)
=> 7/8 points given

viii. Dummy (Gujrati)


Q12: What do you mean by Dummy Variable, and what is its importance in handling qualitative data?
Dummy variables, also known as indicator variables, categorical variables, qualitative variables, or binary
variables, are variables that take on the value of 0 or 1 to indicate the presence or absence of a particular
characteristic or attribute. In regression analysis, dummy variables are used to represent qualitative
variables that cannot be measured on a numerical scale, such as gender, race, or geographic region. By
converting qualitative variables into binary variables, regression models can account for the effects of
qualitative factors on the dependent variable.
The importance of dummy variables in handling qualitative data lies in their ability to allow qualitative
variables to be included in regression models. By converting qualitative variables into binary variables,
regression models can account for the effects of qualitative factors on the dependent variable. This is
particularly useful in social science research, where many variables of interest are qualitative in nature.
Without dummy variables, it would be difficult to include qualitative variables in regression models and
account for their effects on the dependent variable.

its uses in regional (not required) and seasonal analysis


The use of dummy variables in seasonal analysis is a common technique employed to identify and
remove seasonal patterns from economic time series data. Seasonal patterns refer to regular oscillatory
movements in data, often associated with specific quarters or periods of the year. Examples include
the seasonal spikes in sales of department stores during Christmas, increased demand for ice cream
in the summer, or fluctuations in air travel demand during holiday seasons.

The objective of seasonal analysis is to deseasonalize or seasonally adjust time series data, allowing
analysts to focus on other components such as trends. Seasonal adjustment is particularly important
Prepared by Madhav Gupta (2K21/CO/262)

for economic indicators like the unemployment rate, consumer price index (CPI), producer's price index
(PPI), and industrial production index, which are commonly published in seasonally adjusted form.
One method of deseasonalization involves the use of dummy variables. Dummy variables are binary
variables that take the value of 1 in a specific category and 0 otherwise. In the context of seasonal
analysis, dummy variables are assigned to each quarter of the year to capture the seasonal effects.
However, to avoid the dummy variable trap, an intercept term is omitted from the model.

The dummy variable coefficients are assessed for statistical significance, and a statistically
significant coefficient for a specific quarter indicates a significant seasonal effect during that
period. By including dummy variables for each quarter, the model allows for different intercepts
in each season, effectively capturing the mean sales of refrigerators in each quarter.

Alternative To Chow Test


The Dummy Variable Alternative to the Chow Test is presented as a method for exploring the sources of
differences identified by the Chow test in regression models. While the Chow test can indicate structural
changes, it does not specify whether variations arise from differences in intercept terms, slope
coefficients, or both. The Dummy Variable Alternative introduces four possibilities: coincident
regressions, parallel regressions, concurrent regressions, and dissimilar regressions. To identify the
source of differences, a multistep procedure involves pooling observations and running a single multiple
regression with a dummy variable representing different periods. This approach allows for a more detailed
analysis of structural changes by distinguishing variations in intercepts and slope coefficients across
different time periods.
The Chow test is used to examine the structural stability of a regression model. If the data is from 1974-
75 to 1989-90 to find out the relationship between income and saving. The sample was divided into two.
1974-75 to 1988-89 and 1989=90 to 195-96, and showed on the basis of the Chow test that there was a
difference in the regression of saving on income between the two periods. However, it could not be told
that whether the difference in two regression was because of difference in the intercept terms or the slope
coefficients or both. Very often this knowledge itself is very useful. There are four possibilities:
Prepared by Madhav Gupta (2K21/CO/262)

Caution And Its Uses


Prepared by Madhav Gupta (2K21/CO/262)

UNIT 5

v. Heteroscedasticity: (Gujrati)
Nature:

Heteroscedasticity refers to a violation of the


homoscedasticity assumption in classical linear regression
models, where the variance of the disturbance term (ui) is
expected to be constant across different values of the
explanatory variables. In homoscedasticity, the conditional
variance of the dependent variable (Yi) remains the same
regardless of the values taken by the independent variable
(Xi). However, in cases of heteroscedasticity, the conditional
variance of Yi is not constant and may vary with the values of
Xi.

This variability in conditional variance can be symbolically


represented as E(u^2i) = σ^2i, indicating that the assumption
of equal spread is not met. The difference between homoscedasticity and heteroscedasticity is
illustrated using the example of a two-variable regression model where savings (Y) is regressed on
income (X). In a homoscedastic scenario, the variance of savings remains constant at all levels of
income, while in a heteroscedastic scenario, the variance of savings increases with income.

Several factors contribute to the presence of heteroscedasticity, including changes over time (as in
error-learning models where errors decrease with learning), increased discretionary income leading
to more varied choices, improvements in data collection techniques reducing errors, the presence of
outliers, specification errors in the regression model, and skewness in the distribution of regressors.
Additionally, heteroscedasticity may arise from incorrect data transformations or functional form
specifications.

It is noted that heteroscedasticity is more likely in cross-sectional data, where members of a population
differ in size or characteristics, compared to time series data where variables tend to be of similar
orders of magnitude over a period. Addressing heteroscedasticity is crucial for maintaining the validity
of classical linear regression model assumptions and obtaining accurate regression results.
Thus, In statistics, heteroskedasticity (or heteroscedasticity) happens when the standard deviations of a predicted
variable, monitored over different values of an independent variable or as related to prior time periods, are non-constant.

Consequence:
1. OLS Estimation Allowing for Heteroscedasticity
When using Ordinary Least Squares (OLS) estimation while acknowledging the presence of
heteroscedasticity, several consequences arise:
Confidence Intervals and Hypothesis Testing
• Unnecessarily Larger Confidence Intervals: Confidence intervals based on OLS may be
larger than necessary because the variance of the OLS estimator, denoted as βˆ2, is larger
than the variance of the efficient estimator, denoted as βˆ∗2.
• Inaccurate Testing: The conventional t and F tests may provide inaccurate results. The larger
variance of βˆ2 can lead to the misinterpretation of statistical significance, potentially
classifying a coefficient as insignificant when it might be significant with the correct confidence
intervals.
2. OLS Estimation Disregarding Heteroscedasticity
In scenarios where OLS estimation is performed without accounting for heteroscedasticity, more
serious issues arise:
Biased Estimation
Prepared by Madhav Gupta (2K21/CO/262)

• Bias in Variance Estimation: The standard variance formula for OLS (homoscedastic)
estimation is biased when heteroscedasticity is present. The bias arises because the
conventional estimator of the error variance, σˆ2, is no longer unbiased when
heteroscedasticity exists.
• Misleading Inferences: Continuing to use standard testing procedures without addressing
heteroscedasticity can lead to highly misleading inferences. The bias in variance estimation
affects the reliability of confidence intervals, t-tests, and F-tests, making conclusions drawn
from these methods unreliable.
Empirical Evidence from a Monte Carlo Study
• Overestimation by OLS: Empirical evidence from a Monte Carlo study shows that OLS
consistently overestimates standard errors, especially for larger values of α (the power
parameter determining the relationship between error variance and the explanatory variable
X).
• Superiority of GLS: The results emphasize the superiority of Generalized Least Squares
(GLS) over OLS, particularly in the presence of heteroscedasticity. GLS provides more
accurate standard errors compared to OLS.
In summary, ignoring heteroscedasticity in OLS estimation can lead to biased results, inflated standard
errors, and inaccurate statistical inferences. Addressing heteroscedasticity through methods like GLS
becomes crucial for obtaining reliable estimates and valid hypothesis tests.

Detection:
Formal Methods for Detecting Heteroscedasticity:
1. Park Test:
• Procedure: Conduct a regression of squared residuals on predicted values and their
squares.
• Interpretation: A statistically significant coefficient for the squared predicted values
implies the presence of heteroscedasticity, indicating that the variance of residuals
varies systematically with the predicted values.

2. Goldfeld–Quandt Test:
• Procedure: Divide the dataset based on a chosen variable and compare the variances
of residuals in the subsets.
• Interpretation: A significant difference in variances between subsets suggests the
existence of heteroscedasticity.
3. Breusch–Pagan–Godfrey Test:
• Procedure: Examine whether the error variance relates to specific explanatory
variables.
• Interpretation: Statistically significant coefficients of these variables indicate
heteroscedasticity, implying that certain factors contribute to the varying error
variances.
4. White’s General Heteroscedasticity Test:
• Procedure: Conduct a regression of squared residuals on the original regressors, their
squares, and cross-products.
• Interpretation: A significant model suggests that these variables explain
heteroscedasticity. Additionally, the test can identify specification errors in the model.
5. Koenker–Bassett (KB) Test:
• Procedure: Regress squared residuals on the squared estimated values of the
dependent variable.
• Interpretation: A statistically significant coefficient indicates the presence of
heteroscedasticity. Notably, this test remains applicable even without assuming
normality in the error terms.
Prepared by Madhav Gupta (2K21/CO/262)

Informal Methods for Detecting Heteroscedasticity:


1. Graphical Inspection:
• Procedure: Plot residuals against predicted
values.
• Interpretation: Examine patterns or
changes in variability in the plot, aiming to
identify visual cues that suggest the
presence of heteroscedasticity.

Remedies(test ) :
Two approaches for remedying heteroscedasticity: one when the error variance ((σi)^2) is
known and the other when it is not known.
1. When (σi)^2 Is Known: The Method of Weighted Least Squares (WLS)
• If (σi)^2 is known, the most straightforward method is the Weighted Least Squares
(WLS). This involves dividing the regression equation by the standard deviations ((σi)^2
) to transform the model.
• The WLS estimators obtained this way are Best Linear Unbiased Estimators (BLUE),
providing efficient estimates even in the presence of heteroscedasticity.
2. When (σi)^2 Is Not Known: White's Heteroscedasticity-Consistent Variances and
Standard Errors
• Since true (σi)^2 is rarely known, White's heteroscedasticity-consistent variances and
standard errors can be used to obtain asymptotically valid statistical inferences about
the true parameter values.
• White's procedure is available in regression packages and is known for providing robust
standard errors.
Additional Methods for Heteroscedasticity Remediation:
• Weighted Least Squares (WLS) with Unknown : Use the WLS approach even when
(σi)^2 is not known by estimating it from the data.
• Ad Hoc Transformations:
• Assumption 1: Transform the model if it is believed that the error variance is
proportional to (Xi)^2.
• Assumption 2: Transform the model if the variance is believed to be
proportional to Xi itself.
• Assumption 3: Log transformation often helps reduce heteroscedasticity.
• Assumption 4: Log transformation is particularly advantageous when the error
variance is proportional to the square of the mean value of Y(σi)^2.
ix. Multicollinearity (Gujrati)
i. Multicollinearity means the presence of high correlation between two or more explanatory
variables in a multiple regression model.
ii. ii. The classical linear regression model assumes that there is no perfect multicolinearity
Prepared by Madhav Gupta (2K21/CO/262)

Nature Of Multicollinearity:
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)

Causes:
There are several sources of multicollinearity. As Montgomery and Peck, multicollinearity may
be due to the following factors:

1. The data collection method employed, for example, sampling over a limited range of the
values taken by the regressors in the population.

2. Constraints on the model or in the population being sampled. For example, in the regression
of electricity consumption on income (X2) and house size (X3) there is a physical constraint in
the population in that families with higher incomes generally have larger homes than families
with lower incomes.

3. Model specification, for example, adding polynomial terms to a regression model, especially
when the range of the X variable is small.

4. An overdetermined model. This happens when the model has more explanatory variables
than the number of observations. This could happen in medical research where there may be
a small number of patients about whom information is collected on a large number of variables.

An additional reason for multicollinearity, especially in time series data, may be that the
regressors included in the model share a common trend, that is, they all increase or decrease
over time. Thus, in the regression of consumption expenditure on income, wealth, and
population, the regressors income, wealth, and population may all be growing over time at more
or less the same rate, leading to collinearity among these variables.

Consequences:

Detection
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)

Remedies(Test )
Prepared by Madhav Gupta (2K21/CO/262)

xi. Serial Correlation

Serial correlation, or autocorrelation, is the relationship between a variable and its lagged
version over various time intervals. It assesses how a variable's current value is influenced by
its past values.

Causes of Serial Correlation


Serial correlation arises when regression residuals are correlated, indicating a lack of
independence among errors. Causes include incorrect model specification, non-randomly
distributed data, and error term misspecification, often observed in time-series data like stock
prices.

Types of Serial Correlation


• Positive Serial Correlation:
• Positive errors increase the likelihood of subsequent positive errors.
• Negative errors increase the likelihood of subsequent negative errors.
• Negative Serial Correlation:
• Positive errors increase the likelihood of subsequent negative errors.
• Negative errors increase the likelihood of subsequent positive errors.

Effects of Serial Correlation on Regression Analysis


While serial correlation doesn't cause bias in regression coefficients, it impacts hypothesis tests
and confidence intervals.
• Positive Serial Correlation:
• Inflates the F-statistic, leading to increased Type I errors.
• Underestimates standard errors, making t-statistics seem more significant.
• Negative Serial Correlation:
• Deflates the F-statistic, increasing Type II errors.
• Overestimates standard errors, potentially missing true significance.

Testing for Serial Correlation


Two common methods for testing serial correlation include plotting residuals against time and
the Durbin-Watson test.
• Durbin-Watson Test:
• Statistic (DW) approximated by DW = 2(1 - r), where r is the sample correlation
between residuals from one period and the previous period.
• Values: 2 (no autocorrelation), <2 (positive autocorrelation), >2 (negative
autocorrelation).
• Critical values determine rejection or acceptance of the null hypothesis.

Example: The Durbin-Watson Test


Given a DW statistic of 0.654 for a regression output with two variables, compare it with critical
values to test for significance.
• Result: Significant positive autocorrelation if DW < 0.95.

Correcting Autocorrelation
Several methods exist for correcting serial correlation:
• Hansen Method:
• Estimates the degree of autocorrelation, adjusting standard errors accordingly.
• Newey-West Estimator:
• Creates a weighting matrix to adjust for autocorrelation, widely used in
econometrics and finance.
• Modifying Regression Equation:
• Adds lag terms to account for correlations between the dependent variable and
error terms.
• Other Correction Methods:
• Use of instrumental variables and panel data methods.
Prepared by Madhav Gupta (2K21/CO/262)

UNIT 6
x. How to select independent variables for any dependent variable in the research, (Seema
mam notes)
xii criteria for selection of independent variable for any dependent variable
Prepared by Madhav Gupta (2K21/CO/262)

###############NUMERICAL####################
Chapters from SP Gupta -> 4, 5, 6, (14), 15, 16

i. AM, GM, HM, Median, Mode, Variance, and Standard Deviation. => SP Gupta

ii. Calculation of Correlation

iii. Regression for forecasting => not required


iv. Chai Test, t-test and F-test => mid
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)
Prepared by Madhav Gupta (2K21/CO/262)

You might also like