0% found this document useful (0 votes)
32 views

Linear Regression Assignment Questions and Answer

SUBJECTIVE QUESTIONS

Uploaded by

lakshna673
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Linear Regression Assignment Questions and Answer

SUBJECTIVE QUESTIONS

Uploaded by

lakshna673
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment-based Subjective Questions

1. From your analysis of the categorical variables from the dataset, what could you
infer about their effect on the dependent variable?

Answer: I have analyzed the categorical columns using boxplots, and here are the
key insights from the visualization:

❖ The fall season has the highest number of bookings, and in each season, the booking
count increased significantly from 2018 to 2019.
❖ Most bookings occurred during May, June, July, August, September, and October.
The trend increased from the beginning of the year until mid-year and then started
decreasing towards the end of the year.
❖ Clear weather attracted more bookings, which is expected.
❖ Thursdays, Fridays, Saturdays, and Sundays have more bookings compared to the
beginning of the week.
❖ On holidays, bookings are fewer, likely because people prefer to spend time at home
with their families.
❖ Bookings are almost equal on working days and non-working days.
❖ The number of bookings in 2019 was higher than in 2018, indicating good progress in
terms of business.
2. Why is it important to use drop_first=True during dummy variable creation?
Answer:
Using drop_first=True is crucial because it helps reduce the extra column created
during dummy variable creation, thereby minimizing the correlations among dummy
variables.

The syntax for this is ‘drop_first: bool, default False,’ which specifies whether to
create ‘k-1' dummies out of ‘k’ categorical levels by removing the first level.

For instance, if we have a categorical column with three values and we create
dummy variables for that column, we don't need a dummy variable for the third
value. If a variable is not A or B, it is implicitly C, thus eliminating the need for a third
dummy variable to identify C.

3. Looking at the pair-plot among the numerical variables, which one has the highest
correlation with the target variable?
Answer:
‘temp’ variable has the highest correlation with the target variable.

4. How did you validate the assumptions of Linear Regression after building the model
on the training set?
Answer:
I have validated the Linear Regression Model based on the following five
assumptions:
▪ Normality of Error Terms: The error terms should be normally distributed.
▪ Multicollinearity Check: There should be insignificant multicollinearity
among variables.
▪ Linear Relationship Validation: Linearity should be evident among
variables.
▪ Homoscedasticity: There should be no visible pattern in the residual values.
▪ Independence of Residuals: There should be no autocorrelation.

5. Based on the final model, which are the top 3 features contributing significantly
towards explaining the demand of the shared bikes?
Answer:
The top three features significantly contributing to explaining the demand for
shared bikes are:
▪ Temperature (temp)
▪ Winter season (winter)
▪ September (september)
General Subjective Questions
1. Explain the linear regression algorithm in detail.
Answer:
Linear regression is a statistical model that analyses the linear relationship
between a dependent variable and a given set of independent variables. A linear
relationship implies that changes in the independent variables result in
proportional changes in the dependent variable.

Key Components

o Dependent Variable (Y): The variable being predicted.


o Independent Variable (X): The variable used to make predictions.
o Slope (m): Represents the effect of X on Y.
o Intercept (c): The constant value of Y when X is zero.

The relationship is represented mathematically by the equation:

Y=mX+cY=mX+c

Types of Linear Relationships

o Positive Linear Relationship: Both the independent and dependent


variables increase together.
o Negative Linear Relationship: The independent variable increases while
the dependent variable decreases.
Types of Linear Regression
o Simple Linear Regression: Involves one independent variable.
o Multiple Linear Regression: Involves multiple independent variables.
Assumptions of Linear Regression
a. Multicollinearity: Assumes little or no multicollinearity, meaning
independent variables should not be highly correlated with each other.
b. Autocorrelation: Assumes little or no autocorrelation, meaning residual
errors should not be dependent on each other.
c. Linear Relationship: Assumes a linear relationship between response
and feature variables.
d. Normality of Error Terms: Error terms should be normally distributed.
e. Homoscedasticity: There should be no visible pattern in the residual
values.

2. Explain the Anscombe’s quartet in detail.


Answer:
Anscombe's quartet is a collection of four datasets that have nearly identical simple
descriptive statistics but appear very different when graphed. These datasets were
created by the statistician Francis Anscombe in 1973 to illustrate the importance of
graphing data before analyzing it and the impact of outliers and the distribution of
data on statistical measures.

3. What is Pearson’s R?
Answer:
Pearson's R, also known as Pearson's correlation coefficient, is a statistical
measure that quantifies the strength and direction of a linear relationship
between two variables. It is denoted by rr and ranges from -1 to 1.

Key Features of Pearson’s R

Range:
o r=1r=1: Perfect positive linear correlation.
o r=−1r=−1: Perfect negative linear correlation.
o r=0r=0: No linear correlation.
Interpretation:
o Positive Values: Indicates a positive relationship where, as one variable
increases, the other variable also increases.
o Negative Values: Indicates a negative relationship where, as one variable
increases, the other variable decreases.
o Magnitude: The closer the value of rr is to 1 or -1, the stronger the linear
relationship between the two variables.

4. What is scaling? Why is scaling performed? What is the difference between


normalized scaling and standardized scaling?
Answer:
Scaling is a strategy for standardizing the independent features included in data over
a specified range. Data pre-processing involves handling significantly variable
magnitudes, values, and units. Rescaling variables is crucial for achieving a comparable
scale. If scales are not similar, certain coefficients may be much larger or smaller than
others when fitting a regression model.
▪ Normalized Scaling translates data to a scale of 0 to 1. However, this may
result in data loss or distortion of outliers. This method is typically
employed when features vary in magnitude and data distribution is
unclear. Min-Max scaling is a typical approach for such cases, using the
formula below:

Xscaled = X−Xmin / Xmax −Xmin

where:

• X is the original value.


• Xmin is the minimum value of the feature.
• Xmax is the maximum value of the feature.
• Xscaled is the scaled value in the range [0, 1].

▪ Standardized scaling translates data to a conventional normal distribution


using the formula below. This method is typically utilized when the feature
distribution is normal/gaussian, there are no outliers, and the modified
data does not have specified boundaries.

Xstandardized =(X−μ)/ σ

where:

• X is the original value.


• μ is the mean of the feature.
• σ is the standard deviation of the feature.

• Xstandardized is the standardized value.

5. You might have observed that sometimes the value of VIF is infinite. Why does this
happen?
Answer:

The Variance Inflation Factor (VIF) measures the collinearity between


predictor variables in multiple regression. It is calculated by dividing the
variance of all betas in a given model by the variance of a single beta when fitted
alone. A VIF of infinity indicates a perfect relationship between two independent
variables, leading to an R² value of 1, which results in 1/(1-R²) being infinite. To
address this, one of the variables causing perfect multicollinearity needs to be
removed from the dataset. In summary, an infinite

VIF value indicates that the corresponding variable can be exactly


expressed as a linear combination of other variables.

6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear
regression.
Answer:
A Q-Q plot, or Quantile-Quantile plot, is a graphical tool used to compare
the distribution of a dataset to a theoretical distribution, often the normal
distribution. The plot displays the quantiles of the sample data against the
quantiles of the theoretical distribution. If the data follows the theoretical
distribution, the points on the Q-Q plot will approximately lie on a straight line.

Use and Importance of a Q-Q Plot in Linear Regression:

Assessing Normality of Residuals:


▪ In linear regression, one key assumption is that the residuals (the
differences between observed and predicted values) are normally
distributed. A Q-Q plot can help check this assumption. If the
residuals are normally distributed, the points on the Q-Q plot will
fall along a straight line.
Detecting Deviations from Normality:
▪ Deviations from the straight line in a Q-Q plot can indicate
departures from normality. For example, systematic deviations,
such as an S-shaped curve, may indicate skewness, while
deviations at the ends of the plot (tails) can indicate the presence
of outliers or heavy tails.
Identifying Potential Problems:
▪ By visualizing the distribution of residuals, a Q-Q plot helps
identify potential problems with the regression model, such as
non-linearity, heteroscedasticity (non-constant variance), or the
presence of outliers. These issues can affect the validity of the
model's predictions and the reliability of statistical tests.
Model Diagnostics and Improvement:
Analysing the Q-Q plot can guide model diagnostics and
improvements. If the residuals are not normally distributed, it may
suggest the need for transforming variables, adding polynomial terms, or
using a different modeling approach.

In summary, a Q-Q plot is a valuable diagnostic tool in linear regression that


helps assess the normality of residuals, detect deviations from normality, identify
potential problems, and guide improvements to the regression model.

You might also like