Linear Regression Assignment Questions and Answer
Linear Regression Assignment Questions and Answer
1. From your analysis of the categorical variables from the dataset, what could you
infer about their effect on the dependent variable?
Answer: I have analyzed the categorical columns using boxplots, and here are the
key insights from the visualization:
❖ The fall season has the highest number of bookings, and in each season, the booking
count increased significantly from 2018 to 2019.
❖ Most bookings occurred during May, June, July, August, September, and October.
The trend increased from the beginning of the year until mid-year and then started
decreasing towards the end of the year.
❖ Clear weather attracted more bookings, which is expected.
❖ Thursdays, Fridays, Saturdays, and Sundays have more bookings compared to the
beginning of the week.
❖ On holidays, bookings are fewer, likely because people prefer to spend time at home
with their families.
❖ Bookings are almost equal on working days and non-working days.
❖ The number of bookings in 2019 was higher than in 2018, indicating good progress in
terms of business.
2. Why is it important to use drop_first=True during dummy variable creation?
Answer:
Using drop_first=True is crucial because it helps reduce the extra column created
during dummy variable creation, thereby minimizing the correlations among dummy
variables.
The syntax for this is ‘drop_first: bool, default False,’ which specifies whether to
create ‘k-1' dummies out of ‘k’ categorical levels by removing the first level.
For instance, if we have a categorical column with three values and we create
dummy variables for that column, we don't need a dummy variable for the third
value. If a variable is not A or B, it is implicitly C, thus eliminating the need for a third
dummy variable to identify C.
3. Looking at the pair-plot among the numerical variables, which one has the highest
correlation with the target variable?
Answer:
‘temp’ variable has the highest correlation with the target variable.
4. How did you validate the assumptions of Linear Regression after building the model
on the training set?
Answer:
I have validated the Linear Regression Model based on the following five
assumptions:
▪ Normality of Error Terms: The error terms should be normally distributed.
▪ Multicollinearity Check: There should be insignificant multicollinearity
among variables.
▪ Linear Relationship Validation: Linearity should be evident among
variables.
▪ Homoscedasticity: There should be no visible pattern in the residual values.
▪ Independence of Residuals: There should be no autocorrelation.
5. Based on the final model, which are the top 3 features contributing significantly
towards explaining the demand of the shared bikes?
Answer:
The top three features significantly contributing to explaining the demand for
shared bikes are:
▪ Temperature (temp)
▪ Winter season (winter)
▪ September (september)
General Subjective Questions
1. Explain the linear regression algorithm in detail.
Answer:
Linear regression is a statistical model that analyses the linear relationship
between a dependent variable and a given set of independent variables. A linear
relationship implies that changes in the independent variables result in
proportional changes in the dependent variable.
Key Components
Y=mX+cY=mX+c
3. What is Pearson’s R?
Answer:
Pearson's R, also known as Pearson's correlation coefficient, is a statistical
measure that quantifies the strength and direction of a linear relationship
between two variables. It is denoted by rr and ranges from -1 to 1.
Range:
o r=1r=1: Perfect positive linear correlation.
o r=−1r=−1: Perfect negative linear correlation.
o r=0r=0: No linear correlation.
Interpretation:
o Positive Values: Indicates a positive relationship where, as one variable
increases, the other variable also increases.
o Negative Values: Indicates a negative relationship where, as one variable
increases, the other variable decreases.
o Magnitude: The closer the value of rr is to 1 or -1, the stronger the linear
relationship between the two variables.
where:
Xstandardized =(X−μ)/ σ
where:
5. You might have observed that sometimes the value of VIF is infinite. Why does this
happen?
Answer:
6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear
regression.
Answer:
A Q-Q plot, or Quantile-Quantile plot, is a graphical tool used to compare
the distribution of a dataset to a theoretical distribution, often the normal
distribution. The plot displays the quantiles of the sample data against the
quantiles of the theoretical distribution. If the data follows the theoretical
distribution, the points on the Q-Q plot will approximately lie on a straight line.