92% found this document useful (13 votes)
7K views6 pages

Subjective Questions

The document contains questions related to linear regression analysis and model building. It asks the respondent to summarize inferences made from analyzing categorical variables, explain why drop_first=True is used for dummy variable creation, and identify the feature with highest correlation to the target variable based on a pair plot. It also asks to validate linear regression assumptions on the training set, list the top 3 significant features in the final model, and explain the linear regression algorithm and Anscombe's quartet in detail. General questions about Pearson's R, scaling, VIF values becoming infinite, and Q-Q plots are also included.

Uploaded by

Nitish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
92% found this document useful (13 votes)
7K views6 pages

Subjective Questions

The document contains questions related to linear regression analysis and model building. It asks the respondent to summarize inferences made from analyzing categorical variables, explain why drop_first=True is used for dummy variable creation, and identify the feature with highest correlation to the target variable based on a pair plot. It also asks to validate linear regression assumptions on the training set, list the top 3 significant features in the final model, and explain the linear regression algorithm and Anscombe's quartet in detail. General questions about Pearson's R, scaling, VIF values becoming infinite, and Q-Q plots are also included.

Uploaded by

Nitish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment-based Subjective Questions

1. From your analysis of the categorical variables from the dataset, what
could you infer about their effect on the dependent variable?

Answer: Here are some of the inferences I made from my analysis of categorical variables
from the dataset on the dependent variable (Count)

1. Fall has the highest median, which is expected as weather conditions are most
optimal to ride bike followed by summer.
2. Median bike rents are increasing year on as year 2019 has a higher median then
2018, it might be due the fact that bike rentals are getting popular and people are
becoming more aware about environment.
3. Overall spread in the month plot is reflection of season plot as fall months have
higher median.
4. People rent more on non - holidays compared to holidays, so reason might be they
prefer to spend time with family and use personal vehicle instead of bike rentals.
5. Overall median across all days is same but spread for Saturday and Wednesday is
bigger may be evident that those who have plans for Saturday might not rent bikes
as it a non-working day.
6. Working and non-working days have almost the same median although spread is
bigger for non-working days as people might have plans and do not want to rent
bikes because of that
7. Clear weather is most optimal for bike renting, as temperate is optimal, humidity is
less, and temperature is less.

2. Why is it important to use drop_first=True during dummy variable


creation?

Answer: A variable with n levels can be represented by n-1 dummy variables. So, if we
remove the first column then also, we can represent the data. If the value of variable from 2
to n is 0, it means that the value of 1st variable is 1.
Example: 'Relationship' with three levels, namely, 'Single', 'In a Relationship', and 'Married', I
would create a dummy table like the following:
But I can clearly see that there is no need to define three different levels. If I drop a level,
say 'Single', I would still be able to explain the three levels.
Let us drop the dummy variable 'Single' from the columns and see what the table looks like:

If both the dummy variables, namely, 'In a Relationship' and 'Married', are equal to zero,
that means that the person is single. If 'In a relationship' is one and 'Married' is zero, that
means that the person is in a relationship, and finally, if 'In a relationship' is zero and
'Married' is 1, that means that the person is married.

3. Looking at the pair-plot among the numerical variables, which one has
the highest correlation with the target variable?

Answer: ‘temp’ had the highest correlation coefficient of 0.63.


4. How did you validate the assumptions of Linear Regression after
building the model on the training set?

Answer: By plotting the residuals distribution. It came out to be a normal distribution with
a mean value of 0.

5. Based on the final model, which are the top 3 features contributing
significantly towards explaining the demand of the shared bikes?

Answer: The Following are the top 3 features contributing significantly towards explaining
the demands of the shared bikes:
• atemp (0.412)
• yr (0.236)
• weathersit Light rain (-0.275)
General Subjective Questions
1. Explain the linear regression algorithm in detail.

Answer: A linear regression algorithm tries to explain the relationship between independent
and dependent variable using a straight line. It is applicable to numerical variables only.
Following steps are performed while doing linear regression:
• The dataset is divided into test and training data
• Train data is divided into features(independent) and target (dependent) datasets
• A linear model is fitted using the training dataset. Internally the api’s from python
uses gradient descent algorithm to find the coefficients of the best fit line. The
gradient descent algorithm works by minimising the cost function. A typical example
of cost function is residual sum of squares.
• In case of multiple features, the predicted variable is a hyperplane instead of line.
The predicted variable takes the following form:

𝑌= 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+⋯+ 𝛽𝑛𝑥𝑛
• The predicted variable is than compared with test data and assumptions are
checked.

2. Explain the Anscombe’s quartet in detail.

Answer: Anscombe’s quartet comprises of four data sets that have nearly identical simple
descriptive statistics but have quite different distribution when visualized graphically. The
simple statistics consist of mean, sample variance of x and y, correlation coefficient, linear
regression line and R-Square value. Anscombe's Quartet shows that multiple data sets with
many similar statistical properties can still be vastly different from one another when
graphed. The graphs are shown below:
Image source - https://en.wikipedia.org/wiki/Anscombe%27s_quartet
3. First plot (top left) appears to be simple linear relationship
4. The second plot (top right) is not distributed normally and correlation coefficient is
irrelevant as it shows a nonlinear relationship
5. The third plot (bottom left) is linear but has different regression line. This is
happening because of the outliers present in the data
6. The fourth plot (bottom right) does not show linear relationship however due to
outliers the statistics got adjusted.

In a nutshell, it is a better practice to visualize data and remove outliers before analysing it.

3. What is Pearson’s R?

Answer: Pearson’s R measures the strength of association of two variables. It is the


covariance of two variables divided by the product of their standard deviation. It has a value
from +1 to -1.
• A value of 1 means a total positive linear correlation. It means that if one variable
increase then other will also increase
• A value of 0 means no correlation
• A value of -1 means a total negative correlation. It means that if one variable
increase then other will decrease

4. What is scaling? Why is scaling performed? What is the difference


between normalized scaling and standardized scaling?

Answer: Scaling of a variable is performed to keep a variable in certain range. Scaling is a


pre-processing step in linear regression analysis. The reason we scale a variable is to make
the computation of gradient descent faster. The step size of gradient descent are generally
low for accuracy, if the data has some small variables (values in the range of 0-1) and some
big variables (values in the range of 0 -1000) than the time taken by the gradient descent
algorithm will be huge.

Normalised Scaling Standardized scaling


Called min max scaling, scales the variable Values are centred around mean with a unit
such that the range is 0-1 standard deviation
Good for non- gaussian distribution Good for gaussian distribution
Value id bounded between 0 and 1 Value is not bounded
Outliers are also scaled Does not affect outliers
5. You might have observed that sometimes the value of VIF is infinite.
Why does this happen?

Answer: The formula for VIF is


1
VIFi = 1− 𝑅2
𝑖
Basically, if R square is 1 than VIF becomes infinite. It means that there is perfect correlation
between the features.

6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in
linear regression.

Answer: A Q-Q plot is a scatter plot of two sets of quantiles against each other. Its purpose
is to check if the two sets of data came from the same distribution. It is a visual check of
data. If the data is from same source than the plot will appear as a line.

You might also like