Subjective Questions
Subjective Questions
1. From your analysis of the categorical variables from the dataset, what
could you infer about their effect on the dependent variable?
Answer: Here are some of the inferences I made from my analysis of categorical variables
from the dataset on the dependent variable (Count)
1. Fall has the highest median, which is expected as weather conditions are most
optimal to ride bike followed by summer.
2. Median bike rents are increasing year on as year 2019 has a higher median then
2018, it might be due the fact that bike rentals are getting popular and people are
becoming more aware about environment.
3. Overall spread in the month plot is reflection of season plot as fall months have
higher median.
4. People rent more on non - holidays compared to holidays, so reason might be they
prefer to spend time with family and use personal vehicle instead of bike rentals.
5. Overall median across all days is same but spread for Saturday and Wednesday is
bigger may be evident that those who have plans for Saturday might not rent bikes
as it a non-working day.
6. Working and non-working days have almost the same median although spread is
bigger for non-working days as people might have plans and do not want to rent
bikes because of that
7. Clear weather is most optimal for bike renting, as temperate is optimal, humidity is
less, and temperature is less.
Answer: A variable with n levels can be represented by n-1 dummy variables. So, if we
remove the first column then also, we can represent the data. If the value of variable from 2
to n is 0, it means that the value of 1st variable is 1.
Example: 'Relationship' with three levels, namely, 'Single', 'In a Relationship', and 'Married', I
would create a dummy table like the following:
But I can clearly see that there is no need to define three different levels. If I drop a level,
say 'Single', I would still be able to explain the three levels.
Let us drop the dummy variable 'Single' from the columns and see what the table looks like:
If both the dummy variables, namely, 'In a Relationship' and 'Married', are equal to zero,
that means that the person is single. If 'In a relationship' is one and 'Married' is zero, that
means that the person is in a relationship, and finally, if 'In a relationship' is zero and
'Married' is 1, that means that the person is married.
3. Looking at the pair-plot among the numerical variables, which one has
the highest correlation with the target variable?
Answer: By plotting the residuals distribution. It came out to be a normal distribution with
a mean value of 0.
5. Based on the final model, which are the top 3 features contributing
significantly towards explaining the demand of the shared bikes?
Answer: The Following are the top 3 features contributing significantly towards explaining
the demands of the shared bikes:
• atemp (0.412)
• yr (0.236)
• weathersit Light rain (-0.275)
General Subjective Questions
1. Explain the linear regression algorithm in detail.
Answer: A linear regression algorithm tries to explain the relationship between independent
and dependent variable using a straight line. It is applicable to numerical variables only.
Following steps are performed while doing linear regression:
• The dataset is divided into test and training data
• Train data is divided into features(independent) and target (dependent) datasets
• A linear model is fitted using the training dataset. Internally the api’s from python
uses gradient descent algorithm to find the coefficients of the best fit line. The
gradient descent algorithm works by minimising the cost function. A typical example
of cost function is residual sum of squares.
• In case of multiple features, the predicted variable is a hyperplane instead of line.
The predicted variable takes the following form:
𝑌= 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+⋯+ 𝛽𝑛𝑥𝑛
• The predicted variable is than compared with test data and assumptions are
checked.
Answer: Anscombe’s quartet comprises of four data sets that have nearly identical simple
descriptive statistics but have quite different distribution when visualized graphically. The
simple statistics consist of mean, sample variance of x and y, correlation coefficient, linear
regression line and R-Square value. Anscombe's Quartet shows that multiple data sets with
many similar statistical properties can still be vastly different from one another when
graphed. The graphs are shown below:
Image source - https://en.wikipedia.org/wiki/Anscombe%27s_quartet
3. First plot (top left) appears to be simple linear relationship
4. The second plot (top right) is not distributed normally and correlation coefficient is
irrelevant as it shows a nonlinear relationship
5. The third plot (bottom left) is linear but has different regression line. This is
happening because of the outliers present in the data
6. The fourth plot (bottom right) does not show linear relationship however due to
outliers the statistics got adjusted.
In a nutshell, it is a better practice to visualize data and remove outliers before analysing it.
3. What is Pearson’s R?
6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in
linear regression.
Answer: A Q-Q plot is a scatter plot of two sets of quantiles against each other. Its purpose
is to check if the two sets of data came from the same distribution. It is a visual check of
data. If the data is from same source than the plot will appear as a line.