0% found this document useful (0 votes)
35 views

Exploratory Data Analysis and Regression Modeling: by Isha Arora Nehalkumar Jesadiya Rohit Nanawati Shah Razak Mohammad

This document summarizes exploratory data analysis and regression modeling that was performed on bike rental data. It includes statistical summaries of the target variable distribution, outlier detection using boxplots, correlation analysis, and comparisons of different regression models including linear regression, lasso regression, ridge regression, and random forest regression. The random forest model with 300 estimators performed best with the highest test and train scores and lowest mean squared error.

Uploaded by

Rohit N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Exploratory Data Analysis and Regression Modeling: by Isha Arora Nehalkumar Jesadiya Rohit Nanawati Shah Razak Mohammad

This document summarizes exploratory data analysis and regression modeling that was performed on bike rental data. It includes statistical summaries of the target variable distribution, outlier detection using boxplots, correlation analysis, and comparisons of different regression models including linear regression, lasso regression, ridge regression, and random forest regression. The random forest model with 300 estimators performed best with the highest test and train scores and lowest mean squared error.

Uploaded by

Rohit N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

EXPLORATORY DATA

ANALYSIS AND
REGRESSION MODELING
By
Isha Arora
NehalKumar Jesadiya
Rohit Nanawati
Shah Razak Mohammad
STATISTICAL SUMMARY
• describe() method computes summary of the
dataset
• It gives mean, median, std and IQR values.
DATA DISTRIBUTION OF
TARGET DATAPOINT WITH
STATISTICAL SUMMARY

Statistical summary includes min,1st quartile,


median, 3rd quartile and max

The figure shows count of Rented Bike


ranging from 0 to maximum
DATA DISTRIBUTION OF
TARGET DATAPOINT
WITH STATISTICAL
SUMMARY

Count of Rented Bike ranging from 0 to 1000


DATA DISTRIBUTION OF
TARGET DATAPOINT
WITH STATISTICAL
SUMMARY

Count of Rented Bike less than 300


DATA DISTRIBUTION
The bar graph below shows the clear understanding about the count of the rented bikes during each hour of the day on
both Holidays and No Holiday days.

Count of Rented Bikes at every hour of a day


DETECTING
OUTLIERS WITH
BOXPLOT
• Boxplots are excellent ways to represent
the statistical information about the
median, third quartile, first quartile and
the outliers.
• The plot consists of a box representing
values falling between Inter-Quartile
Range.
• The ends of vertical lines which extend
from the box have horizontal lines at
both ends are called as whiskers.
• The values beyond the whiskers are the
outliers for our dataset.
RENTED BIKE COUNT ACCORDING TO
SEASONS

Cross tab for proportion of bike count


CORRELATION MATRIX
• A correlation matrix is a table showing
relationship between variables to uncover the
patterns.

• It indicates the trend of change in one variable


with the changes in values of other variable.

• There are 2 types of Correlation:

1) Positive Correlation: The value of one variable


increases with the increase in the value of the
other.
2) Negative Correlation: The value of one variable
decreases with the increasing value of other
variable or vice versa.
VISUALIZATION OF COUNT OF RENTED
BIKE ACCORDING TO THE DATES
REGRESSION MODELLING

Linear Regression LASO Regression Ridge Regression


• Regression is a technique that • Lasso regression analysis is a • It is a technique for analyzing
describes how a response variable shrinkage and variable selection multiple regression data that
Y changes as an explanatory method for linear regression suffer from multicollinearity.
variable X change. models.
• By adding a degree of bias to
• Regression line can be used to • The goal of lasso regression is to the regression estimates, ridge
predict the value of Y for a given obtain the subset of predictors that regression reduces the standard
minimizes prediction error for a errors.
value of X.
quantitative response variable
• Regression Equation is given by :
• Offers automatic feature selection
ŷ = b0 + b1x because it can completely remove
some features.
RANDOM FOREST
• Random forest is a tree-based algorithm which
involves building several decision trees, combining
their output to improve the ability of the model to
generalize.

• It is based on Ensemble technique, which are


supervised learning and combine the predictions of
multiple smaller nodes to improve predictive power.

• Our project uses Bootsrap Aggregation or Bagging


method. Each model is run independently and then
aggregates the output at the end.

• It operates by constructing a multimode of decision


trees at training time and outputting the class that is
the node of prediction of the individual tree.
ANALYSIS RESULTS

The table shows the train_score, Test_score, Regression_score and Mean Squared Error for all the Regressions techniques
that we have performed in our project.
ANALYSIS RESULTS
CONCLUSION

• Analyzing results from models, it is easy to interpret that Random Forest Regressor is
performing best in terms of Mean Squared Error, Test and Train Score and Regression for 300
estimators.
• Values of LASO Regression, Ridge Regression and Linear Regression have nearly similar
values.
• The slight difference between two Random Forest Regressors is noticed. However, The one with
the 300 estimators provides optimal values and further can be used to predict the Bike Rental
Scenario
Thank You

You might also like