Yulu Case Study
Yulu Case Study
In [ ]:
About Yulu
Yulu is India’s leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute
solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.
Yulu zones are located at all the appropriate locations (including metro stations, bus stands, office spaces, residential areas, corporate offices, etc) to make those first and last miles smooth, affordable, and
convenient!
Yulu has recently suffered considerable dips in its revenues. They have contracted a consulting company to understand the factors on which the demand for these shared electric cycles depends. Specifically,
they want to understand the factors affecting the demand for these shared electric cycles in the Indian market.
In [ ]:
Bussiness Problem
The company wants to know: Which variables are significant in predicting the demand for shared electric cycles in the Indian market?
How well those variables describe the electric cycle demands
In [ ]:
Attribute Information
datetime: datetime
weather:
1: Clear, Few clouds, partly cloudy, partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
humidity: humidity
count: count of total rental bikes including both casual and registered
In [ ]:
Problem Statement
1. Define Problem Statement and perform Exploratory Data Analysis.
1. Hypothesis Testing:
a. Sample T-Test to check if Working Day has an effect on the number of electric cycles rented
b. ANNOVA to check if No. of cycles rented is similar or different in different 1. weather 2. season
c. Chi-square test to check if Weather is dependent on the season.
In [ ]:
from scipy.stats import ttest_ind, ttest_1samp, ttest_rel, chi2_contingency, f_oneway, chisquare, levene, shapiro, boxcox
%matplotlib inline
from statsmodels.graphics.gofplots import qqplot
import warnings
warnings.filterwarnings('ignore')
In [144… df=pd.read_csv("C:\\Users\\vidya\\Downloads\\bike_sharing.txt")
In [145… df
Out[145]: datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
... ... ... ... ... ... ... ... ... ... ... ... ...
In [146… df.shape
(10886, 12)
Out[146]:
In [147… df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null object
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
In [148… df.isna().sum()
datetime 0
Out[148]:
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
casual 0
registered 0
count 0
dtype: int64
In [149… df[df.duplicated()]
Out[149]: datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
In [ ]:
Datatype Validation
In [150… df.dtypes
datetime object
Out[150]:
season int64
holiday int64
workingday int64
weather int64
temp float64
atemp float64
humidity int64
windspeed float64
casual int64
registered int64
count int64
dtype: object
column "datetime" dtype is not datetime dtype. conversion of dtype of "datetime" column.
In [152… df.describe()
Out[152]: season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000
The average temperature was 20.23 degrees Celsius, with 20.5 happening 50% of the time.
68% of the data points are collected for the working day, which makes sense as a lot of people use public transportation on working days.
The median temperature is noted at 20.5 degrees Celsius, while 75% of the data has been recorded at 26.24 degrees Celsius. The average temperature is noted as 20.36 degrees Celsius.
In [ ]:
In [153… df.nunique()
datetime 10886
Out[153]:
season 4
holiday 2
workingday 2
weather 4
temp 49
atemp 60
humidity 89
windspeed 28
casual 309
registered 731
count 822
dtype: int64
season, holiday, workingday and weather are categorical variables. Therefore updating the dtype for the same.
In [155… df.dtypes
datetime datetime64[ns]
Out[155]:
season category
holiday category
workingday category
weather category
temp float64
atemp float64
humidity int64
windspeed float64
casual int64
registered int64
count int64
dtype: object
In [ ]:
Derived Column
In [156… bins=[0,40,100,200, 300, 500, 700, 900, 1000]
group=['Low','Average','medium', 'H1', 'H2', 'H3', 'H4' , 'Very high']
df['Rent_count']= pd.cut(df['count'],bins,labels=group) # Create new categorical column
In [157… df
Out[157]: datetime season holiday workingday weather temp atemp humidity windspeed casual registered count Rent_count
... ... ... ... ... ... ... ... ... ... ... ... ... ...
In [ ]:
In [ ]:
4 2734
Out[158]:
2 2733
3 2733
1 2686
Name: season, dtype: int64
In [ ]:
In [159… df.weather.value_counts()
1 7192
Out[159]:
2 2834
3 859
4 1
Name: weather, dtype: int64
In [160… df.workingday.nunique()
2
Out[160]:
In [161… df.humidity.nunique()
89
Out[161]:
There are 4 season in yulu data set in which demand is allmost equal in all season.
Weather 1 which is Clear, Few clouds, partly cloudy, partly cloudy have higher demand for shared electric market as compared to other weathers.
There are only two unique values for working day day weekend or holiday and weekday.
In [ ]:
Univariate Analysis
In [162… col_category=["season","holiday","workingday","weather"]
index += 1
plt.show()
1. Almost all season have same count. There exist negligible change in number.
1. Graph 2 we see It is highly imbalanced to holiday and working day, because a lot of people don't use vehicles on holiday
1. If seen in weather , weather 1 that is clear weather having the maximum demands for bike goes on decreasing as weather changes to mist and then light snow and almost negligible in the heavy rain. As it
is much risky to use Bike in such a climate.
1. 1 more categorical variable is made so as to bin the count of number of bicycles rented in low, medium , high etc. which shows the lognormal distribution as maximum times Low and then for different
High values for many reasons.
1. Data looks common as it should be like equal number of days in each season, more working days and weather is mostly Clear, Few clouds, partly cloudy, partly cloudy.
In [ ]:
plt.figure(figsize=(12,8))
sns.histplot(df[col_numerical[-1]], kde=True,color="r")
plt.show()
1. casual, registered and count somewhat looks like Log Normal Distribution
1. temp, atemp and humidity looks like they follows the Normal Distribution
In [ ]:
In [ ]:
1. Whatever may be the season is the weather has a strong impact as clear wheather Most demand then mist and then light snow. And heavy rain less demand is shown from the above plot.
1. More demand of Yulu bikes is on working day . As it can be used and a transport to commute to their offices.
1. On week day and holiday or weekend demand is high of Yulu bikes when weather is clear or few cloudy.
In [ ]:
1. In spring season total count of rental bikes is more than other seasons.
1. Whenever there is rain, thunderstorm, snow or fog, there were less bikes were rented.
In [ ]:
Bivariate analysis
In [166… col_numerical = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered','count']
fig, axis = plt.subplots(nrows=3, ncols=2, figsize=(16, 14))
index = 0
for row in range(3):
for col in range(2):
sns.boxplot(df[col_numerical[index]], ax=axis[row, col],color="aqua")
index += 1
plt.show()
sns.boxplot(df[col_numerical[-1]])
plt.show()
In [ ]:
In [167… sns.pairplot(df,kind='reg',hue="weather")
<seaborn.axisgrid.PairGrid at 0x29d29fd5310>
Out[167]:
In [ ]:
In [168… plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap="Blues", linewidth=.5)
<AxesSubplot:>
Out[168]:
In [ ]:
To verify for ANOVA assumptions we have plot qqplot of all the numerival attributes and can observe
1. casual, registered and count somewhat looks like Log Normal Distribution are not aligned to red "S" line.
1. temp, atemp and humidity looks like they follows the Normal Distribution are aligned to red "S" line.
1. windspeed follows the binomial distribution not aligned to red "S" line.
In [ ]:
sample t test
Perfoming 2 sample t test on working day and non working day counts.
considreing: Null hypothesis Ho = mean of count of bike on non working day is equal to mean of counts of bike on working day.
Alternate hypothesis Ha = mean of count of bike on non working day is not equal to mean of counts of bike on working day.
In [ ]:
In [170… df.loc[df['workingday']==1]['count'].plot(kind='kde')
<AxesSubplot:ylabel='Density'>
Out[170]:
p_value = [0.22644804]
Since the P value is greater than 0.05 hence null hypotheis has failed to reject.
So we can say that non non working day has no effect on counts of bike.
In [ ]:
Hypothesis Testing
Sample T-Test to check if Working Day has an effect on the number of electric cycles rented.ANNOVA to check if No. of cycles rented is similar or different in different 1. weather 2. season.Chi-square test to
check if Weather is dependent on the season.
Ttest_indResult(statistic=109.95076974934595, pvalue=0.0)
Out[173]:
191.57413191254824
Out[174]:
1. Working Day has effect on number of electric cycles r of cycles rented similar or different ented
2. No.in different seasons
3. No. of cycles rented similar or different in different weather
4. Weather is dependent on season (check between 2 predictor variable)
First 3 statements to check are having one Numerical variable i.e. Count and one Categorical_variable as working Day or seasons or Weather. So For these type of questions we use ttest or Anova i.e (Numeric,
catagorical) 4th one is both the categorical variables so use Chisquare or chi2_contingency test.
In [ ]:
191.57413191254824
Out[175]:
193.01187263896384
Out[176]:
188.50662061024755
Out[177]:
In [ ]:
Using Anova
In [178… #H0 = Working day does not have any effect on number of cycles rented.
#HA = Working day has an positive effect on number of cycles rented. i.e. mu1 > mu2
# We consider it to be Right Tailed
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.05
f_stat, p_value = f_oneway(df_workingday_count,df_non_workingday_count)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")
In [ ]:
Using ttest
In [179… #H0 = Working day does not have any effect on number of cycles rented.
#HA = Working day has an effect on number of cycles rented. mu1 > m2
# We consider it to be Righ Tailed.
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
t_stat, p_value = ttest_ind(df_workingday_count,df_non_workingday_count, alternative = "greater")
print(f"Test statistic = {t_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")
In [ ]:
In [182… #We have taken samples of each dataframe to send it to shapiro as Shapiro test
1.0147116860043298e-118
Out[183]:
Shapiro == Test for normality #We are taking samples of the available data. As it works well with (50 to 200)
values. So we have created subset of
each of 100 values.
In all the above 4 test we got p_value almost 0.0 (like 10^-6 or so) which is less than alpha so we Reject the Null Hypothesis of these samples from Normal Distribution
In [ ]:
In [188… #H0 = season does not have any effect on number of cycles rented.
#HA = At least one season out of four (1:spring, 2:summer,3:fall, 4:winter) has an effect on number of cycles rented.
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
f_stat, p_value = f_oneway(df_season1_spring, df_season2_summer, df_season3_fall, df_season4_winter)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")
In [ ]:
In [189… #As we have 4 different weather ttest will not work here. Need to use ANOVA
205.23679087875416
Out[190]:
178.95553987297106
Out[191]:
118.84633294528521
Out[192]:
164.0
Out[193]:
3.504937946833238e-35
Out[195]:
ShapiroResult(statistic=0.8909225463867188, pvalue=0.0)
Out[197]:
In [198… shapiro(df_weather2_Mist)
ShapiroResult(statistic=0.8767688274383545, pvalue=9.781063280987223e-43)
Out[198]:
In [199… shapiro(df_weather3_LightSnow)
ShapiroResult(statistic=0.7674333453178406, pvalue=3.876134581802921e-33)
Out[199]:
In [ ]:
Using ANOVA
In [200… #H0 = weather does not have any effect on number of cycles rented.
#HA = At least one weather out of four (1: clear, 2: Mist, 3:Light snow, 4:Heavy Rain) has an effect on number of cycles re
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
f_stat, p_value = f_oneway(df_weather1_clear,df_weather2_Mist,df_weather3_LightSnow,df_weather4_HeavyRain)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")
In [201… #H0 = weather does not have any effect on number of cycles rented.
#HA = At least one weather out of four (1: clear, 2: Mist, 3:Light snow, 4:Heavy Rain) has an effect on number of cycles re
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
f_stat, p_value = f_oneway(df_weather1_clear,df_weather2_Mist,df_weather3_LightSnow)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")
As we can see he pvalue is very very very low and we are Rejecting Null Hypothesis becasue we see weather 4 having rent count negligible and clear and lightsnow have good number of bikes rented. So it
does impact and not all similar.
In [ ]:
Using chisquare_test
In [202… val = pd.crosstab(index = df["weather"], columns = df["season"])
print(val)
chisquare(val)
season 1 2 3 4
weather
1 1759 1801 1930 1702
2 715 708 604 807
3 211 224 199 225
4 1 0 0 0
Power_divergenceResult(statistic=array([2749.33581534, 2821.39590194, 3310.63995609, 2531.07388442]), pvalue=array([0., 0., 0., 0.]))
Out[202]:
season 1 2 3 4
weather
1 1759 1801 1930 1702
2 715 708 604 807
3 211 224 199 225
4 1 0 0 0
Power_divergenceResult(statistic=array([2749.33581534, 2821.39590194, 3310.63995609, 2531.07388442]), pvalue=array([0., 0., 0., 0.]))
Out[203]:
In [ ]:
We reject NULL hypothesis that is Weather is independent from season at significance 0.01 we get that the p_value comes out to very low and These 2 attributes are strongly dependent on each other.
In [ ]:
Insights
A 2-sample T-test on working and non-working days with respect to count,implies that the mean population count of both categories are the same.
An ANOVA test on different seasons with respect to count,implies that population count means under different seasons are not the same, meaning there is a difference in the usage of Yulu bikes in different
seasons.
By performing an ANOVA test on different weather conditions except 4 with respect to count, we can infer that population count means under different weather conditions are the same, meaning there is a
difference in the usage of Yulu bikes in different weather conditions.
By performing a Chi2 test on season and weather (categorical variables), we can infer that there is an impact on weather dependent on season.
The maximum number of holidays can be seen during the fall and winter seasons.
More number of counts when weather is clear with less clouds, proved by annova hypothesis test.
In [ ]:
Recommendations:
As casual users are very less Yulu should focus on marketing startegy to bring more customers. for eg. first time user discount, friends and family discounts, referral bonuses etc.
On non working days as count is very low Yulu can think on the promotional activities like city exploration competition, some health campaigns etc.
In heavy rains as rent count is very low Yulu can introduce a different vehicle such as car or having shade or protection from that rain.
In [ ]: