0% found this document useful (0 votes)
135 views1 page

Yulu Case Study

The document discusses a business problem faced by Yulu, an Indian micro-mobility service provider. Yulu has seen dips in revenue and has hired a consulting company to understand factors affecting demand for shared electric cycles. The consulting company needs to identify which variables are significant in predicting demand and how well those variables describe electric cycle demand through hypothesis testing and exploratory data analysis of Yulu usage data.

Uploaded by

vidya.bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views1 page

Yulu Case Study

The document discusses a business problem faced by Yulu, an Indian micro-mobility service provider. Yulu has seen dips in revenue and has hired a consulting company to understand factors affecting demand for shared electric cycles. The consulting company needs to identify which variables are significant in predicting demand and how well those variables describe electric cycle demand through hypothesis testing and exploratory data analysis of Yulu usage data.

Uploaded by

vidya.bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Business Case: Yulu - Hypothesis Testing

In [ ]:

About Yulu
Yulu is India’s leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute
solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.

Yulu zones are located at all the appropriate locations (including metro stations, bus stands, office spaces, residential areas, corporate offices, etc) to make those first and last miles smooth, affordable, and
convenient!

Yulu has recently suffered considerable dips in its revenues. They have contracted a consulting company to understand the factors on which the demand for these shared electric cycles depends. Specifically,
they want to understand the factors affecting the demand for these shared electric cycles in the Indian market.

In [ ]:

Bussiness Problem
The company wants to know: Which variables are significant in predicting the demand for shared electric cycles in the Indian market?
How well those variables describe the electric cycle demands

In [ ]:

Attribute Information
datetime: datetime

season: season (1: spring, 2: summer, 3: fall, 4: winter)

holiday: whether day is a holiday or not

workingday: if day is neither weekend nor holiday is 1, otherwise is 0.

weather:
1: Clear, Few clouds, partly cloudy, partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp: temperature in Celsius

atemp: feeling temperature in Celsius

humidity: humidity

windspeed: wind speed

casual: count of casual users

registered: count of registered users

count: count of total rental bikes including both casual and registered

In [ ]:

Problem Statement
1. Define Problem Statement and perform Exploratory Data Analysis.

1. Univariate and bivariate Analysis

1. Hypothesis Testing:
a. Sample T-Test to check if Working Day has an effect on the number of electric cycles rented
b. ANNOVA to check if No. of cycles rented is similar or different in different 1. weather 2. season
c. Chi-square test to check if Weather is dependent on the season.

In [ ]:

Import Libraries and Data.


In [143… import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_ind, ttest_1samp, ttest_rel, chi2_contingency, f_oneway, chisquare, levene, shapiro, boxcox
%matplotlib inline
from statsmodels.graphics.gofplots import qqplot

import warnings
warnings.filterwarnings('ignore')

In [144… df=pd.read_csv("C:\\Users\\vidya\\Downloads\\bike_sharing.txt")

In [145… df

Out[145]: datetime season holiday workingday weather temp atemp humidity windspeed casual registered count

0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0000 3 13 16

1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0000 8 32 40

2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0000 5 27 32

3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0000 3 10 13

4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0000 0 1 1

... ... ... ... ... ... ... ... ... ... ... ... ...

10881 2012-12-19 19:00:00 4 0 1 1 15.58 19.695 50 26.0027 7 329 336

10882 2012-12-19 20:00:00 4 0 1 1 14.76 17.425 57 15.0013 10 231 241

10883 2012-12-19 21:00:00 4 0 1 1 13.94 15.910 61 15.0013 4 164 168

10884 2012-12-19 22:00:00 4 0 1 1 13.94 17.425 61 6.0032 12 117 129

10885 2012-12-19 23:00:00 4 0 1 1 13.12 16.665 66 8.9981 4 84 88

10886 rows × 12 columns

In [146… df.shape

(10886, 12)
Out[146]:

In [147… df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null object
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB

In [148… df.isna().sum()

datetime 0
Out[148]:
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
casual 0
registered 0
count 0
dtype: int64

There is no Missing Values are present in Dataframe

In [149… df[df.duplicated()]

Out[149]: datetime season holiday workingday weather temp atemp humidity windspeed casual registered count

There are no Duplicate present in Yulu Data

In [ ]:

Datatype Validation
In [150… df.dtypes

datetime object
Out[150]:
season int64
holiday int64
workingday int64
weather int64
temp float64
atemp float64
humidity int64
windspeed float64
casual int64
registered int64
count int64
dtype: object

column "datetime" dtype is not datetime dtype. conversion of dtype of "datetime" column.

In [151… df['datetime'] = pd.to_datetime(df['datetime'])

In [152… df.describe()

Out[152]: season holiday workingday weather temp atemp humidity windspeed casual registered count

count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000

mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132

std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454

min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000

25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000

50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000

75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000

max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000

The average temperature was 20.23 degrees Celsius, with 20.5 happening 50% of the time.

68% of the data points are collected for the working day, which makes sense as a lot of people use public transportation on working days.

The median temperature is noted at 20.5 degrees Celsius, while 75% of the data has been recorded at 26.24 degrees Celsius. The average temperature is noted as 20.36 degrees Celsius.

In [ ]:

In [153… df.nunique()

datetime 10886
Out[153]:
season 4
holiday 2
workingday 2
weather 4
temp 49
atemp 60
humidity 89
windspeed 28
casual 309
registered 731
count 822
dtype: int64

season, holiday, workingday and weather are categorical variables. Therefore updating the dtype for the same.

In [154… #changing it from object dtype to category


df["season"]=df["season"].astype("category")
df["holiday"]=df["holiday"].astype("category")
df["workingday"]=df["workingday"].astype("category")
df["weather"]=df["weather"].astype("category")

In [155… df.dtypes

datetime datetime64[ns]
Out[155]:
season category
holiday category
workingday category
weather category
temp float64
atemp float64
humidity int64
windspeed float64
casual int64
registered int64
count int64
dtype: object

In [ ]:

Derived Column
In [156… bins=[0,40,100,200, 300, 500, 700, 900, 1000]
group=['Low','Average','medium', 'H1', 'H2', 'H3', 'H4' , 'Very high']
df['Rent_count']= pd.cut(df['count'],bins,labels=group) # Create new categorical column

In [157… df

Out[157]: datetime season holiday workingday weather temp atemp humidity windspeed casual registered count Rent_count

0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0000 3 13 16 Low

1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0000 8 32 40 Low

2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0000 5 27 32 Low

3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0000 3 10 13 Low

4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0000 0 1 1 Low

... ... ... ... ... ... ... ... ... ... ... ... ... ...

10881 2012-12-19 19:00:00 4 0 1 1 15.58 19.695 50 26.0027 7 329 336 H2

10882 2012-12-19 20:00:00 4 0 1 1 14.76 17.425 57 15.0013 10 231 241 H1

10883 2012-12-19 21:00:00 4 0 1 1 13.94 15.910 61 15.0013 4 164 168 medium

10884 2012-12-19 22:00:00 4 0 1 1 13.94 17.425 61 6.0032 12 117 129 medium

10885 2012-12-19 23:00:00 4 0 1 1 13.12 16.665 66 8.9981 4 84 88 Average

10886 rows × 13 columns

In [ ]:

In [ ]:

Value count and unique value


In [158… df.season.value_counts()

4 2734
Out[158]:
2 2733
3 2733
1 2686
Name: season, dtype: int64

In [ ]:

In [159… df.weather.value_counts()

1 7192
Out[159]:
2 2834
3 859
4 1
Name: weather, dtype: int64

In [160… df.workingday.nunique()

2
Out[160]:

In [161… df.humidity.nunique()

89
Out[161]:

There are 4 season in yulu data set in which demand is allmost equal in all season.

Weather 1 which is Clear, Few clouds, partly cloudy, partly cloudy have higher demand for shared electric market as compared to other weathers.

There are only two unique values for working day day weekend or holiday and weekday.

In [ ]:

Univariate Analysis
In [162… col_category=["season","holiday","workingday","weather"]

fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(14, 12))


index = 0
for row in range(2):

for col in range(2):


cp=sns.countplot(df[col_category[index]], ax=axis[row, col],palette="cubehelix")
cp.bar_label(cp.containers[0])

index += 1

plt.show()

1. Almost all season have same count. There exist negligible change in number.

1. More count on Holiday as compared to working day.

1. Graph 2 we see It is highly imbalanced to holiday and working day, because a lot of people don't use vehicles on holiday

1. If seen in weather , weather 1 that is clear weather having the maximum demands for bike goes on decreasing as weather changes to mist and then light snow and almost negligible in the heavy rain. As it
is much risky to use Bike in such a climate.

1. 1 more categorical variable is made so as to bin the count of number of bicycles rented in low, medium , high etc. which shows the lognormal distribution as maximum times Low and then for different
High values for many reasons.

1. Data looks common as it should be like equal number of days in each season, more working days and weather is mostly Clear, Few clouds, partly cloudy, partly cloudy.

In [ ]:

In [163… fig, axis = plt.subplots(nrows=3, ncols=2, figsize=(16, 20))


index = 0
for row in range(3):
for col in range(2):
sns.histplot(df[col_numerical[index]], ax=axis[row, col], kde=True,color="green")
index += 1
plt.show()

plt.figure(figsize=(12,8))
sns.histplot(df[col_numerical[-1]], kde=True,color="r")
plt.show()

1. casual, registered and count somewhat looks like Log Normal Distribution

1. temp, atemp and humidity looks like they follows the Normal Distribution

1. windspeed follows the binomial distribution

In [ ]:

In [ ]:

In [164… fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))


sns.countplot(data=df, x='season', ax=axs[0,0],hue=df.weather,palette="Accent")
sns.countplot(data=df, x='holiday', ax=axs[0,1],hue=df.weather,palette="twilight_shifted")
sns.countplot(data=df, x='workingday', ax=axs[1,0],hue=df.weather,palette="mako")
sns.countplot(data=df, x='weather', ax=axs[1,1],hue=df.workingday,palette="cubehelix")
plt.show()

1. Whatever may be the season is the weather has a strong impact as clear wheather Most demand then mist and then light snow. And heavy rain less demand is shown from the above plot.

1. More demand of Yulu bikes is on working day . As it can be used and a transport to commute to their offices.

1. On week day and holiday or weekend demand is high of Yulu bikes when weather is clear or few cloudy.

In [ ]:

In [165… fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(18, 15))


sns.countplot(data=df, x='Rent_count', ax=axs[0,0],hue=df.weather,palette="Accent")
sns.countplot(data=df, x='Rent_count', ax=axs[0,1],hue=df.workingday,palette="twilight_shifted")
sns.countplot(data=df, x='Rent_count', ax=axs[1,0],hue=df.season,palette="cubehelix")
sns.countplot(data=df, x='Rent_count', ax=axs[1,1],hue=df.holiday,palette="cubehelix")
plt.show()

1. In spring season total count of rental bikes is more than other seasons.

1. Whenever there is rain, thunderstorm, snow or fog, there were less bikes were rented.

In [ ]:

Bivariate analysis
In [166… col_numerical = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered','count']
fig, axis = plt.subplots(nrows=3, ncols=2, figsize=(16, 14))
index = 0
for row in range(3):
for col in range(2):
sns.boxplot(df[col_numerical[index]], ax=axis[row, col],color="aqua")
index += 1

plt.show()
sns.boxplot(df[col_numerical[-1]])
plt.show()

In [ ]:

In [167… sns.pairplot(df,kind='reg',hue="weather")

<seaborn.axisgrid.PairGrid at 0x29d29fd5310>
Out[167]:

In [ ]:

In [168… plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap="Blues", linewidth=.5)

<AxesSubplot:>
Out[168]:

In [ ]:

In [169… fig, axis = plt.subplots(nrows=2, ncols=3, figsize=(16, 12))


index = 0
for row in range(2):
for col in range(3):
qqplot(df[col_numerical[index]], line="s", ax=axis[row, col])
index += 1

qqplot(df[col_numerical[-1]], line = "s")


plt.show()

To verify for ANOVA assumptions we have plot qqplot of all the numerival attributes and can observe
1. casual, registered and count somewhat looks like Log Normal Distribution are not aligned to red "S" line.

1. temp, atemp and humidity looks like they follows the Normal Distribution are aligned to red "S" line.

1. windspeed follows the binomial distribution not aligned to red "S" line.

In [ ]:

sample t test
Perfoming 2 sample t test on working day and non working day counts.

Taking significant level(alpha) as 0.05 for all test.

considreing: Null hypothesis Ho = mean of count of bike on non working day is equal to mean of counts of bike on working day.

Alternate hypothesis Ha = mean of count of bike on non working day is not equal to mean of counts of bike on working day.

In [ ]:

In [170… df.loc[df['workingday']==1]['count'].plot(kind='kde')

<AxesSubplot:ylabel='Density'>
Out[170]:

In [171… #The distribution does not follows normal distribution


df1=df.loc[df['workingday']==1]['count'].reset_index()
df1.drop(['index'], axis=1, inplace=True)
df2=df.loc[df['workingday']==0]['count'].reset_index()
df2.drop(['index'], axis=1, inplace=True)
ttest,p_value=ttest_ind(df1,df2)
print("p_value = ",p_value)

p_value = [0.22644804]

Since the P value is greater than 0.05 hence null hypotheis has failed to reject.

So we can say that non non working day has no effect on counts of bike.

In [ ]:

Hypothesis Testing
Sample T-Test to check if Working Day has an effect on the number of electric cycles rented.ANNOVA to check if No. of cycles rented is similar or different in different 1. weather 2. season.Chi-square test to
check if Weather is dependent on the season.

1. Working Day has an effect on the number of electric cycles rented.


In [172… t_stat, p_value = levene(df["count"],df["workingday"])
p_value
alpha = 0.5

In [173… ttest_ind(df["count"], df["workingday"])

Ttest_indResult(statistic=109.95076974934595, pvalue=0.0)
Out[173]:

In [174… population_mean_count = df["count"].mean()


population_mean_count

191.57413191254824
Out[174]:

Select an appropriate test to check whether:

1. Working Day has effect on number of electric cycles r of cycles rented similar or different ented
2. No.in different seasons
3. No. of cycles rented similar or different in different weather
4. Weather is dependent on season (check between 2 predictor variable)

First 3 statements to check are having one Numerical variable i.e. Count and one Categorical_variable as working Day or seasons or Weather. So For these type of questions we use ttest or Anova i.e (Numeric,
catagorical) 4th one is both the categorical variables so use Chisquare or chi2_contingency test.

In [ ]:

In [175… #1.Working Day has effect on number of electric cycles rented


population_mean_count = df["count"].mean()
population_mean_count

191.57413191254824
Out[175]:

In [176… df_workingday_count = df[df["workingday"] == 1]["count"]


df_workingday_count.mean()

193.01187263896384
Out[176]:

In [177… df_non_workingday_count = df[df["workingday"] == 0]["count"]


df_non_workingday_count.mean()

188.50662061024755
Out[177]:

In [ ]:

Using Anova
In [178… #H0 = Working day does not have any effect on number of cycles rented.
#HA = Working day has an positive effect on number of cycles rented. i.e. mu1 > mu2
# We consider it to be Right Tailed
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.05
f_stat, p_value = f_oneway(df_workingday_count,df_non_workingday_count)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")

Test statistic = 1.4631992635777575 pvalue = 0.22644804226428558


Fail to reject Null Hypothesis

In [ ]:

Using ttest
In [179… #H0 = Working day does not have any effect on number of cycles rented.
#HA = Working day has an effect on number of cycles rented. mu1 > m2
# We consider it to be Righ Tailed.
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
t_stat, p_value = ttest_ind(df_workingday_count,df_non_workingday_count, alternative = "greater")
print(f"Test statistic = {t_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")

Test statistic = 1.2096277376026694 pvalue = 0.11322402113180674


Fail to reject Null Hypothesis

In [ ]:

2.No. of cycles rented similar or different in different seasons


In [180… #As we have 4 different seasons ttest will not work here. Need to use ANOVA #Using ANOVA

In [181… df_season1_spring = df[df["season"] == 1]["count"]


df_season1_spring_subset = df_season1_spring.sample(100)

df_season2_summer =df[df["season"] == 2]["count"]


df_season2_summer_subset = df_season2_summer.sample(100)

df_season3_fall = df[df["season"] == 3]["count"]


df_season3_fall_subset = df_season3_fall.sample(100)

df_season4_winter = df[df["season"] == 4]["count"]


df_season4_winter_subset = df_season4_winter.sample(100)

In [182… #We have taken samples of each dataframe to send it to shapiro as Shapiro test

checking for assumptions:

In [183… #Levene's Test


#H0 = All samples have equal variance
#HA = At least one sample will have different variance
t_stat, p_value = levene(df_season1_spring, df_season2_summer, df_season3_fall, df_season4_winter)
p_value

1.0147116860043298e-118
Out[183]:

Shapiro == Test for normality #We are taking samples of the available data. As it works well with (50 to 200)
values. So we have created subset of
each of 100 values.

In [184… #H0 = Sample is drawn from NormalDistribution


#HA = Sample is not from Normal Distribution
##Here we are considering alpha (significance value as ) 0.05
t_stat, pvalue = shapiro(df_season1_spring_subset)
if pvalue < 0.05:
print("Reject H0 Data is not Gaussian")
else:
print("Fail to reject Data is Gaussian")

Reject H0 Data is not Gaussian

In [185… t_stat, pvalue = shapiro(df_season2_summer_subset)


if pvalue < 0.05:
print("Reject H0 Data is not Gaussian")
else:
print("Fail to reject Data is Gaussian")

Reject H0 Data is not Gaussian

In [186… t_stat, pvalue = shapiro(df_season3_fall_subset)


if pvalue < 0.05:
print("Reject H0 Data is not Gaussian")
else:
print("Fail to reject Data is Gaussian")

Reject H0 Data is not Gaussian

In [187… t_stat, pvalue = shapiro(df_season4_winter_subset)


if pvalue < 0.05:
print("Reject H0 Data is not Gaussian")
else:
print("Fail to reject Data is Gaussian")

Reject H0 Data is not Gaussian

In all the above 4 test we got p_value almost 0.0 (like 10^-6 or so) which is less than alpha so we Reject the Null Hypothesis of these samples from Normal Distribution

In [ ]:

In [188… #H0 = season does not have any effect on number of cycles rented.
#HA = At least one season out of four (1:spring, 2:summer,3:fall, 4:winter) has an effect on number of cycles rented.
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
f_stat, p_value = f_oneway(df_season1_spring, df_season2_summer, df_season3_fall, df_season4_winter)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")

Test statistic = 236.94671081032106 pvalue = 6.164843386499654e-149


Reject Null Hypothesis

In [ ]:

3.No. of cycles rented similar or different in different weather


In [ ]:

In [189… #As we have 4 different weather ttest will not work here. Need to use ANOVA

In [190… df_weather1_clear = df[df["weather"] == 1]["count"]


df_weather1_clear.mean()

205.23679087875416
Out[190]:

In [191… df_weather2_Mist = df[df["weather"] == 2]["count"]


df_weather2_Mist.mean()

178.95553987297106
Out[191]:

In [192… df_weather3_LightSnow = df[df["weather"] == 3]["count"]


df_weather3_LightSnow.mean()

118.84633294528521
Out[192]:

In [193… df_weather4_HeavyRain = df[df["weather"] == 4]["count"]


df_weather4_HeavyRain.mean()

164.0
Out[193]:

In [194… #levene's Test = It is chexking for variance

In [195… #H0 = All samples have equal variance


#HA = At least one sample will have different variance
t_stat, p_value = levene(df_weather1_clear, df_weather2_Mist, df_weather3_LightSnow, df_weather4_HeavyRain)
p_value

3.504937946833238e-35
Out[195]:

In [196… #Shapiro == Test for normality

In [197… #H0 = Sample is drawn from NormalDistribution


#HA = Sample is not from Normal Distribution
##Here we are considering alpha (significance value as ) 0.05
shapiro(df_weather1_clear)

ShapiroResult(statistic=0.8909225463867188, pvalue=0.0)
Out[197]:

In [198… shapiro(df_weather2_Mist)

ShapiroResult(statistic=0.8767688274383545, pvalue=9.781063280987223e-43)
Out[198]:

In [199… shapiro(df_weather3_LightSnow)

ShapiroResult(statistic=0.7674333453178406, pvalue=3.876134581802921e-33)
Out[199]:

In [ ]:

Using ANOVA
In [200… #H0 = weather does not have any effect on number of cycles rented.
#HA = At least one weather out of four (1: clear, 2: Mist, 3:Light snow, 4:Heavy Rain) has an effect on number of cycles re
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
f_stat, p_value = f_oneway(df_weather1_clear,df_weather2_Mist,df_weather3_LightSnow,df_weather4_HeavyRain)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")

Test statistic = 65.53024112793271 pvalue = 5.482069475935669e-42


Reject Null Hypothesis

In [201… #H0 = weather does not have any effect on number of cycles rented.
#HA = At least one weather out of four (1: clear, 2: Mist, 3:Light snow, 4:Heavy Rain) has an effect on number of cycles re
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
f_stat, p_value = f_oneway(df_weather1_clear,df_weather2_Mist,df_weather3_LightSnow)
print(f"Test statistic = {f_stat} pvalue = {p_value}")
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")

Test statistic = 98.28356881946706 pvalue = 4.976448509904196e-43


Reject Null Hypothesis

As we can see he pvalue is very very very low and we are Rejecting Null Hypothesis becasue we see weather 4 having rent count negligible and clear and lightsnow have good number of bikes rented. So it
does impact and not all similar.

In [ ]:

4.Weather is dependent on season (check between 2 predictor variable)

Using chisquare_test
In [202… val = pd.crosstab(index = df["weather"], columns = df["season"])
print(val)
chisquare(val)

season 1 2 3 4
weather
1 1759 1801 1930 1702
2 715 708 604 807
3 211 224 199 225
4 1 0 0 0
Power_divergenceResult(statistic=array([2749.33581534, 2821.39590194, 3310.63995609, 2531.07388442]), pvalue=array([0., 0., 0., 0.]))
Out[202]:

In [203… val = pd.crosstab(index = df["weather"], columns = df["season"])


print(val)
chisquare(val)

season 1 2 3 4
weather
1 1759 1801 1930 1702
2 715 708 604 807
3 211 224 199 225
4 1 0 0 0
Power_divergenceResult(statistic=array([2749.33581534, 2821.39590194, 3310.63995609, 2531.07388442]), pvalue=array([0., 0., 0., 0.]))
Out[203]:

In [ ]:

Using chi2_contigency test


In [204… #H0 = Weather is not dependent (Independent) on season.
#HA = Weather is dependent on Season
#Righ Tailed /Left/Two
#Test Statistic and p_value
#We will consider alpha as 0.01 significance value. i.e 99% confidence
alpha = 0.01
val = pd.crosstab(index = df["weather"], columns = df["season"])
#print(val)
chi_stat, p_value, df, confusion_matrix = chi2_contingency(val)
print(f"Test statistic = {chi_stat} pvalue = {p_value}") #degree of freedom (df) = {df}")
#print("The confusion matrix is :")
#print(confusion_matrix)
if (p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to reject Null Hypothesis")

Test statistic = 49.15865559689363 pvalue = 1.5499250736864862e-07


Reject Null Hypothesis

We reject NULL hypothesis that is Weather is independent from season at significance 0.01 we get that the p_value comes out to very low and These 2 attributes are strongly dependent on each other.

In [ ]:

Insights
A 2-sample T-test on working and non-working days with respect to count,implies that the mean population count of both categories are the same.

An ANOVA test on different seasons with respect to count,implies that population count means under different seasons are not the same, meaning there is a difference in the usage of Yulu bikes in different
seasons.

By performing an ANOVA test on different weather conditions except 4 with respect to count, we can infer that population count means under different weather conditions are the same, meaning there is a
difference in the usage of Yulu bikes in different weather conditions.

By performing a Chi2 test on season and weather (categorical variables), we can infer that there is an impact on weather dependent on season.

The maximum number of holidays can be seen during the fall and winter seasons.

There is a positive corelation between counts and temperature.

There is a negative corelation between counts and humidity.

More number of counts when weather is clear with less clouds, proved by annova hypothesis test.

In [ ]:

Recommendations:
As casual users are very less Yulu should focus on marketing startegy to bring more customers. for eg. first time user discount, friends and family discounts, referral bonuses etc.

On non working days as count is very low Yulu can think on the promotional activities like city exploration competition, some health campaigns etc.

In heavy rains as rent count is very low Yulu can introduce a different vehicle such as car or having shade or protection from that rain.

In [ ]:

You might also like