0% found this document useful (0 votes)

226 views16 pages

Data Analysis and Visualization in R - Final Paper - Bike Sharing Dataset Analysis

This document analyzes a bike sharing dataset from Washington D.C. to understand rental patterns. Temperature and rental counts vary by season, with the lowest temperatures and rentals in spring. Rentals are highest for clear weather and lowest for rainy/stormy conditions. Summer has the most rentals on average, followed by fall, with the highest variability in rentals occurring in summer. Rentals are higher on working days than holidays.

Uploaded by

Archit Pateria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

226 views16 pages

Data Analysis and Visualization in R - Final Paper - Bike Sharing Dataset Analysis

Uploaded by

Archit Pateria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Data analysis and visualization in R -

Final Paper: Bike Sharing Dataset

Analysis
Anna Martin
3 February 2016

Dataset Description
The dataset was obtained form UCI Machine Learning Repository
(https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
(https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)).

# import data set

hour <- read.csv("~/Dropbox/Project/DataSets/Bike-Sharing-Dataset/hour.csv")

Number of columns:

ncol(hour)

## [1] 17

Number of rows:

nrow(hour)

## [1] 17379

The data came from a two-year historical log corresponding to years 2011 and 2012 from Capital
Bikeshare system, Washington D.C., USA.

Bike sharing systems are new generation of traditional bike rentals where whole process from
membership, rental and return back has become automatic. Through these systems, user is able to
easily rent a bike from a particular position and return back at another position. Currently, there are
about over 500 bike-sharing programs around the world which is composed of over 500 thousands
bicycles. Today, there exists great interest in these systems due to their important role in traffic,
environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data
being generated by these systems make them attractive for the research. Opposed to other
transport services such as bus or subway, the duration of travel, departure and arrival position is
explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor
network that can be used for sensing mobility in the city. Hence, it is expected that most of important
events in the city could be detected via monitoring these data.

Attribute Information:
instant: record index
dteday : date
season : season (1:springer, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
hr : hour (0 to 23)
holiday : weather day is holiday or not (extracted from [Web Link])
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered
clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min),
t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-
t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

Questions
1. How do temperature values change over the seasons? What is mean, standard deviation and
median of temperatures for each season?

2. For which weather condition the number of total bike rentals are the lowest/highest?

3. Is there a correlation between total number of rentals and season? What is the mean, median
and standard deviation for total number of rentals (count) per season? Which season is the
most popular for the bike rentals?

4. Is correlation between felt air temperature (atemp) and number of bike rentals significant? Is
there a difference between the correlations for two years (2011 and 2012)?

5. Is weather condition correlated to number of bike rentals? What is minimum, maximum, mean,
median, standard deviation and number of occurrences for each weather condition? How
weather condition influences the distribution of bike rentals?

6. Is there a significant difference between total bike rentals on holidays and working days?

Data preparation
1. Recode season values from 1-4 to Spring-Winter.

### TASK 1

names(hour)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"

# define recode function for recoding values:

recodev <- function(original.vector,
old.values,
new.values) {
new.vector <- original.vector
for (i in 1:length(old.values)) {
change.log <- original.vector == old.values[i] &
is.na(original.vector) == F
new.vector[change.log] <- new.values[i]

}
return(new.vector)
}
# apply the functiontion for recoding season values
hour$season <- recodev(original.vector = hour$season,
old.values = c(1:4),
new.values = c("spring","summer","fall",
"winter"))

2. Rename columns “yr” and “mnth” on “year” and “month” and recode year
values (0: 2011, 1:2012).

### TASK 1

# rename columns
names(hour)[4:5] <- c("year","month")
# recode year values
hour$year <- recodev(original.vector = hour$year,
old.values = c(0,1),
new.values = c(2011,2012))
# check column names
names(hour)

## [1] "instant" "dteday" "season" "year" "month"

## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"

3. Rename “hum” on “humidity” and “cnt” on “count”.

### TASK 1

# rename columns
names(hour)[names(hour)=="hum"] <- "humidity"
names(hour)[names(hour)=="cnt"] <- "count"
names(hour)
## [1] "instant" "dteday" "season" "year" "month"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "humidity" "windspeed" "casual"
## [16] "registered" "count"

4. Denormalise “temp”" and “atemp” with the created function.

### TASKS 10, 1

# create a function for denormalisartion

tconvert <- function(min, max, vector){
result <- vector * (max - min) + min
return (result)
}

# apply the function and denormalise the temperature values

hour$temp <- tconvert(-8, 39, hour$temp)
hour$atemp <- tconvert(-16, 50, hour$atemp)

Analysis
1. How do temperature values change over the seasons? What is mean,
standard deviation and median of temperatures for each season?

### TASKS 2, 9

# calculate mean, st.dev and median for each season

# by aggregation with dplyr library
library(dplyr)
hour.agg <- hour %>%
group_by(season) %>%
summarise(
temp.min = min(temp),
temp.max = max(temp),
temp.med = median(temp),
temp.stdev = sd(temp),
temp.mean = mean(temp),
count = n())
hour.agg

## Source: local data frame [4 x 7]

##
## season temp.min temp.max temp.med temp.stdev temp.mean count
## 1 fall 9.86 39.00 24.90 4.413428 25.201277 4496
## 2 spring -7.06 25.84 5.16 5.580120 6.059892 4242
## 3 summer -0.48 36.18 18.32 6.543958 17.599170 4409
## 4 winter -1.42 27.72 11.74 5.741867 11.887486 4232
### TASK 8

# create a boxplot for temperature by season

boxplot(temp ~ season,
data = hour,
xlab = "Season",
ylab = "Temperature",
main = "Temperature by Season",
col = "skyblue")

# check seasons and respective months

# fall months
unique(hour$month[hour$season=="fall"])

## [1] 6 7 8 9

# winter months
unique(hour$month[hour$season=="winter"])

## [1] 9 10 11 12

# spring months
unique(hour$month[hour$season=="spring"])

## [1] 1 2 3 12
# summer months
unique(hour$month[hour$season=="summer"])

## [1] 3 4 5 6

As it can be seen from the analysis above, the lowest minimum temperature as well as the minimum
mean temperature applies to spring (-7.06°C and 5.16°C respectively), maximum temperature as
well as the maximum mean value belongs to fall (39.00°C and 24.90°C respectively). Boxplot clearly
demonstrates that the lowest temperatures are typical for spring season and followed by winter
regarding this parameter, while the highest temperatures belong to fall and followed by summer.
Such untypical temperature values can be explained by months shift in the dataset.

2. For which weather condition the number of total bike rentals are the
lowest/highest?

### TASK 8

# create a beanplot for number of bike rents per each weather condition
library("beanplot")
require("beanplot")
require("RColorBrewer")
bean.cols <- lapply(brewer.pal(6, "Set3"),
function(x){return(c(x, "black", "gray", "red"))})
beanplot(count ~ weathersit,
data = hour,
main = "Bike Rents by Weather Condition",
xlab = "Weather Condition",
ylab = "Number of rentals",
col = bean.cols,
lwd = 1,
what = c (1,1,1,0),
log = ""
)
The beanplot demonstrates that the lowest number of rents is typical for the 4th weather type (rain,
thunderstorm etc.) while the highest mean value of rentals have days with the 1st weather type
(clear, partly cloudy etc.)

3. Is there a correlation between total number of rentals and season? What is

the mean, median and standard deviation for total number of rentals (count)
per season? Which season is the most popular for the bike rentals?

### TASK 11

# create a data frame

df <- data.frame(spring = rep(NA, 3),
winter = rep(NA, 3),
summer = rep(NA, 3),
fall = rep(NA, 3))
row.names(df) <- c("mean", "median", "sd")

# fill the data frame with corresponding mean, median and sd values
vec <- c ("mean","median","sd")
for (n in vec){
for (i in unique(hour$season)) {
my.fun <- get(n)
res <- my.fun(hour$count[hour$season == i])
df[n,i] <- res
}
}
df
## spring winter summer fall
## mean 111.1146 198.8689 208.3441 236.0162
## median 76.0000 155.5000 165.0000 199.0000
## sd 119.2240 182.9680 188.3625 197.7116

From the numbers above we can see that the highest mean, median and standard deviation values
of total bike rentals are typical for fall season (236.0162, 199 and 236.0162 respectively), while the
lowest values has spring season (111.1146, 76 and 119.224 respectively).

# statistics (analysis of variance model)

summary(aov(count ~ season, data = hour))

## Df Sum Sq Mean Sq F value Pr(>F)

## season 3 37729358 12576453 409.2 <2e-16 ***
## Residuals 17375 534032233 30736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of the variance model demonstrates that number of rents and season are significantly
correlated (p-value < 2e-16).

# pairwise comparison of means for seasons

# in order to identify any difference between two means that is greater than the exp
ected standard error
TukeyHSD(aov(count ~ season, data = hour))

## Tukey multiple comparisons of means

## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ season, data = hour)
##
## $season
## diff lwr upr p adj
## spring-fall -124.901668 -134.54307 -115.2602613 0.0000000
## summer-fall -27.672168 -37.21916 -18.1251741 0.0000000
## winter-fall -37.147380 -46.79465 -27.5001142 0.0000000
## summer-spring 97.229500 87.54202 106.9169764 0.0000000
## winter-spring 87.754288 77.96798 97.5405970 0.0000000
## winter-summer -9.475213 -19.16852 0.2180949 0.0581801

Pairwise means difference analysis reveals that the most significant difference in total number of
bike rentals is for spring and fall seasons (-124.9), while the most insignificant means values
difference is between winter and summer. This tells us that the the distribution of total bike rentals is
quite similar for these two seasons, but differ significantly for spring and fall seasons.
### TASK 8

# create a boxplot for count~season in order to reveal the most popular season
# for bike rentals

boxplot(count ~ season,
data = hour,
xlab = "Season",
ylab = "Count",
main = "Count by Season",
col = "yellow3")

The boxplots show that the most popular seasons for renting a bike is fall and summer while the
most unpopular one is spring.

4. Is correlation between felt air temperature (atemp) and number of bike

rentals significant? Is there a difference between the correlations for two years
(2011 and 2012)?

### TASK 4

# correlation test for count~atemp

t1 <- cor.test(hour$atemp[hour$year == 2011],
hour$count[hour$year == 2011])
t1
##
## Pearson's product-moment correlation
##
## data: hour$atemp[hour$year == 2011] and hour$count[hour$year == 2011]
## t = 46.4598, df = 8643, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4300004 0.4637388
## sample estimates:
## cor
## 0.4470285

t2 <- cor.test(hour$atemp[hour$year == 2012],

hour$count[hour$year == 2012])
t2

##
## Pearson's product-moment correlation
##
## data: hour$atemp[hour$year == 2012] and hour$count[hour$year == 2012]
## t = 40.3462, df = 8732, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3785679 0.4139248
## sample estimates:
## cor
## 0.3963933

# apa format
library("yarrr")
apa(t1)

## [1] "r = 0.45, t(8643) = 46.46, p < 0.01 (2-tailed)"

apa(t2)

## [1] "r = 0.4, t(8732) = 40.35, p < 0.01 (2-tailed)"

The correlation test demonstrates significant correlation between the felt air temperature and the
number of bike rents for both years (p-value < 0.01 in both cases), although the correlation
coefficients differ being higher for 2011 (0.45) than for 2011 (0.4).
### TASKS 5, 6

# plotting the results in a scatterplot with regression lines

# blank plot
plot(x = 1,
xlab = "Temperature",
ylab = "Number of Rents",
xlim = c(-25,50),
ylim = c(0,1000),
main = "Temperature vs. Count")

# draw points for 2011 year

points(x = hour$atemp[hour$year == 2011],
y = hour$count[hour$year == 2011],
pch = 16,
col = "red",
cex = 0.5
)
# draw points for 2012 year
points(x = hour$atemp[hour$year == 2012],
y = hour$count[hour$year == 2012],
pch = 16,
col = "darkgreen",
cex = 0.5
)

# add regression lines for two ears

abline(lm(count~atemp, hour, subset = year == 2011),
col = "darkgreen",
lwd = 3)

abline(lm(count~atemp, hour, subset = year == 2012),

col = "red",
lwd = 3)

# add legend
legend("topleft",
legend = c(2011, 2012),
col = c("darkgreen","red"),
pch = c(16, 16),
bg = "white",
cex = 1
)
The scatterplot with the regression lines for both years demonstrates once again the difference
between the correlation for 2011 and 2012 years. The slope of the regression lines shows that the
influence of the temperature for 2011 is more significant than for 2012.

5. Is weather condition correlated to number of bike rentals? What is

minimum, maximum, mean, median, standard deviation and number of
occurrences for each weather condition? How weather condition influences
the distribution of bike rentals?

### TASK 5

# summary on linear model fitting

summary(lm(count~weathersit, hour))
##
## Call:
## lm(formula = count ~ weathersit, data = hour)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.65 -139.65 -45.65 89.35 790.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 247.054 3.328 74.24 <2e-16 ***
## weathersit -40.407 2.130 -18.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 179.5 on 17377 degrees of freedom
## Multiple R-squared: 0.02029, Adjusted R-squared: 0.02023
## F-statistic: 359.8 on 1 and 17377 DF, p-value: < 2.2e-16

summary(aov(count~weathersit, hour))

## Df Sum Sq Mean Sq F value Pr(>F)

## weathersit 1 11598301 11598301 359.8 <2e-16 ***
## Residuals 17377 560163290 32236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, there is a significant correlation between weather condition and number of bike rents (p-value <
2e-16).

### TASK 9

# calculate min, max, mean, st.dev and median for each season
# by aggregation with dplyr library

w.agg <- hour %>%

group_by(weathersit) %>%
summarise(
temp.min = min(temp),
temp.max = max(temp),
temp.mean = mean(temp),
temp.stdev = sd(temp),
temp.med = median(temp),
count = n())
w.agg

## Source: local data frame [4 x 7]

##
## weathersit temp.min temp.max temp.mean temp.stdev temp.med count
## 1 1 -7.06 39.00 16.0195409 9.436434 16.44 11413
## 2 2 -7.06 37.12 14.2989349 8.268867 13.62 4544
## 3 3 -4.24 35.24 13.4643270 7.543913 13.62 1419
## 4 4 -1.42 2.34 0.7733333 1.956766 1.40 3
The aggregation results reveals that the minimum temperatures for clear/cloudy and mist/cloudy
weather conditions are the same (-7.06°C). The highest maximum temperature has the 1st weather
condition - clear/partly cloudy (39°C). The lowest temperature for all estimated parameters is for
heavy rain/thunderstorm weather condition and it’s the most rare weather condition in the whole
dataset (only 4 records), while the most often one is the 1st one (11413).

### TASKS 7, 11

# create histograms for each weather condition

# to explore distribution of the bike rentals by
# weather condition

# create a vector for histograms titles

vec <- c("Clear Weather", "Cloudy Weather", "Rainy Weather", "Thunderstorm Weather")

# parameters for plots combining

par(mfrow = c(2, 2))

# create 4 histograms with a loop

for (i in c(1:4)){
name.i <- vec[i]
hist(hour$count[hour$weathersit == i],
main = name.i,
xlab = "Number of Rents",
ylab = "Frequency",
breaks = 10,
col = "yellow3",
border = "black")

# the line indicating median value

abline(v = median(hour$count[hour$weathersit == i]),
col = "black",
lwd = 3,
lty = 2)

# the line indicating mean value

abline(v = mean(hour$count[hour$weathersit == i]),
col = "blue",
lwd = 3,
lty = 2)
}
The histograms demonstrate that distribution of the bike rents for Clear and Cloudy weather is pretty
similar (although the frequency is much higher in the first case), while differs significantly for Rainy
weather and drastically for Thunderstorm weather, where, as it was already pointed out, the
frequency being extremely low.

6. Is there a significant difference between total bike rentals on holidays and

working days?

### TASK 3

t <- t.test(hour$count[hour$holiday == 0],

hour$count[hour$holiday == 1])

# apa format
apa(t)

## [1] "mean difference = -33.56, t(539.61) = 4.69, p < 0.01 (2-tailed)"

# TASK 8

beanplot(count ~ holiday,
data = hour,
main = "Bike Rents by Type of a Day",
xlab = "Type of Day",
ylab = "Number of rents",
col = bean.cols,
lwd = 1,
what = c(1,1,1,0),
log = ""
)
In accordance with the conducted t-test, there is a significant difference between total bike rentals on
holidays and working days (p=value < 0.01). The beanplot shows that the maximum number of rents
is significantly higher for working days than for holidays, while the frequency of moderate numbers of
bike rentals (200-600 rentals) is higher for holidays.

Conclusion
Conducting a range of different statistical test and plotting the data with variety of plots on the
dataset comprising two-year historical log on bike rentals in Washington D.C. allow to make the
following conclusions:

The mean temperatures vary significantly over the seasons

Figures of total bike rents changes depending on weather condition and vary regarding their
means. The most significant pairwise mean difference is typical for spring and fall seasons,
while the most insignificant for winter and summer.

There is a strong correlation between felt air temperature and the total number of bike rentals,
although it differs for two represented years.

Weather condition and total number of bike rentals also seemed to be significantly correlated.
The two popular weather conditions for bike rentals are Clear and Cloudy weather.

There exist a significant correlation between number of total bike rentals and type of day.

Starbucks Case Data
No ratings yet
Starbucks Case Data
32 pages
Grande Alliance K2014
100% (1)
Grande Alliance K2014
11 pages
Calibration of Peristaltic Pumps - Lab 1
100% (2)
Calibration of Peristaltic Pumps - Lab 1
12 pages
4 - Dooly County Doughnuts UV7397-PDF-EnG
No ratings yet
4 - Dooly County Doughnuts UV7397-PDF-EnG
3 pages
Harris Corporation Data
No ratings yet
Harris Corporation Data
16 pages
Great DaKota Bank Case Study
No ratings yet
Great DaKota Bank Case Study
10 pages
2016 01 1720161144caso
0% (1)
2016 01 1720161144caso
4 pages
Coffee Shop Co Case
No ratings yet
Coffee Shop Co Case
3 pages
Process Fundamentals Practice Questions: Nstructions
No ratings yet
Process Fundamentals Practice Questions: Nstructions
19 pages
Decision Tree Exercises
No ratings yet
Decision Tree Exercises
3 pages
Bike Renting PDF
No ratings yet
Bike Renting PDF
26 pages
Exercises On Asset Analysis
100% (1)
Exercises On Asset Analysis
2 pages
Dealscope Google Fitbit PDF
No ratings yet
Dealscope Google Fitbit PDF
1 page
Scenario 1 Aggressive Discounting Strategy
100% (1)
Scenario 1 Aggressive Discounting Strategy
11 pages
Uber Slides
No ratings yet
Uber Slides
29 pages
Case:: Presented by
No ratings yet
Case:: Presented by
45 pages
UV1461 Analyzing Uncertainty - Probability Distribution and Simulation
No ratings yet
UV1461 Analyzing Uncertainty - Probability Distribution and Simulation
13 pages
Perceptual Maps and Product Positioning: Marketing 8112
No ratings yet
Perceptual Maps and Product Positioning: Marketing 8112
5 pages
Group 1 Uber Case
No ratings yet
Group 1 Uber Case
8 pages
Case 10
No ratings yet
Case 10
7 pages
Cvs Health Current and New Approaches To Making Drugs More Affordable
No ratings yet
Cvs Health Current and New Approaches To Making Drugs More Affordable
18 pages
Column, Bar, and Pie Charts Compare Values in
No ratings yet
Column, Bar, and Pie Charts Compare Values in
6 pages
Case - Demand For Toffe Inc
No ratings yet
Case - Demand For Toffe Inc
16 pages
Destin Brass
No ratings yet
Destin Brass
5 pages
Sources of Finance
No ratings yet
Sources of Finance
41 pages
Keller 2002
No ratings yet
Keller 2002
28 pages
High Mountain Technologies
0% (4)
High Mountain Technologies
1 page
HaveText WillTravel - CanAirbnbUseReviewTextDatatoOptimizeProfits - PDF
No ratings yet
HaveText WillTravel - CanAirbnbUseReviewTextDatatoOptimizeProfits - PDF
6 pages
Motorola's Droid 2
No ratings yet
Motorola's Droid 2
29 pages
Acko Report
No ratings yet
Acko Report
34 pages
Classic Pen Case
No ratings yet
Classic Pen Case
2 pages
IS Assignment - Group 7
No ratings yet
IS Assignment - Group 7
3 pages
Vishal Mega Mart
No ratings yet
Vishal Mega Mart
17 pages
Boston Creamery Inc: Variance Analysis
No ratings yet
Boston Creamery Inc: Variance Analysis
9 pages
Project Abhishek1sip
No ratings yet
Project Abhishek1sip
95 pages
Delta and Singapore Airlines Case Study - Updated
0% (1)
Delta and Singapore Airlines Case Study - Updated
9 pages
Analysis of Too Yumm Retail Strategy v1
No ratings yet
Analysis of Too Yumm Retail Strategy v1
2 pages
Harvard Business School Case
No ratings yet
Harvard Business School Case
51 pages
Pillsbury Cookie Challenge PDF
No ratings yet
Pillsbury Cookie Challenge PDF
1 page
Vungle
No ratings yet
Vungle
14 pages
Brownie Heaven
No ratings yet
Brownie Heaven
8 pages
Conjoint Analysis
No ratings yet
Conjoint Analysis
70 pages
Google Fitbit Acquisition Dealscope
No ratings yet
Google Fitbit Acquisition Dealscope
1 page
Operations Research-Sec D
No ratings yet
Operations Research-Sec D
5 pages
This Spreadsheet Supports STUDENT Analysis of The Case "Transportation and Consolidation at Elevalt LTD." (UVA-OM-1490)
No ratings yet
This Spreadsheet Supports STUDENT Analysis of The Case "Transportation and Consolidation at Elevalt LTD." (UVA-OM-1490)
7 pages
Gadget Toy Co.
No ratings yet
Gadget Toy Co.
2 pages
Final Supply Chain Horse Meat
No ratings yet
Final Supply Chain Horse Meat
15 pages
Supply Chain Network Re-Design: FMC Biscuits
No ratings yet
Supply Chain Network Re-Design: FMC Biscuits
7 pages
Best Fitness Devices To Complement Your Carb Manager Membership
No ratings yet
Best Fitness Devices To Complement Your Carb Manager Membership
34 pages
Cunningham Motors: The Virtual Automobile Company
No ratings yet
Cunningham Motors: The Virtual Automobile Company
7 pages
Krispy Naturals Case Writeup
100% (3)
Krispy Naturals Case Writeup
13 pages
BSC How To Play
100% (1)
BSC How To Play
10 pages
Case 7 Hasseur Case
No ratings yet
Case 7 Hasseur Case
3 pages
Green Hills Marketing
No ratings yet
Green Hills Marketing
11 pages
Bluesky Case
No ratings yet
Bluesky Case
4 pages
CASE 2 Suburban Electronics Company Session17
No ratings yet
CASE 2 Suburban Electronics Company Session17
7 pages
Sample Questions MBA
No ratings yet
Sample Questions MBA
8 pages
Output
No ratings yet
Output
24 pages
Bike Sharing Data Analysis
No ratings yet
Bike Sharing Data Analysis
24 pages
Capital Bike Sharing Dataset Description-Part 3
No ratings yet
Capital Bike Sharing Dataset Description-Part 3
2 pages
Bike Sharing Analysis
No ratings yet
Bike Sharing Analysis
4 pages
MC11. Designing Presentations
No ratings yet
MC11. Designing Presentations
23 pages
MC2. Problem Solving Tools and Approaches
No ratings yet
MC2. Problem Solving Tools and Approaches
3 pages
Assignment - 1 - DSML: Ques - 1
No ratings yet
Assignment - 1 - DSML: Ques - 1
9 pages
Consumer Behavior
No ratings yet
Consumer Behavior
7 pages
Google Inc 2014
No ratings yet
Google Inc 2014
19 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
Homework7 1
No ratings yet
Homework7 1
11 pages
The Influence of Handling Health Services Complaints On Patient's Trust in Regional General Hospital
No ratings yet
The Influence of Handling Health Services Complaints On Patient's Trust in Regional General Hospital
20 pages
Methods of Geographical Analysis
No ratings yet
Methods of Geographical Analysis
62 pages
STAT-101-Chapter 8,9,10,11,12 BY ATHA
No ratings yet
STAT-101-Chapter 8,9,10,11,12 BY ATHA
13 pages
The Practice of Statistics Chapter 3
No ratings yet
The Practice of Statistics Chapter 3
26 pages
Crude Oil Market and Its Impact On Indian Economy
No ratings yet
Crude Oil Market and Its Impact On Indian Economy
7 pages
EASY-FIT: A Software System For Data Fitting in Dynamic Systems
No ratings yet
EASY-FIT: A Software System For Data Fitting in Dynamic Systems
35 pages
Very Fast BDS
No ratings yet
Very Fast BDS
95 pages
Formal Report 1
No ratings yet
Formal Report 1
7 pages
Optimizing Expectile (Arrastrado) PDF
No ratings yet
Optimizing Expectile (Arrastrado) PDF
6 pages
CE304-Unit 5-Lect1-Jumah2018
No ratings yet
CE304-Unit 5-Lect1-Jumah2018
10 pages
Engineering Statistics: Measures of Central Tendency
No ratings yet
Engineering Statistics: Measures of Central Tendency
10 pages
Hasil Output SPSS 21
No ratings yet
Hasil Output SPSS 21
7 pages
GPS Water Vapour Experimental Results From Observations of The Australian Regional GPS Network (ARGN)
No ratings yet
GPS Water Vapour Experimental Results From Observations of The Australian Regional GPS Network (ARGN)
18 pages
D T E Hunt - A Wilson - Chemical Analysis of Water - General Principles and Techniques-Royal Society of Chemistry (1986) PDF
No ratings yet
D T E Hunt - A Wilson - Chemical Analysis of Water - General Principles and Techniques-Royal Society of Chemistry (1986) PDF
722 pages
Biology A Levels P5 Help
50% (2)
Biology A Levels P5 Help
11 pages
Statistics Cheatsheet
No ratings yet
Statistics Cheatsheet
3 pages
Journal of King Saud University - Science: Muhammad Baba Muh'd, Adamu Uzairu, G.A. Shallangwa, Sani Uba
No ratings yet
Journal of King Saud University - Science: Muhammad Baba Muh'd, Adamu Uzairu, G.A. Shallangwa, Sani Uba
10 pages
GMD 7 1247 2014
No ratings yet
GMD 7 1247 2014
5 pages
Testing Hypothesis For Oil and Gas Companies
No ratings yet
Testing Hypothesis For Oil and Gas Companies
4 pages
Youth Unemployment Econometrics - Eco C314-301o
100% (1)
Youth Unemployment Econometrics - Eco C314-301o
26 pages
Customer Satisfaction and Loyalty
100% (2)
Customer Satisfaction and Loyalty
14 pages
RRL 5
No ratings yet
RRL 5
37 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
49 pages
Heteroscedasticity Issue
100% (2)
Heteroscedasticity Issue
3 pages
Apstats Unit2combined Practice Test
No ratings yet
Apstats Unit2combined Practice Test
8 pages
Fuzzy Regression Models: Penn State University
No ratings yet
Fuzzy Regression Models: Penn State University
17 pages
ARDL in R
No ratings yet
ARDL in R
23 pages

Data Analysis and Visualization in R - Final Paper - Bike Sharing Dataset Analysis

Uploaded by

Data Analysis and Visualization in R - Final Paper - Bike Sharing Dataset Analysis

Uploaded by

Data analysis and visualization in R -

Final Paper: Bike Sharing Dataset

# import data set

# define recode function for recoding values:

## [1] "instant" "dteday" "season" "year" "month"

3. Rename “hum” on “humidity” and “cnt” on “count”.

4. Denormalise “temp”" and “atemp” with the created function.

### TASKS 10, 1

# create a function for denormalisartion

# apply the function and denormalise the temperature values

# calculate mean, st.dev and median for each season

## Source: local data frame [4 x 7]

# create a boxplot for temperature by season

# check seasons and respective months

3. Is there a correlation between total number of rentals and season? What is

# create a data frame

# statistics (analysis of variance model)

## Df Sum Sq Mean Sq F value Pr(>F)

# pairwise comparison of means for seasons

## Tukey multiple comparisons of means

4. Is correlation between felt air temperature (atemp) and number of bike

# correlation test for count~atemp

t2 <- cor.test(hour$atemp[hour$year == 2012],

## [1] "r = 0.45, t(8643) = 46.46, p < 0.01 (2-tailed)"

## [1] "r = 0.4, t(8732) = 40.35, p < 0.01 (2-tailed)"

# plotting the results in a scatterplot with regression lines

# draw points for 2011 year

# add regression lines for two ears

abline(lm(count~atemp, hour, subset = year == 2012),

5. Is weather condition correlated to number of bike rentals? What is

# summary on linear model fitting

## Df Sum Sq Mean Sq F value Pr(>F)

w.agg <- hour %>%

## Source: local data frame [4 x 7]

# create histograms for each weather condition

# create a vector for histograms titles

# parameters for plots combining

# create 4 histograms with a loop

# the line indicating median value

# the line indicating mean value

6. Is there a significant difference between total bike rentals on holidays and

t <- t.test(hour$count[hour$holiday == 0],

## [1] "mean difference = -33.56, t(539.61) = 4.69, p < 0.01 (2-tailed)"

The mean temperatures vary significantly over the seasons

You might also like