DT_QA2_.pdf
DT_QA2_.pdf
2023-06-19
library(tidyverse)
library(readr)
(a) For our analysis, the subjects are not the cricketers themselves, but each batting innings they partici-
pated in. In order to make the data tidy:
Rearrange the data into a long format so that there is a row for each batter in each innings.
## # A tibble: 310 x 5
## batter team role test_innings performance
## <chr> <chr> <chr> <chr> <chr>
## 1 Ali Eng allrounder Test 1_Innings_1 Batting at number 8 scored ~
## 2 Anderson England bowl Test 1_Innings_1 Batting at number 11 scored~
## 3 Archer England bowl Test 1_Innings_1 Batting at number NA scored~
## 4 Bairstow England wicketkeeper Test 1_Innings_1 Batting at number 7 scored ~
## 5 Bancroft Aus bat Test 1_Innings_1 Batting at number 1 scored ~
## 6 Broad England bowler Test 1_Innings_1 Batting at number 10 scored~
## 7 Burns England bat Test 1_Innings_1 Batting at number 1 scored ~
## 8 Buttler England bat Test 1_Innings_1 Batting at number 5 scored ~
## 9 Cummins Australia bowler Test 1_Innings_1 Batting at number 9 scored ~
## 10 Curran England bowl Test 1_Innings_1 Batting at number NA scored~
## # i 300 more rows
Use str_match() to create new columns for each of the following for each player innings:
1
balls=str_match(performance,"(\\d+) ball")[,2],
fours=str_match(performance,"(\\d+) four")[,2],
sixes=str_match(performance,"(\\d+) six")[,2])
ashes_lg
## # A tibble: 310 x 12
## batter team role test_innings performance test innings bat_num score balls
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Ali Eng allr~ Test 1_Inni~ Batting at~ 1 1 8 0 5
## 2 Ander~ Engl~ bowl Test 1_Inni~ Batting at~ 1 1 11 3 19
## 3 Archer Engl~ bowl Test 1_Inni~ Batting at~ 1 1 <NA> <NA> <NA>
## 4 Bairs~ Engl~ wick~ Test 1_Inni~ Batting at~ 1 1 7 8 35
## 5 Bancr~ Aus bat Test 1_Inni~ Batting at~ 1 1 1 8 25
## 6 Broad Engl~ bowl~ Test 1_Inni~ Batting at~ 1 1 10 29 67
## 7 Burns Engl~ bat Test 1_Inni~ Batting at~ 1 1 1 133 312
## 8 Buttl~ Engl~ bat Test 1_Inni~ Batting at~ 1 1 5 5 10
## 9 Cummi~ Aust~ bowl~ Test 1_Inni~ Batting at~ 1 1 9 5 10
## 10 Curran Engl~ bowl Test 1_Inni~ Batting at~ 1 1 <NA> <NA> <NA>
## # i 300 more rows
## # i 2 more variables: fours <chr>, sixes <chr>
• Ensure all categorical variables with a small number of levels are coded as factors,
• Ensure all categorical variables with a large number of levels are coded as characters, and
• Ensure all quantitative variables are coded as integers or numeric, as appropriate.
ashes_lg
## # A tibble: 310 x 12
## batter team role test_innings performance test innings bat_num score balls
## <chr> <fct> <fct> <chr> <chr> <fct> <fct> <int> <int> <int>
## 1 Ali Eng allr~ Test 1_Inni~ Batting at~ 1 1 8 0 5
## 2 Ander~ Engl~ bowl Test 1_Inni~ Batting at~ 1 1 11 3 19
## 3 Archer Engl~ bowl Test 1_Inni~ Batting at~ 1 1 NA NA NA
## 4 Bairs~ Engl~ wick~ Test 1_Inni~ Batting at~ 1 1 7 8 35
## 5 Bancr~ Aus bat Test 1_Inni~ Batting at~ 1 1 1 8 25
## 6 Broad Engl~ bowl~ Test 1_Inni~ Batting at~ 1 1 10 29 67
## 7 Burns Engl~ bat Test 1_Inni~ Batting at~ 1 1 1 133 312
## 8 Buttl~ Engl~ bat Test 1_Inni~ Batting at~ 1 1 5 5 10
## 9 Cummi~ Aust~ bowl~ Test 1_Inni~ Batting at~ 1 1 9 5 10
2
## 10 Curran Engl~ bowl Test 1_Inni~ Batting at~ 1 1 NA NA NA
## # i 300 more rows
## # i 2 more variables: fours <int>, sixes <int>
(c) Clean the data; recode the factors using fct_recode( ) such that there are no typographical errors in
the team names and player roles.
## # A tibble: 2 x 2
## team n
## <fct> <int>
## 1 Australia 160
## 2 England 150
## # A tibble: 4 x 2
## role n
## <fct> <int>
## 1 allrounder 70
## 2 bat 110
## 3 bowl 110
## 4 wicketkeeper 20
ashes_lg %>%
ggplot(aes(score))+
geom_histogram(col="black")
3
40
30
count
20
10
(b) Describe the distribution of scores, considering shape, location, spread and outliers.
summary(ashes_lg$score)
(c) Produce a bar chart of the teams participating in the series, with different colours for each team.
Noting that each player is represented by 10 rows in the data frame, how many players were used by
each team in the series?
ashes_lg %>%
select(batter,team) %>%
distinct() %>%
ggplot(aes(team))+
geom_bar(aes(fill=team),col="black")
4
15
10
team
count
Australia
England
Australia England
team
(a) Using ggplot, produce histograms of scores during the series, faceted by team.
ashes_lg %>%
ggplot(aes(score))+
geom_histogram(aes(fill=team), col="black")+
facet_wrap(~team)
5
Australia England
25
20
15
team
count
Australia
England
10
(b) Produce side-by-side boxplots of scores by each team during the series.
ashes_lg %>%
ggplot(aes(team,score))+
geom_boxplot(aes(fill=team))
6
200
150
team
score
Australia
100
England
50
Australia England
team
(c) Compare the distributions of scores by each team during the series, considering shape, location, spread
and outliers, and referencing the relevant plots. Which team looks to have had a higher variability of
scores?
Shape: They are left-skewed and unimodal for both teams (see histograms).
Location: Both Australia team and England team have a similar median (see boxpot).
Spread: Australia team has a larger interquartile range (see boxpot) comparing to England team.
Outliers: Australia team has eight outliers range between 80 and 220; England team also has eight outliers
range between 70 and 135.
Australia team has a higher variability of scores.
ashes_lg %>%
ggplot(aes(balls,score))+
geom_point()+
geom_smooth()
7
## Warning: Removed 103 rows containing non-finite values (‘stat_smooth()‘).
200
150
score
100
50
(b) Describe the relationship between score and number of balls. Are players who face more balls likely to
score more runs?
cor(na.omit(ashes_lg$balls), na.omit(ashes_lg$score))
## [1] 0.9425244
(c) Compute a new variable, scoring_rate, defined as the number of runs divided by the number of balls.
Produce a scatterplot of scoring_rate against number of balls.
ashes_lg %>%
ggplot(aes(balls,scoring_rate))+
geom_point()+
geom_smooth()
8
## ‘geom_smooth()‘ using method = ’loess’ and formula = ’y ~ x’
1.0
scoring_rate
0.5
0.0
(d) Is there a relationship between scoring rate and number of balls? Are players who face more balls
likely to score runs more quickly?
There is no strong relationship between scoring rate and the number of balls faced.
Players who face more balls are not likely to score runs more quickly.
(a) Produce a bar chart of the number of players on each team participating in the series, with the segments
coloured by the players’ roles.
ashes_lg %>%
select(batter,team,role) %>%
distinct() %>%
ggplot(aes(team,col=role))+
geom_bar(aes(fill=role),col="black")
9
15
10 role
allrounder
count
bat
bowl
wicketkeeper
5
Australia England
team
(b) Produce a contingency table of the proportion of players from each team who play in each particular
role.
role_tb %>%
mutate(total=allrounder+bat+bowl+wicketkeeper,
allrounder_ppt=allrounder/total,
batter_ppt=bat/total,
bowler_ppt=bowl/total,
wicketkeeper_ppt=wicketkeeper/total
) %>%
select(team,allrounder_ppt,batter_ppt,bowler_ppt,wicketkeeper_ppt) %>%
knitr::kable(digits = 3)
(c) Using these two figures, state which team is made up of a larger proportion of batters, and which team
contains a larger proportion of all-rounders.
10
Australian team is made up of a larger proportion of batters.
England team contains a larger proportion of all-rounders.
11