0% found this document useful (0 votes)
12 views11 pages

DT_QA2_.pdf

The document outlines a data analysis assignment focused on cricket performance data, specifically analyzing batting innings from a dataset. It includes steps for data cleaning, transformation, univariate analysis, and visualizations using R's tidyverse package. Key tasks involve reshaping data, creating new variables, and generating various plots to explore player performance and team statistics.

Uploaded by

锐 李
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

DT_QA2_.pdf

The document outlines a data analysis assignment focused on cricket performance data, specifically analyzing batting innings from a dataset. It includes steps for data cleaning, transformation, univariate analysis, and visualizations using R's tidyverse package. Key tasks involve reshaping data, creating new variables, and generating various plots to explore player performance and team statistics.

Uploaded by

锐 李
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Assignment2

2023-06-19

library(tidyverse)
library(readr)

ashes <- read_csv("ashes.csv")

Q1: Reading and Cleaning

(a) For our analysis, the subjects are not the cricketers themselves, but each batting innings they partici-
pated in. In order to make the data tidy:
Rearrange the data into a long format so that there is a row for each batter in each innings.

ashes_lg <- ashes %>%


gather(key="test_innings",value="performance",4:13)
ashes_lg

## # A tibble: 310 x 5
## batter team role test_innings performance
## <chr> <chr> <chr> <chr> <chr>
## 1 Ali Eng allrounder Test 1_Innings_1 Batting at number 8 scored ~
## 2 Anderson England bowl Test 1_Innings_1 Batting at number 11 scored~
## 3 Archer England bowl Test 1_Innings_1 Batting at number NA scored~
## 4 Bairstow England wicketkeeper Test 1_Innings_1 Batting at number 7 scored ~
## 5 Bancroft Aus bat Test 1_Innings_1 Batting at number 1 scored ~
## 6 Broad England bowler Test 1_Innings_1 Batting at number 10 scored~
## 7 Burns England bat Test 1_Innings_1 Batting at number 1 scored ~
## 8 Buttler England bat Test 1_Innings_1 Batting at number 5 scored ~
## 9 Cummins Australia bowler Test 1_Innings_1 Batting at number 9 scored ~
## 10 Curran England bowl Test 1_Innings_1 Batting at number NA scored~
## # i 300 more rows

Use str_match() to create new columns for each of the following for each player innings:

• the player’s batting number,


• their score, and
• the number of balls they faced.

ashes_lg <- ashes_lg %>%


mutate(test=str_match(test_innings,"Test (\\d+)_")[,2],
innings=str_match(test_innings,"Innings_(\\d+)")[,2],
bat_num=str_match(performance,"Batting at number (\\d+) scored")[,2],
score=str_match(performance,"scored (\\d+)")[,2],

1
balls=str_match(performance,"(\\d+) ball")[,2],
fours=str_match(performance,"(\\d+) four")[,2],
sixes=str_match(performance,"(\\d+) six")[,2])

ashes_lg

## # A tibble: 310 x 12
## batter team role test_innings performance test innings bat_num score balls
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Ali Eng allr~ Test 1_Inni~ Batting at~ 1 1 8 0 5
## 2 Ander~ Engl~ bowl Test 1_Inni~ Batting at~ 1 1 11 3 19
## 3 Archer Engl~ bowl Test 1_Inni~ Batting at~ 1 1 <NA> <NA> <NA>
## 4 Bairs~ Engl~ wick~ Test 1_Inni~ Batting at~ 1 1 7 8 35
## 5 Bancr~ Aus bat Test 1_Inni~ Batting at~ 1 1 1 8 25
## 6 Broad Engl~ bowl~ Test 1_Inni~ Batting at~ 1 1 10 29 67
## 7 Burns Engl~ bat Test 1_Inni~ Batting at~ 1 1 1 133 312
## 8 Buttl~ Engl~ bat Test 1_Inni~ Batting at~ 1 1 5 5 10
## 9 Cummi~ Aust~ bowl~ Test 1_Inni~ Batting at~ 1 1 9 5 10
## 10 Curran Engl~ bowl Test 1_Inni~ Batting at~ 1 1 <NA> <NA> <NA>
## # i 300 more rows
## # i 2 more variables: fours <chr>, sixes <chr>

(b) Recode the data to make it ‘tame’, that is,

• Ensure all categorical variables with a small number of levels are coded as factors,
• Ensure all categorical variables with a large number of levels are coded as characters, and
• Ensure all quantitative variables are coded as integers or numeric, as appropriate.

ashes_lg$bat_num <- as.integer(ashes_lg$bat_num)


ashes_lg$score <- as.integer(ashes_lg$score)
ashes_lg$balls <- as.integer(ashes_lg$balls)
ashes_lg$fours <- as.integer(ashes_lg$fours)
ashes_lg$sixes <- as.integer(ashes_lg$sixes)
ashes_lg$team <- as.factor(ashes_lg$team)
ashes_lg$role <- as.factor(ashes_lg$role)
ashes_lg$test <- as.factor(ashes_lg$test)
ashes_lg$innings <- as.factor(ashes_lg$innings)

ashes_lg

## # A tibble: 310 x 12
## batter team role test_innings performance test innings bat_num score balls
## <chr> <fct> <fct> <chr> <chr> <fct> <fct> <int> <int> <int>
## 1 Ali Eng allr~ Test 1_Inni~ Batting at~ 1 1 8 0 5
## 2 Ander~ Engl~ bowl Test 1_Inni~ Batting at~ 1 1 11 3 19
## 3 Archer Engl~ bowl Test 1_Inni~ Batting at~ 1 1 NA NA NA
## 4 Bairs~ Engl~ wick~ Test 1_Inni~ Batting at~ 1 1 7 8 35
## 5 Bancr~ Aus bat Test 1_Inni~ Batting at~ 1 1 1 8 25
## 6 Broad Engl~ bowl~ Test 1_Inni~ Batting at~ 1 1 10 29 67
## 7 Burns Engl~ bat Test 1_Inni~ Batting at~ 1 1 1 133 312
## 8 Buttl~ Engl~ bat Test 1_Inni~ Batting at~ 1 1 5 5 10
## 9 Cummi~ Aust~ bowl~ Test 1_Inni~ Batting at~ 1 1 9 5 10

2
## 10 Curran Engl~ bowl Test 1_Inni~ Batting at~ 1 1 NA NA NA
## # i 300 more rows
## # i 2 more variables: fours <int>, sixes <int>

(c) Clean the data; recode the factors using fct_recode( ) such that there are no typographical errors in
the team names and player roles.

#ashes_lg %>% count(team)


ashes_lg$team <- ashes_lg$team %>%
fct_recode(Australia="Aus",
England="Eng")
ashes_lg %>% count(team)

## # A tibble: 2 x 2
## team n
## <fct> <int>
## 1 Australia 160
## 2 England 150

#ashes_lg %>% count(role)


ashes_lg$role <- ashes_lg$role %>%
fct_recode(allrounder="all rounder",
allrounder="all-rounder",
bat="batsman",
bat="batting",
bowl="bowler",
bowl="bowling")
ashes_lg %>% count(role)

## # A tibble: 4 x 2
## role n
## <fct> <int>
## 1 allrounder 70
## 2 bat 110
## 3 bowl 110
## 4 wicketkeeper 20

Q2: Univeriate Analysis

(a) Produce a histogram of all scores during the series.

ashes_lg %>%
ggplot(aes(score))+
geom_histogram(col="black")

## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

## Warning: Removed 103 rows containing non-finite values (‘stat_bin()‘).

3
40

30
count

20

10

0 50 100 150 200


score

(b) Describe the distribution of scores, considering shape, location, spread and outliers.

summary(ashes_lg$score)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s


## 0.00 4.00 12.00 23.94 30.50 211.00 103

Shape: It is left-skewed and unimodal.


Location: The median is at 12.
Spread: The interquartile range is (30.5 − 4.0) = 26.5.
Outliers: One potential outlier is at 211.

(c) Produce a bar chart of the teams participating in the series, with different colours for each team.
Noting that each player is represented by 10 rows in the data frame, how many players were used by
each team in the series?

ashes_lg %>%
select(batter,team) %>%
distinct() %>%
ggplot(aes(team))+
geom_bar(aes(fill=team),col="black")

4
15

10
team
count

Australia
England

Australia England
team

There are 16 Australia players and 15 England players.

Q3: Scores for each team

(a) Using ggplot, produce histograms of scores during the series, faceted by team.

ashes_lg %>%
ggplot(aes(score))+
geom_histogram(aes(fill=team), col="black")+
facet_wrap(~team)

## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

## Warning: Removed 103 rows containing non-finite values (‘stat_bin()‘).

5
Australia England

25

20

15
team
count

Australia
England
10

0 50 100 150 200 0 50 100 150 200


score

(b) Produce side-by-side boxplots of scores by each team during the series.

ashes_lg %>%
ggplot(aes(team,score))+
geom_boxplot(aes(fill=team))

## Warning: Removed 103 rows containing non-finite values (‘stat_boxplot()‘).

6
200

150

team
score

Australia
100
England

50

Australia England
team

(c) Compare the distributions of scores by each team during the series, considering shape, location, spread
and outliers, and referencing the relevant plots. Which team looks to have had a higher variability of
scores?

Shape: They are left-skewed and unimodal for both teams (see histograms).
Location: Both Australia team and England team have a similar median (see boxpot).
Spread: Australia team has a larger interquartile range (see boxpot) comparing to England team.
Outliers: Australia team has eight outliers range between 80 and 220; England team also has eight outliers
range between 70 and 135.
Australia team has a higher variability of scores.

Q4: Scoring rates

(a) Produce a scatterplot of scores against number of balls

ashes_lg %>%
ggplot(aes(balls,score))+
geom_point()+
geom_smooth()

## ‘geom_smooth()‘ using method = ’loess’ and formula = ’y ~ x’

7
## Warning: Removed 103 rows containing non-finite values (‘stat_smooth()‘).

## Warning: Removed 103 rows containing missing values (‘geom_point()‘).

200

150
score

100

50

0 100 200 300


balls

(b) Describe the relationship between score and number of balls. Are players who face more balls likely to
score more runs?

cor(na.omit(ashes_lg$balls), na.omit(ashes_lg$score))

## [1] 0.9425244

There is a strong positive linear relationship.


Players who face more balls are likely to score more runs.

(c) Compute a new variable, scoring_rate, defined as the number of runs divided by the number of balls.
Produce a scatterplot of scoring_rate against number of balls.

ashes_lg <- ashes_lg %>%


mutate(scoring_rate=score/balls)

ashes_lg %>%
ggplot(aes(balls,scoring_rate))+
geom_point()+
geom_smooth()

8
## ‘geom_smooth()‘ using method = ’loess’ and formula = ’y ~ x’

## Warning: Removed 103 rows containing non-finite values (‘stat_smooth()‘).

## Warning: Removed 103 rows containing missing values (‘geom_point()‘).

1.0
scoring_rate

0.5

0.0

0 100 200 300


balls

(d) Is there a relationship between scoring rate and number of balls? Are players who face more balls
likely to score runs more quickly?

There is no strong relationship between scoring rate and the number of balls faced.
Players who face more balls are not likely to score runs more quickly.

Q5: Teams’ roles

(a) Produce a bar chart of the number of players on each team participating in the series, with the segments
coloured by the players’ roles.

ashes_lg %>%
select(batter,team,role) %>%
distinct() %>%
ggplot(aes(team,col=role))+
geom_bar(aes(fill=role),col="black")

9
15

10 role
allrounder
count

bat
bowl
wicketkeeper
5

Australia England
team

(b) Produce a contingency table of the proportion of players from each team who play in each particular
role.

role_tb <- ashes_lg %>%


count(role,team) %>%
spread(role, n)

role_tb %>%
mutate(total=allrounder+bat+bowl+wicketkeeper,
allrounder_ppt=allrounder/total,
batter_ppt=bat/total,
bowler_ppt=bowl/total,
wicketkeeper_ppt=wicketkeeper/total
) %>%
select(team,allrounder_ppt,batter_ppt,bowler_ppt,wicketkeeper_ppt) %>%
knitr::kable(digits = 3)

team allrounder_ppt batter_ppt bowler_ppt wicketkeeper_ppt


Australia 0.125 0.438 0.375 0.062
England 0.333 0.267 0.333 0.067

(c) Using these two figures, state which team is made up of a larger proportion of batters, and which team
contains a larger proportion of all-rounders.

10
Australian team is made up of a larger proportion of batters.
England team contains a larger proportion of all-rounders.

11

You might also like