0% found this document useful (0 votes)
8 views

05_Data_Transformation_Exploration_Visualization

The document provides an introduction to data transformation, exploration, and visualization using R, specifically in the context of biostatistics. It covers essential techniques such as rounding variables, converting numeric variables to factors, computing new variables, and sorting datasets. Additionally, it emphasizes the importance of data exploration and visualization for understanding the dataset.

Uploaded by

kenny00215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

05_Data_Transformation_Exploration_Visualization

The document provides an introduction to data transformation, exploration, and visualization using R, specifically in the context of biostatistics. It covers essential techniques such as rounding variables, converting numeric variables to factors, computing new variables, and sorting datasets. Additionally, it emphasizes the importance of data exploration and visualization for understanding the dataset.

Uploaded by

kenny00215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Biostatistics I: Introduction to R

Data transformation, exploration and visualization

Eleni-Rosalina Andrinopoulou
Department of Biostatistics, Erasmus Medical Center

[email protected]

7@erandrinopoulou
In this Section

▶ Data transformation
▶ Data exploration
▶ Data visualization
▶ A lot of practice

1
Data Transformation

You will never receive the perfect data set!


▶ Round continuous variables
▶ Convert numeric variables to factors
▶ Compute new variables
▶ transform variables
▶ Sort the data set
▶ Data sets of wide ⇐⇒ long format

2
Data Transformation
▶ Round continuous variables

pbc[1:3, c("time", "age", "bili", "chol")]

time age bili chol


1 400 58.76523 14.5 261
2 4500 56.44627 1.1 302
3 1012 70.07255 1.4 176

round(pbc[1:3, c("time", "age", "bili", "chol")], digits = 2)

time age bili chol


1 400 58.77 14.5 261
2 4500 56.45 1.1 302
3 1012 70.07 1.4 176
3
Data Transformation

▶ Convert numeric variables to factors

DF <- pbc[,c("id", "time", "status", "trt", "age",


"sex", "bili", "chol")]
head(DF)

id time status trt age sex bili chol


1 1 400 2 1 58.76523 f 14.5 261
2 2 4500 0 1 56.44627 f 1.1 302
3 3 1012 2 1 70.07255 m 1.4 176
4 4 1925 2 1 54.74059 f 1.8 244
5 5 1504 1 2 38.10541 f 3.4 279
6 6 2503 2 2 66.25873 f 0.8 248

4
Data Transformation
▶ Convert numeric variables to factors

DF <- pbc[,c("id", "time", "status", "trt", "age",


"sex", "bili", "chol")]
DF$trt <- factor(x = DF$trt, levels = c(1, 2),
labels = c("D-penicillmain", "placebo"))
head(DF)

id time status trt age sex bili chol


1 1 400 2 D-penicillmain 58.76523 f 14.5 261
2 2 4500 0 D-penicillmain 56.44627 f 1.1 302
3 3 1012 2 D-penicillmain 70.07255 m 1.4 176
4 4 1925 2 D-penicillmain 54.74059 f 1.8 244
5 5 1504 1 placebo 38.10541 f 3.4 279
6 6 2503 2 placebo 66.25873 f 0.8 248
5
Data Transformation

▶ Compute new variables


▶ transform variables

DF <- pbc[,c("id", "time", "status", "trt", "age",


"sex", "bili", "chol")]
head(DF)

id time status trt age sex bili chol


1 1 400 2 1 58.76523 f 14.5 261
2 2 4500 0 1 56.44627 f 1.1 302
3 3 1012 2 1 70.07255 m 1.4 176
4 4 1925 2 1 54.74059 f 1.8 244
5 5 1504 1 2 38.10541 f 3.4 279
6 6 2503 2 2 66.25873 f 0.8 248
6
Data Transformation
▶ Compute new variables
▶ transform variables

DF <- pbc[,c("id", "time", "status", "trt", "age",


"sex", "bili", "chol")]
DF$time <- DF$time/30
DF$time_years <- DF$time/12
head(DF)

id time status trt age sex bili chol time_years


1 1 13.33333 2 1 58.76523 f 14.5 261 1.111111
2 2 150.00000 0 1 56.44627 f 1.1 302 12.500000
3 3 33.73333 2 1 70.07255 m 1.4 176 2.811111
4 4 64.16667 2 1 54.74059 f 1.8 244 5.347222
5 5 50.13333 1 2 38.10541 f 3.4 279 4.177778
6 6 83.43333 2 2 66.25873 f 0.8 248 6.952778
7
Data Transformation

▶ Sort the data set in either ascending or descending order


▶ The variable by which we sort can be a numeric, string or factor

head(sort(pbc$bili))

[1] 0.3 0.3 0.3 0.4 0.4 0.4

8
Data Transformation
▶ Sort the data set in either ascending or descending order
▶ The variable by which we sort can be a numeric, string or factor

head(pbc[order(pbc$bili), ])

id time status trt age sex ascites hepato spiders edema bili chol
8 8 2466 2 2 53.05681 f 0 0 0 0 0.3 280
36 36 3611 0 2 56.41068 f 0 0 0 0 0.3 172
163 163 2055 2 1 53.49760 f 0 0 0 0 0.3 233
84 84 4032 0 2 55.83025 f 0 0 0 0 0.4 263
108 108 2583 2 1 50.35729 f 0 0 0 0 0.4 127
135 135 3150 0 1 42.96783 f 0 0 0 0 0.4 263
albumin copper alk.phos ast trig platelet protime stage
8 4.00 52 4651.2 28.38 189 373 11.0 3
36 3.39 18 558.0 71.30 96 311 10.6 2
163 4.08 20 622.0 66.65 68 358 9.9 3
84 3.76 29 1345.0 137.95 74 181 11.2 3
108 3.50 14 1062.0 49.60 84 334 10.3 2
135 3.57 123 836.0 74.40 121 445 11.0 2
9
Data Transformation
▶ Sort the data set in either ascending or descending order
▶ The variable by which we sort can be a numeric, string or factor

head(pbc[order(pbc$bili, pbc$age), ])

id time status trt age sex ascites hepato spiders edema bili chol
8 8 2466 2 2 53.05681 f 0 0 0 0.0 0.3 280
163 163 2055 2 1 53.49760 f 0 0 0 0.0 0.3 233
36 36 3611 0 2 56.41068 f 0 0 0 0.0 0.3 172
135 135 3150 0 1 42.96783 f 0 0 0 0.0 0.4 263
320 320 2403 0 NA 44.00000 f NA NA NA 0.5 0.4 NA
168 168 2713 0 2 47.75359 f 0 1 0 0.0 0.4 257
albumin copper alk.phos ast trig platelet protime stage
8 4.00 52 4651.2 28.38 189 373 11.0 3
163 4.08 20 622.0 66.65 68 358 9.9 3
36 3.39 18 558.0 71.30 96 311 10.6 2
135 3.57 123 836.0 74.40 121 445 11.0 2
320 3.81 NA NA NA NA 226 10.5 3
168 3.80 44 842.0 97.65 110 NA 9.2 2
10
Data Transformation

▶ Data sets of wide ⇐⇒ long format

head(pbc[,c("id", "time", "status", "trt", "age",


"sex", "bili", "chol")])

id time status trt age sex bili chol


1 1 400 2 1 58.76523 f 14.5 261
2 2 4500 0 1 56.44627 f 1.1 302
3 3 1012 2 1 70.07255 m 1.4 176
4 4 1925 2 1 54.74059 f 1.8 244
5 5 1504 1 2 38.10541 f 3.4 279
6 6 2503 2 2 66.25873 f 0.8 248

11
Data Transformation

▶ Data sets of wide ⇐⇒ long format

head(pbcseq[, c("id", "futime", "status", "trt", "age", "day",


"sex", "bili", "chol")])

id futime status trt age day sex bili chol


1 1 400 2 1 58.76523 0 f 14.5 261
2 1 400 2 1 58.76523 192 f 21.3 NA
3 2 5169 0 1 56.44627 0 f 1.1 302
4 2 5169 0 1 56.44627 182 f 0.8 NA
5 2 5169 0 1 56.44627 365 f 1.0 NA
6 2 5169 0 1 56.44627 768 f 1.9 NA

12
Data Transformation

▶ Data sets of wide ⇐⇒ long format

?reshape

13
Data Exploration

▶ Common questions for the pbc data set


▶ What is the mean and standard deviation for age?
▶ What is the mean and standard deviation for time?
▶ What is the median and interquartile range for age?
▶ What is the percentage of placebo patients?
▶ What is the percentage of females?
▶ What is the mean and standard deviation for age in males?
▶ What is the mean and standard deviation for baseline serum bilirubin?
▶ What is the percentage of missings in serum bilirubin?

All these questions can be answered using R!

14
Data Exploration

▶ Hints

▶ Check functions: mean(...), sd(...), percent(...), median(...), IQR(...),


table(...)

15
Data Exploration

▶ Hints

▶ Check functions: mean(...), sd(...), percent(...), median(...), IQR(...),


table(...)

What is the mean value for age?

mean(pbc$age)

[1] 50.74155

15
Data Visualization

▶ It is important to investigate each variable in our data set using plots


▶ Descriptive statistics for continuous and categorical variables
▶ Distributions of variables
▶ Distributions of variables per group
▶ Extreme values
▶ Linear/nonlinear evolutions

16
Data Visualization
Take care!
Serum bilirubin

0.8
0.7
0.9

1
0.6

1.1

1.2 0.5

1.3
0.4

0.3
28
25.5
24.5
22.5
1.4 21.6
20
18
17.9
17.4
1.5 17.2
17.1
16.2
16
15
14.5
1.6 14.4
14.1
14
13.8
13.6
1.7 13
12.6
12.2
11.4
1.8 11.1
11
10.8
99.5
8.9
1.9 8.7
8.6
8.5
8.4
2 88.1
7.3
7.2
7.1
2.1 6.8
6.7
6.6
6.5
2.2 6.4
6.3
66.1
2.3 5.9
5.7
2.4 5.6
5.5
2.5 5.4
5.2
2.6
2.7 5.1
2.82.9 4.75
4.6
33.1
3.2 3.3 3.4 3.9
3.8 4.44.5
44.2
3.7
3.53.6
17
Data Visualization
Take care!

20 20

15 15
Serum bilirubin

Serum bilirubin
10 10

5 5

0 0
First Last First Last

18
Data Visualization
Take care!
Serum bilirubin

First Last
19
Data Visualization
Take care!

0.80
Serum bilirubin

0.75

0.70

0.65

0.60

First Last
20
Data Visualization
Take care!

25
Serum bilirubin

20

15

10

First Last
21
Data Visualization

▶ R has very powerful graphics capabilities


▶ Some good references are
▶ Murrel, P. (2005) R Graphics. Boca Raton: Chapman & Hall/CRC.
▶ Sarkar, D. (2008) Lattice Multivariate Data Visualization with R. New
York: Springer-Verlag.

22
Data Visualization

▶ Traditional graphics system


▶ package graphics
▶ Trellis graphics system
▶ package lattice (which is based on package grid)
▶ Grammar of Graphics implementation (i.e., Wilkinson, L. (1999) The
Grammar of Graphics. New York: Springer-Verlag)
▶ packages ggplot & ggplot2

23
Data Visualization

Important plotting basic functions


▶ plot(): scatter plot (and others)
▶ barplot(): bar plots
▶ boxplot(): box-and-whisker plots
▶ hist(): histograms
▶ dotchart(): dot plots
▶ pie(): pie charts
▶ qqnorm(), qqline(), qqplot(): distribution plots
▶ pairs(): for multivariate data

24
Data Visualization
Continuous variables

plot(x = pbc$age, y = pbc$bili)


25
20
pbc$bili

15
10
5
0

30 40 50 60 70 80

pbc$age

25
Data Visualization
Continuous variables

plot(x = pbc$age, y = pbc$bili, xlab = "age", ylab = "bili",


cex.lab = 1.9, cex.axis = 1.5)
25
20
15
bili
10
5
0

30 40 50 60 70 80
age
26
Data Visualization
Continuous variables

plot(x = pbc$age, y = pbc$bili, xlab = "age", ylab = "bili",


cex.lab = 1.9, cex.axis = 1.5, col = "red")
25
20
15
bili
10
5
0

30 40 50 60 70 80
age
27
Data Visualization

▶ For more options check

?plot

28
Data Visualization
Continuous variables per group

plot(x = pbc$age, y = pbc$bili, xlab = "age", ylab = "bili",


cex.lab = 1.9, cex.axis = 1.5, col = pbc$sex)
25
20
15
bili
10
5
0

30 40 50 60 70 80
age
29
Data Visualization
Continuous variables per group

boxplot(formula = pbc$age ~ pbc$sex, xlab = "sex", ylab = "age",


cex.lab = 1.9, cex.axis = 1.5)
80
70
60
age
50
40
30

m f
sex
30
Data Visualization

Continuous variables per group

pbc_male_bili <- pbc$bili[pbc$sex == "m"]


pbc_female_bili <- pbc$bili[pbc$sex == "f"]
plot(density(x = pbc_male_bili), col = rgb(0,0,1,0.5),
main = "Density plots", xlab = "bili", ylab = "")
lines(density(x = pbc_female_bili), col = rgb(1,0,0,0.5))
legend(x = 8, y = 0.2, legend = c("male", "female"),
col = c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)), lty = 1)

31
Data Visualization

Continuous variables per group


Density plots
0.25
0.20

male
female
0.15
0.10
0.05
0.00

−2 0 2 4 6 8 10 12

bili

32
Data Visualization

Continuous variables per group

pbc_male_bili <- pbc$bili[pbc$sex == "m"]


pbc_female_bili <- pbc$bili[pbc$sex == "f"]
plot(density(x = pbc_male_bili), col = rgb(0,0,1,0.5),
main = "Density plots", xlab = "bili", ylab = "")
polygon(density(x = pbc_male_bili), col = rgb(0,0,1,0.5),
border = "blue")
lines(density(x = pbc_female_bili), col = rgb(1,0,0,0.5))
polygon(density(x = pbc_female_bili), col = rgb(1,0,0,0.5),
border = "red")
legend(x = 8, y = 0.2, legend = c("male", "female"),
col = c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)), lty = 1)

33
Data Visualization

Continuous variables per group


Density plots
0.25
0.20

male
female
0.15
0.10
0.05
0.00

−2 0 2 4 6 8 10 12

bili

34
Data Visualization
Categorical variables

pbc$status <- factor(x = pbc$status, levels = c(0, 1, 2),


labels = c("censored", "transplant", "dead"))
pie(table(pbc$status), col = c("green", "blue", "red"), cex = 2)

censored

transplant
dead

35
Summary

Transformation Exploration Visualization


▶ round() ▶ mean(), sd() ▶ plot(), legend()
▶ factor() ▶ median(), IQR() ▶ hist()
▶ order() ▶ table() ▶ barchart()
▶ reshape() ▶ boxplot()
▶ xyplot(), ggplot()
▶ par()

36

You might also like