K22 Datenstrukturen
K22 Datenstrukturen
1 Attributes
2 Vectors
3 Factors
4 Matrices
5 Arrays
6 Lists
7 data.frames
Last chapter, we have learnt about the six basic data types in R. Now, we
are taking a look at data structures. In particular, we are interested in:
Generating the various data structures,
what their respective use is,
what their characteristics are,
how to subset them.
But before we will actually turn to data structures themselves, we first have
to take a look at how R stores meta data to its objects → attributes.
Attributes
Special attributes I
Special attributes II
Special attributes IV
dim(x) names(x)
## [1] 2 3 ## NULL
dimnames(x) class(x)
Functions changing the output type or length usually lead to the loss
of attributes:
attributes(sum(x))
## NULL
If both objects are of equal length, then the attributes of the first
object are preserved:
y <- 2:7
attr(y, "description") <- "vector of the numbers 2 to 7"
y + x
## [1] 3 5 7 9 11 13
## attr(,"description")
## [1] "vector of the numbers 2 to 7"
Vectors
What is a vector?
What is a vector?
The type of an atomic vector is one of the basic data types, i.e. double,
integer, character, logical, complex or raw. Atomic vectors can
only contain values of a single data type. Combining multiple data
types in a single vector leads to (undesired) coercion, see Paradigm 3.
Generating a vector I
Generating a vector II
The names attribute can be placed directly when generating the vector:
c(one = 1, two = 2, three = 3)
## one two three
## 1 2 3
Multiple nested vectors are brought down to the same level:
c(c(1, 2), c(3, 4))
## [1] 1 2 3 4
When using a binary function with input vectors of unequal length, the
shorter vector will be replicated to match the length of the longer
vector → recycling.
1:2 + 3:6
## [1] 4 6 6 8
Paradigm 4
Many functions in (base) R are vectorized.
For vector inputs of different length, the shorter vector will be replicated to
the length of the longer vector.
## [1] TRUE
Subsetting - Properties I
Subsetting - Properties II
What is the result of this call and how can it be explained?
vec[c(TRUE, NA)]
What are the results of these calls and how can they be explained?
vec[NA] vec[NA_integer_]
Subsetting - Properties II
What is the result of this call and how can it be explained?
vec[c(TRUE, NA)]
## one <NA> three
## 1 NA 3
vec[NA] vec[NA_integer_]
Subsetting - Properties II
What is the result of this call and how can it be explained?
vec[c(TRUE, NA)]
## one <NA> three
## 1 NA 3
vec[NA] vec[NA_integer_]
## <NA> <NA> <NA> ## <NA>
## NA NA NA ## NA
In the case of identical names, only the first element with the
respective name is returned.
Subsetting - Properties IV
Subsetting - Properties IV
Subsetting with a positive number whose index does not exist always
leads to an NA.
And what happens here and why?
vec[-4]
Subsetting - Properties IV
Subsetting with a positive number whose index does not exist always
leads to an NA.
And what happens here and why?
vec[-4]
## one two three
## 1 2 3
When dropping an index which does not exist, nothing can be dropped.
Subsetting - Summary I
Subsetting - Summary II
Logical values
A logical index vector specifies all positions to be subsetted.
A logical index vector must have the same length as the vector to be
subsetted.
If it does not, it is replicated accordingly.
If the index vector is longer than the vector being subsetted, only the
first elements of the index vector are considered.
Nothing
An blank index returns all elements.
Zero
Subsetting with zero returns a vector of length 0.
names attribute
Elements with a specific name can be subsetted.
Just like with integers, vector components can be subsetted repeatedly.
The index vector can be of arbitrary length.
Names must be specified exactly.
In the case of equal names, only the first element with the respective
name is returned:
x <- c(a = 1, a = 2)
x[c("a", "a")]
## a a
## 1 1
Factors
Matrices
Matrices are always stored by column - even if the values are inserted
by row.
Since matrices are only vectors, they can only contain data of a single
data type as well.
Single rows and/or columns of a matrix can be named.
After all, the data length does clearly not fit the specified dimensions.
But, since a multiple of the data length matches the dimensions, no
warning is issued.
If one dimension is left unspecified, it is set in a way that all entries fit
into the matrix once and a warning is issued:
matrix(1:5, nrow = 2)
## Warning in matrix(1:5, nrow = 2): data length [5] is not a sub-multiple
or multiple of the number of rows [2]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 1
However, this does not work since no new object is being generated.
Instead, this is an attempt to modify an already existing object:
x <- 1:5
dim(x) <- c(2, 3)
## Error in dim(x) <- c(2, 3): dims [product 6] do not match the length of
object [5]
If the data length is larger than the number of matrix entries, only the
first data points are used - without issuing a warning!
matrix(1:10, nrow = 2, ncol = 2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Calculations over columns are much more efficient than calculations over
rows:
library(microbenchmark)
set.seed(1908)
rnd <- rnorm(1e3 * 1e3)
f1 <- function() matrix(rnd, nrow = 1e3)
f2 <- function() matrix(rnd, nrow = 1e3, byrow = TRUE)
## Unit: milliseconds
## expr min lq mean median uq max neval
## f1() 1.613101 1.775902 2.765082 1.961400 2.342901 14.4933 100
## f2() 4.868101 5.195851 6.582729 5.477101 6.774001 16.3230 100
## Unit: microseconds
## expr min lq mean median uq max neval
## f3() 851.001 902.501 986.250 946.3005 1012.901 1772.602 100
## f4() 1965.001 2019.501 2163.112 2091.8015 2227.451 2914.700 100
Subsetting matrices I
Subsetting matrices II
Subsetting matrices IV
Subsetting matrices V
Arrays
Lists
Generating lists I
Generating lists II
cf. vector() command for atomic vectors → it’s the same function.
Subsetting lists
## $b ## $c
## [,1] [,2] ## [,1] [,2]
## [1,] 1.711441 -0.4721664 ## [1,] -0.2857736 1.2276303
## [2,] -0.602908 -0.6353713 ## [2,] 0.1381082 -0.8017795
## $c ## named list()
## [,1] [,2]
## [1,] -0.2857736 1.2276303 # Nothing / Blank
## [2,] 0.1381082 -0.8017795 li[]
li[[2]] li[["b"]]
li[[-1]] li[[0]]
li$c
## [,1] [,2]
## [1,] -0.2857736 1.2276303
## [2,] 0.1381082 -0.8017795
Partial matching I
Partial matching means that using the unique beginning of a name suffices
for subsetting:
li2 <- list(abc = 2, def = 7, deg = 10)
Partial matching II
All three subsetting variants can also be used to append new elements to
an already existing list:
As for the $ operator and the double squared brackets, new elements
can be appended by assigning them to unoccupied names or indices:
li3 <- list(a = 5, b = 4)
li3$d <- "Hello"
li3[[5]] <- FALSE
str(li3)
## List of 5
## $ a: num 5
## $ b: num 4
## $ d: chr "Hello"
## $ : NULL
## $ : logi FALSE
When using the single squared brackets, new elements must be part of
a list:
li3[c("e", "f")] <- list(0, 1)
str(li3)
## List of 7
## $ a: num 5
## $ b: num 4
## $ d: chr "Hello"
## $ : NULL
## $ : logi FALSE
## $ e: num 0
## $ f: num 1
The three subsetting methods can also be used to delete list elements. In
order to do that, the respective list element must be set to NULL:
li3$f <- NULL
li3[[6]] <- NULL
li3[c("b", "d")] <- NULL
li3
## $a
## [1] 5
##
## [[2]]
## NULL
##
## [[3]]
## [1] FALSE
## [[1]]
## [[1]]$a
## [1] 1
##
##
## [[2]]
## [[2]]$a
## [1] "a"
## $a
## [1] 1
##
## $a
## [1] "a"
Side note: As we see here, names don’t have to be unique for lists either!
Subsetting with the non-unique name always returns the first element:
c(li5, li6)$a
## [1] 1
Since lists are vectors, we can do some ’interesting’ things with them. For
example, we can place a dimensions attribute:
li7 <- list(a = matrix(rnorm(4), nrow = 2), b = matrix(rnorm(4), nrow = 2),
c = matrix(rnorm(4), nrow = 2), d = matrix(rnorm(4), nrow = 2))
dim(li7) <- c(2, 2)
li7
## [,1] [,2]
## [1,] numeric,4 numeric,4
## [2,] numeric,4 numeric,4
## [,1] [,2]
## [1,] "Hello" FALSE
## [2,] 5 3+5i
Beware!
Just because we can play around like this with lists, it does not mean
that we should use such data structures!
Many functions in R are not designed to work with such complex data
structures.
Besides leading to incomprehensible code, the use of such data
structures can also result in errors that are very hard to find!
data.frames
class(iris) typeof(iris)
## [1] "data.frame" ## [1] "list"
Examples:
## x y z
## 1 1 2 1
## 2 2 3 1, 2
## 3 3 4 1, 2, 3
Beware! Just like when we were playing around with lists, caution is
advised here as well! Just because something technically works, does not
mean you should use it!
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 82 / 85
Further data structures
Aside from the already mentioned data structures, there are also further
though less frequently used data structures. They shall not be discussed
here in detail.
Two examples:
The contingency table:
tab <- table(sample(10, 100, replace = TRUE))
tab
##
## 1 2 3 4 5 6 7 8 9 10
## 11 7 12 13 9 11 6 10 14 7
class(tab)
## [1] "table"
Generally, it is possible to build your own data structures. But chances are
they will be based on the data structures presented here.