Unit 5 - DS - 1st year
Unit 5 - DS - 1st year
Unit - 5
Subsetting R objects:
There are three operators that can be used to extract subsets of R objects.
● The [ operator always returns an object of the same class as the original. It can be used to
select multiple elements of an object
● The [[ operator is used to extract elements of a list or a data frame. It can only be used to
extract a single element and the class of the returned object will not necessarily be a list
or data frame.
● The $ operator is used to extract elements of a list or data frame by literal name. Its
semantics are similar to that of [[ ]].
Vectors are basic objects in R and they can be subsetted using the [ operator.
> x <- c("a", "b", "c", "c", "d", "a")
> x[1] ## Extract the first element
[1] "a"
Matrices can be subsetted in the usual way with (i,j) type indices. Here, we create a simple 2×3
matrix with the matrix function.
> x <- matrix(1:6, 2, 3)
>x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Vectorised Operations:
Many operations in R are vectorized, meaning that operations occur in parallel in certain R
objects. This allows you to write code that is efficient, concise, and easier to read than in non-
vectorized languages.
The simplest example is when adding two vectors together.
> x <- 1:4
> y <- 6:9
> z <- x + y
>z
[1] 7 9 11 13
Matrix operations are also vectorized, making for nicly compact notation. This way, we can do
element-by-element operations on matrices without having to loop over every element.
> x <- matrix(1:4, 2, 2)
> y <- matrix(rep(10, 4), 2, 2)
>
> ## element-wise multiplication
>x*y
[,1] [,2]
[1,] 10 30
[2,] 20 40
2. filter: Filter rows with filter().It extracts a subset of rows from a data frame based on
logical conditions.
x<-filter(iris, Sepal.Length > 5.843)
head(x)
summary(x)
3. arrange: Arrange rows with arrange(). It helps to reorder rows of a data frame.
x<-arrange(mtcars, cyl)
head(select(x, mpg , cyl), 10)
tail(select(x, mpg , cyl), 10)
5. mutate: Add new columns with mutate(). It helps to add new variables/columns or
transform existing variables.
sleep_data<-mutate(sleep, extra_derived= extra – mean(extra, na.rm = TRUE))
str(sleep_data)
head(sleep_data)
7. %>%: the “pipe” operator is used to connect multiple verb actions together into a
pipeline.
swiss %>% select(Examination, Education) %>% head
Control Statements in R:
If condition:
This task is carried out only if this condition is returned as TRUE. R makes it even
easier: You can drop the word then and specify your choice in an if statement.
Syntax:
if (test_expression) {
Statement
if-else Condition
An if…else statement contains the same elements as an if statement (see the preceding
section), with some extra elements:
Syntax:
if (test_expression) {
statement
} else {
statement
for Loop in R
A loop is a sequence of instructions that is repeated until a certain condition is
reached. for, while and repeat, with the additional clauses break and next are used to
construct loops.
values <- c(1,2,3,4,5)
for(id in 1:5){
print(values[id])
Nested Loop
It is similar to the standard for loop, which makes it easy to convert “for” loop to a
foreach loop. Unlike many parallel programming packages for R, foreach doesn’t
require the body of the “for” loop to be turned into a function. We can call this a
nesting operator because it is used to create nested foreach loops.
Example:
print(mat[id1, id2])
11
while Loop
The format is while(cond) expr, where cond is the condition to test and expr is an
expression.
R would complain about the missing expression that was supposed to provide the
required True or False and in fact, it does not know ‘response’ before using it in the
loop. We can also do this because, if we answer right at first attempt, the loop will not
be executed at all.
Example:
val = 2.987
print(c(val,val-2,val-1))
We use break statements inside a loop (repeat, for, while) to stop the iterations and
flow the control outside of the loop. While in a nested looping situation, where there
is a loop inside another loop, this statement exits from the innermost loop that is being
evaluated.
A repeat loop is used to iterate over a block of code, multiple numbers of times. There
is no condition check in a repeat loop to exit the loop. We ourselves put a condition
explicitly inside the body of the loop and use the break statement to exit the loop.
Failing to do so will result in an infinite loop.
Syntax:
repeat {
break
}}
next Statement
next jumps to the next cycle without completing a particular iteration. In fact, it jumps
to the evaluation of the condition holding the current loop. Next statement enables you
to skip the current iteration of a loop without terminating it.
Example:
x = 1: 4
for (i in x) {
if (i == 2) {
next
print(i)
}
return Statement:
Many times, we will require some functions to do processing and return back the
result. This is accomplished with the return() statement in R.
Syntax:
return(expression)
Example:
if (x > 0) {
} else if (x < 0) {
} else {
return(result)
R - Functions
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions. The function in turn
performs its task and returns control to the interpreter as well as any result which may be stored
in other objects.
Function body
Function Components
The different parts of a function are −
R has many in-built functions which can be directly called in the program without defining them
first. We can also create and use our own functions referred to as user defined functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They
are directly called by user written programs. You can refer to the most widely used R functions.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
print(mean(25:82))
print(sum(41:68))
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and
once created they can be used like the built-in functions. Below is an example of how a
function is created and used.
# Create a function to print squares of numbers in sequence.
for(i in 1:a) {
b <- i^2
print(b)
Scoping rules of R:
The scoping rules for R are the main feature that make it different from the original S
language. The scoping rules of a language determine how a value is associated with a
free variable in a function. R uses lexical scoping or static scoping. An alternative to
lexical scoping is dynamic scoping which is implemented by some languages. Lexical
scoping turns out to be particularly useful for simplifying statistical computations.
The scoping rules of a language determine how values are assigned to free variables.
Free variables are not formal arguments and are not local variables (assigned inside the
function body).
Lexical scoping in R means that the values of free variables are searched for in the
environment in which the function was defined.
Typically, a function is defined in the global environment, so that the values of free
variables are just found in the user’s workspace. This behavior is logical for most people
and is usually the “right thing” to do. However, in R you can have functions defined
inside other functions (languages like C don’t let you do this). Now things get interesting
—in this case the environment in which a function is defined is the body of another
function!
Here is an example of a function that returns another function as its return value.
Remember, in R functions are treated like any other object and so this is perfectly valid.
+ x^n
+ }
+ pow
+ }
There are numerous other languages that support lexical scoping, including
● Scheme
● Perl
● Python
● Common Lisp (all languages converge to Lisp, right?)
Lexical scoping in R has consequences beyond how free variables are looked up. In
particular, it’s the reason that all objects must be stored in memory in R. This is because
all functions must carry a pointer to their respective defining environments, which could
be anywhere. In the S language (R’s close cousin), free variables are always looked up
in the global workspace, so everything can be stored on the disk because the “defining
environment” of all functions is the same.
Naming conventions
R has no naming conventions that are generally agreed upon. As a newcomer to R it’s
useful to decide which naming convention to adopt.
Syntax
● Place spaces around all infix operators (=, +, -, <-, etc.).
● Use <-, not =, for assignment.
● Use comments to mark off sections of code.
● Comment your code with care. Comments should explain the why, not the what
● Each line of a comment should begin with the comment symbol and a single space
● An opening curly brace should never go on its own line and should always be followed by a
new line; a closing curly brace should always go on its own line, unless followed by else.
● Always indent the code inside the curly braces.
● Keep your lines less than 80 characters.This is the amount that will fit comfortably on a
printed page at a reasonable size. If you find you are running out of room, this is probably an
indication that you should encapsulate some of the work in a separate function.
Writing for and while loops is useful when programming but not particularly easy when
working interactively on the command line. Multi-line expressions with curly braces are
just not that easy to sort through when working on the command line. R has some
functions which implement looping in a compact form to make your life easier.
lapply()
The lapply() function does the following simple series of operations:
[1] 2.5
$b
[1] 0.248845
$c
[1] 0.9935285
$d
[1] 5.051388
sapply()
The sapply() function behaves similarly to lapply(); the only real difference is in the
return value. sapply() will try to simplify the result of lapply() if possible. Essentially,
sapply() calls lapply() on its input and then applies the following algorithm:
● If the result is a list where every element is length 1, then a vector is returned
● If the result is a list where every element is a vector of the same length (> 1), a
matrix is returned.
● If it can’t figure things out, a list is returned
split()
The split() function takes a vector or other objects and splits it into groups determined
by a factor or list of factors.
> str(split)
$`1`
[1] 0.07478098
$`2`
[1] 0.5266905
$`3`
[1] 1.458703
tapply
tapply() is used to apply a function over subsets of a vector. It can be thought of as a
combination of split() and sapply() for vectors only. I’ve been told that the “t” in
tapply() refers to “table”, but that is unconfirmed.
> str(tapply)
function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
> f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
1 2 3
apply()
The apply() function is used to evaluate a function (often an anonymous one) over the
margins of an array. It is most often used to apply a function to the rows or columns of a
matrix (which is just a 2-dimensional array). However, it can be used with general
arrays, for example, to take the average of an array of matrices. Using apply() is not
really faster than writing a loop, but it works in one line and is highly compact.
> str(apply)
mapply()
The mapply() function is a multivariate apply of sorts which applies a function in parallel
over a set of arguments. Recall that lapply() and friends only iterate over a single R
object. What if you want to iterate over multiple R objects in parallel? This is what
mapply() is for.
> str(mapply)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Debugging
R has a number of ways to indicate to you that something’s not right. There are different
levels of indication that can be used, ranging from mere notification to fatal error.
Executing any function in R may result in the following conditions.
Simulation
Generating Random Numbers
Simulation is an important (and big) topic for both statistics and for a variety of other
areas where there is a need to introduce randomness. Sometimes you want to
implement a statistical procedure that requires random number generation or sampling
(i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes
you want to simulate a system and random number generators can be used to model
random inputs.
R comes with a set of pseudo-random number generators that allow you to simulate
from well-known probability distributions like the Normal, Poisson, and binomial. Some
example functions for probability distributions in R:
● rnorm: generate random Normal variates with a given mean and standard
deviation.
● dnorm: evaluate the Normal probability density (with a given mean/SD) at a point
(or vector of points)
● pnorm: evaluate the cumulative distribution function for a Normal distribution
● rpois: generate random Poisson variates with a given rate
For each probability distribution there are typically four functions available that start with
a “r”, “d”, “p”, and “q”. The “r” function is the one that actually simulates random
numbers from that distribution. The other functions are prefixed with a
● d for density
● r for random number generation
● p for cumulative distribution
● q for quantile function (inverse cumulative distribution)
If you’re only interested in simulating random numbers, then you will likely only need the
“r” functions and not the others. However, if you intend to simulate from arbitrary
probability distributions using something like rejection sampling, then you will need the
other functions too.