0% found this document useful (0 votes)
26 views19 pages

Unit 5 - DS - 1st year

Unit 5 covers subsetting R objects using different operators, vectorized operations, and managing data frames with the dplyr package. It also discusses control statements, R functions, scoping rules, naming conventions, syntax guidelines, and looping on the command line. Key functions like lapply, sapply, and tapply are introduced for efficient data manipulation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views19 pages

Unit 5 - DS - 1st year

Unit 5 covers subsetting R objects using different operators, vectorized operations, and managing data frames with the dplyr package. It also discusses control statements, R functions, scoping rules, naming conventions, syntax guidelines, and looping on the command line. Key functions like lapply, sapply, and tapply are introduced for efficient data manipulation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit 5

Prepared by: Varun Rao (Dean, Data Science & AI)


For: Data Science - 1st years

Unit - 5

Subsetting R objects:

There are three operators that can be used to extract subsets of R objects.
● The [ operator always returns an object of the same class as the original. It can be used to
select multiple elements of an object
● The [[ operator is used to extract elements of a list or a data frame. It can only be used to
extract a single element and the class of the returned object will not necessarily be a list
or data frame.
● The $ operator is used to extract elements of a list or data frame by literal name. Its
semantics are similar to that of [[ ]].

Vectors are basic objects in R and they can be subsetted using the [ operator.
> x <- c("a", "b", "c", "c", "d", "a")
> x[1] ## Extract the first element
[1] "a"

Matrices can be subsetted in the usual way with (i,j) type indices. Here, we create a simple 2×3
matrix with the matrix function.
> x <- matrix(1:6, 2, 3)
>x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

Vectorised Operations:

Many operations in R are vectorized, meaning that operations occur in parallel in certain R
objects. This allows you to write code that is efficient, concise, and easier to read than in non-
vectorized languages.
The simplest example is when adding two vectors together.
> x <- 1:4
> y <- 6:9
> z <- x + y
>z
[1] 7 9 11 13

Matrix operations are also vectorized, making for nicly compact notation. This way, we can do
element-by-element operations on matrices without having to loop over every element.
> x <- matrix(1:4, 2, 2)
> y <- matrix(rep(10, 4), 2, 2)
>
> ## element-wise multiplication
>x*y
[,1] [,2]
[1,] 10 30
[2,] 20 40

Managing Data Frames with the dplyr:

Some of the key functions provided by the dplyr package are:


1. select: Select columns with select(). It returns a subset of the columns of a data frame.
x<-select(iris,c(Species,Sepal.Length))
head(x)

2. filter: Filter rows with filter().It extracts a subset of rows from a data frame based on
logical conditions.
x<-filter(iris, Sepal.Length > 5.843)
head(x)
summary(x)

3. arrange: Arrange rows with arrange(). It helps to reorder rows of a data frame.
x<-arrange(mtcars, cyl)
head(select(x, mpg , cyl), 10)
tail(select(x, mpg , cyl), 10)

4. rename: rename variables in a data frame.


mydata <- rename(trees, "Tree Diameter in Inches"=Girth, "Height in ft"=Height, "Volume of Timber"=Volume)
head(mydata)

5. mutate: Add new columns with mutate(). It helps to add new variables/columns or
transform existing variables.
sleep_data<-mutate(sleep, extra_derived= extra – mean(extra, na.rm = TRUE))
str(sleep_data)
head(sleep_data)

6. summarize: Summarize values with summarize(). This function generates summary


statistics of different variables in the data frame.

summarise(.data, ..., .groups = NULL)


summarize(.data, ..., .groups = NULL)

7. %>%: the “pipe” operator is used to connect multiple verb actions together into a
pipeline.
swiss %>% select(Examination, Education) %>% head

Control Statements in R:

In R programming, there are 8 types of control statements as follows:


● if condition
● if-else condition
● for loop
● nested loops
● while loop
● repeat and break statement
● return statement
● next statement

If condition:
This task is carried out only if this condition is returned as TRUE. R makes it even
easier: You can drop the word then and specify your choice in an if statement.

Syntax:

if (test_expression) {

Statement

if-else Condition

An if…else statement contains the same elements as an if statement (see the preceding
section), with some extra elements:

● The keyword else, placed after the first code block.


● The second block of code, contained within braces, has to be carried out,
only if the result of the condition in the if() statement is FALSE.

Syntax:
if (test_expression) {

statement

} else {

statement

for Loop in R
A loop is a sequence of instructions that is repeated until a certain condition is
reached. for, while and repeat, with the additional clauses break and next are used to
construct loops.
values <- c(1,2,3,4,5)

for(id in 1:5){

print(values[id])

Nested Loop

It is similar to the standard for loop, which makes it easy to convert “for” loop to a
foreach loop. Unlike many parallel programming packages for R, foreach doesn’t
require the body of the “for” loop to be turned into a function. We can call this a
nesting operator because it is used to create nested foreach loops.

Example:

mat <- matrix(1:10, 2)

for (id1 in seq(nrow(mat))) {

for (id2 in seq(ncol(mat))) {

print(mat[id1, id2])

11
while Loop

The format is while(cond) expr, where cond is the condition to test and expr is an
expression.

R would complain about the missing expression that was supposed to provide the
required True or False and in fact, it does not know ‘response’ before using it in the
loop. We can also do this because, if we answer right at first attempt, the loop will not
be executed at all.

Example:

val = 2.987

while(val <= 4.987) {

val = val + 0.987

print(c(val,val-2,val-1))

repeat and break Statement :

We use break statements inside a loop (repeat, for, while) to stop the iterations and
flow the control outside of the loop. While in a nested looping situation, where there
is a loop inside another loop, this statement exits from the innermost loop that is being
evaluated.

A repeat loop is used to iterate over a block of code, multiple numbers of times. There
is no condition check in a repeat loop to exit the loop. We ourselves put a condition
explicitly inside the body of the loop and use the break statement to exit the loop.
Failing to do so will result in an infinite loop.
Syntax:

repeat {

# simulations; generate some value have an expectation if within some range,

# then exit the loop

if ((value - expectation) <= threshold) {

break

}}

next Statement

next jumps to the next cycle without completing a particular iteration. In fact, it jumps
to the evaluation of the condition holding the current loop. Next statement enables you
to skip the current iteration of a loop without terminating it.

Example:

x = 1: 4

for (i in x) {

if (i == 2) {

next

print(i)

}
return Statement:

Many times, we will require some functions to do processing and return back the
result. This is accomplished with the return() statement in R.

Syntax:

return(expression)

Example:

check <- function(x) {

if (x > 0) {

result <- "Positive"

} else if (x < 0) {

result <- "Negative"

} else {

result <- "Zero"

return(result)

R - Functions
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions. The function in turn
performs its task and returns control to the interpreter as well as any result which may be stored
in other objects.

An R function is created by using the keyword function. The basic syntax of


an R function definition is as follows −

function_name <- function(arg_1, arg_2, ...) {

Function body

Function Components
The different parts of a function are −

● Function Name − This is the actual name of the function. It is stored in


the R environment as an object with this name.
● Arguments − An argument is a placeholder. When a function is invoked,
you pass a value to the argument. Arguments are optional; that is, a
function may contain no arguments. Also arguments can have default
values.
● Function Body − The function body contains a collection of statements
that defines what the function does.
● Return Value − The return value of a function is the last expression in
the function body to be evaluated.

R has many in-built functions which can be directly called in the program without defining them
first. We can also create and use our own functions referred to as user defined functions.

Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They
are directly called by user written programs. You can refer to the most widely used R functions.
# Create a sequence of numbers from 32 to 44.

print(seq(32,44))

# Find the mean of numbers from 25 to 82.

print(mean(25:82))

# Find the sum of numbers from 41 to 68.

print(sum(41:68))

User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and
once created they can be used like the built-in functions. Below is an example of how a
function is created and used.
# Create a function to print squares of numbers in sequence.

new.function <- function(a) {

for(i in 1:a) {

b <- i^2

print(b)

Scoping rules of R:

The scoping rules for R are the main feature that make it different from the original S
language. The scoping rules of a language determine how a value is associated with a
free variable in a function. R uses lexical scoping or static scoping. An alternative to
lexical scoping is dynamic scoping which is implemented by some languages. Lexical
scoping turns out to be particularly useful for simplifying statistical computations.
The scoping rules of a language determine how values are assigned to free variables.
Free variables are not formal arguments and are not local variables (assigned inside the
function body).

Lexical scoping in R means that the values of free variables are searched for in the
environment in which the function was defined.

Lexical Scoping: Why Does It Matter?

Typically, a function is defined in the global environment, so that the values of free
variables are just found in the user’s workspace. This behavior is logical for most people
and is usually the “right thing” to do. However, in R you can have functions defined
inside other functions (languages like C don’t let you do this). Now things get interesting
—in this case the environment in which a function is defined is the body of another
function!

Here is an example of a function that returns another function as its return value.
Remember, in R functions are treated like any other object and so this is perfectly valid.

> make.power <- function(n) {

+ pow <- function(x) {

+ x^n

+ }

+ pow

+ }

There are numerous other languages that support lexical scoping, including

● Scheme
● Perl
● Python
● Common Lisp (all languages converge to Lisp, right?)
Lexical scoping in R has consequences beyond how free variables are looked up. In
particular, it’s the reason that all objects must be stored in memory in R. This is because
all functions must carry a pointer to their respective defining environments, which could
be anywhere. In the S language (R’s close cousin), free variables are always looked up
in the global workspace, so everything can be stored on the disk because the “defining
environment” of all functions is the same.

Naming conventions
R has no naming conventions that are generally agreed upon. As a newcomer to R it’s
useful to decide which naming convention to adopt.

● There are 5 naming conventions to choose from:


○ alllowercase: e.g. adjustcolor
○ period.separated: e.g. plot.new
○ underscore_separated: e.g. numeric_version
○ lowerCamelCase: e.g. addTaskCallback
○ UpperCamelCase: e.g. SignatureMethod
● Strive for names that are concise and meaningful
● Generally, variable names should be nouns and function names should be verbs.
● Non-exported and helper functions always start with “.”
● Local variables and functions are all in small letters and in “.” syntax
(do.something, get.xyyy). It makes it easy to distinguish local vs global and
therefore leads to a cleaner code.
● File names should be meaningful and end in .R.
● Pick one naming convention and stick to it. My suggestion:
○ Files: underscore_separated, all lower case: e.g. numeric_version
○ Functions: period.separated, all lower case: e.g. my.function

Syntax
● Place spaces around all infix operators (=, +, -, <-, etc.).
● Use <-, not =, for assignment.
● Use comments to mark off sections of code.
● Comment your code with care. Comments should explain the why, not the what
● Each line of a comment should begin with the comment symbol and a single space
● An opening curly brace should never go on its own line and should always be followed by a
new line; a closing curly brace should always go on its own line, unless followed by else.
● Always indent the code inside the curly braces.
● Keep your lines less than 80 characters.This is the amount that will fit comfortably on a
printed page at a reasonable size. If you find you are running out of room, this is probably an
indication that you should encapsulate some of the work in a separate function.

Looping on the Command Line

Writing for and while loops is useful when programming but not particularly easy when
working interactively on the command line. Multi-line expressions with curly braces are
just not that easy to sort through when working on the command line. R has some
functions which implement looping in a compact form to make your life easier.

● lapply(): Loop over a list and evaluate a function on each element


● sapply(): Same as lapply but try to simplify the result
● apply(): Apply a function over the margins of an array
● tapply(): Apply a function over subsets of a vector
● mapply(): Multivariate version of lapply

lapply()
The lapply() function does the following simple series of operations:

1. it loops over a list, iterating over each element in that list


2. it applies a function to each element of the list (a function that you specify)
3. and returns a list (the l is for “list”).

x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))

> lapply(x, mean)


$a

[1] 2.5

$b

[1] 0.248845

$c

[1] 0.9935285

$d

[1] 5.051388

sapply()
The sapply() function behaves similarly to lapply(); the only real difference is in the
return value. sapply() will try to simplify the result of lapply() if possible. Essentially,
sapply() calls lapply() on its input and then applies the following algorithm:

● If the result is a list where every element is length 1, then a vector is returned
● If the result is a list where every element is a vector of the same length (> 1), a
matrix is returned.
● If it can’t figure things out, a list is returned

split()
The split() function takes a vector or other objects and splits it into groups determined
by a factor or list of factors.

The arguments to split() are

> str(split)

function (x, f, drop = FALSE, ...)

lapply(split(x, f), mean)

$`1`

[1] 0.07478098

$`2`

[1] 0.5266905

$`3`

[1] 1.458703

tapply
tapply() is used to apply a function over subsets of a vector. It can be thought of as a
combination of split() and sapply() for vectors only. I’ve been told that the “t” in
tapply() refers to “table”, but that is unconfirmed.

> str(tapply)

function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

The arguments to tapply() are as follows:


● X is a vector
● INDEX is a factor or a list of factors (or else they are coerced to factors)
● FUN is a function to be applied
● … contains other arguments to be passed FUN
● simplify, should we simplify the result?

f <- gl(3, 10)

> f

[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

Levels: 1 2 3

> tapply(x, f, mean)

1 2 3

0.1896235 0.5336667 0.9568236

apply()
The apply() function is used to evaluate a function (often an anonymous one) over the
margins of an array. It is most often used to apply a function to the rows or columns of a
matrix (which is just a 2-dimensional array). However, it can be used with general
arrays, for example, to take the average of an array of matrices. Using apply() is not
really faster than writing a loop, but it works in one line and is highly compact.

> str(apply)

function (X, MARGIN, FUN, ...)

The arguments to apply() are


● X is an array
● MARGIN is an integer vector indicating which margins should be “retained”.
● FUN is a function to be applied

x <- matrix(rnorm(200), 20, 10)

> apply(x, 2, mean) ## Take the mean of each column

[1] 0.02218266 -0.15932850 0.09021391 0.14723035 -0.22431309 -0.49657847

[7] 0.30095015 0.07703985 -0.20818099 0.06809774

mapply()
The mapply() function is a multivariate apply of sorts which applies a function in parallel
over a set of arguments. Recall that lapply() and friends only iterate over a single R
object. What if you want to iterate over multiple R objects in parallel? This is what
mapply() is for.

> str(mapply)

function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)

The arguments to mapply() are

● FUN is a function to apply


● ... contains R objects to apply over
● MoreArgs is a list of other arguments to FUN.
● SIMPLIFY indicates whether the result should be simplified

mapply(rep, 1:4, 4:1)

[[1]]

[1] 1 1 1 1
[[2]]

[1] 2 2 2

[[3]]

[1] 3 3

[[4]]

[1] 4

Debugging

R has a number of ways to indicate to you that something’s not right. There are different
levels of indication that can be used, ranging from mere notification to fatal error.
Executing any function in R may result in the following conditions.

● message: A generic notification/diagnostic message produced by the message()


function; execution of the function continues
● warning: An indication that something is wrong but not necessarily fatal;
execution of the function continues. Warnings are generated by the warning()
function
● error: An indication that a fatal problem has occurred and execution of the
function stops. Errors are produced by the stop() function.
● condition: A generic concept for indicating that something unexpected has
occurred; programmers can create their own custom conditions if they want.

Simulation
Generating Random Numbers

Simulation is an important (and big) topic for both statistics and for a variety of other
areas where there is a need to introduce randomness. Sometimes you want to
implement a statistical procedure that requires random number generation or sampling
(i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes
you want to simulate a system and random number generators can be used to model
random inputs.

R comes with a set of pseudo-random number generators that allow you to simulate
from well-known probability distributions like the Normal, Poisson, and binomial. Some
example functions for probability distributions in R:

● rnorm: generate random Normal variates with a given mean and standard
deviation.
● dnorm: evaluate the Normal probability density (with a given mean/SD) at a point
(or vector of points)
● pnorm: evaluate the cumulative distribution function for a Normal distribution
● rpois: generate random Poisson variates with a given rate

For each probability distribution there are typically four functions available that start with
a “r”, “d”, “p”, and “q”. The “r” function is the one that actually simulates random
numbers from that distribution. The other functions are prefixed with a

● d for density
● r for random number generation
● p for cumulative distribution
● q for quantile function (inverse cumulative distribution)

If you’re only interested in simulating random numbers, then you will likely only need the
“r” functions and not the others. However, if you intend to simulate from arbitrary
probability distributions using something like rejection sampling, then you will need the
other functions too.

You might also like