0% found this document useful (0 votes)
54 views

R Lab

1. Vectors can be created using the c() function by concatenating values or using the vector() function and specifying the type and length. 2. Lists allow storing different data types and can be created using the list() function. 3. Common operations on vectors and lists include accessing elements using [[ ]] or [ ], adding or removing elements, and applying functions to each element.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

R Lab

1. Vectors can be created using the c() function by concatenating values or using the vector() function and specifying the type and length. 2. Lists allow storing different data types and can be created using the list() function. 3. Common operations on vectors and lists include accessing elements using [[ ]] or [ ], adding or removing elements, and applying functions to each element.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

MGM’s

Jawaharlal Nehru Engineering College


Aurangabad-431003

Department of Information Technology

Lab Manual

Programming Lab (R Programming)


For
Third Year IT


Vision IT Department

To develop expertise of budding technocrats by imparting technical knowledge and value


based education.

Mission IT Department

Equipping the students with technical skills, soft skills and professional attitude.

Providing the state of art facilities to the students to excel as competent professionals,
entrepreneurs and researchers.


List of Experiments
i. Study of data analysis using MS-Excel(Prerequisite)

1. Study of basic Syntaxes in R

2. Implementation of vector data objects operations

3. Implementation of matrix, array and factors and perform va in R

4. Implementation and use of data frames in R

5. Create Sample (Dummy) Data in R and perform data manipulation with R

6. Study and implementation of various control structures in R

7. Data Manipulation with dplyr package

8. Data Manipulation with data.table package

9. Study and implementation of Data Visualization with ggplot2

10. Study and implementation data transpose operations in R


Experiment No: 1
Aim: To perform the basic mathematical operations in r programming

Theory:

In R, the fundamental unit of share-able code is the package. A package bundles


together code, data, documentation, and tests and provides an easy method to share with
others1. As of May 2017 there were over 10,000 packages available on CRAN. This huge
variety of packages is one of the reasons that R is so successful: chances are that someone has
already solved a problem that you’re working on, and you can benefit from their work by
downloading their package.

Installing Packages
The most common place to get packages from is CRAN. To install packages from CRAN you
use install.packages("packagename"). For instance, if you want to install the ggplot2
package, which is a very popular visualization package you would type the following in the
console:
# install package from CRAN
install.packages("ggplot2")

Loading Packages
Once the package is downloaded to your computer you can access the functions and
resources provided by the package in two different ways:
# load the package to use in the current R session
library(packagename)

Getting Help on Packages


For more direct help on packages that are installed on your computer you can use the help
and vignette functions. Here we can get help on the ggplot2 package with the following:
help(package = "ggplot2") # provides details regarding contents of a package
vignette(package = "ggplot2") # list vignettes available for a specific package
vignette("ggplot2-specs") # view specific vignette
vignette() # view all vignettes on your computer

Assignment
The first operator you’ll run into is the assignment operator. The assignment operator is used
to assign a value. For instance we can assign the value 3 to the variable x using the <-
assignment operator.
# assignment
x <- 3
Interestingly, R actually allows for five assignment operators:
# leftward assignment
x <- value
x = value
x <<- value
# rightward assignment
value -> x
value ->> x
The original assignment operator in R was <- and has continued to be the preferred among R
users. The = assignment operator was added in 2001 primarily because it is the accepted
assignment operator in many other languages and beginners to R coming from other
languages were so prone to use it.
The operators <<- is normally only used in functions which we will not get into the details.

Evaluation
We can then evaluate the variable by simply typing x at the command line which will return
the value of x. Note that prior to the value returned you’ll see ## [1] in the command line.
This simply implies that the output returned is the first output. Note that you can type any
comments in your code by preceding the comment with the hash tag (#) symbol. Any values,
symbols, and texts following # will not be evaluated.
# evaluation
x
## [1] 3

Case Sensitivity
Lastly, note that R is a case sensitive programming language. Meaning all variables,
functions, and objects must be called by their exact spelling:

x <- 1
y <- 3
z <- 4
x*y*z
## [1] 12
x*Y*z
## Error in eval(expr, envir, enclos): object 'Y' not found

Basic Arithmetic
At its most basic function R can be used as a calculator. When applying basic arithmetic, the
PEMDAS order of operations applies: parentheses first followed by exponentiation,
multiplication and division, and final addition and subtraction.

8+9/5^2
## [1] 8.36

8 + 9 / (5 ^ 2)
## [1] 8.36
8 + (9 / 5) ^ 2
## [1] 11.24
(8 + 9) / 5 ^ 2
## [1] 0.68
By default R will display seven digits but this can be changed using options() as previously
outlined.
1/7
## [1] 0.1428571
options(digits = 3)
1/7
## [1] 0.143
pi
## [1] 3.141592654
options(digits = 22)
pi
## [1] 3.141592653589793115998
We can also perform integer divide (%/%) and modulo (%%) functions. The integer divide
function will give the integer part of a fraction while the modulo will provide the remainder.
42 / 4 # regular division
## [1] 10.5
42 %/% 4 # integer division
## [1] 10
42 %% 4 # modulo (remainder)
## [1] 2

Miscellaneous Mathematical Functions


There are many built-in functions to be aware of. These include but are not limited to the
following. Go ahead and run this code in your console.
x <- 10
abs(x) # absolute value
sqrt(x) # square root
exp(x) # exponential transformation
log(x) # logarithmic transformation
cos(x) # cosine and other trigonometric functions

Infinite, and NaN Numbers:


When performing undefined calculations, R will produce Inf (infinity) and NaN (not a
number) outputs.
1/0 # infinity
## [1] Inf
Inf - Inf # infinity minus infinity
## [1] NaN

The workspace environment will also list your user defined objects such as vectors, matrices,
data frames, lists, and functions. For example, if you type the following in your console:
x <- 2
y <- 3
You will now see x and y listed in your workspace environment. To identify or remove the
objects (i.e. vectors, data frames, user defined functions, etc.) in your current R environment:

# list all objects


ls()

# identify if an R object with a given name is present


exists("x")

# remove defined object from the environment


rm(x)

# you can remove multiple objects


rm(x, y)

# basically removes everything in the working environment -- use with caution!


rm(list = ls())

Conclusion: In this way we had understand the basics of r programming.


Experiment No: 2
Aim: Implementation of vector and List data objects operations

Theory:

With R, it’s Important that one understand that there is a difference between the actual
R object and the manner in which that R object is printed to the console. Often, the printed
output may have additional bells and whistles to make the output more friendly to the users.
However, these bells and whistles are not inherently part of the object
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the
vector() function. There is really only one rule about vectors in R, which is that A vector can
only contain objects of the same class. But of course, like any good rule, there is an
exception, which is a list, which we will get to a bit later. A list is represented as a vector but
can contain objects of different classes. Indeed, that’s usually why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data
analysis

Creating Vectors

The c() function can be used to create vectors of objects by concatenating things together.

> x <- c(0.5, 0.6) ## numeric

> x <- c(TRUE, FALSE) ## logical

> x <- c(T, F) ## logical

> x <- c("a", "b", "c") ## character

> x <- 9:29 ## integer

> x <- c(1+0i, 2+4i) ## complex


Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE.
However, in general one should try to use the explicit TRUE and FALSE values when
indicating logical values. The T and F values are primarily there for when you’re feeling lazy.

You can also use the vect88or() function to initialize vectors.

> x <- vector("numeric", length = 10)

>x

[1] 0 0 0 0 0 0 0 0 0 0

A vector is an object that contains a set of values called its elements.

Numeric vector

x <- c(1,2,3,4,5,6)

The operator <– is equivalent to "=" sign.

Character vector

State <- c("DL", "MU", "NY", "DL", "NY", "MU")

To calculate frequency for State vector, you can use table function.

To calculate mean for a vector, you can use mean function.

Since the above vector contains a NA (not available) value, the mean function returns NA.

To calculate mean for a vector excluding NA values, you can include na.rm = TRUE
parameter in mean function.
You can use subscripts to refer elements of a vector.

Convert a column "x" to numeric

data$x = as.numeric(data$x)

Some useful vectors can be created quickly with R. The colon operator is

used to generate integer sequences


> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

> -3:4

[1] -3 -2 -1 0 1 2 3 4

> 9:5

[1] 9 8 7 6 5

More generally, the function seq() can generate any arithmetic progression.

> seq(from=2, to=6, by=0.4)

[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0

> seq(from=-1, to=1, length=6)

[1] -1.0 -0.6 -0.2 0.2 0.6 1.0

Sometimes it’s necessary to have repeated values, for which we use rep()

> rep(5,3)

[1] 5 5 5

> rep(2:5,each=3)

[1] 2 2 2 3 3 3 4 4 4 5 5 5

> rep(-1:3, length.out=10)

[1] -1 0 1 2 3 -1 0 1 2 3
We can also use R’s vectorization to create more interesting sequences:

> 2^(0:10)

[1] 1 2 4 8 16 32 64 128 256 512 1024

> 1:3 + rep(seq(from=0,by=10,to=30), each=3)

[1] 1 2 3 11 12 13 21 22 23 31 32 33

Lists:
A list allows you to store a variety of objects.

You can use subscripts to select the specific component of the list.
> x <- list(1:3, TRUE, "Hello", list(1:2, 5))

Here x has 4 elements: a numeric vector, a logical, a string and another list.

We can select an entry of x with double square brackets:

> x[[3]]

[1] "Hello"

To get a sub-list, use single brackets:

> x[c(1,3)]

[[1]]

[1] 1 2 3

[[2]]

[1] "Hello"

Notice the difference between x[[3]] and x[3].

We can also name some or all of the entries in our list, by supplying argument names to list():

> x <- list(y=1:3, TRUE, z="Hello")

>x

$y

[1] 1 2 3

[[2]]

[1] TRUE

$z

[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to

how we can recover the entries by their name. We can still use the numeric

position if we prefer:

> x$y

[1] 1 2 3

> x[[1]]
[1] 1 2 3

The function names() can be used to obtain a character vector of all the

names of objects in a list.

> names(x)

[1] "y" "" "z"

Conclusion:
Experiment No. 3
Aim: Implementation of various operations on matrix, array and factors in R

Theory:
Matrices are much used in statistics, and so play an important role in R. To create a matrix
use the function matrix(), specifying elements by column first:

> matrix(1:12, nrow=3, ncol=4)

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

This is called column-major order. Of course, we need only give one of the dimensions:

> matrix(1:12, nrow=3)

unless we want vector recycling to help us:

> matrix(1:3, nrow=3, ncol=4)

[,1] [,2] [,3] [,4]

[1,] 1 1 1 1

[2,] 2 2 2 2

[3,] 3 3 3 3

Sometimes it’s useful to specify the elements by row first

> matrix(1:12, nrow=3, byrow=TRUE)

There are special functions for constructing certain matrices:

> diag(3)

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 0 1 0

[3,] 0 0 1
> diag(1:3)

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 0 2 0

[3,] 0 0 3

> 1:5 %o% 1:5

[,1] [,2] [,3] [,4] [,5]

[1,] 1 2 3 4 5

[2,] 2 4 6 8 10

[3,] 3 6 9 12 15

[4,] 4 8 12 16 20

[5,] 5 10 15 20 25

The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj .
The function outer() generalizes this to any function f on two arguments, to create a matrix
with entries f(xi , yj ). (More on functions later.)

> outer(1:3, 1:4, "+")

[,1] [,2] [,3] [,4]

[1,] 2 3 4 5

[2,] 3 4 5 6

[3,] 4 5 6 7

Matrix multiplication is performed using the operator %*%, which is quite

distinct from scalar multiplication *.

> A <- matrix(c(1:8,10), 3, 3)

> x <- c(1,2,3)

> A %*% x # matrix multiplication

[,1]
[1,] 30

[2,] 36

[3,] 45

> A*x # NOT matrix multiplication

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 4 10 16

[3,] 9 18 30

Standard functions exist for common mathematical operations on matrices

> t(A) # transpose

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 10

> det(A) # determinant

[1] -3

> diag(A) # diagonal

[1] 1 5 10

> solve(A) # inverse

[,1] [,2] [,3]

[1,] -0.6667 -0.6667 1

[2,] -1.3333 3.6667 -2

[3,] 1.0000 -2.0000 1

Array:

Of course, if we have a data set consisting of more than two pieces of categorical information
about each subject, then a matrix is not sufficient. The generalization of matrices to higher
dimensions is the array. Arrays are defined much like matrices, with a call to the array()
command. Here is a 2 × 3 × 3 array:

> arr = array(1:18, dim=c(2,3,3))

> arr

,,1

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

,,2

[,1] [,2] [,3]

[1,] 7 9 11

[2,] 8 10 12

,,3

[,1] [,2] [,3]

[1,] 13 15 17

[2,] 14 16 18

Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3
matrix. Note that we no longer specify the number of rows and columns separately, but use a
single vector dim whose length is the number of dimensions. You can recover this vector
with the dim() function.

> dim(arr)

[1] 2 3 3

Note that a 2-dimensional array is identical to a matrix. Arrays can be

subsetted and modified in exactly the same way as a matrix, only using the

appropriate number of co-ordinates:

> arr[1,2,3]

[1] 15

> arr[,2,]
[,1] [,2] [,3]

[1,] 3 9 15

[2,] 4 10 16

> arr[1,1,] = c(0,-1,-2) # change some values

> arr[,,1,drop=FALSE]

,,1

[,1] [,2] [,3]

[1,] 0 3 5

[2,] 2 4 6

Factors

R has a special data structure to store categorical variables. It tells R that a variable is
nominal or ordinal by making it a factor.

Simplest form of the factor function :

Ideal form of the factor function :

The factor function has three parameters:


1. Vector Name
2. Values (Optional)
3. Value labels (Optional)
Convert a column "x" to factor

data$x = as.factor(data$x)
Experiment No. 4
Aim: Implementation and perform the various operations on data frames in R
Theory:

A data frame is a table or a two-dimensional array-like structure in which each column


contains values of one variable and each row contains one set of values from each column.

• Data frames are tabular data objects.

• A Data frame is a list of vector of equal length.

• Data frame in R is used for storing data tables.

Characteristics of a data frame:

1. The column names should be non-empty.

2. The row names should be unique.

3. The data stored in a data frame can be of numeric, factor or character type.

Create Data Frame

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-
11",
"2015-03-27")),
stringsAsFactors = FALSE
)# Print the data frame.
print(emp.data)

When we execute the above code, it produces the following result –


emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27
Get the Structure of the Data Frame

The structure of the data frame can be seen by using str() function.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

When we execute the above code, it produces the following result –

'data.frame': 5 obs. of 4 variables:


$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...

Summary of Data in Data Frame

The statistical summary and nature of the data can be obtained by applying summary()
function.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date
Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27

Extract Data from Data Frame:

# Extract Specific columns.

result <- data.frame(emp.data$emp_name,emp.data$salary)

print(result)

When we execute the above code, it produces the following result −


emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25

# Extract first two rows.

result <- emp.data[1:2,]

print(result)

When we execute the above code, it produces the following result −


emp_id emp_name salary start_date
1 1 Rick 623.3 2012-01-01
2 2 Dan 515.2 2013-09-23

# Extract 3rd and 5th row with 2nd and 4th column.

result <- emp.data[c(3,5),c(2,4)]

print(result)

When we execute the above code, it produces the following result −


emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27

Expand Data Frame

A data frame can be expanded by adding columns and rows.0

1. Add Column

Just add the column vector using a new column name.


# Add the "dept" coulmn.

emp.data$dept <- c("IT","Operations","IT","HR","Finance")

v <- emp.data

print(v)

When we execute the above code, it produces the following result −


emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance

2. Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows
in the same structure as the existing data frame and use the rbind() function.

In the example below we create a data frame with new rows and merge it with the existing
data frame to create the final data frame.

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

Conclusion:
Experiment No. 5
Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R

Theory:

This covers how to execute most frequently used data manipulation tasks with R. It includes
various examples with datasets and code. It gives you a quick look at several functions used
in R.

Drop data frame columns by name:

DF <- data.frame( x=1:10, y=10:1, z=rep(5,10), a=11:20 )

# for multiple

> drops <- c("x","z")

 DF[ , !(names(DF) %in% drops)]

 # OR

> keeps <- c("y", "a")

> DF[keeps]

> DF

Order function for sort:

d3=data.frame(roll=c(2,4,6,3,1,5),

name=c('a','b','c','d','e','e'),

marks=c(44,55,22,33,66,77))

> d3

d3[order(d3$roll),]

OR

d3[with(d3,order(roll)),]

Subsets:
roll=c(1:5)
names=c(letters[1:5])
marks=c(12,33,44,55,66)
d4=data.frame(roll,names,marks)
sub1=subset(d4,marks>33 & roll>4)
sub1
sub1=sub1=subset(d4,marks>33 & roll>4,select = c(roll,names))
sub1

Drop factor levels in a subsetted data frame:

df <- data.frame(letters=letters[1:5], numbers=seq(1:5))


df
levels(df$letters)
sub2=subset(df,numbers>3)
sub2
levels(sub2$letters)
sub2$letters=factor(sub2$letters)
levels(sub2$letters)

Rename Columns in R
 colnames(d)[colnames(d)==“roll"]=“ID“

Use of factor in data frame


# add column of class

d$class=c(1,2,1,2,1,2)

cls=factor(d$class,levels = c(1,2),labels = c("f","s"))

table(cls)

# for factor levels and labels are optional

Replacing / Recoding values


By 'recoding', it means replacing existing value(s) with the new value(s).

Create Dummy Data


mydata = data.frame(State = ifelse(sign(rnorm(25))==-1,'Delhi','Goa'), Q1= sample(1:25))
In this example, we are replacing 1 with 6 in Q1 variable
mydata$Q1[mydata$Q1==1] <- 6

In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to
convert the variable from factor to character.
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'

In this example, we are replacing 2 and 3 with NA values in whole dataset.


mydata[mydata == 2 | mydata == 3] <- NA

Another method
You have to first install the car package.
# Install the car package
install.packages("car")

# Load the car package


library("car")
# Recode 1 to 6
mydata$Q1 <- recode(mydata$Q1, "1=6")

Recoding to a new column


Create a new column called Ques1
mydata$Ques1<- recode(mydata$Q1, "1:4=0; 5:6=1")

Sorting
Sorting is one of the most common data manipulation task. It is generally used when
we want to see the top 5 highest / lowest values of a variable.

Sorting a vector
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than
1 dimensional vector.
Sorting a data frame
mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT=
sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
Sort gender variable in ascending order and then SAT in descending order
mydata.sorted1 <- mydata[order(mydata$Gender, -mydata$SAT),]

Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.

Value labelling
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male",
"female"))
Use ordered() for ordinal data
mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly agree",
"Somewhat agree", "Somewhat disagree", "Strongly disagree"))

Dealing with missing data


Number of missing values in a variable
colSums(is.na(mydata))
Number of missing values in a row
rowSums(is.na(mydata))
List rows of data that have missing values
mydata[!complete.cases(mydata),]
Creating a new dataset without missing data
mydata1 <- na.omit(mydata)
Convert a value to missing
mydata[mydata$Q1==999,"Q1"] <- NA
Intersect, merge in R – intersection of data frames:
data1=data.frame(roll=c(1,2,3,4,5),
name=c('sachin','rahul','vijay','kapil','saurav'))
data1
data2=data.frame(roll=c(1,2,3,5),
marks=c(20,25,43,60))
data2
result=intersect(data1$roll,data2$roll)
result
result=merge(data1,data2,all=FALSE)
result
Conclusion:
Experiment No: 6
Aim: Study and implementation of various control structures in R

Theory:
Loop helps you to repeat the similar operation on different variables or on different columns or
on different datasets. For example, you want to multiple each variable by 5. Instead of multiply
each variable one by one, you can perform this task in loop. Its main benefit is to bring down the
duplication in your code which helps to make changes later in the code.

If-Else and Nested If-Else in R

The If-Else statements are important part of R programming. In this tutorial, we will see various
ways to apply conditional statements (If..Else nested IF) in R. In R, there are a lot of powerful
packages for data manipulation. In the later part of this tutorial, we will see how IF ELSE
statements are used in popular packages.

Sample Data

Let's create a sample data to show how to perform IF ELSE function. This data frame would be
used further in examples.
x1 x2 x3
1 129 A
3 178 B
5 140 C
7 186 D
9 191 E
11 104 F
13 150 G
15 183 H
17 151 I
19 142 J

Run the program below to generate the above table in R.

set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
x2 = sample(100:200,10,FALSE),
x3 = LETTERS[1:10])
x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In total,
these are 10 numeric values.
x2 = sample(100:200,10,FALSE) : The variable 'x2' constitutes 10 non-repeating random
numbers ranging between 100 and 200.

x3 = LETTERS[1:10] : The variable 'x3' contains 10 alphabets starting from A to Z.

Syntax of ifelse() function :

The ifelse() function in R works similar to MS Excel IF function. See the syntax below -

ifelse(condition, value if condition is true, value if condition is false)


Example 1 : Simple IF ELSE Statement

Suppose you are asked to create a binary variable - 1 or 0 based on the variable 'x2'. If value of a
variable 'x2' is greater than 150, assign 1 else 0.

mydata$x4 = ifelse(mydata$x2>150,1,0)
In this case, it creates a variable x4 on the same data frame 'mydata'. The output is shown in the
image below -

ifelse : Output
Create variable in a new data frame

Suppose you need to add the above created binary variable in a new data frame. You can do it by
using the code below -
x = ifelse(mydata$x2>150,1,0)
newdata = cbind(x,mydata)
The cbind() is used to combine two vectors, matrices or data frames by columns.
Apply ifelse() on Character Variables

If variable 'x3' contains character values - 'A', 'D', the variable 'x1' should be multiplied by 2.
Otherwise it should be multiplied by 3.

mydata$y = ifelse(mydata$x3 %in% c("A","D") ,mydata$x1*2,mydata$x1*3)


The output is shown in the table below

x1 x2 x3 y
1 129 A 2
3 178 B 9
5 140 C 15
7 186 D 14
9 191 E 27
11 104 F 33
13 150 G 39
15 183 H 45
17 151 I 51
19 142 J 57

Example 2 : Nested If ELSE Statement in R

Multiple If Else statements can be written similarly to excel's If function. In this case, we are
telling R to multiply variable x1 by 2 if variable x3 contains values 'A' 'B'. If values are 'C' 'D',
multiply it by 3. Else multiply it by 4.
mydata$y = ifelse(mydata$x3 %in% c("A","B") ,mydata$x1*2,
ifelse(mydata$x3 %in% c("C","D"), mydata$x1*3,
mydata$x1*4))
Do you hate specifying data frame multiple times with each variable?

You can use with() function to avoid mentioning data frame each time. It makes writing R code
faster.

mydata$y = with(mydata, ifelse(x3 %in% c("A","B") , x1*2,


ifelse(x3 %in% c("C","D"), x1*3, x1*4)))

Special Topics related to IF ELSE

In this section, we will cover the following topics -

1. How to treat missing (NA) values in IF ELSE.


2. How to use OR and AND operators in IF ELSE
3. Aggregate or Summary Functions and IF ELSE Statement

Handle Missing Values

Incorrect Method

x = NA
ifelse(x==NA,1,0)
Result : NA
It should have returned 1.

Correct Method

x = NA
ifelse(is.na(x),1,0)
Result : 1
The is.na() function tests whether a value is NA or not.

Use OR and AND Operators

The & symbol is used to perform AND conditions

ifelse(mydata$x1<10 & mydata$x2>150,1,0)


Result : 0 1 0 1 1 0 0 0 0 0

The | symbol is used to perform OR conditions

ifelse(mydata$x1<10 | mydata$x2>150,1,0)
Result : 1 1 1 1 1 0 0 1 1 0

Count cases where condition meets

In this example, we can counting the number of records where the condition meets.

sum(ifelse(mydata$x1<10 | mydata$x2>150,1,0))
Result : 7

If Else Statement : Another Style

There is one more way to define if..else statement in R. This style of writing If Else is mostly
used when we use conditional statements in loop and R functions. In other words, it is used when
we need to perform various actions based on a condition.

Syntax -

if(condition) yes else no


k = 99
if(k > 100) 1 else 0
Result : 0

If..Else If..Else Statements

k = 100
if(k > 100){
print("Greater than 100")
} else if (k < 100){
print("Less than 100")
} else {
print ("Equal to 100")
}
Result : "Equal to 100"

If Else in Popular Packages

1. dplyr package

if_else(condition, value if condition is true, value if condition is false, value if NA)

The following program checks whether a value is a multiple of 2

library(dplyr)
x=c(1,NA,2,3)
if_else(x%%2==0, "Multiple of 2", "Not a multiple of 2", "Missing")
Result : "Not a multiple of 2" "Missing" "Multiple of 2" "Not a multiple of 2"

The %% symbol returns remainder after a value is divided by divisor. In this case, first element
1 is divided by 2.

2. sqldf package
We can write SQL query in R using sqldf package. In SQL, If Else statement is defined in CASE
WHEN.

df=data.frame(k=c(2,NA,3,4,5))
library(sqldf)
sqldf(
"SELECT *,
CASE WHEN (k%2)=0 THEN 'Multiple of 2'
WHEN k is NULL THEN 'Missing'
ELSE 'Not a multiple of 2'
END AS T
FROM df"
)
Output

kT
2 Multiple of 2
NA Missing
3 Not a multiple of 2
4 Multiple of 2
5 Not a multiple of 2

What is Loop?

Loop helps you to repeat the similar operation on different variables or on different columns or
on different datasets. For example, you want to multiple each variable by 5. Instead of multiply
each variable one by one, you can perform this task in loop. Its main benefit is to bring down the
duplication in your code which helps to make changes later in the code.

Ways to Write Loop in R


1. For Loop
2. While Loop
3. Apply Family of Functions such as Apply, Lapply, Sapply etc

Apply Family of Functions

They are the hidden loops in R. They make loops easier to read and write. But these concepts are
very new to the programming world as compared to For Loop and While Loop.

1. Apply Function

It is used when we want to apply a function to the rows or columns of a matrix or data frame. It
cannot be applied on lists or vectors.
apply arguments

Create a sample data set


dat <- data.frame(x = c(1:5,NA),
z = c(1, 1, 0, 0, NA,0),
y = 5*c(1:6))

Example 1 : Find Maximum value of each row

apply(dat, 1, max, na.rm= TRUE)


Output : 5 10 15 20 25 30

In the second parameter of apply function, 1 denotes the function to be applied at row level.

Example 2 : Find Maximum value of each column

apply(dat, 2, max, na.rm= TRUE)


The output is shown in the table below -
x Z y
5 1 30
In the second parameter of apply function, 2 denotes the function to be applied at column
level.

2. Lapply Function

When we apply a function to each element of a data structure and it returns a list.
lapply arguments

Example 1 : Calculate Median of each of the variables


lapply(dat, function(x) median(x, na.rm = TRUE))
The function(x) is used to define the function we want to apply. The na.rm=TRUE is used to
ignore missing values and median would now be calculated on non-missing values.

Example 2 : Apply a custom function


lapply(dat, function(x) x + 1)
In this case, we are adding 1 to each variables and the final output would be a list and output is
shown in the image below.

Output

3. Sapply Function

Sapply is a user friendly version of Lapply as it returns a vector when we apply a function to
each element of a data structure.

Example 1 : Number of Missing Values in each Variable

sapply(dat, function(x) sum(is.na(x)))


The above function returns 1,1,0 for variables x,z,y in data frame 'dat'.

Example 2 : Extract names of all numeric variables in IRIS dataset

colnames(iris)[which(sapply(iris,is.numeric))]
In this example, sapply(iris,is.numeric) returns TRUE/FALSE against each variable. If the
variable is numeric, it would return TRUE otherwise FALSE. Later, which function returns the
column position of the numeric variables . Try running only this portion of the
code which(sapply(iris,is.numeric)). Adding colnames function would help to return the actual
names of the numeric variables.
Lapply and Sapply Together

In this example, we would show you how both lapply and sapply are used simultaneously to
solve the problem.

Create a sample data


dat <- data.frame(x = c(1:5,NA),
z = c(1, 1, 0, 0, NA,0),
y = factor(5*c(1:6)))

Converting Factor Variables to Numeric

The following code would convert all the factor variables of data frame 'dat' to numeric types
variables.
index <- sapply(dat, is.factor)
dat[index] <- lapply(dat[index], function(x) as.numeric(as.character(x)))
Explanation :
1. index would return TRUE / FALSE whether the variable is factor or not
2. Converting only those variables wherein index=TRUE.

4. For Loop

Like apply family of functions, For Loop is used to repeat the same task on multiple data
elements or datasets. It is similar to FOR LOOP in other languages such as VB, python etc. This
concept is not new and it has been in the programming field over many years.

Example 1 : Maximum value of each column

x = NULL
for (i in 1:ncol(dat)){
x[i]= max(dat[i], na.rm = TRUE)}
x
Prior to starting a loop, we need to make sure we create an empty vector. The empty vector is
defined by x=NULL. Next step is to define the number of columns for which loop over would be
executed. It is done with ncol function. The length function could also be used to know the
number of column.

The above FOR LOOP program can be written like the code below -

x = vector("double", ncol(dat))
for (i in seq_along(dat)){
x[i]= max(dat[i], na.rm = TRUE)}
x
The vector function can be used to create an empty vector. The seq_along finds out what to
loop over.

Example 2 : Split IRIS data based on unique values in "species" variable

The program below creates multiple data frames based on the number of unique values in
variable Species in IRIS dataset.

for (i in 1:length(unique(iris$Species))) {
require(dplyr)
assign(paste("iris",i, sep = "."), filter(iris, Species == as.character(unique(iris$Species)[i])))
}
It returns three data frames named iris.1 iris.2 iris.3.

Combine / Append Data within LOOP

In the example below, we are combining / appending rows in iterative process. It is same as
PROC APPEND in SAS.

Method 1 : Use do.call with rbind

do.call() applies a given function to the list as a whole. When it is used with rbind, it would bind
all the list arguments. In other words, it converts list to matrix of multiple rows.

temp =list()
for (i in 1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
temp[[i]] =series
}
output = do.call(rbind, temp)
output
Method 2 : Use Standard Looping Technique

In this case, we are first creating an empty table (data frame). Later we are appending data to
empty data frame.
dummydt=data.frame(matrix(ncol=0,nrow=0))
for (i in 1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
if (i==1) {output = rbind(dummydt,series)} else {output = rbind(output,series)}
}
output
If we need to wrap the above code in function, we need to make some changes in the code. For
example, data$variable won't work inside the code . Instead we should use data[[variable]]. See
the code below -
dummydt=data.frame(matrix(ncol=0,nrow=0))
temp = function(data, var) {
for (i in 1:length(unique(data[[var]]))) {
series= data.frame(Species = as.character(unique(data[[var]]))[i])
if (i==1) {output = rbind(dummydt,series)} else {output = rbind(output,series)}
}
return(output)}
temp(iris, "Species")

For Loop and Sapply Together

Suppose you are asked to impute Missing Values with Median in each of the numeric variable in
a data frame. It's become a daunting task if you don't know how to write a loop. Otherwise, it's a
straightforward task.

In the program below, which(sapply(dat, is.numeric)) makes sure loop runs only on numeric
variables.

for (i in which(sapply(dat, is.numeric))) {


dat[is.na(dat[, i]), i] <- median(dat[, i], na.rm = TRUE)
}

Create new columns in Loop

Suppose you need to standardise multiple variables. To accomplish this task, we need to execute
the following steps -

1. Identify numeric variables


2. Calculate Z-score i.e. subtracting mean from original values and then divide it by
standard deviation of the raw variable.
3. Run Step2 for all the numeric variables
4. Make names of variables based on original names. For example x1_scaled.

Create dummy data


mydata = data.frame(x1=sample(1:100,100), x2=sample(letters,100, replace=TRUE),
x3=rnorm(100))
Standardize Variables

lst=list()
for (i in which(sapply(mydata, is.numeric))) {
x.scaled = (mydata[,i] - mean(mydata[,i])) /sd(mydata[,i])
lst[[i]] = x.scaled
}
names(lst) <- paste(names(sapply(mydata, is.numeric)),"_scaled", sep="")
mydata.scaled= data.frame(do.call(cbind, lst))
In this case, do.call with cbind function helps to make data in matrix form from list.

5. While Loop in R

A while loop is more broader than a for loop because you can rescript any for loop as a while
loop but not vice-versa.

In the example below, we are checking whether a number is an odd or even,

i=1
while(i<7)
{
if(i%%2==0)
print(paste(i, "is an Even number"))
else if(i%%2>0)
print(paste(i, "is an Odd number"))
i=i+1
}
The double percent sign (%%) indicates mod. Read i%%2 as mod(i,2). The iteration would
start from 1 to 6 (i.e. i<7). It stops when condition is met.

Output:
[1] "1 is an Odd number"
[1] "2 is an Even number"
[1] "3 is an Odd number"
[1] "4 is an Even number"
[1] "5 is an Odd number"
[1] "6 is an Even number"

Loop Concepts : Break and Next

Break Keyword

When a loop encounters 'break' it stops the iteration and breaks out of loop.

for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
break } else {
print(paste("i=", i, "j=", j))
}
}
}
Output :
[1] "i= 1 j= 3"
[1] "i= 1 j= 2"
[1] "i= 1 j= 1"

In this case, as condition i+j >4 is met, it breaks out of loop.

Next Keyword

When a loop encounters 'next', it terminates the current iteration and moves to next iteration.
for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
next
} else {
print(paste("i=", i, "j=", j))
}
}
}

Output :
[1] "i= 1 j= 3"
[1] "i= 1 j= 2"
[1] "i= 1 j= 1"
[1] "i= 2 j= 2"
[1] "i= 2 j= 1"
[1] "i= 3 j= 1"

If you get confused between 'break' and 'next', compare the output of both and see the difference.

Conclusion:
Experiment No. 7
Aim: Data Manipulation with dplyr package

Theory:
The dplyr package is one of the most powerful and popular package in R. This package was
written by the most popular R programmer Hadley Wickham who has written many useful R
packages such as ggplot2, tidyr etc. This post includes several examples and tips of how to use
dplyr package for cleaning and transforming data. It's a complete tutorial on data manipulation
and data wrangling with R.

What is dplyr?
The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In
short, it makes data exploration and data manipulation easy and fast in R.

What's special about dplyr?


The package "dplyr" comprises many functions that perform mostly used data manipulation
operations such as applying filter, selecting specific columns, sorting data, adding or deleting
columns and aggregating data. Another most important advantage of this package is that it's very
easy to learn and use dplyr functions. Also easy to recall these functions. For example, filter() is
used to filter rows.

dplyr vs. Base R Functions


dplyr functions process faster than base R functions. It is because dplyr functions were written in
a computationally efficient manner. They are also more stable in the syntax and better supports
data frames than vectors.

SQL Queries vs. dplyr

People have been utilizing SQL for analyzing data for decades. Every modern data analysis
software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to
perform data analysis. It was rather designed for querying and managing data. There are many
data analysis operations where SQL fails or makes simple things difficult. For example,
calculating median for multiple variables, converting wide format data to long format etc.
Whereas, dplyr package was designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select()for selecting
variables, group_by() - group data by grouping variable, join() - joining two data sets. Also
includes inner_join() and left_join(). It also supports sub queries for which SQL was popular
for.
How to install and load dplyr package

To install the dplyr package, type the following command.

install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
Important dplyr Functions to remember

dplyr Function Description Equivalent SQL


select() Selecting columns (variables) SELECT
filter() Filter (subset) rows. WHERE
group_by() Group the data GROUP BY
summarise() Summarise (or aggregate) data -
arrange() Sort the data ORDER BY
join() Joining data frames (tables) JOIN
mutate() Creating New Variables COLUMN ALIAS

Data : Income Data by States

In this tutorial, we are using the following data which contains income generated by states from
year 2002 to 2015. Note : This data do not contain actual income figures of the states.

This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of first 6
rows of the dataset is shown below.

Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

1 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229 1944173

2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826 1436541

3 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886 1554330

4 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213 1188104 1628980

5 C California 1685349 1675807 1889570 1480280 1735069 1812546 1487315 1663809

6 C Colorado 1343824 1878473 1886149 1236697 1871471 1814218 1875146 1752387


Y2010 Y2011 Y2012 Y2013 Y2014 Y2015

1 1237582 1440756 1186741 1852841 1558906 1916661

2 1629616 1230866 1512804 1985302 1580394 1979143

3 1300521 1130709 1907284 1363279 1525866 1647724

4 1669295 1928238 1216675 1591896 1360959 1329341

5 1624509 1639670 1921845 1156536 1388461 1644607

6 1913275 1665877 1491604 1178355 1383978 1330736

Download the Dataset

How to load Data

Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
Example 1 : Selecting Random N Rows

The sample_n function selects random rows from a data frame (or table). The second parameter
of the function tells R the number of rows to select.

sample_n(mydata,3)

Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826 1436541

8 D Delaware 1330403 1268673 1706751 1403759 1441351 1300836 1762096 1553585

33 N New York 1395149 1611371 1170675 1446810 1426941 1463171 1732098 1426216

Y2010 Y2011 Y2012 Y2013 Y2014 Y2015

2 1629616 1230866 1512804 1985302 1580394 1979143

8 1370984 1318669 1984027 1671279 1803169 1627508


33 1604531 1683687 1500089 1718837 1619033 1367705

Example 2 : Selecting Random Fraction of Rows

The sample_frac function returns randomly N% of rows. In the example below, it returns
randomly 10% of rows.

sample_frac(mydata,0.1)

Example 3 : Remove Duplicate Rows based on all the variables (Complete Row)

The distinct function is used to eliminate duplicates.

x1 = distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as in
mydata.

Example 4 : Remove Duplicate Rows based on a variable

The .keep_all function is used to retain all other variables in the output data frame.

x2 = distinct(mydata, Index, .keep_all= TRUE)

Example 5 : Remove Duplicates Rows based on multiple variables

In the example below, we are using two variables - Index, Y2010 to determine uniqueness.

x2 = distinct(mydata, Index, Y2010, .keep_all= TRUE)

select( ) Function

It is used to select only desired variables.

select() syntax : select(data , ....)


data : Data Frame
.... : Variables by name or by function

Example 6 : Selecting Variables (or Columns)

Suppose you are asked to select only a few variables. The code below selects variables "Index",
columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)

Example 7 : Dropping Variables

The minus sign before a variable tells R to drop the variable.

mydata = select(mydata, -Index, -State)


The above code can also be written like :

mydata = select(mydata, -c(Index,State))

Example 8 : Selecting or Dropping Variables starts with 'Y'

The starts_with() function is used to select variables starts with an alphabet.

mydata3 = select(mydata, starts_with("Y"))


Adding a negative sign before starts_with() implies dropping the variables starts with 'Y'

mydata33 = select(mydata, -starts_with("Y"))


The following functions helps you to select variables based on their names.

Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.

Example 9 : Selecting Variables contain 'I' in their names


mydata4 = select(mydata, contains("I"))

Example 10 : Reorder Variables

The code below keeps variable 'State' in the front and the remaining variables follow that.

mydata5 = select(mydata, State, everything())


New order of variables are displayed below -

[1] "State" "Index" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006" "Y2007" "Y2008" "Y2009"

[11] "Y2010" "Y2011" "Y2012" "Y2013" "Y2014" "Y2015"

rename( ) Function

It is used to change variable name.

rename() syntax : rename(data , new_name = old_name)


data : Data Frame
new_name : New variable name you want to keep
old_name : Existing Variable Name

Example 11 : Rename Variables

The rename function can be used to rename variables.

In the following code, we are renaming 'Index' variable to 'Index1'.

mydata6 = rename(mydata, Index1=Index)

Output

filter( ) Function

It is used to subset data with matching logical conditions.

filter() syntax : filter(data , ....)


data : Data Frame
.... : Logical Condition

Example 12 : Filter Rows

Suppose you need to subset data. You want to filter rows and retain only those values in which
Index is equal to A.
mydata7 = filter(mydata, Index == "A")
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

1 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229 1944173

2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826 1436541

3 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886 1554330

4 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213 1188104 1628980

Y2010 Y2011 Y2012 Y2013 Y2014 Y2015

1 1237582 1440756 1186741 1852841 1558906 1916661

2 1629616 1230866 1512804 1985302 1580394 1979143

3 1300521 1130709 1907284 1363279 1525866 1647724

4 1669295 1928238 1216675 1591896 1360959 1329341

Example 13 : Multiple Selection Criteria

The %in% operator can be used to select multiple items. In the following program, we are
telling R to select rows against 'A' and 'C' in column 'Index'.

mydata7 = filter(mydata6, Index %in% c("A", "C"))

Example 14 : 'AND' Condition in Selection Criteria

Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in
the column 'Index' and income greater than 1.3 million in Year 2002.

mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )

Example 15 : 'OR' Condition in Selection Criteria

The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
Example 16 : NOT Condition
The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))

Example 17 : CONTAINS Condition

The grepl function is used to search for pattern matching. In the following code, we are looking
for records wherein column state contains 'Ar' in their name.

mydata10 = filter(mydata6, grepl("Ar", State))

summarise( ) Function

It is used to summarize data.

summarise() syntax : summarise(data , ....)


data : Data Frame
..... : Summary Functions such as mean, median etc

Example 18 : Summarize selected variables

In the example below, we are calculating mean and median for the variable Y2015.

summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))

Output

Example 19 : Summarize Multiple Variables

In the following example, we are calculating number of records, mean and median for variables
Y2005 and Y2006. The summarise_at function allows us to select multiple variables by their
names.

summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))

Output
Example 20 : Summarize with Custom Functions
We can also use custom functions in the summarise function. In this case, we are computing the
number of records, number of missing values, mean and median for variables Y2011 and Y2012.
The dot (.) denotes each variables specified in the second argument of the function.
summarise_at(mydata, vars(Y2011, Y2012),
funs(n(), missing = sum(is.na(.)), mean(., na.rm = TRUE), median(.,na.rm = TRUE)))

Summarize : Output

How to apply Non-Standard Functions

Suppose you want to subtract mean from its original value and then calculate variance of it.

set.seed(222)
mydata <- data.frame(X1=sample(1:100,100), X2=runif(100))
summarise_at(mydata,vars(X1,X2), function(x) var(x - mean(x)))

X1 X2

1 841.6667 0.08142161

Example 21 : Summarize all Numeric Variables

The summarise_if function allows you to summarise conditionally.

summarise_if(mydata, is.numeric, funs(n(),mean,median))


Alternative Method :

First, store data for all the numeric variables


numdata = mydata[sapply(mydata,is.numeric)]
Second, the summarise_all function calculates summary statistics for all the columns in a data
frame
summarise_all(numdata, funs(n(),mean,median))

Example 22 : Summarize Factor Variable

We are checking the number of levels/categories and count of missing observations in a


categorical (factor) variable.
summarise_all(mydata["Index"], funs(nlevels(.), nmiss=sum(is.na(.))))

nlevels nmiss

1 19 0

arrange() function :

Use : Sort data

Syntax
arrange(data_frame, variable(s)_to_sort)
or
data_frame %>% arrange(variable(s)_to_sort)
To sort a variable in descending order, use desc(x).

Example 23 : Sort Data by Multiple Variables

The default sorting order of arrange() function is ascending. In this example, we are sorting
data by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other variable by ascending oder.
arrange(mydata, desc(Index), Y2011)
Pipe Operator %>%

It is important to understand the pipe (%>%) operator before knowing the other functions of
dplyr package. dplyr utilizes pipe operator from another package (magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator. The question
arises "Why to use pipe operator %>%". The answer is it lets to wrap multiple functions
together with the use of %>%.

Syntax :
filter(data_frame, variable == value)
or
data_frame %>% filter(variable == value)
The %>% is NOT restricted to filter function. It can be used with any function.

Example :
The code below demonstrates the usage of pipe %>% operator. In this example, we are selecting
10 random observations of two variables "Index" "State" from the data frame "mydata".
dt = sample_n(select(mydata, Index, State),10)
or
dt = mydata %>% select(Index, State) %>% sample_n(10)

Output

group_by() function :

Use : Group data by categorical variable

Syntax :
group_by(data, variables)
or
data %>% group_by(variables)

Example 24 : Summarise Data by Categorical Variable

We are calculating count and mean of variables Y2011 and Y2012 by variable Index.
t = summarise_at(group_by(mydata, Index), vars(Y2011, Y2012), funs(n(), mean(., na.rm =
TRUE)))
The above code can also be written like
t = mydata %>% group_by(Index) %>%
summarise_at(vars(Y2011:Y2015), funs(n(), mean(., na.rm = TRUE)))

Index Y2011_n Y2012_n Y2013_n Y2014_n Y2015_n Y2011_mean Y2012_mean

A 4 4 4 4 4 1432642 1455876
C 3 3 3 3 3 1750357 1547326

D 2 2 2 2 2 1336059 1981868

F 1 1 1 1 1 1497051 1131928

G 1 1 1 1 1 1851245 1850111

H 1 1 1 1 1 1902816 1695126

I 4 4 4 4 4 1690171 1687056

K 2 2 2 2 2 1489353 1899773

L 1 1 1 1 1 1210385 1234234

M 8 8 8 8 8 1582714 1586091

N 8 8 8 8 8 1448351 1470316

O 3 3 3 3 3 1882111 1602463

P 1 1 1 1 1 1483292 1290329

R 1 1 1 1 1 1781016 1909119

S 2 2 2 2 2 1381724 1671744

T 2 2 2 2 2 1724080 1865787

U 1 1 1 1 1 1288285 1108281

V 2 2 2 2 2 1482143 1488651

W 4 4 4 4 4 1711341 1660192

do() function :

Use : Compute within groups


Syntax :
do(data_frame, expressions_to_apply_to_each_group)
Note : The dot (.) is required to refer to a data frame.

Example 25: Filter Data within a Categorical Variable

Suppose you need to pull top 2 rows from 'A', 'C' and 'I' categories of variable Index.
t = mydata %>% filter(Index %in% c("A", "C","I")) %>% group_by(Index) %>%
do(head( . , 2))

Output : do() function

Example 26 : Selecting 3rd Maximum Value by Categorical Variable

We are calculating third maximum value of variable Y2015 by variable Index. The following
code first selects only two variables Index and Y2015. Then it filters the variable Index with 'A',
'C' and 'I' and then it groups the same variable and sorts the variable Y2015 in descending order.
At last, it selects the third row.

t = mydata %>% select(Index, Y2015) %>%


filter(Index %in% c("A", "C","I")) %>%
group_by(Index) %>%
do(arrange(.,desc(Y2015))) %>% slice(3)
The slice() function is used to select rows by position.

Output
Using Window Functions

Like SQL, dplyr uses window functions that are used to subset data within a group. It returns a
vector of values. We could use min_rank() function that calculates rank in the preceding
example,

t = mydata %>% select(Index, Y2015) %>%


filter(Index %in% c("A", "C","I")) %>%
group_by(Index) %>%
filter(min_rank(desc(Y2015)) == 3)

Index Y2015

1 A 1647724

2 C 1330736

3 I 1583516

Example 27 : Summarize, Group and Sort Together

In this case, we are computing mean of variables Y2014 and Y2015 by variable Index. Then sort
the result by calculated mean variable Y2015.
t = mydata %>%
group_by(Index)%>%
summarise(Mean_2014 = mean(Y2014, na.rm=TRUE),
Mean_2015 = mean(Y2015, na.rm=TRUE)) %>%
arrange(desc(Mean_2015))

mutate() function :

Use : Creates new variables

Syntax :
mutate(data_frame, expression(s) )
or
data_frame %>% mutate(expression(s))
Example 28 : Create a new variable

The following code calculates division of Y2015 by Y2014 and name it "change".
mydata1 = mutate(mydata, change=Y2015/Y2014)
Example 29 : Multiply all the variables by 1000
It creates new variables and name them with suffix "_new".
mydata11 = mutate_all(mydata, funs("new" = .* 1000))

Output
The output shown in the image above is truncated due to high number of variables.

Note - The above code returns the following error messages -

Warning messages:
1: In Ops.factor(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, :
‘*’ not meaningful for factors
2: In Ops.factor(1:51, 1000) : ‘*’ not meaningful for factors

It implies you are multiplying 1000 to string(character) values which are stored as factor
variables. These variables are 'Index', 'State'. It does not make sense to apply multiplication
operation on character variables. For these two variables, it creates newly created variables
which contain only NA.

Solution : See Example 45 - Apply multiplication on only numeric variables

Example 30 : Calculate Rank for Variables

Suppose you need to calculate rank for variables Y2008 to Y2010.

mydata12 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(.)))


Output
By default, min_rank() assigns 1 to the smallest value and high number to the largest value. In
case, you need to assign rank 1 to the largest value of a variable, use min_rank(desc(.))

mydata13 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(desc(.))))


Example 31 : Select State that generated highest income among the variable 'Index'
out = mydata %>% group_by(Index) %>% filter(min_rank(desc(Y2015)) == 1) %>%
select(Index, State, Y2015)

Index State Y2015

1 A Alaska 1979143

2 C Connecticut 1718072

3 D Delaware 1627508

4 F Florida 1170389

5 G Georgia 1725470

6 H Hawaii 1150882

7 I Idaho 1757171

8 K Kentucky 1913350

9 L Louisiana 1403857

10 M Missouri 1996005

11 N New Hampshire 1963313

12 O Oregon 1893515

13 P Pennsylvania 1668232

14 R Rhode Island 1611730

15 S South Dakota 1136443


16 T Texas 1705322

17 U Utah 1729273

18 V Virginia 1850394

19 W Wyoming 1853858

Example 32 : Cumulative Income of 'Index' variable

The cumsum function calculates cumulative sum of a variable. With mutate function, we
insert a new variable called 'Total' which contains values of cumulative income of variable
Index.

out2 = mydata %>% group_by(Index) %>% mutate(Total=cumsum(Y2015)) %>%


select(Index, Y2015, Total)

join() function :

Use : Join two datasets

Syntax :
inner_join(x, y, by = )
left_join(x, y, by = )
right_join(x, y, by = )
full_join(x, y, by = )
semi_join(x, y, by = )
anti_join(x, y, by = )
x, y - datasets (or tables) to merge / join
by - common variable (primary key) to join by.

Example 33 : Common rows in both the tables

df1 = data.frame(ID = c(1, 2, 3, 4, 5),

w = c('a', 'b', 'c', 'd', 'e'),

x = c(1, 1, 0, 0, 1),

y=rnorm(5),
z=letters[1:5])

df2 = data.frame(ID = c(1, 7, 3, 6, 8),

a = c('z', 'b', 'k', 'd', 'l'),

b = c(1, 2, 3, 0, 4),

c =rnorm(5),

d =letters[2:6])

INNER JOIN returns rows when there is a match in both tables. In this example, we are
merging df1 and df2 with ID as common variable (primary key).

df3 = inner_join(df1, df2, by = "ID")

Output : INNER JOIN


If the primary key does not have same name in both the tables, try the following way:
inner_join(df1, df2, by = c("ID"="ID1"))
Example 34 : Applying LEFT JOIN

LEFT JOIN : It returns all rows from the left table, even if there are no matches in the right
table.

left_join(df1, df2, by = "ID")

Output : LEFT JOIN

Combine Data Vertically


intersect(x, y)
Rows that appear in both x and y.

union(x, y)
Rows that appear in either or both x and y.

setdiff(x, y)
Rows that appear in x but not y.

Example 35 : Applying INTERSECT

Prepare Sample Data for Demonstration

mtcars$model <- rownames(mtcars)


first <- mtcars[1:20, ]
second <- mtcars[10:32, ]
INTERSECT selects unique rows that are common to both the data frames.

intersect(first, second)

Example 36 : Applying UNION

UNION displays all rows from both the tables and removes duplicate records from the combined
dataset. By using union_all function, it allows duplicate rows in the combined dataset.

x=data.frame(ID = 1:6, ID1= 1:6)


y=data.frame(ID = 1:6, ID1 = 1:6)
union(x,y)
union_all(x,y)

Example 37 : Rows appear in one table but not in other table


setdiff(first, second)

Example 38 : IF ELSE Statement

Syntax :

if_else(condition, true, false, missing = NULL)


true : Value if condition meets
false : Value if condition does not meet
missing : Value if missing cases.It will be used to replace missing values (Default : NULL)
df <- c(-10,2, NA)
if_else(df < 0, "negative", "positive", missing = "missing value")
Create a new variable with IF_ELSE

If a value is less than 5, add it to 1 and if it is greater than or equal to 5, add it to 2. Otherwise 0.

df =data.frame(x = c(1,5,6,NA))
df %>% mutate(newvar=if_else(x<5, x+1, x+2,0))

Output

Nested IF ELSE

Multiple IF ELSE statement can be written using if_else() function. See the example below -
mydf =data.frame(x = c(1:5,NA))
mydf %>% mutate(newvar= if_else(is.na(x),"I am missing",
if_else(x==1,"I am one",
if_else(x==2,"I am two",
if_else(x==3,"I am three","Others")))))
Output

x flag

1 1 I am one

2 2 I am two

3 3 I am three

4 4 Others

5 5 Others

6 NA I am missing
SQL-Style CASE WHEN Statement
We can use case_when() function to write nested if-else queries. In case_when(), you can use
variables directly within case_when() wrapper. TRUE refers to ELSE statement.

mydf %>% mutate(flag = case_when(is.na(x) ~ "I am missing",

x == 1 ~ "I am one",

x == 2 ~ "I am two",

x == 3 ~ "I am three",

TRUE ~ "Others"))

Important Point

Make sure you set is.na() condition at the beginning in nested ifelse. Otherwise, it would not be
executed.

Example 39 : Apply ROW WISE Operation

Suppose you want to find maximum value in each row of variables 2012, 2013, 2014, 2015.
The rowwise() function allows you to apply functions to rows.
df = mydata %>%
rowwise() %>% mutate(Max= max(Y2012,Y2013,Y2014,Y2015)) %>%
select(Y2012:Y2015,Max)
Example 40 : Combine Data Frames

Suppose you are asked to combine two data frames. Let's first create two sample datasets.
df1=data.frame(ID = 1:6, x=letters[1:6])
df2=data.frame(ID = 7:12, x=letters[7:12])
Input Datasets
The bind_rows() function combine two datasets with rows. So combined dataset would
contain 12 rows (6+6) and 2 columns.

xy = bind_rows(df1,df2)
It is equivalent to base R function rbind.
xy = rbind(df1,df2)
The bind_cols() function combine two datasets with columns. So combined dataset would
contain 4 columns and 6 rows.

xy = bind_cols(x,y)
or
xy = cbind(x,y)
The output is shown below-

cbind Output

Example 41 : Calculate Percentile Values

The quantile() function is used to determine Nth percentile value. In this example, we are
computing percentile values by variable Index.
mydata %>% group_by(Index) %>%
summarise(Pecentile_25=quantile(Y2015, probs=0.25),
Pecentile_50=quantile(Y2015, probs=0.5),
Pecentile_75=quantile(Y2015, probs=0.75),
Pecentile_99=quantile(Y2015, probs=0.99))

The ntile() function is used to divide the data into N bins.


x= data.frame(N= 1:10)
x = mutate(x, pos = ntile(x$N,5))

Example 42 : Automate Model Building

This example explains the advanced usage of do() function. In this example, we are building
linear regression model for each level of a categorical variable. There are 3 levels in variable cyl
of dataset mtcars.

length(unique(mtcars$cyl))
Result : 3

by_cyl <- group_by(mtcars, cyl)


models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
summarise(models, rsq = summary(mod)$r.squared)
models %>% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))
)

Output : R-Squared Values

if() Family of Functions

It includes functions like select_if, mutate_if, summarise_if. They come into action only when
logical condition meets. See examples below.

Example 43 : Select only numeric columns

The select_if() function returns only those columns where logical condition is TRUE.
The is.numeric refers to retain only numeric variables.
mydata2 = select_if(mydata, is.numeric)
Similarly, you can use the following code for selecting factor columns -
mydata3 = select_if(mydata, is.factor)
Example 44 : Number of levels in factor variables

Like select_if() function, summarise_if() function lets you to summarise only for variables where
logical condition holds.
summarise_if(mydata, is.factor, funs(nlevels(.)))
It returns 19 levels for variable Index and 51 levels for variable State.

Example 45 : Multiply by 1000 to numeric variables

mydata11 = mutate_if(mydata, is.numeric, funs("new" = .* 1000))

Example 46 : Convert value to NA

In this example, we are converting "" to NA using na_if() function.

k <- c("a", "b", "", "d")


na_if(k, "")
Result : "a" "b" NA "d"

Conclusion:
Experiment No. 8
Aim: Data Manipulation with data.table package

Theory:
The data.table R package is considered as the fastest package for data manipulation. This tutorial
includes various examples and practice questions to make you familiar with the package.
Analysts generally call R programming not compatible with big datasets ( > 10 GB) as it is not
memory efficient and loads everything into RAM. To change their perception, 'data.table'
package comes into play. This package was designed to be concise and painless. There are many
benchmarks done in the past to compare dplyr vs data.table. In every benchmark, data.table wins.
The efficiency of this package was also compared with python' package (panda). And data.table
wins. In CRAN, there are more than 200 packages that are dependent on data.table which makes
it listed in the top 5 R's package.

data.table Syntax

The syntax of data.table is shown in the image below :

data.table Syntax

DT[ i , j , by]
1. The first parameter of data.table i refers to rows. It implies subsetting rows. It is
equivalent to WHERE clause in SQL
2. The second parameter of data.table j refers to columns. It implies subsetting
columns (dropping / keeping). It is equivalent to SELECT clause in SQL.
3. The third parameter of data.table by refers to adding a group so that all
calculations would be done within a group. Equivalent to SQL's GROUP BY clause.

The data.table syntax is NOT RESTRICTED to only 3 parameters. There are other
arguments that can be added to data.table syntax. The list is as follows -
1. with, which
2. allow.cartesian
3. roll, rollends
4. .SD, .SDcols
5. on, mult, nomatch
The above arguments would be explained in the latter part of the post.
How to Install and load data.table Package

install.packages("data.table")
#load required library
library(data.table)

Read Data

In data.table package, fread() function is available to read or get data from your computer or
from a web page. It is equivalent to read.csv() function of base R.

mydata = fread("https://github.com/arunsrinivasan/satrdays-
workshop/raw/master/flights_2014.csv")

Describe Data

This dataset contains 253K observations and 17 columns. It constitutes information about flights'
arrival or departure time, delays, flight cancellation and destination in year 2014.

nrow(mydata)
[1] 253316
ncol(mydata)
[1] 17
names(mydata)
[1] "year" "month" "day" "dep_time" "dep_delay" "arr_time" "arr_delay"
[8] "cancelled" "carrier" "tailnum" "flight" "origin" "dest" "air_time"
[15] "distance" "hour" "min"
head(mydata)
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight
1: 2014 1 1 914 14 1238 13 0 AA N338AA 1
2: 2014 1 1 1157 -3 1523 13 0 AA N335AA 3
3: 2014 1 1 1902 2 2224 9 0 AA N327AA 21
4: 2014 1 1 722 -8 1014 -26 0 AA N3EHAA 29
5: 2014 1 1 1347 2 1706 1 0 AA N319AA 117
6: 2014 1 1 1824 4 2145 0 0 AA N3DEAA 119
origin dest air_time distance hour min
1: JFK LAX 359 2475 9 14
2: JFK LAX 363 2475 11 57
3: JFK LAX 351 2475 19 2
4: LGA PBI 157 1035 7 22
5: JFK LAX 350 2475 13 47
6: EWR LAX 339 2454 18 24
Selecting or Keeping Columns

Suppose you need to select only 'origin' column. You can use the code below -

dat1 = mydata[ , origin] # returns a vector


The above line of code returns a vector not data.table.

To get result in data.table format, run the code below :

dat1 = mydata[ , .(origin)] # returns a data.table


It can also be written like data.frame way

dat1 = mydata[, c("origin"), with=FALSE]

Keeping a column based on column position

dat2 =mydata[, 2, with=FALSE]


In this code, we are selecting second column from mydata.

Keeping Multiple Columns

The following code tells R to select 'origin', 'year', 'month', 'hour' columns.

dat3 = mydata[, .(origin, year, month, hour)]

Keeping multiple columns based on column position

You can keep second through fourth columns using the code below -

dat4 = mydata[, c(2:4), with=FALSE]

Dropping a Column

Suppose you want to include all the variables except one column, say. 'origin'. It can be easily
done by adding ! sign (implies negation in R)

dat5 = mydata[, !c("origin"), with=FALSE]

Dropping Multiple Columns


dat6 = mydata[, !c("origin", "year", "month"), with=FALSE]

Keeping variables that contain 'dep'

You can use %like% operator to find pattern. It is same as base R's grepl() function, SQL's
LIKE operator and SAS's CONTAINS function.

dat7 = mydata[,names(mydata) %like% "dep", with=FALSE]

Rename Variables

You can rename variables with setnames() function. In the following code, we are renaming a
variable 'dest' to 'destination'.

setnames(mydata, c("dest"), c("Destination"))


To rename multiple variables, you can simply add variables in both the sides.

setnames(mydata, c("dest","origin"), c("Destination", "origin.of.flight"))

Subsetting Rows / Filtering

Suppose you are asked to find all the flights whose origin is 'JFK'.

# Filter based on one variable


dat8 = mydata[origin == "JFK"]
Select Multiple Values

Filter all the flights whose origin is either 'JFK' or 'LGA'

dat9 = mydata[origin %in% c("JFK", "LGA")]

Apply Logical Operator : NOT

The following program selects all the flights whose origin is not equal to 'JFK' and 'LGA'

# Exclude Values
dat10 = mydata[!origin %in% c("JFK", "LGA")]

Filter based on Multiple variables

If you need to select all the flights whose origin is equal to 'JFK' and carrier = 'AA'
dat11 = mydata[origin == "JFK" & carrier == "AA"]

Faster Data Manipulation with Indexing

data.table uses binary search algorithm that makes data manipulation faster.

Binary Search Algorithm

Binary search is an efficient algorithm for finding a value from a sorted list of values. It involves
repeatedly splitting in half the portion of the list that contains values, until you found the value
that you were searching for.
Suppose you have the following values in a variable :

5, 10, 7, 20, 3, 13, 26


You are searching the value 20 in the above list. See how binary search algorithm works -

1. First, we sort the values


2. We would calculate the middle value i.e. 10.
3. We would check whether 20 = 10? No. 20 < 10.
4. Since 20 is greater than 10, it should be somewhere after 10. So we can ignore all
the values that are lower than or equal to 10.
5. We are left with 13, 20, 26. The middle value is 20.
6. We would again check whether 20=20. Yes. the match found.

If we do not use this algorithm, we would have to search 5 in the whole list of seven values.

It is important to set key in your dataset which tells system that data is sorted by the key column.
For example, you have employee’s name, address, salary, designation, department, employee ID.
We can use 'employee ID' as a key to search a particular employee.

Set Key

In this case, we are setting 'origin' as a key in the dataset mydata.

# Indexing (Set Keys)


setkey(mydata, origin)
Note : It makes the data table sorted by the column 'origin'.

How to filter when key is turned on.

You don't need to refer the key column when you apply filter.
data12 = mydata[c("JFK", "LGA")]

Performance Comparison

You can compare performance of the filtering process (With or Without KEY).

system.time(mydata[origin %in% c("JFK", "LGA")])


system.time(mydata[c("JFK", "LGA")])

Performance - With or without KEY


If you look at the real time in the image above, setting key makes filtering twice as faster than
without using keys.

Indexing Multiple Columns

We can also set keys to multiple columns like we did below to columns 'origin' and 'dest'. See the
example below.

setkey(mydata, origin, dest)


Filtering while setting keys on Multiple Columns

# First key column 'origin' matches “JFK” and second key column 'dest' matches “MIA”
mydata[.("JFK", "MIA")]
It is equivalent to the following code :

mydata[origin == "JFK" & dest == "MIA"]


To identify the column(s) indexed by

key(mydata)
Result : It returns origin and dest as these are columns that are set keys.

Sorting Data

We can sort data using setorder() function, By default, it sorts data on ascending order.

mydata01 = setorder(mydata, origin)


Sorting Data on descending order

In this case, we are sorting data by 'origin' variable on descending order.

mydata02 = setorder(mydata, -origin)

Sorting Data based on multiple variables

In this example, we tells R to reorder data first by origin on ascending order and then variable
'carrier'on descending order.

mydata03 = setorder(mydata, origin, -carrier)

Adding Columns (Calculation on rows)

You can do any operation on rows by adding := operator. In this example, we are subtracting
'dep_delay' variable from 'dep_time' variable to compute scheduled departure time.

mydata[, dep_sch:=dep_time - dep_delay]

Adding Multiple Columns

mydata002 = mydata[, c("dep_sch","arr_sch"):=list(dep_time - dep_delay, arr_time -


arr_delay)]

IF THEN ELSE

The 'IF THEN ELSE' conditions are very popular for recoding values. In data.table package, it
can be done with the following methods :

Method I : mydata[, flag:= 1*(min < 50)]


Method II : mydata[, flag:= ifelse(min < 50, 1,0)]

It means to set flag= 1 if min is less than 50. Otherwise, set flag =0.

How to write Sub Queries (like SQL)

We can use this format - DT[ ] [ ] [ ] to build a chain in data.table. It is like sub-queries like
SQL.

mydata[, dep_sch:=dep_time - dep_delay][,.(dep_time,dep_delay,dep_sch)]


First, we are computing scheduled departure time and then selecting only relevant columns.

Summarize or Aggregate Columns

Like SAS PROC MEANS procedure, we can generate summary statistics of specific variables.
In this case, we are calculating mean, median, minimum and maximum value of variable
arr_delay.

mydata[, .(mean = mean(arr_delay, na.rm = TRUE),


median = median(arr_delay, na.rm = TRUE),
min = min(arr_delay, na.rm = TRUE),
max = max(arr_delay, na.rm = TRUE))]

Summarize with data.table package


Summarize Multiple Columns

To summarize multiple variables, we can simply write all the summary statistics function in a
bracket. See the command below-

mydata[, .(mean(arr_delay), mean(dep_delay))]


If you need to calculate summary statistics for a larger list of variables, you can use .SD and
.SDcols operators. The .SD operator implies 'Subset of Data'.

mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")]


In this case, we are calculating mean of two variables - arr_delay and dep_delay.

Summarize all numeric Columns

By default, .SD takes all continuous variables (excluding grouping variables)

mydata[, lapply(.SD, mean)]

Summarize with multiple statistics

mydata[, sapply(.SD, function(x) c(mean=mean(x), median=median(x)))]


GROUP BY (Within Group Calculation)

Summarize by group 'origin

mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = origin]

Summary by group

Use key column in a by operation

Instead of 'by', you can use keyby= operator.

mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), keyby = origin]

Summarize multiple variables by group 'origin'

mydata[, .(mean(arr_delay, na.rm = TRUE), mean(dep_delay, na.rm = TRUE)), by = origin]


Or it can be written like below -
mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay", "dep_delay"), by =
origin]

Remove Duplicates

You can remove non-unique / duplicate cases with unique() function. Suppose you want to
eliminate duplicates based on a variable, say. carrier.

setkey(mydata, "carrier")
unique(mydata)

Suppose you want to remove duplicated based on all the variables. You can use the command
below -
setkey(mydata, NULL)
unique(mydata)
Note : Setting key to NULL is not required if no key is already set.

Extract values within a group


The following command selects first and second values from a categorical variable carrier.

mydata[, .SD[1:2], by=carrier]

Select LAST value from a group

mydata[, .SD[.N], by=carrier]

SQL's RANK OVER PARTITION

In SQL, Window functions are very useful for solving complex data problems. RANK OVER
PARTITION is the most popular window function. It can be easily translated in data.table with
the help of frank() function. frank() is similar to base R's rank() function but much faster. See
the code below.

dt = mydata[, rank:=frank(-distance,ties.method = "min"), by=carrier]


In this case, we are calculating rank of variable 'distance' by 'carrier'. We are assigning rank 1 to
the highest value of 'distance' within unique values of 'carrier'.

Cumulative SUM by GROUP

We can calculate cumulative sum by using cumsum() function.

dat = mydata[, cum:=cumsum(distance), by=carrier]

Lag and Lead

The lag and lead of a variable can be calculated with shift() function. The syntax of shift()
function is as follows - shift(variable_name, number_of_lags, type=c("lag", "lead"))

DT <- data.table(A=1:5)
DT[ , X := shift(A, 1, type="lag")]
DT[ , Y := shift(A, 1, type="lead")]
Lag and Lead Function

Between and LIKE Operator

We can use %between% operator to define a range. It is inclusive of the values of both the ends.

DT = data.table(x=6:10)
DT[x %between% c(7,9)]
The %like% is mainly used to find all the values that matches a pattern.

DT = data.table(Name=c("dep_time","dep_delay","arrival"), ID=c(2,3,4))
DT[Name %like% "dep"]

Merging / Joins

The merging in data.table is very similar to base R merge() function. The only difference is
data.table by default takes common key variable as a primary key to merge two datasets.
Whereas, data.frame takes common variable name as a primary key to merge the datasets.

Sample Data

(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))


(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
Inner Join

It returns all the matching observations in both the datasets.

merge(dt1, dt2, by="A")

Left Join

It returns all observations from the left dataset and the matched observations from the right
dataset.

merge(dt1, dt2, by="A", all.x = TRUE)

Right Join

It returns all observations from the right dataset and the matched observations from the left
dataset.

merge(dt1, dt2, by="A", all.y = TRUE)


Full Join

It return all rows when there is a match in one of the datasets.

merge(dt1, dt2, all=TRUE)

Convert a data.table to data.frame

You can use setDF() function to accomplish this task.

setDF(mydata)
Similarly, you can use setDT() function to convert data frame to data table.

set.seed(123)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE)
setDT(X, key = "A")

Other Useful Functions

Reshape Data

It includes several useful functions which makes data cleaning easy and smooth. To reshape or
transpose data, you can use dcast.data.table() and melt.data.table() functions. These functions
are sourced from reshape2 package and make them efficient. It also add some new features in
these functions.

Rolling Joins

It supports rolling joins. They are commonly used for analyzing time series data. A very R
packages supports these kind of joins.

Examples for Practise

Q1. Calculate total number of rows by month and then sort on descending order

mydata[, .N, by = month] [order(-N)]


The .N operator is used to find count.
Q2. Find top 3 months with high mean arrival delay

mydata[, .(mean_arr_delay = mean(arr_delay, na.rm = TRUE)), by = month][order(-


mean_arr_delay)][1:3]

Q3. Find origin of flights having average total delay is greater than 20 minutes

mydata[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("arr_delay", "dep_delay"), by =


origin][(arr_delay + dep_delay) > 20]

Q4. Extract average of arrival and departure delays for carrier == 'DL' by 'origin' and 'dest'
variables

mydata[carrier == "DL",
lapply(.SD, mean, na.rm = TRUE),
by = .(origin, dest),
.SDcols = c("arr_delay", "dep_delay")]

Q5. Pull first value of 'air_time' by 'origin' and then sum the returned values when it is greater
than 300

mydata[, .SD[1], .SDcols="air_time", by=origin][air_time > 300, sum(air_time)]


Experiment No. 9
Aim: Study and implementation of Data Visualization with ggplot2
Theory
For the purpose of data visualization, R offers various methods through inbuilt graphics and
powerful packages such as ggolot2. Former helps in creating simple graphs while latter assists in
creating customized professional graphs. In this article we will try to learn how various graphs
can be made and altered using ggplot2 package.

What is ggplot2?
ggplot2 is a robust and a versatile R package, developed by the most well known R developer,
Hadley Wickham, for generating aesthetic plots and charts.

The ggplot2 implies "Grammar of Graphics" which believes in the principle that a plot can be
split into the following basic parts -
Plot = data + Aesthetics + Geometry
1. data refers to a data frame (dataset).
2. Aesthetics indicates x and y variables. It is also used to tell R how data are
displayed in a plot, e.g. color, size and shape of points etc.
3. Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot,
density plot, dot plot etc.)

ggplot2 Standard Syntax

Apart from the above three parts, there are other important parts of plot -
1. Faceting implies the same type of graph can be applied to each subset of the data.
For example, for variable gender, creating 2 graphs for male and female.
2. Annotation lets you to add text to the plot.
3. Summary Statistics allows you to add descriptive statistics on a plot.
4. Scales are used to control x and y axis limits
Why ggplot2 is better?

 Excellent themes can be created with a single command.


 Its colors are nicer and more pretty than the usual graphics.
 Easy to visualize data with multiple variables.
 Provides a platform to create simple graphs providing plethora of information.

The table below shows common charts along with various important functions used in these
charts.
Important Important Functions
Plots

Scatter Plot geom_point(), geom_smooth(), stat_smooth()

Bar Chart geom_bar(), geom_errorbar()

Histogram geom_histogram(), stat_bin(), position_identity(), position_stack(),


position_dodge()

Box Plot geom_boxplot(), stat_boxplot(), stat_summary()

Line Plot geom_line(), geom_step(), geom_path(), geom_errorbar()

Pie Chart coord_polar()

Datasets

In this article, we will use three datasets - 'iris' , 'mpg' and 'mtcars' datasets available in R.

1. The 'iris' data comprises of 150 observations with 5 variables. We have 3 species of flowers:
Setosa, Versicolor and Virginica and for each of them the sepal length and width and petal length
and width are provided.

2. The 'mtcars' data consists of fuel consumption (mpg) and 10 aspects of automobile design
and performance for 32 automobiles. In order words, we have 32 observations and 11 different
variables:

1. mpg Miles/(US) gallon


2. cyl Number of cylinders
3. disp Displacement (cu.in.)
4. hp Gross horsepower
5. drat Rear axle ratio
6. wt Weight (1000 lbs)
7. qsec 1/4 mile time
8. vs V/S
9. am Transmission (0 = automatic, 1 = manual)
10. gear Number of forward gears
11. carb Number of carburetors

3. The 'mpg' data consists of 234 observations and 11 variables.

Install and Load Package

First we need to install package in R by using command install.packages( ).


#installing package
install.packages("ggplot2")
library(ggplot2)
Once installation is completed, we need to load the package so that we can use the functions
available in the ggplot2 package. To load the package, use command library( )

Histogram, Density plots and Box plots are used for visualizing a continuous variable.

Creating Histogram:
Firstly we consider the iris data to create histogram and scatter plot.

# Considering the iris data.


# Creating a histogram
ggplot(data = iris, aes( x = Sepal.Length)) + geom_histogram( )
Here we call ggplot( ) function, the first argument being the dataset to be used.

1. aes( ) i.e. aesthetics we define which variable will be represented on the x- axis;
here we consider 'Sepal.Length'
2. geom_histogram( ) denotes we want to plot a histogram.
Histogram in R

To change the width of bin in the histograms we can use binwidth in geom_histogram( )
ggplot(data = iris, aes(x = Sepal.Length)) + geom_histogram(binwidth=1)

One can also define the number of bins being wanted, the binwidth in that case will be adjusted
automatically.

ggplot(data = iris , aes(x=Sepal.Length)) + geom_histogram(color="black", fill="white", bins =


10)

Using color = "black" and fill = "white" we are denoting the boundary colors and the inside
color of the bins respectively.

How to visualize various groups in histogram


ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_histogram(fill="white", binwidth = 1)
Histogram depicting various species

Creating Density Plot


Density plot is also used to present the distribution of a continuous variable.
ggplot(iris, aes( x = Sepal.Length)) + geom_density( )
geom_density( ) function is for displaying density plot.
Density Plot

How to show various groups in density plot


ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_density( )
Density Plot by group

Creating Bar and Column Charts :


Bar and column charts are probably the most common chart type. It is best used to compare
different values.

Now mpg data will be used for creating the following graphics.

ggplot(mpg, aes(x= class)) + geom_bar()


Here we are trying to create a bar plot for number of cars in each class using geom_bar( ).
Column Chart using ggplot2

Using coord_flip( ) one can inter-change x and y axis.


ggplot(mpg, aes(x= class)) + geom_bar() + coord_flip()
Bar Chart

How to add or modify Main Title and Axis Labels


The following functions can be used to add or alter main title and axis labels.
1. ggtitle("Main title"): Adds a main title above the plot
2. xlab("X axis label"): Changes the X axis label
3. ylab("Y axis label"): Changes the Y axis label
4. labs(title = "Main title", x = "X axis label", y = "Y axis label"): Changes main
title and axis labels
p = ggplot(mpg, aes(x= class)) + geom_bar()
p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")

Title and Axis Labels

How to add data labels


p = ggplot(mpg, aes(x= class)) + geom_bar()
p = p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")
p + geom_text(stat='count', aes(label=..count..), vjust=-0.25)
geom_text() is used to add text directly to the plot. vjust is to adjust the position of data labels in
bar.
Add Data Labels in Bar

How to reorder Bars


Using stat="identity" we can use our derived values instead of count.
library(plyr)
library(dplyr)
count(mpg,class) %>% arrange(-n) %>%
mutate(class = factor(class,levels= class)) %>%
ggplot(aes(x=class, y=n)) + geom_bar(stat="identity")
The above command will firstly create a frequency distribution for the type of car and then
arrange it in descending order using arrange(-n). Then using mutate( ) we modify the 'class'
column to a factor with levels 'class' and hence plot the bar plot using geom_bar( ).
Change order of bars

Here, bar of SUV appears first as it has maximum number of cars. Now bars are ordered based
on frequency count.

Showing Mean of Continuous Variable by Categorical Variable


df = mpg %>% group_by(class) %>% summarise(mean = mean(displ)) %>%
arrange(-mean) %>% mutate(class = factor(class,levels= class))

p = ggplot(df, aes(x=class, y=mean)) + geom_bar(stat="identity")


p + geom_text(aes(label = sprintf("%0.2f", round(mean, digits = 2))),
vjust=1.6, color="white", fontface = "bold", size=4)

Now using dplyr library we create a new dataframe 'df' and try to plot it.
Using group_by we group the data according to various types of cars and summarise enables us
to find the statistics (here mean for 'displ' variable) for each group. To add data labels (with 2
decimal places) we use geom_text( )
Customized BarPlot

Creating Stacked Bar Chart

p <- ggplot(data=mpg, aes(x=class, y=displ, fill=drv))


p + geom_bar(stat = "identity")
Stacked BarPlot

p + geom_bar(stat="identity", position=position_dodge())

Stacked - Position_dodge
Creating BoxPlot

Using geom_boxplot( ) one can create a boxplot.

To create different boxplots for 'disp' for different levels of x we can define aes(x = cyl, y = disp)

mtcars$cyl = factor(mtcars$cyl)
ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot()

We can see one outlier for 6 cylinders.

To create a notched boxplot we write notch = TRUE

ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot(notch = TRUE)


Notched Boxplot

Scatter Plot
A scatterplot is used to graphically represent the relationship between two continuous variables.
# Creating a scatter plot denoting various species.
ggplot(data = iris, aes( x = Sepal.Length, y = Sepal.Width,shape = Species, color = Species)) +
geom_point()
We plot the points using geom_point( ). In the aesthetics we define that x axis denotes sepal
length, y axis denotes sepal width; shape = Species and color = Species denotes that different
shapes and different sizes should be used for each particular specie of flower.
Scatter Plot

Scatter plots are constructed using geom_point( )

# Creating scatter plot for automatic cars denoting different cylinders.


ggplot(data = subset(mtcars,am == 0),aes(x = mpg,y = disp,colour = factor(cyl))) +
geom_point()
Scatter plot denotingvarious levels of cyl

We use subset( ) function to select only those cars which have am = 0; paraphrasing it; we are
considering only those cars which are automatic. We plot the displacement corresponding to
mileage and for different cylinders we are using various colors. Also factor(cyl) transforms our
continuous variable cylinder to a factor.

# Seeing the patterns with the help of geom_smooth.


ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp)) + geom_point() + geom_smooth()
In the above command we try to plot mileage (mpg) and displacement (disp) and variation in
colors denote the varying horsepower(hp) . geom_smooth( ) is used to determine what kind of
pattern is exhibited by the points.
In a similar way we can use geom_line( ) to plot another line on the graph:

# Plotting the horsepower using geom_line


ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp)) + geom_point(size = 2.5) +
geom_line(aes(y = hp))

Here in geom_point we have added an optional argument size = 2.5 denoting the size of the
points. geom_line( ) creates a line. Note that we have not provided any aesthetics for x axis in
geom_line, it means that it will plot the horsepower(hp) corresponding to mileage(mpg) only.
Modifying the axis labels and appending the title and subtitle
#Adding title or changing the labels
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + labs(title = "Scatter plot")
#Alternatively
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot")
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot",
subtitle = "mtcars data in R")

Adding title and subtitle to plots

Here using labs( ) we can change the title of our legend or ggtitle we can assign our graph some
title. If we want to add some title or sub-title to our graph thus we can use ggtitle( )where the
first argument is our 'main title' and second argument is our subtitle.
a <- ggplot(mtcars,aes(x = mpg, y = disp, color = factor(cyl))) + geom_point()
a
#Changing the axis labels.
a + labs(color = "Cylinders")
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement")
We firstly save our plot to 'a' and thus we make the alterations.
Note that in the labs command we are using color = "Cylinders" which changes the title of our
legend.
Using the xlab and ylab commands we can change the x and y axis labels respectively. Here our
x axis label is 'mileage' and y axis label is 'displacement'
#Combining it all
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement") + ggtitle(label =
"Scatter plot", subtitle = "mtcars data in R")

In the above plot we can see that the labels on x axis,y axis and legend have changed; the title
and subtitle have been added and the points are colored, distinguishing the number of cylinders.

Playing with themes


Themes can be used in ggplot2 to change the backgrounds,text colors, legend colors and axis
texts.
Firstly we save our plot to 'b' and hence create the visualizations by manipulating 'b'. Note that in
aesthetics we have written mpg, disp which automatically plots mpg on x axis and disp on y axis.
#Changing the themes.
b <- ggplot(mtcars,aes(mpg,disp)) + geom_point() + labs(title = "Scatter Plot")
#Changing the size and color of the Title and the background color.
b + theme(plot.title = element_text(color = "blue",size = 17),plot.background =
element_rect("orange"))
Plot background color changed.

We use theme( ) to modify the the plot title and background. plot.title is an element_text( )
object in which we have specified the color and size of our title. Utilizing plot.background which
is an element_rect( ) object we can specify the color of our background.
ggplot2( ) offers by default themes with background panel design colors being changed
automatically. Some of them are theme_gray, theme_minimal, theme_dark etc.
b + theme_minimal( )
We can observe horizontal and vertical lines behind the points. What if we don't need them? This
can be achieved via:
#Removing the lines from the background.
b + theme(panel.background = element_blank())
Setting panel.background = element_blank( ) with no other parameter can remove those lines
and color from the panel.
#Removing the text from x and y axis.
b + theme(axis.text = element_blank())
b + theme(axis.text.x = element_blank())
b + theme(axis.text.y = element_blank())
To remove the text from both the axis we can use axis.text = element_blank( ). If we want to
remove the text only from particular axis then we need to specify it.
Now we save our plot to c and then make the changes.
#Changing the legend position
c <- ggplot(mtcars,aes(x = mpg, y = disp, color = hp)) +labs(title = "Scatter Plot") +
geom_point()
c + theme(legend.position = "top")
If we want to move the legend then we can specify legend.position as "top" or "bottom" or "left"
or "right".
Finally combining all what we have learnt in themes we create the above plot where the legend is
placed at bottom, plot title is in forest green color, the background is in yellow and no text is
displayed on both the axis.

#Combining everything.
c + theme(legend.position = "bottom", axis.text = element_blank()) +
theme(plot.title = element_text(color = "Forest Green",size = 17),plot.background =
element_rect("Yellow"))
Scatter Plot

Changing the color scales in the legend


In ggplot2, by default the color scale is from dark blue to light blue. It might happen that we
wish to innovate the scales by changing the colors or adding new colors. This can be done
successfuly via scale_color_gradient function.

c + scale_color_gradient(low = "yellow",high = "red")


Suppose we want the colors to vary from yellow to red; yellow denoting the least value and red
denoting the highest value; we set low = "yellow" and high = "red". Note that in the legend it
takes the scale to be started from 0 and not the minimum value of the series.
What if we want 3 colors?

c + scale_color_gradient2(low = "red",mid = "green",high = "blue")


To serve the purpose of having 3 colors in the legend we use scale_color_gradient2 with low =
"red",mid = "green" and high = "blue" means it divides the entire range(Starting from 0) to the
maximum observation in 3 equal parts, with first part being shaded as red, central part as green
and highest part as blue.

c + theme(legend.position = "bottom") + scale_color_gradientn(colours = c("red","forest


green","white","blue"))
If we want more than 3 colors to be represented by our legend we can
utilizescale_color_gradientn( ) function and the argument colors will be a vector starting where
1st element denotes the color of the 1st part, 2nd color denotes the color of 2nd part etc.
Changing the breaks in the legend.
It can be seen that the legend for continuous variable starts from 0.
Suppose we want the breaks to be: 50,125,200,275 and 350, we use seq(50,350,75) where 50
denotes the least number, 350 is the maximum number in the sequence and 75 is the difference
between 2 consecutive numbers.
#Changing the breaks in the legend
c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), labels =
paste(seq(50,350,75),"hp"))
In scale_color_continuous we set the breaks as our desired sequence, and can change the labels
if we want. Using paste function our sequence is followed by the word "hp" and name =
"horsepower" changes the name of our legend.

Changing the break points and color scale of the legend together.
Let us try changing the break points and the colors in the legend together by trial and error.

#Trial 1 : This one is wrong


c + scale_color_continuous( breaks = seq(50,350,75)) +
scale_color_gradient(low = "blue",high = "red")
We can refer to trial1 image for the above code which can be found below. Notice that the color
scale is blue to red as desired but the breaks have not changed.
#Trial 2: Next one is wrong.
c + scale_color_gradient(low = "blue",high = "red") +
scale_color_continuous( breaks = seq(50,350,75))
trial2 image is the output for the above code. Here the color scale has not changed but the breaks
have been created.
trial1

trial2

What is happening? The reason for this is that we cannot have 2 scale_color functions for a
single graph. If there are multiple scale_color_ functions then R overwrites the other
scale_color_ functions by the last scale_color_ command it has received.
In trial 1, scale_color_gradient overwrites the previous scale_color_continuous command.
Similarly in trial 2, scale_color_continuous overwrites the previous scale_color_gradient
command.

The correct way to do is to define the arguments in one function only.

c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), low = "red", high


= "black") + theme(panel.background = element_rect("green"),
plot.background = element_rect("orange"))
Here low = "red" and high = "black" are defined in scale_color_continuous function along with
the breaks.

Changing the axis cut points

We save our initial plot to 'd'.


d <- ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point(aes(color = factor(am))) +
xlab("Mileage") + ylab("Displacement") +
theme(panel.background = element_rect("black") , plot.background = element_rect("pink"))
To change the axis cut points we use scale_(axisname)_continuous.

d + scale_x_continuous(limits = c(2,4)) + scale_y_continuous(limits = c(15,30))


To change the x axis limits to 2 to 4, we use scale_x_continuous and my 'limits' is a vector
defining the upper and lower limits of the axis. Likewise, scale_y_continuous set the least cut
off point to 15 and highest cut off point of y axis to 30.

d + scale_x_continuous(limits = c(2,4),breaks = seq(2,4,0.25)) +


scale_y_continuous(limits = c(15,30),breaks = seq(15,30,3))
We can also add another parameter 'breaks' which will need a vector to specify all the cut of
points of the axis. Here we create a sequence of 2,2.5,3,3.5,4 for x axis and for y axis the
sequence is 15,18,21,...,30.

Faceting.
Faceting is a technique which is used to plot the graphs for the data corresponding to various
categories of a particular variable. Let us try to understand it via an illustration:

facet_wrap function is used for faceting where the after the tilde(~) sign we define the variables
on which we want the classification.
Faceting for carb

We see that there are 6 categories of "carb". Faceting creates 6 plots between mpg and disp;
where the points correspond to the categories.
We can mention the number of rows we need for faceting.
# Control the number of rows and columns with nrow and ncol
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb,nrow = 3)
Here an additional parameter nrow = 3 depicts that in total all the graphs should be adjusted in 3
rows.

Faceting using multiple variables.


Faceting can be done for various combinations of carb and am.
# You can facet by multiple variables
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb + am)
#Alternatively
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(c("carb","am"))
There are 6 unique 'carb' values and 2 unique 'am' values thus there could be 12 possible
combinations but we can get only 9 graphs, this is because for remaining 3 combinations there is
no observation.
It might be puzzling to grasp which the level of am and carb specially when the labels ain't
provided. Accordingly we can label the variables.
# Use the `labeller` option to control how labels are printed:
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb + am, labeller =
"label_both")

facet_wrap in multiple variables.

R provides facet_grid( ) function which can be used to faced in two dimensions.


z <- ggplot(mtcars, aes(mpg, disp)) + geom_point()
We store our basic plot in 'z' and thus we can make the additions:

z + facet_grid(. ~ cyl) #col


z + facet_grid(cyl ~ .) #row
z + facet_grid(gear ~ cyl,labeller = "label_both") #row and col
using facet_grid( )

In facet_grid(.~cyl), it facets the data by 'cyl' and the cylinders are represented in columns. If we
want to represent 'cyl' in rows, we write facet_grid(cyl~.). If we want to facet according to 2
variables we write facet_grid(gear~cyl) where gears are represented in rows and 'cyl' are
illustrated in columns.

Adding text to the points.


Using ggplot2 we can define what are the different values / labels for all the points. This can be
accomplished by using geom_text( )
#Adding texts to the points
ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
geom_text(aes(label = am))
In geom_text we provide aes(label = am) which depicts that for all the points the corresponding
levels of "am" should be shown.
In the graph it can be perceived that the labels of 'am' are overlapping with the points. In some
situations it may become difficult to read the labels when there are many points. In order to avoid
this we use geom_text_repel function in 'ggrepel' library.
require(ggrepel)
ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
geom_text_repel(aes(label = am))
We load the library ggrepel using require( ) function. If we don't want the text to overlap we
use geom_text_repel( ) instead of geom_text( ) of ggplot2 , keeping the argument aes(label =
am).

geom_text_repel

Conclusion:
Experiment No: 10
Aim: Study and implementation of data transpose operations in R

Theory:
In R, we can transpose our data very easily. In R, there are many packages such as tidyr and
reshape2 that helps to make it easy. This package was written by the most popular R
expert Hadley Wickham.

Sample Data

The code below would create a sample data that would be used for demonstration.

data <- read.table(text="X Y Z


ID12 2012-06 566
ID1 2012-06 10239
ID6 2012-06 524
ID12 2012-07 2360
ID1 2012-07 13853
ID6 2012-07 2352
ID12 2012-08 3950
ID1 2012-08 14738
ID6 2012-08 4104",header=TRUE)

Convert Long to Wide Format

Suppose you have data containing three variables such as X, Y and Z. The variable 'X' contains
IDs and the variable 'Y' contains dates and the variable 'Z' contains income. The data is
structured in a long format and you need to convert it to wide format so that the dates moved to
column names and the income information comes under these dates. The snapshot of data and
desired output is shown below -
R : Convert Long to Wide Format

In reshape2 package, there are two function for transforming long-format data to wide format.
The functions are "dcast" and "acast". The only difference between these two functions are as
follows :
1. dcast function returns a data frame as output.
2. acast function returns a vector, matrix or array as output.

Install reshape2 package if not installed already


if (!require(reshape2)){
install.packages('reshape2')
library(reshape2)
}
R Code : Transform Long to Wide Format

mydt = dcast(data,X~Y,value.var = "Z")

How dcast function works


1. The first parameter of dcast function refers to a data frame
2. The left hand side of the casting function refers to ID variables.
3. The right hand side refers to the variable to move to column name
4. The value.var would contain a name of the variable that stores values.

Example 2 : More than 1 ID Variable

Let's see another example wherein we have more than 1 ID variable. It contains information
about Income generated from 2 products - Product A and B reported semi-annually.

Example of Transforming Data


library(reshape2)
xx=dcast(data, Year + SemiYear ~ Product, value.var = "Income")
In the above code, "Year + SemiYear" are the 2 ID variables. We want "Product" variable to be
moved to columns.

The output is shown below -

Output

If you want the final output to be reported at year level

It seems to be VERY EASY (just remove the additional ID variable 'SemiYear'). But it's a little
tricky. See the explanation below -
dcast(data, Year ~ Product, value.var = "Income")
Warning : Aggregation function missing: defaulting to length

Year ProductA ProductB


1 2 2
2 2 2

The income values are incorrect in the above table.

We need to define the statistics to aggregate income at year level. Let's sum the income to
report annual score.
dcast(data, Year ~ Product, fun.aggregate = sum, value.var = "Income")

Year ProductA ProductB


1 27446 23176
2 22324 24881

Convert Wide Format Data to Long Format

Suppose you have data containing information of species and their sepal length. The data of
sepal length of species are in columns.

Wide to Long Format

Create Sample Data

mydata = read.table(text= "ID setosa versicolor virginica


1 5.1 NA NA
2 4.9 NA NA
3 NA 7 NA
4 NA 6.4 NA
5 NA NA 6.3
6 NA NA 5.8
", header=TRUE)
The following program would reshape data from wide to long format.

library(reshape2)
x = colnames(mydata[,-1])
t = melt(mydata,id.vars = "ID",measure.vars = x , variable.name="Species",
value.name="Sepal.Length",na.rm = TRUE)
How melt function works :

1. id.vars - ID variables to keep in the final output.


2. measure.vars - variables to be transformed
3. variable.name - name of variable used to store measured variable names
4. value.name - name of variable used to store values

Sample Data

data = read.table(text="
XYZ
650
6 3 NA
615
853
1 NA 1
872
2 0 2", header=TRUE)

Apply Function

When we want to apply a function to the rows or columns of a matrix or data frame. It cannot be
applied on lists or vectors.
apply arguments

Calculate maximum value across row

apply(data, 1, max)
It returns NA if NAs exist in a row. To ignore NAs, you can use the following line of code.

apply(data, 1, max, na.rm = TRUE)


Calculate mean value across row

apply(data, 1, mean)
apply(data, 1, mean, na.rm = TRUE)
Calculate number of 0s in each row

apply(data == 0, 1, sum, na.rm= TRUE)


Calculate number of values greater than 5 in each row

apply(data > 5, 1, sum, na.rm= TRUE)


Select all rows having mean value greater than or equal to 4

df = data[apply(data, 1, mean, na.rm = TRUE)>=4,]


Remove rows having NAs

helper = apply(data, 1, function(x){any(is.na(x))})


df2 = data[!helper,]
It can be easily done with df2 = na.omit(data).

Count unique values across row

df3 = apply(data,1, function(x) length(unique(na.omit(x))))

Conclusion:

You might also like