R Lab
R Lab
Lab Manual
Vision IT Department
Mission IT Department
Equipping the students with technical skills, soft skills and professional attitude.
Providing the state of art facilities to the students to excel as competent professionals,
entrepreneurs and researchers.
List of Experiments
i. Study of data analysis using MS-Excel(Prerequisite)
Experiment No: 1
Aim: To perform the basic mathematical operations in r programming
Theory:
Installing Packages
The most common place to get packages from is CRAN. To install packages from CRAN you
use install.packages("packagename"). For instance, if you want to install the ggplot2
package, which is a very popular visualization package you would type the following in the
console:
# install package from CRAN
install.packages("ggplot2")
Loading Packages
Once the package is downloaded to your computer you can access the functions and
resources provided by the package in two different ways:
# load the package to use in the current R session
library(packagename)
Assignment
The first operator you’ll run into is the assignment operator. The assignment operator is used
to assign a value. For instance we can assign the value 3 to the variable x using the <-
assignment operator.
# assignment
x <- 3
Interestingly, R actually allows for five assignment operators:
# leftward assignment
x <- value
x = value
x <<- value
# rightward assignment
value -> x
value ->> x
The original assignment operator in R was <- and has continued to be the preferred among R
users. The = assignment operator was added in 2001 primarily because it is the accepted
assignment operator in many other languages and beginners to R coming from other
languages were so prone to use it.
The operators <<- is normally only used in functions which we will not get into the details.
Evaluation
We can then evaluate the variable by simply typing x at the command line which will return
the value of x. Note that prior to the value returned you’ll see ## [1] in the command line.
This simply implies that the output returned is the first output. Note that you can type any
comments in your code by preceding the comment with the hash tag (#) symbol. Any values,
symbols, and texts following # will not be evaluated.
# evaluation
x
## [1] 3
Case Sensitivity
Lastly, note that R is a case sensitive programming language. Meaning all variables,
functions, and objects must be called by their exact spelling:
x <- 1
y <- 3
z <- 4
x*y*z
## [1] 12
x*Y*z
## Error in eval(expr, envir, enclos): object 'Y' not found
Basic Arithmetic
At its most basic function R can be used as a calculator. When applying basic arithmetic, the
PEMDAS order of operations applies: parentheses first followed by exponentiation,
multiplication and division, and final addition and subtraction.
8+9/5^2
## [1] 8.36
8 + 9 / (5 ^ 2)
## [1] 8.36
8 + (9 / 5) ^ 2
## [1] 11.24
(8 + 9) / 5 ^ 2
## [1] 0.68
By default R will display seven digits but this can be changed using options() as previously
outlined.
1/7
## [1] 0.1428571
options(digits = 3)
1/7
## [1] 0.143
pi
## [1] 3.141592654
options(digits = 22)
pi
## [1] 3.141592653589793115998
We can also perform integer divide (%/%) and modulo (%%) functions. The integer divide
function will give the integer part of a fraction while the modulo will provide the remainder.
42 / 4 # regular division
## [1] 10.5
42 %/% 4 # integer division
## [1] 10
42 %% 4 # modulo (remainder)
## [1] 2
The workspace environment will also list your user defined objects such as vectors, matrices,
data frames, lists, and functions. For example, if you type the following in your console:
x <- 2
y <- 3
You will now see x and y listed in your workspace environment. To identify or remove the
objects (i.e. vectors, data frames, user defined functions, etc.) in your current R environment:
Theory:
With R, it’s Important that one understand that there is a difference between the actual
R object and the manner in which that R object is printed to the console. Often, the printed
output may have additional bells and whistles to make the output more friendly to the users.
However, these bells and whistles are not inherently part of the object
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the
vector() function. There is really only one rule about vectors in R, which is that A vector can
only contain objects of the same class. But of course, like any good rule, there is an
exception, which is a list, which we will get to a bit later. A list is represented as a vector but
can contain objects of different classes. Indeed, that’s usually why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data
analysis
Creating Vectors
The c() function can be used to create vectors of objects by concatenating things together.
>x
[1] 0 0 0 0 0 0 0 0 0 0
Numeric vector
x <- c(1,2,3,4,5,6)
Character vector
To calculate frequency for State vector, you can use table function.
Since the above vector contains a NA (not available) value, the mean function returns NA.
To calculate mean for a vector excluding NA values, you can include na.rm = TRUE
parameter in mean function.
You can use subscripts to refer elements of a vector.
data$x = as.numeric(data$x)
Some useful vectors can be created quickly with R. The colon operator is
[1] 1 2 3 4 5 6 7 8 9 10
> -3:4
[1] -3 -2 -1 0 1 2 3 4
> 9:5
[1] 9 8 7 6 5
More generally, the function seq() can generate any arithmetic progression.
[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0
Sometimes it’s necessary to have repeated values, for which we use rep()
> rep(5,3)
[1] 5 5 5
> rep(2:5,each=3)
[1] 2 2 2 3 3 3 4 4 4 5 5 5
[1] -1 0 1 2 3 -1 0 1 2 3
We can also use R’s vectorization to create more interesting sequences:
> 2^(0:10)
[1] 1 2 3 11 12 13 21 22 23 31 32 33
Lists:
A list allows you to store a variety of objects.
You can use subscripts to select the specific component of the list.
> x <- list(1:3, TRUE, "Hello", list(1:2, 5))
Here x has 4 elements: a numeric vector, a logical, a string and another list.
> x[[3]]
[1] "Hello"
> x[c(1,3)]
[[1]]
[1] 1 2 3
[[2]]
[1] "Hello"
We can also name some or all of the entries in our list, by supplying argument names to list():
>x
$y
[1] 1 2 3
[[2]]
[1] TRUE
$z
[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to
how we can recover the entries by their name. We can still use the numeric
position if we prefer:
> x$y
[1] 1 2 3
> x[[1]]
[1] 1 2 3
The function names() can be used to obtain a character vector of all the
> names(x)
Conclusion:
Experiment No. 3
Aim: Implementation of various operations on matrix, array and factors in R
Theory:
Matrices are much used in statistics, and so play an important role in R. To create a matrix
use the function matrix(), specifying elements by column first:
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
This is called column-major order. Of course, we need only give one of the dimensions:
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
> diag(3)
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(1:3)
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj .
The function outer() generalizes this to any function f on two arguments, to create a matrix
with entries f(xi , yj ). (More on functions later.)
[1,] 2 3 4 5
[2,] 3 4 5 6
[3,] 4 5 6 7
[,1]
[1,] 30
[2,] 36
[3,] 45
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 30
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 10
[1] -3
[1] 1 5 10
Array:
Of course, if we have a data set consisting of more than two pieces of categorical information
about each subject, then a matrix is not sufficient. The generalization of matrices to higher
dimensions is the array. Arrays are defined much like matrices, with a call to the array()
command. Here is a 2 × 3 × 3 array:
> arr
,,1
[1,] 1 3 5
[2,] 2 4 6
,,2
[1,] 7 9 11
[2,] 8 10 12
,,3
[1,] 13 15 17
[2,] 14 16 18
Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3
matrix. Note that we no longer specify the number of rows and columns separately, but use a
single vector dim whose length is the number of dimensions. You can recover this vector
with the dim() function.
> dim(arr)
[1] 2 3 3
subsetted and modified in exactly the same way as a matrix, only using the
> arr[1,2,3]
[1] 15
> arr[,2,]
[,1] [,2] [,3]
[1,] 3 9 15
[2,] 4 10 16
> arr[,,1,drop=FALSE]
,,1
[1,] 0 3 5
[2,] 2 4 6
Factors
R has a special data structure to store categorical variables. It tells R that a variable is
nominal or ordinal by making it a factor.
data$x = as.factor(data$x)
Experiment No. 4
Aim: Implementation and perform the various operations on data frames in R
Theory:
3. The data stored in a data frame can be of numeric, factor or character type.
The structure of the data frame can be seen by using str() function.
The statistical summary and nature of the data can be obtained by applying summary()
function.
print(result)
print(result)
# Extract 3rd and 5th row with 2nd and 4th column.
print(result)
1. Add Column
v <- emp.data
print(v)
2. Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows
in the same structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing
data frame to create the final data frame.
Conclusion:
Experiment No. 5
Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R
Theory:
This covers how to execute most frequently used data manipulation tasks with R. It includes
various examples with datasets and code. It gives you a quick look at several functions used
in R.
# for multiple
# OR
> DF[keeps]
> DF
d3=data.frame(roll=c(2,4,6,3,1,5),
name=c('a','b','c','d','e','e'),
marks=c(44,55,22,33,66,77))
> d3
d3[order(d3$roll),]
OR
d3[with(d3,order(roll)),]
Subsets:
roll=c(1:5)
names=c(letters[1:5])
marks=c(12,33,44,55,66)
d4=data.frame(roll,names,marks)
sub1=subset(d4,marks>33 & roll>4)
sub1
sub1=sub1=subset(d4,marks>33 & roll>4,select = c(roll,names))
sub1
Rename Columns in R
colnames(d)[colnames(d)==“roll"]=“ID“
d$class=c(1,2,1,2,1,2)
table(cls)
In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to
convert the variable from factor to character.
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
Another method
You have to first install the car package.
# Install the car package
install.packages("car")
Sorting
Sorting is one of the most common data manipulation task. It is generally used when
we want to see the top 5 highest / lowest values of a variable.
Sorting a vector
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than
1 dimensional vector.
Sorting a data frame
mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT=
sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
Sort gender variable in ascending order and then SAT in descending order
mydata.sorted1 <- mydata[order(mydata$Gender, -mydata$SAT),]
Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.
Value labelling
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male",
"female"))
Use ordered() for ordinal data
mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly agree",
"Somewhat agree", "Somewhat disagree", "Strongly disagree"))
Theory:
Loop helps you to repeat the similar operation on different variables or on different columns or
on different datasets. For example, you want to multiple each variable by 5. Instead of multiply
each variable one by one, you can perform this task in loop. Its main benefit is to bring down the
duplication in your code which helps to make changes later in the code.
The If-Else statements are important part of R programming. In this tutorial, we will see various
ways to apply conditional statements (If..Else nested IF) in R. In R, there are a lot of powerful
packages for data manipulation. In the later part of this tutorial, we will see how IF ELSE
statements are used in popular packages.
Sample Data
Let's create a sample data to show how to perform IF ELSE function. This data frame would be
used further in examples.
x1 x2 x3
1 129 A
3 178 B
5 140 C
7 186 D
9 191 E
11 104 F
13 150 G
15 183 H
17 151 I
19 142 J
set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
x2 = sample(100:200,10,FALSE),
x3 = LETTERS[1:10])
x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In total,
these are 10 numeric values.
x2 = sample(100:200,10,FALSE) : The variable 'x2' constitutes 10 non-repeating random
numbers ranging between 100 and 200.
The ifelse() function in R works similar to MS Excel IF function. See the syntax below -
Suppose you are asked to create a binary variable - 1 or 0 based on the variable 'x2'. If value of a
variable 'x2' is greater than 150, assign 1 else 0.
mydata$x4 = ifelse(mydata$x2>150,1,0)
In this case, it creates a variable x4 on the same data frame 'mydata'. The output is shown in the
image below -
ifelse : Output
Create variable in a new data frame
Suppose you need to add the above created binary variable in a new data frame. You can do it by
using the code below -
x = ifelse(mydata$x2>150,1,0)
newdata = cbind(x,mydata)
The cbind() is used to combine two vectors, matrices or data frames by columns.
Apply ifelse() on Character Variables
If variable 'x3' contains character values - 'A', 'D', the variable 'x1' should be multiplied by 2.
Otherwise it should be multiplied by 3.
x1 x2 x3 y
1 129 A 2
3 178 B 9
5 140 C 15
7 186 D 14
9 191 E 27
11 104 F 33
13 150 G 39
15 183 H 45
17 151 I 51
19 142 J 57
Multiple If Else statements can be written similarly to excel's If function. In this case, we are
telling R to multiply variable x1 by 2 if variable x3 contains values 'A' 'B'. If values are 'C' 'D',
multiply it by 3. Else multiply it by 4.
mydata$y = ifelse(mydata$x3 %in% c("A","B") ,mydata$x1*2,
ifelse(mydata$x3 %in% c("C","D"), mydata$x1*3,
mydata$x1*4))
Do you hate specifying data frame multiple times with each variable?
You can use with() function to avoid mentioning data frame each time. It makes writing R code
faster.
Incorrect Method
x = NA
ifelse(x==NA,1,0)
Result : NA
It should have returned 1.
Correct Method
x = NA
ifelse(is.na(x),1,0)
Result : 1
The is.na() function tests whether a value is NA or not.
ifelse(mydata$x1<10 | mydata$x2>150,1,0)
Result : 1 1 1 1 1 0 0 1 1 0
In this example, we can counting the number of records where the condition meets.
sum(ifelse(mydata$x1<10 | mydata$x2>150,1,0))
Result : 7
There is one more way to define if..else statement in R. This style of writing If Else is mostly
used when we use conditional statements in loop and R functions. In other words, it is used when
we need to perform various actions based on a condition.
Syntax -
k = 100
if(k > 100){
print("Greater than 100")
} else if (k < 100){
print("Less than 100")
} else {
print ("Equal to 100")
}
Result : "Equal to 100"
1. dplyr package
library(dplyr)
x=c(1,NA,2,3)
if_else(x%%2==0, "Multiple of 2", "Not a multiple of 2", "Missing")
Result : "Not a multiple of 2" "Missing" "Multiple of 2" "Not a multiple of 2"
The %% symbol returns remainder after a value is divided by divisor. In this case, first element
1 is divided by 2.
2. sqldf package
We can write SQL query in R using sqldf package. In SQL, If Else statement is defined in CASE
WHEN.
df=data.frame(k=c(2,NA,3,4,5))
library(sqldf)
sqldf(
"SELECT *,
CASE WHEN (k%2)=0 THEN 'Multiple of 2'
WHEN k is NULL THEN 'Missing'
ELSE 'Not a multiple of 2'
END AS T
FROM df"
)
Output
kT
2 Multiple of 2
NA Missing
3 Not a multiple of 2
4 Multiple of 2
5 Not a multiple of 2
What is Loop?
Loop helps you to repeat the similar operation on different variables or on different columns or
on different datasets. For example, you want to multiple each variable by 5. Instead of multiply
each variable one by one, you can perform this task in loop. Its main benefit is to bring down the
duplication in your code which helps to make changes later in the code.
They are the hidden loops in R. They make loops easier to read and write. But these concepts are
very new to the programming world as compared to For Loop and While Loop.
1. Apply Function
It is used when we want to apply a function to the rows or columns of a matrix or data frame. It
cannot be applied on lists or vectors.
apply arguments
In the second parameter of apply function, 1 denotes the function to be applied at row level.
2. Lapply Function
When we apply a function to each element of a data structure and it returns a list.
lapply arguments
Output
3. Sapply Function
Sapply is a user friendly version of Lapply as it returns a vector when we apply a function to
each element of a data structure.
colnames(iris)[which(sapply(iris,is.numeric))]
In this example, sapply(iris,is.numeric) returns TRUE/FALSE against each variable. If the
variable is numeric, it would return TRUE otherwise FALSE. Later, which function returns the
column position of the numeric variables . Try running only this portion of the
code which(sapply(iris,is.numeric)). Adding colnames function would help to return the actual
names of the numeric variables.
Lapply and Sapply Together
In this example, we would show you how both lapply and sapply are used simultaneously to
solve the problem.
The following code would convert all the factor variables of data frame 'dat' to numeric types
variables.
index <- sapply(dat, is.factor)
dat[index] <- lapply(dat[index], function(x) as.numeric(as.character(x)))
Explanation :
1. index would return TRUE / FALSE whether the variable is factor or not
2. Converting only those variables wherein index=TRUE.
4. For Loop
Like apply family of functions, For Loop is used to repeat the same task on multiple data
elements or datasets. It is similar to FOR LOOP in other languages such as VB, python etc. This
concept is not new and it has been in the programming field over many years.
x = NULL
for (i in 1:ncol(dat)){
x[i]= max(dat[i], na.rm = TRUE)}
x
Prior to starting a loop, we need to make sure we create an empty vector. The empty vector is
defined by x=NULL. Next step is to define the number of columns for which loop over would be
executed. It is done with ncol function. The length function could also be used to know the
number of column.
The above FOR LOOP program can be written like the code below -
x = vector("double", ncol(dat))
for (i in seq_along(dat)){
x[i]= max(dat[i], na.rm = TRUE)}
x
The vector function can be used to create an empty vector. The seq_along finds out what to
loop over.
The program below creates multiple data frames based on the number of unique values in
variable Species in IRIS dataset.
for (i in 1:length(unique(iris$Species))) {
require(dplyr)
assign(paste("iris",i, sep = "."), filter(iris, Species == as.character(unique(iris$Species)[i])))
}
It returns three data frames named iris.1 iris.2 iris.3.
In the example below, we are combining / appending rows in iterative process. It is same as
PROC APPEND in SAS.
do.call() applies a given function to the list as a whole. When it is used with rbind, it would bind
all the list arguments. In other words, it converts list to matrix of multiple rows.
temp =list()
for (i in 1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
temp[[i]] =series
}
output = do.call(rbind, temp)
output
Method 2 : Use Standard Looping Technique
In this case, we are first creating an empty table (data frame). Later we are appending data to
empty data frame.
dummydt=data.frame(matrix(ncol=0,nrow=0))
for (i in 1:length(unique(iris$Species))) {
series= data.frame(Species =as.character(unique(iris$Species))[i])
if (i==1) {output = rbind(dummydt,series)} else {output = rbind(output,series)}
}
output
If we need to wrap the above code in function, we need to make some changes in the code. For
example, data$variable won't work inside the code . Instead we should use data[[variable]]. See
the code below -
dummydt=data.frame(matrix(ncol=0,nrow=0))
temp = function(data, var) {
for (i in 1:length(unique(data[[var]]))) {
series= data.frame(Species = as.character(unique(data[[var]]))[i])
if (i==1) {output = rbind(dummydt,series)} else {output = rbind(output,series)}
}
return(output)}
temp(iris, "Species")
Suppose you are asked to impute Missing Values with Median in each of the numeric variable in
a data frame. It's become a daunting task if you don't know how to write a loop. Otherwise, it's a
straightforward task.
In the program below, which(sapply(dat, is.numeric)) makes sure loop runs only on numeric
variables.
Suppose you need to standardise multiple variables. To accomplish this task, we need to execute
the following steps -
lst=list()
for (i in which(sapply(mydata, is.numeric))) {
x.scaled = (mydata[,i] - mean(mydata[,i])) /sd(mydata[,i])
lst[[i]] = x.scaled
}
names(lst) <- paste(names(sapply(mydata, is.numeric)),"_scaled", sep="")
mydata.scaled= data.frame(do.call(cbind, lst))
In this case, do.call with cbind function helps to make data in matrix form from list.
5. While Loop in R
A while loop is more broader than a for loop because you can rescript any for loop as a while
loop but not vice-versa.
i=1
while(i<7)
{
if(i%%2==0)
print(paste(i, "is an Even number"))
else if(i%%2>0)
print(paste(i, "is an Odd number"))
i=i+1
}
The double percent sign (%%) indicates mod. Read i%%2 as mod(i,2). The iteration would
start from 1 to 6 (i.e. i<7). It stops when condition is met.
Output:
[1] "1 is an Odd number"
[1] "2 is an Even number"
[1] "3 is an Odd number"
[1] "4 is an Even number"
[1] "5 is an Odd number"
[1] "6 is an Even number"
Break Keyword
When a loop encounters 'break' it stops the iteration and breaks out of loop.
for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
break } else {
print(paste("i=", i, "j=", j))
}
}
}
Output :
[1] "i= 1 j= 3"
[1] "i= 1 j= 2"
[1] "i= 1 j= 1"
Next Keyword
When a loop encounters 'next', it terminates the current iteration and moves to next iteration.
for (i in 1:3) {
for (j in 3:1) {
if ((i+j) > 4) {
next
} else {
print(paste("i=", i, "j=", j))
}
}
}
Output :
[1] "i= 1 j= 3"
[1] "i= 1 j= 2"
[1] "i= 1 j= 1"
[1] "i= 2 j= 2"
[1] "i= 2 j= 1"
[1] "i= 3 j= 1"
If you get confused between 'break' and 'next', compare the output of both and see the difference.
Conclusion:
Experiment No. 7
Aim: Data Manipulation with dplyr package
Theory:
The dplyr package is one of the most powerful and popular package in R. This package was
written by the most popular R programmer Hadley Wickham who has written many useful R
packages such as ggplot2, tidyr etc. This post includes several examples and tips of how to use
dplyr package for cleaning and transforming data. It's a complete tutorial on data manipulation
and data wrangling with R.
What is dplyr?
The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In
short, it makes data exploration and data manipulation easy and fast in R.
People have been utilizing SQL for analyzing data for decades. Every modern data analysis
software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to
perform data analysis. It was rather designed for querying and managing data. There are many
data analysis operations where SQL fails or makes simple things difficult. For example,
calculating median for multiple variables, converting wide format data to long format etc.
Whereas, dplyr package was designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select()for selecting
variables, group_by() - group data by grouping variable, join() - joining two data sets. Also
includes inner_join() and left_join(). It also supports sub queries for which SQL was popular
for.
How to install and load dplyr package
install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
Important dplyr Functions to remember
In this tutorial, we are using the following data which contains income generated by states from
year 2002 to 2015. Note : This data do not contain actual income figures of the states.
This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of first 6
rows of the dataset is shown below.
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")
Example 1 : Selecting Random N Rows
The sample_n function selects random rows from a data frame (or table). The second parameter
of the function tells R the number of rows to select.
sample_n(mydata,3)
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
33 N New York 1395149 1611371 1170675 1446810 1426941 1463171 1732098 1426216
The sample_frac function returns randomly N% of rows. In the example below, it returns
randomly 10% of rows.
sample_frac(mydata,0.1)
Example 3 : Remove Duplicate Rows based on all the variables (Complete Row)
x1 = distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as in
mydata.
The .keep_all function is used to retain all other variables in the output data frame.
In the example below, we are using two variables - Index, Y2010 to determine uniqueness.
select( ) Function
Suppose you are asked to select only a few variables. The code below selects variables "Index",
columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.
The code below keeps variable 'State' in the front and the remaining variables follow that.
[1] "State" "Index" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006" "Y2007" "Y2008" "Y2009"
rename( ) Function
Output
filter( ) Function
Suppose you need to subset data. You want to filter rows and retain only those values in which
Index is equal to A.
mydata7 = filter(mydata, Index == "A")
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
The %in% operator can be used to select multiple items. In the following program, we are
telling R to select rows against 'A' and 'C' in column 'Index'.
Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in
the column 'Index' and income greater than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )
The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
Example 16 : NOT Condition
The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))
The grepl function is used to search for pattern matching. In the following code, we are looking
for records wherein column state contains 'Ar' in their name.
summarise( ) Function
In the example below, we are calculating mean and median for the variable Y2015.
Output
In the following example, we are calculating number of records, mean and median for variables
Y2005 and Y2006. The summarise_at function allows us to select multiple variables by their
names.
Output
Example 20 : Summarize with Custom Functions
We can also use custom functions in the summarise function. In this case, we are computing the
number of records, number of missing values, mean and median for variables Y2011 and Y2012.
The dot (.) denotes each variables specified in the second argument of the function.
summarise_at(mydata, vars(Y2011, Y2012),
funs(n(), missing = sum(is.na(.)), mean(., na.rm = TRUE), median(.,na.rm = TRUE)))
Summarize : Output
Suppose you want to subtract mean from its original value and then calculate variance of it.
set.seed(222)
mydata <- data.frame(X1=sample(1:100,100), X2=runif(100))
summarise_at(mydata,vars(X1,X2), function(x) var(x - mean(x)))
X1 X2
1 841.6667 0.08142161
nlevels nmiss
1 19 0
arrange() function :
Syntax
arrange(data_frame, variable(s)_to_sort)
or
data_frame %>% arrange(variable(s)_to_sort)
To sort a variable in descending order, use desc(x).
The default sorting order of arrange() function is ascending. In this example, we are sorting
data by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other variable by ascending oder.
arrange(mydata, desc(Index), Y2011)
Pipe Operator %>%
It is important to understand the pipe (%>%) operator before knowing the other functions of
dplyr package. dplyr utilizes pipe operator from another package (magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator. The question
arises "Why to use pipe operator %>%". The answer is it lets to wrap multiple functions
together with the use of %>%.
Syntax :
filter(data_frame, variable == value)
or
data_frame %>% filter(variable == value)
The %>% is NOT restricted to filter function. It can be used with any function.
Example :
The code below demonstrates the usage of pipe %>% operator. In this example, we are selecting
10 random observations of two variables "Index" "State" from the data frame "mydata".
dt = sample_n(select(mydata, Index, State),10)
or
dt = mydata %>% select(Index, State) %>% sample_n(10)
Output
group_by() function :
Syntax :
group_by(data, variables)
or
data %>% group_by(variables)
We are calculating count and mean of variables Y2011 and Y2012 by variable Index.
t = summarise_at(group_by(mydata, Index), vars(Y2011, Y2012), funs(n(), mean(., na.rm =
TRUE)))
The above code can also be written like
t = mydata %>% group_by(Index) %>%
summarise_at(vars(Y2011:Y2015), funs(n(), mean(., na.rm = TRUE)))
A 4 4 4 4 4 1432642 1455876
C 3 3 3 3 3 1750357 1547326
D 2 2 2 2 2 1336059 1981868
F 1 1 1 1 1 1497051 1131928
G 1 1 1 1 1 1851245 1850111
H 1 1 1 1 1 1902816 1695126
I 4 4 4 4 4 1690171 1687056
K 2 2 2 2 2 1489353 1899773
L 1 1 1 1 1 1210385 1234234
M 8 8 8 8 8 1582714 1586091
N 8 8 8 8 8 1448351 1470316
O 3 3 3 3 3 1882111 1602463
P 1 1 1 1 1 1483292 1290329
R 1 1 1 1 1 1781016 1909119
S 2 2 2 2 2 1381724 1671744
T 2 2 2 2 2 1724080 1865787
U 1 1 1 1 1 1288285 1108281
V 2 2 2 2 2 1482143 1488651
W 4 4 4 4 4 1711341 1660192
do() function :
Suppose you need to pull top 2 rows from 'A', 'C' and 'I' categories of variable Index.
t = mydata %>% filter(Index %in% c("A", "C","I")) %>% group_by(Index) %>%
do(head( . , 2))
We are calculating third maximum value of variable Y2015 by variable Index. The following
code first selects only two variables Index and Y2015. Then it filters the variable Index with 'A',
'C' and 'I' and then it groups the same variable and sorts the variable Y2015 in descending order.
At last, it selects the third row.
Output
Using Window Functions
Like SQL, dplyr uses window functions that are used to subset data within a group. It returns a
vector of values. We could use min_rank() function that calculates rank in the preceding
example,
Index Y2015
1 A 1647724
2 C 1330736
3 I 1583516
In this case, we are computing mean of variables Y2014 and Y2015 by variable Index. Then sort
the result by calculated mean variable Y2015.
t = mydata %>%
group_by(Index)%>%
summarise(Mean_2014 = mean(Y2014, na.rm=TRUE),
Mean_2015 = mean(Y2015, na.rm=TRUE)) %>%
arrange(desc(Mean_2015))
mutate() function :
Syntax :
mutate(data_frame, expression(s) )
or
data_frame %>% mutate(expression(s))
Example 28 : Create a new variable
The following code calculates division of Y2015 by Y2014 and name it "change".
mydata1 = mutate(mydata, change=Y2015/Y2014)
Example 29 : Multiply all the variables by 1000
It creates new variables and name them with suffix "_new".
mydata11 = mutate_all(mydata, funs("new" = .* 1000))
Output
The output shown in the image above is truncated due to high number of variables.
Warning messages:
1: In Ops.factor(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, :
‘*’ not meaningful for factors
2: In Ops.factor(1:51, 1000) : ‘*’ not meaningful for factors
It implies you are multiplying 1000 to string(character) values which are stored as factor
variables. These variables are 'Index', 'State'. It does not make sense to apply multiplication
operation on character variables. For these two variables, it creates newly created variables
which contain only NA.
1 A Alaska 1979143
2 C Connecticut 1718072
3 D Delaware 1627508
4 F Florida 1170389
5 G Georgia 1725470
6 H Hawaii 1150882
7 I Idaho 1757171
8 K Kentucky 1913350
9 L Louisiana 1403857
10 M Missouri 1996005
12 O Oregon 1893515
13 P Pennsylvania 1668232
17 U Utah 1729273
18 V Virginia 1850394
19 W Wyoming 1853858
The cumsum function calculates cumulative sum of a variable. With mutate function, we
insert a new variable called 'Total' which contains values of cumulative income of variable
Index.
join() function :
Syntax :
inner_join(x, y, by = )
left_join(x, y, by = )
right_join(x, y, by = )
full_join(x, y, by = )
semi_join(x, y, by = )
anti_join(x, y, by = )
x, y - datasets (or tables) to merge / join
by - common variable (primary key) to join by.
x = c(1, 1, 0, 0, 1),
y=rnorm(5),
z=letters[1:5])
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
INNER JOIN returns rows when there is a match in both tables. In this example, we are
merging df1 and df2 with ID as common variable (primary key).
LEFT JOIN : It returns all rows from the left table, even if there are no matches in the right
table.
union(x, y)
Rows that appear in either or both x and y.
setdiff(x, y)
Rows that appear in x but not y.
intersect(first, second)
UNION displays all rows from both the tables and removes duplicate records from the combined
dataset. By using union_all function, it allows duplicate rows in the combined dataset.
Syntax :
If a value is less than 5, add it to 1 and if it is greater than or equal to 5, add it to 2. Otherwise 0.
df =data.frame(x = c(1,5,6,NA))
df %>% mutate(newvar=if_else(x<5, x+1, x+2,0))
Output
Nested IF ELSE
Multiple IF ELSE statement can be written using if_else() function. See the example below -
mydf =data.frame(x = c(1:5,NA))
mydf %>% mutate(newvar= if_else(is.na(x),"I am missing",
if_else(x==1,"I am one",
if_else(x==2,"I am two",
if_else(x==3,"I am three","Others")))))
Output
x flag
1 1 I am one
2 2 I am two
3 3 I am three
4 4 Others
5 5 Others
6 NA I am missing
SQL-Style CASE WHEN Statement
We can use case_when() function to write nested if-else queries. In case_when(), you can use
variables directly within case_when() wrapper. TRUE refers to ELSE statement.
x == 1 ~ "I am one",
x == 2 ~ "I am two",
x == 3 ~ "I am three",
TRUE ~ "Others"))
Important Point
Make sure you set is.na() condition at the beginning in nested ifelse. Otherwise, it would not be
executed.
Suppose you want to find maximum value in each row of variables 2012, 2013, 2014, 2015.
The rowwise() function allows you to apply functions to rows.
df = mydata %>%
rowwise() %>% mutate(Max= max(Y2012,Y2013,Y2014,Y2015)) %>%
select(Y2012:Y2015,Max)
Example 40 : Combine Data Frames
Suppose you are asked to combine two data frames. Let's first create two sample datasets.
df1=data.frame(ID = 1:6, x=letters[1:6])
df2=data.frame(ID = 7:12, x=letters[7:12])
Input Datasets
The bind_rows() function combine two datasets with rows. So combined dataset would
contain 12 rows (6+6) and 2 columns.
xy = bind_rows(df1,df2)
It is equivalent to base R function rbind.
xy = rbind(df1,df2)
The bind_cols() function combine two datasets with columns. So combined dataset would
contain 4 columns and 6 rows.
xy = bind_cols(x,y)
or
xy = cbind(x,y)
The output is shown below-
cbind Output
The quantile() function is used to determine Nth percentile value. In this example, we are
computing percentile values by variable Index.
mydata %>% group_by(Index) %>%
summarise(Pecentile_25=quantile(Y2015, probs=0.25),
Pecentile_50=quantile(Y2015, probs=0.5),
Pecentile_75=quantile(Y2015, probs=0.75),
Pecentile_99=quantile(Y2015, probs=0.99))
This example explains the advanced usage of do() function. In this example, we are building
linear regression model for each level of a categorical variable. There are 3 levels in variable cyl
of dataset mtcars.
length(unique(mtcars$cyl))
Result : 3
It includes functions like select_if, mutate_if, summarise_if. They come into action only when
logical condition meets. See examples below.
The select_if() function returns only those columns where logical condition is TRUE.
The is.numeric refers to retain only numeric variables.
mydata2 = select_if(mydata, is.numeric)
Similarly, you can use the following code for selecting factor columns -
mydata3 = select_if(mydata, is.factor)
Example 44 : Number of levels in factor variables
Like select_if() function, summarise_if() function lets you to summarise only for variables where
logical condition holds.
summarise_if(mydata, is.factor, funs(nlevels(.)))
It returns 19 levels for variable Index and 51 levels for variable State.
Conclusion:
Experiment No. 8
Aim: Data Manipulation with data.table package
Theory:
The data.table R package is considered as the fastest package for data manipulation. This tutorial
includes various examples and practice questions to make you familiar with the package.
Analysts generally call R programming not compatible with big datasets ( > 10 GB) as it is not
memory efficient and loads everything into RAM. To change their perception, 'data.table'
package comes into play. This package was designed to be concise and painless. There are many
benchmarks done in the past to compare dplyr vs data.table. In every benchmark, data.table wins.
The efficiency of this package was also compared with python' package (panda). And data.table
wins. In CRAN, there are more than 200 packages that are dependent on data.table which makes
it listed in the top 5 R's package.
data.table Syntax
data.table Syntax
DT[ i , j , by]
1. The first parameter of data.table i refers to rows. It implies subsetting rows. It is
equivalent to WHERE clause in SQL
2. The second parameter of data.table j refers to columns. It implies subsetting
columns (dropping / keeping). It is equivalent to SELECT clause in SQL.
3. The third parameter of data.table by refers to adding a group so that all
calculations would be done within a group. Equivalent to SQL's GROUP BY clause.
The data.table syntax is NOT RESTRICTED to only 3 parameters. There are other
arguments that can be added to data.table syntax. The list is as follows -
1. with, which
2. allow.cartesian
3. roll, rollends
4. .SD, .SDcols
5. on, mult, nomatch
The above arguments would be explained in the latter part of the post.
How to Install and load data.table Package
install.packages("data.table")
#load required library
library(data.table)
Read Data
In data.table package, fread() function is available to read or get data from your computer or
from a web page. It is equivalent to read.csv() function of base R.
mydata = fread("https://github.com/arunsrinivasan/satrdays-
workshop/raw/master/flights_2014.csv")
Describe Data
This dataset contains 253K observations and 17 columns. It constitutes information about flights'
arrival or departure time, delays, flight cancellation and destination in year 2014.
nrow(mydata)
[1] 253316
ncol(mydata)
[1] 17
names(mydata)
[1] "year" "month" "day" "dep_time" "dep_delay" "arr_time" "arr_delay"
[8] "cancelled" "carrier" "tailnum" "flight" "origin" "dest" "air_time"
[15] "distance" "hour" "min"
head(mydata)
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight
1: 2014 1 1 914 14 1238 13 0 AA N338AA 1
2: 2014 1 1 1157 -3 1523 13 0 AA N335AA 3
3: 2014 1 1 1902 2 2224 9 0 AA N327AA 21
4: 2014 1 1 722 -8 1014 -26 0 AA N3EHAA 29
5: 2014 1 1 1347 2 1706 1 0 AA N319AA 117
6: 2014 1 1 1824 4 2145 0 0 AA N3DEAA 119
origin dest air_time distance hour min
1: JFK LAX 359 2475 9 14
2: JFK LAX 363 2475 11 57
3: JFK LAX 351 2475 19 2
4: LGA PBI 157 1035 7 22
5: JFK LAX 350 2475 13 47
6: EWR LAX 339 2454 18 24
Selecting or Keeping Columns
Suppose you need to select only 'origin' column. You can use the code below -
The following code tells R to select 'origin', 'year', 'month', 'hour' columns.
You can keep second through fourth columns using the code below -
Dropping a Column
Suppose you want to include all the variables except one column, say. 'origin'. It can be easily
done by adding ! sign (implies negation in R)
You can use %like% operator to find pattern. It is same as base R's grepl() function, SQL's
LIKE operator and SAS's CONTAINS function.
Rename Variables
You can rename variables with setnames() function. In the following code, we are renaming a
variable 'dest' to 'destination'.
Suppose you are asked to find all the flights whose origin is 'JFK'.
The following program selects all the flights whose origin is not equal to 'JFK' and 'LGA'
# Exclude Values
dat10 = mydata[!origin %in% c("JFK", "LGA")]
If you need to select all the flights whose origin is equal to 'JFK' and carrier = 'AA'
dat11 = mydata[origin == "JFK" & carrier == "AA"]
data.table uses binary search algorithm that makes data manipulation faster.
Binary search is an efficient algorithm for finding a value from a sorted list of values. It involves
repeatedly splitting in half the portion of the list that contains values, until you found the value
that you were searching for.
Suppose you have the following values in a variable :
If we do not use this algorithm, we would have to search 5 in the whole list of seven values.
It is important to set key in your dataset which tells system that data is sorted by the key column.
For example, you have employee’s name, address, salary, designation, department, employee ID.
We can use 'employee ID' as a key to search a particular employee.
Set Key
You don't need to refer the key column when you apply filter.
data12 = mydata[c("JFK", "LGA")]
Performance Comparison
You can compare performance of the filtering process (With or Without KEY).
We can also set keys to multiple columns like we did below to columns 'origin' and 'dest'. See the
example below.
# First key column 'origin' matches “JFK” and second key column 'dest' matches “MIA”
mydata[.("JFK", "MIA")]
It is equivalent to the following code :
key(mydata)
Result : It returns origin and dest as these are columns that are set keys.
Sorting Data
We can sort data using setorder() function, By default, it sorts data on ascending order.
In this example, we tells R to reorder data first by origin on ascending order and then variable
'carrier'on descending order.
You can do any operation on rows by adding := operator. In this example, we are subtracting
'dep_delay' variable from 'dep_time' variable to compute scheduled departure time.
IF THEN ELSE
The 'IF THEN ELSE' conditions are very popular for recoding values. In data.table package, it
can be done with the following methods :
It means to set flag= 1 if min is less than 50. Otherwise, set flag =0.
We can use this format - DT[ ] [ ] [ ] to build a chain in data.table. It is like sub-queries like
SQL.
Like SAS PROC MEANS procedure, we can generate summary statistics of specific variables.
In this case, we are calculating mean, median, minimum and maximum value of variable
arr_delay.
To summarize multiple variables, we can simply write all the summary statistics function in a
bracket. See the command below-
Summary by group
Remove Duplicates
You can remove non-unique / duplicate cases with unique() function. Suppose you want to
eliminate duplicates based on a variable, say. carrier.
setkey(mydata, "carrier")
unique(mydata)
Suppose you want to remove duplicated based on all the variables. You can use the command
below -
setkey(mydata, NULL)
unique(mydata)
Note : Setting key to NULL is not required if no key is already set.
In SQL, Window functions are very useful for solving complex data problems. RANK OVER
PARTITION is the most popular window function. It can be easily translated in data.table with
the help of frank() function. frank() is similar to base R's rank() function but much faster. See
the code below.
The lag and lead of a variable can be calculated with shift() function. The syntax of shift()
function is as follows - shift(variable_name, number_of_lags, type=c("lag", "lead"))
DT <- data.table(A=1:5)
DT[ , X := shift(A, 1, type="lag")]
DT[ , Y := shift(A, 1, type="lead")]
Lag and Lead Function
We can use %between% operator to define a range. It is inclusive of the values of both the ends.
DT = data.table(x=6:10)
DT[x %between% c(7,9)]
The %like% is mainly used to find all the values that matches a pattern.
DT = data.table(Name=c("dep_time","dep_delay","arrival"), ID=c(2,3,4))
DT[Name %like% "dep"]
Merging / Joins
The merging in data.table is very similar to base R merge() function. The only difference is
data.table by default takes common key variable as a primary key to merge two datasets.
Whereas, data.frame takes common variable name as a primary key to merge the datasets.
Sample Data
Left Join
It returns all observations from the left dataset and the matched observations from the right
dataset.
Right Join
It returns all observations from the right dataset and the matched observations from the left
dataset.
setDF(mydata)
Similarly, you can use setDT() function to convert data frame to data table.
set.seed(123)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE)
setDT(X, key = "A")
Reshape Data
It includes several useful functions which makes data cleaning easy and smooth. To reshape or
transpose data, you can use dcast.data.table() and melt.data.table() functions. These functions
are sourced from reshape2 package and make them efficient. It also add some new features in
these functions.
Rolling Joins
It supports rolling joins. They are commonly used for analyzing time series data. A very R
packages supports these kind of joins.
Q1. Calculate total number of rows by month and then sort on descending order
Q3. Find origin of flights having average total delay is greater than 20 minutes
Q4. Extract average of arrival and departure delays for carrier == 'DL' by 'origin' and 'dest'
variables
mydata[carrier == "DL",
lapply(.SD, mean, na.rm = TRUE),
by = .(origin, dest),
.SDcols = c("arr_delay", "dep_delay")]
Q5. Pull first value of 'air_time' by 'origin' and then sum the returned values when it is greater
than 300
What is ggplot2?
ggplot2 is a robust and a versatile R package, developed by the most well known R developer,
Hadley Wickham, for generating aesthetic plots and charts.
The ggplot2 implies "Grammar of Graphics" which believes in the principle that a plot can be
split into the following basic parts -
Plot = data + Aesthetics + Geometry
1. data refers to a data frame (dataset).
2. Aesthetics indicates x and y variables. It is also used to tell R how data are
displayed in a plot, e.g. color, size and shape of points etc.
3. Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot,
density plot, dot plot etc.)
Apart from the above three parts, there are other important parts of plot -
1. Faceting implies the same type of graph can be applied to each subset of the data.
For example, for variable gender, creating 2 graphs for male and female.
2. Annotation lets you to add text to the plot.
3. Summary Statistics allows you to add descriptive statistics on a plot.
4. Scales are used to control x and y axis limits
Why ggplot2 is better?
The table below shows common charts along with various important functions used in these
charts.
Important Important Functions
Plots
Datasets
In this article, we will use three datasets - 'iris' , 'mpg' and 'mtcars' datasets available in R.
1. The 'iris' data comprises of 150 observations with 5 variables. We have 3 species of flowers:
Setosa, Versicolor and Virginica and for each of them the sepal length and width and petal length
and width are provided.
2. The 'mtcars' data consists of fuel consumption (mpg) and 10 aspects of automobile design
and performance for 32 automobiles. In order words, we have 32 observations and 11 different
variables:
Histogram, Density plots and Box plots are used for visualizing a continuous variable.
Creating Histogram:
Firstly we consider the iris data to create histogram and scatter plot.
1. aes( ) i.e. aesthetics we define which variable will be represented on the x- axis;
here we consider 'Sepal.Length'
2. geom_histogram( ) denotes we want to plot a histogram.
Histogram in R
To change the width of bin in the histograms we can use binwidth in geom_histogram( )
ggplot(data = iris, aes(x = Sepal.Length)) + geom_histogram(binwidth=1)
One can also define the number of bins being wanted, the binwidth in that case will be adjusted
automatically.
Using color = "black" and fill = "white" we are denoting the boundary colors and the inside
color of the bins respectively.
Now mpg data will be used for creating the following graphics.
Here, bar of SUV appears first as it has maximum number of cars. Now bars are ordered based
on frequency count.
Now using dplyr library we create a new dataframe 'df' and try to plot it.
Using group_by we group the data according to various types of cars and summarise enables us
to find the statistics (here mean for 'displ' variable) for each group. To add data labels (with 2
decimal places) we use geom_text( )
Customized BarPlot
p + geom_bar(stat="identity", position=position_dodge())
Stacked - Position_dodge
Creating BoxPlot
To create different boxplots for 'disp' for different levels of x we can define aes(x = cyl, y = disp)
mtcars$cyl = factor(mtcars$cyl)
ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot()
Scatter Plot
A scatterplot is used to graphically represent the relationship between two continuous variables.
# Creating a scatter plot denoting various species.
ggplot(data = iris, aes( x = Sepal.Length, y = Sepal.Width,shape = Species, color = Species)) +
geom_point()
We plot the points using geom_point( ). In the aesthetics we define that x axis denotes sepal
length, y axis denotes sepal width; shape = Species and color = Species denotes that different
shapes and different sizes should be used for each particular specie of flower.
Scatter Plot
We use subset( ) function to select only those cars which have am = 0; paraphrasing it; we are
considering only those cars which are automatic. We plot the displacement corresponding to
mileage and for different cylinders we are using various colors. Also factor(cyl) transforms our
continuous variable cylinder to a factor.
Here in geom_point we have added an optional argument size = 2.5 denoting the size of the
points. geom_line( ) creates a line. Note that we have not provided any aesthetics for x axis in
geom_line, it means that it will plot the horsepower(hp) corresponding to mileage(mpg) only.
Modifying the axis labels and appending the title and subtitle
#Adding title or changing the labels
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + labs(title = "Scatter plot")
#Alternatively
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot")
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot",
subtitle = "mtcars data in R")
Here using labs( ) we can change the title of our legend or ggtitle we can assign our graph some
title. If we want to add some title or sub-title to our graph thus we can use ggtitle( )where the
first argument is our 'main title' and second argument is our subtitle.
a <- ggplot(mtcars,aes(x = mpg, y = disp, color = factor(cyl))) + geom_point()
a
#Changing the axis labels.
a + labs(color = "Cylinders")
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement")
We firstly save our plot to 'a' and thus we make the alterations.
Note that in the labs command we are using color = "Cylinders" which changes the title of our
legend.
Using the xlab and ylab commands we can change the x and y axis labels respectively. Here our
x axis label is 'mileage' and y axis label is 'displacement'
#Combining it all
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement") + ggtitle(label =
"Scatter plot", subtitle = "mtcars data in R")
In the above plot we can see that the labels on x axis,y axis and legend have changed; the title
and subtitle have been added and the points are colored, distinguishing the number of cylinders.
We use theme( ) to modify the the plot title and background. plot.title is an element_text( )
object in which we have specified the color and size of our title. Utilizing plot.background which
is an element_rect( ) object we can specify the color of our background.
ggplot2( ) offers by default themes with background panel design colors being changed
automatically. Some of them are theme_gray, theme_minimal, theme_dark etc.
b + theme_minimal( )
We can observe horizontal and vertical lines behind the points. What if we don't need them? This
can be achieved via:
#Removing the lines from the background.
b + theme(panel.background = element_blank())
Setting panel.background = element_blank( ) with no other parameter can remove those lines
and color from the panel.
#Removing the text from x and y axis.
b + theme(axis.text = element_blank())
b + theme(axis.text.x = element_blank())
b + theme(axis.text.y = element_blank())
To remove the text from both the axis we can use axis.text = element_blank( ). If we want to
remove the text only from particular axis then we need to specify it.
Now we save our plot to c and then make the changes.
#Changing the legend position
c <- ggplot(mtcars,aes(x = mpg, y = disp, color = hp)) +labs(title = "Scatter Plot") +
geom_point()
c + theme(legend.position = "top")
If we want to move the legend then we can specify legend.position as "top" or "bottom" or "left"
or "right".
Finally combining all what we have learnt in themes we create the above plot where the legend is
placed at bottom, plot title is in forest green color, the background is in yellow and no text is
displayed on both the axis.
#Combining everything.
c + theme(legend.position = "bottom", axis.text = element_blank()) +
theme(plot.title = element_text(color = "Forest Green",size = 17),plot.background =
element_rect("Yellow"))
Scatter Plot
Changing the break points and color scale of the legend together.
Let us try changing the break points and the colors in the legend together by trial and error.
trial2
What is happening? The reason for this is that we cannot have 2 scale_color functions for a
single graph. If there are multiple scale_color_ functions then R overwrites the other
scale_color_ functions by the last scale_color_ command it has received.
In trial 1, scale_color_gradient overwrites the previous scale_color_continuous command.
Similarly in trial 2, scale_color_continuous overwrites the previous scale_color_gradient
command.
Faceting.
Faceting is a technique which is used to plot the graphs for the data corresponding to various
categories of a particular variable. Let us try to understand it via an illustration:
facet_wrap function is used for faceting where the after the tilde(~) sign we define the variables
on which we want the classification.
Faceting for carb
We see that there are 6 categories of "carb". Faceting creates 6 plots between mpg and disp;
where the points correspond to the categories.
We can mention the number of rows we need for faceting.
# Control the number of rows and columns with nrow and ncol
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb,nrow = 3)
Here an additional parameter nrow = 3 depicts that in total all the graphs should be adjusted in 3
rows.
In facet_grid(.~cyl), it facets the data by 'cyl' and the cylinders are represented in columns. If we
want to represent 'cyl' in rows, we write facet_grid(cyl~.). If we want to facet according to 2
variables we write facet_grid(gear~cyl) where gears are represented in rows and 'cyl' are
illustrated in columns.
geom_text_repel
Conclusion:
Experiment No: 10
Aim: Study and implementation of data transpose operations in R
Theory:
In R, we can transpose our data very easily. In R, there are many packages such as tidyr and
reshape2 that helps to make it easy. This package was written by the most popular R
expert Hadley Wickham.
Sample Data
The code below would create a sample data that would be used for demonstration.
Suppose you have data containing three variables such as X, Y and Z. The variable 'X' contains
IDs and the variable 'Y' contains dates and the variable 'Z' contains income. The data is
structured in a long format and you need to convert it to wide format so that the dates moved to
column names and the income information comes under these dates. The snapshot of data and
desired output is shown below -
R : Convert Long to Wide Format
In reshape2 package, there are two function for transforming long-format data to wide format.
The functions are "dcast" and "acast". The only difference between these two functions are as
follows :
1. dcast function returns a data frame as output.
2. acast function returns a vector, matrix or array as output.
Let's see another example wherein we have more than 1 ID variable. It contains information
about Income generated from 2 products - Product A and B reported semi-annually.
Output
It seems to be VERY EASY (just remove the additional ID variable 'SemiYear'). But it's a little
tricky. See the explanation below -
dcast(data, Year ~ Product, value.var = "Income")
Warning : Aggregation function missing: defaulting to length
We need to define the statistics to aggregate income at year level. Let's sum the income to
report annual score.
dcast(data, Year ~ Product, fun.aggregate = sum, value.var = "Income")
Suppose you have data containing information of species and their sepal length. The data of
sepal length of species are in columns.
library(reshape2)
x = colnames(mydata[,-1])
t = melt(mydata,id.vars = "ID",measure.vars = x , variable.name="Species",
value.name="Sepal.Length",na.rm = TRUE)
How melt function works :
Sample Data
data = read.table(text="
XYZ
650
6 3 NA
615
853
1 NA 1
872
2 0 2", header=TRUE)
Apply Function
When we want to apply a function to the rows or columns of a matrix or data frame. It cannot be
applied on lists or vectors.
apply arguments
apply(data, 1, max)
It returns NA if NAs exist in a row. To ignore NAs, you can use the following line of code.
apply(data, 1, mean)
apply(data, 1, mean, na.rm = TRUE)
Calculate number of 0s in each row
Conclusion: