0% found this document useful (0 votes)
33 views

1research Methodology For Commerce Lab

Uploaded by

rashimakkar80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

1research Methodology For Commerce Lab

Uploaded by

rashimakkar80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

RESEARCH METHODOLOGY FOR COMMERCE LAB

BCOM 113
R PROGRAMMING
IN PRACTICAL FULFILLMENT OF
BACHELORS OF COMMERCE [ BCOM.(H) ]
[2022-2025]

GUIDED BY: DR. LAXMI SUBMITTED BY: ROMIL


02251488822

FAIRFEILD INSTITUTE OF MANAGEMENT AND TECHNOLOGY


(KAPASHERA, NEW DELHI)

AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY


(DWARKA, NEW DELHI)
OBJECTIVE

The objective of this lab is to understand various aspects of research. Identification, and use of
various statistical tests using software tools available to a researcher. Labs are structured to give
students experience with conducting experiments, analyzing data, thinking critically about
theory and data, and communicating their results and analysis in writing and oral presentation.
TABLE OF CONTENTS

Title Page No.

Introduction To R Programming 4-5

Introduction To Functions 6-7

Assignment No. 1 (Descriptive Statistics) 8-10

Assignment No. 2(Implementation of various operations on matrix, array and 11-15


factors in R)
ASSIGNMENT NO. 3 16-19
Aim: Implementation Of Vector And List Data Objects Operations
ASSIGNMENT 4 20-22
Aim: To Create Sample (Dummy) Data In R And Perform Data Manipulation
With R
ASSIGNMENT NO. 5 23-25
Aim: Study And Implementation Of Various Control Structures In R
ASSIGNMENT NO. 6 26-34
Aim: Data Manipulation With Dplyr Package
INTRODUCTION TO R PROGRAMMING
R operates in two modes: interactive and batch. The one typically used is interactive mode. In this mode,
you type in commands, R displays results, you type in more commands, and so on. On the other hand,
batch mode does not require interaction with the user. It’s useful for production jobs, such as when a
program must be run periodically, say once per day, because you can automate the process.
INTERACTIVE MODE
On a Linux or Mac system, start an R session by typing R on the command line in a terminal window. On a
Windows machine, start R by clicking the R icon. The result is a greeting and the R prompt, which is the >
sign. The screen will look something like this:
Type 'demo()' for some demos,
'help()' for on-line help, or 'help. Start()' for an HTML browser interface to help.
Type 'q()' to quit R.>
You can then execute R commands. The window in which all this appears is called the R console. As a
quick example, consider a standard normal distribution—that is, with mean 0 and variance 1. If a random
variable X has that distribution,
then its values are centered around 0, some negative, some positive, averaging in the end to 0. Now form a
new random variable Y = |X|. Since we’ve taken the absolute value, the values of Y will not be centered
around 0, and the mean of Y will be positive. Let’s find the mean of Y. Our approach is based on a
simulated example of N(0,1) variates.
>mean(abs(rnorm(100)))
>[1] 0.7194236
This code generates the 100 random variates, finds their absolute values, and then finds the mean of the
absolute values.
The [1] you see means that the first item in this line of output is item 1. In this case, our output consists of
only one line (and one item), so this is redundant. This notation becomes helpful when you need to read
voluminous output that consists of a lot of items spread over many lines. For example, if there were two
rows of output with six items per row, the second row would be labeled.
>rnorm(10)
[1] -0.6427784 -1.0416696 -1.4020476 -0.6718250 -0.9590894 -0.8684650
[7] -0.5974668 0.6877001 1.3577618 -2.2794378
Here, there are 10 values in the output, and the label [7] in the second row lets you quickly see that
0.6877001, for instance, is the eighth output item.You can also store R commands in a file. By convention,
R code files have the suffix .R or .r. If you create a code file called z.R, you can execute the contents of that
file by issuing the following command:
>source("z.R")
BATCH MODE
Sometimes it’s convenient to automate R sessions. For example, you may wish to run an R script that
generates a graph without needing to bother with manually launching R and executing the script yourself.
Here you would run R in batch mode
.As an example, let’s put our graph-making code into a file named z.R with the following contents:
pdf("xh.pdf") # set graphical output file
hist(rnorm(100)) # generate 100 N(0,1) variates and plot their histogram dev.off() # close the graphical
output file
The items marked with # are comments. They’re ignored by the R interpreter.Comments serve as notes to
remind us and others what the code is doing, in a human-readable format.
Here’s a step-by-step breakdown of what we’re doing in the preceding code:

 We call the pdf() function to inform R that we want the graph we create to be saved in the PDF file
xh.pdf.
 We call rnorm() (for random normal) to generate 100 N(0,1) random variates.
 We call hist() on those variates to draw a histogram of these values.
 We call dev.off() to close the graphical “device” we are using, which is

the file xh.pdf in this case. This is the mechanism that actually causes the file to be written to disk.
We could run this code automatically, without entering R’s interactive mode, by invoking R with an
operating system shell command (such as at the $ prompt commonly used in Linux systems):

$ R CMD BATCH z.R

You can confirm that this worked by using your PDF viewer to display the saved histogram. (It will just be
a plain-vanilla histogram, but R is capable of producing quite sophisticated variations.)
INTRODUCTION TO FUNCTIONS
As in most programming languages, the heart of R programming consists of writing functions. A function is
a group of instructions that takes inputs, uses them to compute other values, and returns a result.
As a simple introduction, let’s define a function named oddcount(), whose purpose is to count the odd
numbers in a vector of integers. Normally, we would compose the function code using a text editor and
save it in a file, but in this quick-and-dirty example, we’ll enter it line by line in R’s interactive mode. We’ll
then call the function on a couple of test cases.
# counts the number of odd integers in x
>oddcount <- function(x) {
+ k <- 0 # assign 0 to k
+ for (n in x) {
+ if (n %% 2 == 1) k <- k+1 # %% is the modulo operator
+}
+ return(k)
+}
>oddcount(c(1,3,5))
[1] 3
>oddcount(c(1,2,3,7,9))
[1] 4
First, we told R that we wanted to define a function named oddcount with one argument, x. The left brace
demarcates the start of the body of the function. We wrote one R statement per line.
Until the body of the function is finished, R reminds you that you’re still in the definition by using + as its
prompt, instead of the usual >. (Actually,+ is a line-continuation character, not a prompt for a new input.) R
resumes the > prompt after you finally enter a right brace to conclude the function body.
After defining the function, we evaluated two calls to oddcount(). Since there are three odd numbers in the
vector (1,3,5), the call oddcount(c(1,3,5)) returns the value 3. There are four odd numbers in (1,2,3,7,9), so
the second call returns 4.
Notice that the modulo operator for remainder arithmetic is %% in R, as indicated by the comment. For
example, 38 divided by 7 leaves a remainder of 3:
>38 %% 7
[1] 3
For instance, let’s see what happens with the following code:
for (n in x) {
if (n %% 2 == 1) k <- k+1
}
First, it sets n to x[1], and then it tests that value for being odd or even. If the value is odd, which is the case
here, the count variable k is incremented.Then n is set to x[2], tested for being odd or even, and so on.
By the way, C/C++ programmers might be tempted to write the preceding loop like this:
for (i in 1:length(x)) {
if (x[i] %% 2 == 1) k <- k+1
}
Here, length(x) is the number of elements in x. Suppose there are 25 elements. Then 1:length(x) means
1:25, which in turn means 1,2,3,...,25. This code would also work (unless x were to have length 0), but one
of the major themes of R programming is to avoid loops if possible; if not, keep loops simple. Look again
at our original formulation:
for (n in x) {
if (n %% 2 == 1) k <- k+1
}
It’s simpler and cleaner, as we do not need to resort to using the length() function and array indexing.
At the end of the code, we use the return statement:
return(k)
This has the function return the computed value of k to the code that called it. However, simply writing the
following also works:
K
R functions will return the last value computed if there is no explicit return() call. However, this approach
must be used with care,
In programming language terminology, x is the formal argument (or formal parameter ) of the function
oddcount(). In the first function call in the preceding example, c(1,3,5) is referred to as the actual argument.
These terms allude to the fact that x in the function definition is just a placeholder, whereas c(1,3,5) is the
value actually used in the computation. Similarly, in
the second function call, c(1,2,3,7,9) is the actual argument.
ASSIGNMENT NO. 1
Aim: Descriptive statistics

Let’s make a simple data set (in R parlance, a vector ) consisting of the numbers 1, 2, and 4, and name it x:
>x <- c(1,2,4)
The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not work
in some special situations. Note that there are no fixed types associated with variables. Here, we’ve
assigned a vector to x, but later we might assign something of a different type to it. We’ll look at vectors
and the other types in Section 1.4.
The c stands for concatenate. Here, we are concatenating the numbers 1, 2, and 4. More precisely, we are
concatenating three one-element vectors that consist of those numbers. This is because any number is also
considered to be a one-element vector.Now we can also do the following:
>q <- c(x,x,8)

which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates).Now let’s confirm that the data is really in x.
To print the vector to the screen, simply type its name. If you type any variable name (or, more
generally,any expression) while in interactive mode, R will print out the value of that
variable (or expression). Programmers familiar with other languages such as Python will find this feature
familiar. For our example, enter this:
>x
[1] 1 2 4
sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are accessed via [ ].
Here’s how we can print out the third element of x:
>x[3]
[1] 4
As in other languages, the selector (here, 3) is called the index or subscript. Those familiar with ALGOL-
family languages, such as C and C++, should note that elements of R vectors are indexed starting from 1,
not 0. Subsetting is a very important operation on vectors. Here’s an example:
>x <- c(1,2,4)
>x[2:3]
[1The expression x[2:3] refers to the subvector of x consisting of elements 2 through 3, which are 2 and 4
here. We can easily find the mean and standard deviation of our data set, as follows:
>mean(x) [1] 2.333333
>sd(x)
[1] 1.527525
This again demonstrates typing an expression at the prompt in order to print it. In the first line, our
expression is the function call mean(x). The return value from that call is printed automatically, without
requiring a call to R’s print() function.
If we want to save the computed mean in a variable instead of just printing it to the screen, we could
execute this code:
>y <- mean(x)

Again, let’s confirm that y really does contain the mean of x:


>y
[1] 2.333333
As noted earlier, we use # to write comments, like this:
>y # print out y [1] 2.333333
Comments are especially valuable for documenting program code, but they are useful in interactive
sessions, too, since R records the command history (as discussed in Section 1.6). If you save your session
and resume it later, the comments can help you remember what you were doing.Finally, let’s do something
with one of R’s internal data sets (these are
used for demos). You can get a list of these data sets by typing the following:
>data()
One of the data sets is called Nile and contains data on the flow of the Nile River. Let’s find the mean and
standard deviation of this data set:
>mean(Nile)[1]919.35>sd(Nile)
>sd(Nile)[1] 169.2275] 2 4
We can also plot a histogram of the data:
>hist(Nile)

A window pops up with the histogram in it, as shown in Figure 1-1. This graph is bare-bones simple, but R
has all kinds of optional bells and whistles for plotting. For instance, you can change the number of bins by
specifying the breaks variable. The call hist(z,breaks=12) would draw a histogram of the data set z with 12
bins. You can also create nicer labels, make use of color, and make many other changes to create a more
informative and eye appealing graph. When you become more familiar with R, you’ll be able to construct
complex, rich color graphics of striking beauty.

Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function (or
alternatively by pressing CTRL-D in Linux or CMD-D on a Mac):

>q()
Save workspace image? [y/n/c]: n

That last prompt asks whether you want to save your variables so that you can resume work later. If you
answer y, then all those objects will be loaded automatically the next time you run R. This is a very
important feature, especially when working with large or numerous data sets. Answering y here also saves
the session’s command history. We’ll talk more about saving your workspace.
Assignment No. 2

Aim: Implementation Of Various Operations On Matrix, Array And Factors In R

Theory: Matrices Are Much Used In Statistics, And So Play An Important Role In R. To Create A Matrix
Use The Function Matrix(), Specifying Elements By Column First:

>matrix(1:12, nrow=3, ncol=4) [,1] [,2] [,3] [,4]


[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

This is called column-major order. Of course, we need only give one of the dimensions:

>matrix(1:12, nrow=3)

unless we want vector recycling to help us:

>matrix(1:3, nrow=3, ncol=4) [,1] [,2] [,3] [,4]

[1,] 1 1 1 1

[2,] 2 2 2 2

[3,] 3 3 3 3

Sometimes it’s useful to specify the elements by row first

>matrix(1:12, nrow=3, byrow=TRUE)

There are special functions for constructing certain matrices:


>diag(3) [,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
>diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
>1:5 %o% 1:5

[,1] [,2] [,3] [,4] [,5]


[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25

The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj . The function
outer() generalizes this to any function f on two arguments, to create a matrix with entries f(xi , yj ). (More
on functions later.)
>outer(1:3, 1:4, "+") [,1] [,2] [,3] [,4]
[1,] 2 3 4 5
[2,] 3 4 5 6
[3,] 4 5 6 7

Matrix multiplication is performed using the operator %*%, which is quite distinct from scalar
multiplication *.

>A <- matrix(c(1:8,10), 3, 3)


>x <- c(1,2,3)
>A %*% x # matrix multiplication
[,1]
[1,] 30
[2,] 36
[3,] 45

>A*x # NOT matrix multiplication [,1] [,2] [,3]


[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 30

Standard functions exist for common mathematical operations on matrices


>t(A) # transpose [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 10

>det(A) # determinant [1] -3


>diag(A) # diagonal [1] 1 5 10
>solve(A) # inverse [,1] [,2] [,3]

[1,] -0.6667 -0.6667 1


[2,] -1.3333 3.6667 -2
[3,] 1.0000 -2.0000 1
ARRAY:
Of course, if we have a data set consisting of more than two pieces of categorical information about each
subject, then a matrix is not sufficient. The generalization of matrices to higher
dimensions is the array. Arrays are defined much like matrices, with a call to the array() command. Here is
a 2 × 3 × 3 array:

>arr = array(1:18, dim=c(2,3,3))


>arr
,,1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
,,2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
,,3
[,1] [,2] [,3]
[1,] 13 15 17
[2,] 14 16 18

Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3 matrix. Note that
we no longer specify the number of rows and columns separately, but use a single vector dim whose length
is the number of dimensions. You can recover this vector with the dim() function.
>dim(arr) [1] 2 3 3

Note that a 2-dimensional array is identical to a matrix. Arrays can be subsetted and modified in exactly the
same way as a matrix, only using the appropriate number of co-ordinates:
>arr[1,2,3] [1] 15
>arr[,2,]
[,1] [,2] [,3]
[1,] 3 9 15
[2,] 4 10 16

>arr[1,1,] = c(0,-1,-2) # change some values


>arr[,,1,drop=FALSE]
,,1
[,1] [,2] [,3]
[1,] 0 3 5
[2,] 2 4 6

Factors
R has a special data structure to store categorical variables. It tells R that a variable is nominal or ordinal by
making it a factor.

Simplest form of the factor function :

Ideal form of the factor function :

The factor function has three parameters:


1.Vector Name
2.Values (Optional)
3.Value labels (Optional) Convert a column "x" to factor
data$x = as.factor(data$x)
Assignment No. 3
Aim: Implementation of vector and List data objects operations
THEORY: With R, it’s Important that one understand that there is a difference between the actual R object
and the manner in which that R object is printed to the console. Often, the printed output may have
additional bells and whistles to make the output more friendly to the users. However, these bells and
whistles are not inherently part of the object R has five basic or “atomic” classes of objects:
Aim: Implementation of vector and List data objects operations

 character
 numeric (real numbers)
 integer
 complex
 logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the vector() function. There
is really only one rule about vectors in R, which is that A vector can only contain objects of the same class.
But of course, like any good rule, there is an exception, which is a list, which we will get to a bit later. A list
is represented as a vector but can contain objects of different classes. Indeed, that’s usually why we use
them.
There is also a class for “raw” objects, but they are not commonly used directly in data analysis
CREATING VECTORS
The c() function can be used to create vectors of objects by concatenating things together.

> x <- c(0.5, 0.6) ## numeric

>x <- c(TRUE, FALSE) ## logical

>x <- c(T, F) ## logical

>x <- c("a", "b", "c") ## character

>x <- 9:29 ## integer

>x <- c(1+0i, 2+4i) ## complex


Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE. However, in
general one should try to use the explicit TRUE and FALSE values when indicating logical values. The T
and F values are primarily there for when you’re feeling lazy.
You can also use the vect88or() function to initialize vectors.
>x <- vector("numeric", length = 10)
>x
[1] 0 0 0 0 0 0 0 0 0 0
A vector is an object that contains a set of values called its elements.

NUMERIC VECTOR
x <- c(1,2,3,4,5,6)
The operator <– is equivalent to "=" sign.

CHARACTER VECTOR
State <- c("DL", "MU", "NY", "DL", "NY", "MU")
To calculate frequency for State vector, you can use table function.

To calculate mean for a vector, you can use mean function.

Since the above vector contains a NA (not available) value, the mean function returns NA

To calculate mean for a vector excluding NA values, you can include na.rm = TRUE parameter in mean
function.
You can use subscripts to refer elements of a vector.

Convert a column "x" to numeric data$x = as.numeric(data$x)


Some useful vectors can be created quickly with R. The colon operator is

used to generate integer sequences


>1:10

[1] 1 2 3 4 5 6 7 8 9 10

>-3:4

[1] -3 -2 -1 0 1 2 3 4

>9:5

[1] 9 8 7 6 5

More generally, the function seq() can generate any arithmetic progression.

>seq(from=2, to=6, by=0.4)

[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0

>seq(from=-1, to=1, length=6) [1] -1.0 -0.6 -0.2 0.2 0.6 1.0
Sometimes it’s necessary to have repeated values, for which we use rep()

>rep(5,3) [1] 5 5 5
>rep(2:5,each=3)

[1] 2 2 2 3 3 3 4 4 4 5 5 5

>rep(-1:3, length.out=10) [1] -1 0 1 2 3 -1 0 1 2 3


We can also use R’s vectorization to create more interesting sequences:

>2^(0:10)

[1] 1 2 4 8 16 32 64 128 256 512 1024


8
>1:3 + rep(seq(from=0,by=10,to=30), each=3) [1] 1 2 3 11 12 13 21 22 23 31 32 33

>x <- list(1:3, TRUE, "Hello", list(1:2, 5))

Here x has 4 elements: a numeric vector, a logical, a string and another list. We can select an entry of x with
double square brackets:
>x[[3]]

[1] "Hello"

To get a sub-list, use single brackets:

>x[c(1,3)]

[[1]]

[1] 1 2 3

[[2]]

[1] "Hello"

Notice the difference between x[[3]] and x[3].


We can also name some or all of the entries in our list, by supplying argument names to list():

>x <- list(y=1:3, TRUE, z="Hello")


>x
$y

[1] 1 2 3
[[2]]
[1] TRUE
$z

[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to how we can recover the entries by their
name. We can still use the numeric position if we prefer:
>x$y [1] 1 2 3
>x[[1]]

[1] 1 2 3

The function names() can be used to obtain a character vector of all the names of objects in a list.
>names(x) [1] "y" "" "z"

ASSIGNMENT 4

Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R

THEORY: This covers how to execute most frequently used data manipulation tasks with R. It includes
various examples with datasets and code. It gives you a quick look at several functions used in R.

 DROP DATA FRAME COLUMNS BY NAME:

DF <- data.frame( x=1:10, y=10:1, z=rep(5,10), a=11:20 ) # for multiple


>drops <- c("x","z")

DF[ , !(names(DF) %in% drops)]

# OR

>keeps <- c("y", "a")

>DF[keeps]

>DF

 ORDER FUNCTION FOR SORT:


d3=data.frame(roll=c(2,4,6,3,1,5), name=c('a','b','c','d','e','e'), marks=c(44,55,22,33,66,77))

>d3 d3[order(d3$roll),] OR
d3[with(d3,order(roll)),]

SUBSETS:

roll=c(1:5) names=c(letters[1:5]) marks=c(12,33,44,55,66)


d4=data.frame(roll,names,marks) sub1=subset(d4,marks>33 & roll>4) sub1
sub1=sub1=subset(d4,marks>33 & roll>4,select = c(roll,names)) sub1

 DROP FACTOR LEVELS IN A SUBSETTED DATA FRAME:


df <- data.frame(letters=letters[1:5], numbers=seq(1:5)) df
levels(df$letters) sub2=subset(df,numbers>3) sub2
levels(sub2$letters) sub2$letters=factor(sub2$letters) levels(sub2$letters)

 RENAME COLUMNS IN R
colnames(d)[colnames(d)==“roll"]=“ID“

 USE OF FACTOR IN DATA FRAME


# add column of class d$class=c(1,2,1,2,1,2) d
cls=factor(d$class,levels = c(1,2),labels = c("f","s")) table(cls)
# for factor levels and labels are optional

 REPLACING / RECODING VALUES


By 'recoding', it means replacing existing value(s) with the new value(s).
Create Dummy Data
mydata = data.frame(State = ifelse(sign(rnorm(25))==-1,'Delhi','Goa'), Q1= sample(1:25))

In this example, we are replacing 1 with 6 in Q1 variable mydata$Q1[mydata$Q1==1] <- 6

In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to convert the variable
from factor to character.
mydata$State = as.character(mydata$State) mydata$State[mydata$State=='Delhi'] <- 'Mumbai'

In this example, we are replacing 2 and 3 with NA values in whole dataset. mydata[mydata == 2 | mydata
== 3] <- NA

 ANOTHER METHOD
You have to first install the car package. # Install the car package install.packages("car")
# Load the car package library("car")
# Recode 1 to 6
mydata$Q1 <- recode(mydata$Q1, "1=6")

 RECODING TO A NEW COLUMN


Create a new column called Ques1
mydata$Ques1<- recode(mydata$Q1, "1:4=0; 5:6=1")

 SORTING
Sorting is one of the most common data manipulation task. It is generally used when we want to see the top
5 highest / lowest values of a variable.

 SORTING A VECTOR
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than 1 dimensional
vector.
 SORTING A DATA FRAME
mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT= sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
Sort gender variable in ascending order and then SAT in descending order mydata.sorted1 <-
mydata[order(mydata$Gender, -mydata$SAT),]

Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.

 VALUE LABELLING
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male", "female"))
Use ordered() for ordinal data
mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly agree", "Somewhat agree",
"Somewhat disagree", "Strongly disagree"))

 DEALING WITH MISSING DATA


Number of missing values in a variable colSums(is.na(mydata))
Number of missing values in a row rowSums(is.na(mydata))
List rows of data that have missing values mydata[!complete.cases(mydata),]
Creating a new dataset without missing data mydata1 <- na.omit(mydata)
Convert a value to missing mydata[mydata$Q1==999,"Q1"] <- NA

Intersect, merge in R – intersection of data frames:


data1=data.frame(roll=c(1,2,3,4,5),name=c('sachin','rahul','vijay','kapil','saurav')) data1
data2=data.frame(roll=c(1,2,3,5), marks=c(20,25,43,60))

 CONCLUSION:
 DATA2 RESULT=INTERSECT(DATA1$ROLL,DATA2$ROLL) RESULT
RESULT=MERGE(DATA1,DATA2,ALL=FALSE) RESULT
ASSIGNMENT NO. 5

Aim: Study and implementation of various control structures in R


THEORY:
Loop helps you to repeat the similar operation on different variables or on different columns or on different
datasets. For example, you want to multiple each variable by 5. Instead of multiply each variable one by
one, you can perform this task in loop. Its main benefit is to bring down the duplication in your code which
helps to make changes later in the code.

 IF-ELSE AND NESTED IF-ELSE IN R


The If-Else statements are important part of R programming. In this tutorial, we will see various ways to
apply conditional statements (If..Else nested IF) in R. In R, there are a lot of powerful packages for data
manipulation. In the later part of this tutorial, we will see how IF ELSE statements are used in popular
packages.

 SAMPLE DATA
Let's create a sample data to show how to perform IF ELSE function. This data frame would be used
further in examples.

X1 X2 X3
1 12 A
9
3 17 B
8
5 14 C
0
7 18 D
6
9 19 E
1
11 10 F
4
13 15 G
0
15 18 H
3
17 15 I
1
19 14 J
2
Run the program below to generate the above table in R
set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
x2 = sample(100:200,10,FALSE),

x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In total,
these are 10 numeric values

x2 = sample(100:200,10,FALSE) : The variable 'x2' constitutes 10 non-repeating random numbers ranging


between 100 and 200.

x3 = LETTERS[1:10] : The variable 'x3' contains 10 alphabets starting from A to Z.

 SYNTAX OF IFELSE() FUNCTION :


The ifelse() function in R works similar to MS Excel IF function. See the syntax below –

ifelse(condition, value if condition is true, value if condition is false)

 EXAMPLE 1 : SIMPLE IF ELSE STATEMENT


Suppose you are asked to create a binary variable - 1 or 0 based on the variable 'x2'. If value of a variable
'x2' is greater than 150, assign 1 else 0.
mydata$x4 = ifelse(mydata$x2>150,1,0)
In this case, it creates a variable x4 on the same data frame 'mydata'. The output is shown in the image
below –

 CREATE VARIABLE IN A NEW DATA FRAME


Suppose you need to add the above created binary variable in a new data frame. You can do it by using the
code below –
x = ifelse(mydata$x2>150,1,0) newdata = cbind(x,mydata)
The cbind() is used to combine two vectors, matrices or data frames by columns.

 APPLY IFELSE() ON CHARACTER VARIABLES


If variable 'x3' contains character values - 'A', 'D', the variable 'x1' should be multiplied by 2. Otherwise it
should be multiplied by 3.
mydata$y = ifelse(mydata$x3 %in% c("A","D") ,mydata$x1*2,mydata$x1*3)

 THE OUTPUT IS SHOWN IN THE TABLE BELOW

x1 x2
x3 y 1
129 A
2
3 178 B 9
5 140 C 15

 7 186 D 14
9 191 E 27
11 104 F 33
13 150 G 39
15 183 H 45
17 151 I 51
19 142 J 57
ASSIGNMENT NO. 6

Aim: Data Manipulation with dplyr package


 THEORY:
The dplyr package is one of the most powerful and popular package in R. This package was written by the
most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2,
tidyr etc. This post includes several examples and tips of how to use dplyr package for cleaning and
transforming data. It's a complete tutorial on data manipulation and data wrangling with R.

 WHAT IS DPLYR?
The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes
data exploration and data manipulation easy and fast in R.

 WHAT'S SPECIAL ABOUT DPLYR?


The package "dplyr" comprises many functions that perform mostly used data manipulation operations
such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating
data. Another most important advantage of this package is that it's very easy to learn and use dplyr
functions. Also easy to recall these functions. For example, filter() is used to filter rows.

 DPLYR VS. BASE R FUNCTIONS


dplyr functions process faster than base R functions. It is because dplyr functions were written in a
computationally efficient manner. They are also more stable in the syntax and better supports data frames
than vectors.

 SQL QUERIES VS. DPLYR


People have been utilizing SQL for analyzing data for decades. Every modern data analysis software such
as Python, R, SAS etc supports SQL commands. But SQL was never designed to perform data analysis. It
was rather designed for querying and managing data. There are many data analysis operations where SQL
fails or makes simple things difficult. For example, calculating median for multiple variables, converting
wide format data to long format etc. Whereas, dplyr package was designed to do data analysis.
The names of dplyr functions are similar to SQL commands such as select()for selecting variables,
group_by() - group data by grouping variable, join() - joining two data sets. Also includes inner_join() and
left_join(). It also supports sub queries for which SQL was popular for.

 HOW TO INSTALL AND LOAD DPLYR PACKAGE


To install the dplyr package, type the following command
install.packages("dplyr")
To load dplyr package, type the command below
library(dplyr)
 IMPORTANT DPLYR FUNCTIONS TO REMEMBER
dplyr Function Description Equivalent SQL
select() Selecting columns SELECT
(variables)
filter() Filter (subset) rows. WHERE
group_by() Group the data GROUP BY
summarise() Summarise (or -
aggregate) data
arrange() Sort the data ORDER BY
join() Joining data frames JOIN
(tables)
mutate() Creating New COLUMN ALIAS
Variables

Data : Income Data by States


In this tutorial, we are using the following data which contains income generated by states from year 2002
to 2015. Note : This data do not contain actual income figures of the states.
This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of first 6 rows of the
dataset is shown below.
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
1 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229 1944173

2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826 1436541

3 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886 1554330

Y2010 Y2011 Y2012 Y2013 Y2014 Y2015

1 1237582 1440756 1186741 1852841 1558906 1916661

2 1629616 1230866 1512804 1985302 1580394 1979143

3 1300521 1130709 1907284 1363279 1525866 1647724

4 1669295 1928238 1216675 1591896 1360959 1329341

5 1624509 1639670 1921845 1156536 1388461 1644607

6 1913275 1665877 1491604 1178355 1383978 1330736

Download the Dataset


How to load Data
Submit the following code. Change the file path in the code below.
mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")

EXAMPLE 1 : SELECTING RANDOM N ROWS


The sample_n function selects random rows from a data frame (or table). The second parameter of the
function tells R the number of rows to select.
sample_n(mydata,3)
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826 1436541

8 D Delaware 1330403 1268673 1706751 1403759 1441351 1300836 1762096 1553585


33 N New York 1395149 1611371 1170675 1446810 1426941 1463171 1732098 1426216
Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
2 1629616 1230866 1512804 1985302 1580394 1979143

8 1370984 1318669 1984027 1671279 1803169 1627508

EXAMPLE 2 : Selecting Random Fraction Of Rows


The sample_frac function returns randomly N% of rows. In the example below, it returns randomly 10% of
rows.
sample_frac(mydata,0.1)

EXAMPLE 3 : Remove Duplicate Rows Based On All The Variables (Complete Row)
The distinct function is used to eliminate duplicates.
x1 = distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.

EXAMPLE 4 : Remove Duplicate Rows Based On A Variable


The .keep_all function is used to retain all other variables in the output data frame.
x2 = distinct(mydata, Index, .keep_all= TRUE)

EXAMPLE 5 : Remove Duplicates Rows Based On Multiple Variables


In the example below, we are using two variables - Index, Y2010 to determine uniqueness.
x2 = distinct(mydata, Index, Y2010, .keep_all= TRUE)
 SELECT( ) FUNCTION
It is used to select only desired variables.

EXAMPLE 6 : Selecting Variables (Or Columns)


Suppose you are asked to select only a few variables. The code below selects variables "Index", columns
from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)

EXAMPLE 7 : Dropping Variables


The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)
The above code can also be written like :
mydata = select(mydata, -c(Index,State))

EXAMPLE 8 : Selecting Or Dropping Variables Starts With 'Y'


The starts_with() function is used to select variables starts with an alphabet.
mydata3 = select(mydata, starts_with("Y"))
Adding a negative sign before starts_with() implies dropping the variables starts with 'Y'
mydata33 = select(mydata, -starts_with("Y"))

The following functions helps you to select variables based on their names.
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02,
x03.
one_of() Variables in character vector.
everything() All variables.

EXAMPLE 9 : Selecting Variables Contain 'I' In Their Names


mydata4 = select(mydata, contains("I"))

EXAMPLE 10 : Reorder Variables


The code below keeps variable 'State' in the front and the remaining variables follow that.
mydata5 = select(mydata, State, everything())
New order of variables are displayed below –
[1] "State" "Index" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006" "Y2007" "Y2008" "Y2009"
[11] "Y2010" "Y2011" "Y2012" "Y2013" "Y2014" "Y2015"

 RENAME( ) FUNCTION
It is used to change variable name.
rename() syntax : rename(data , new_name = old_name)
data : Data Frame
new_name : New variable name you want to keep

EXAMPLE 11 : Rename Variables


The rename function can be used to rename variables.
In the following code, we are renaming 'Index' variable to 'Index1'.
mydata6 = rename(mydata, Index1=Index)

 FILTER( ) FUNCTION
It is used to subset data with matching logical conditions.
filter() syntax : filter(data , )
data : Data Frame

EXAMPLE 12 : Filter Rows


Suppose you need to subset data. You want to filter rows and retain only those values in which Index is
equal to A.
mydata7 = filter(mydata, Index == "A")
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

1 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134 1945229 1944173

2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841 1551826 1436541

3 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382 1752886 1554330


4 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213 1188104 1628980
Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
1 1237582 1440756 1186741 1852841 1558906 1916661

2 1629616 1230866 1512804 1985302 1580394 1979143

3 1300521 1130709 1907284 1363279 1525866 1647724

4 1669295 1928238 1216675 1591896 1360959 1329341

EXAMPLE 13 : Multiple Selection Criteria


The %in% operator can be used to select multiple items. In the following program, we are telling R to
select rows against 'A' and 'C' in column 'Index'.
mydata7 = filter(mydata6, Index %in% c("A", "C"))

EXAMPLE 14 : 'And' Condition In Selection Criteria


Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in the column
'Index' and income greater than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )

EXAMPLE 15 : 'Or' Condition In Selection Criteria


The 'I' denotes OR in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)

EXAMPLE 16 : Not Condition


The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))

EXAMPLE 17 : Contains Condition


The grepl function is used to search for pattern matching. In the following code, we are looking for records
wherein column state contains 'Ar' in their name.
mydata10 = filter(mydata6, grepl("Ar", State))
 SUMMARISE( ) FUNCTION
It is used to summarize data.
summarise() syntax : summarise(data ,…)
data : Data Frame

EXAMPLE 18 : Summarize Selected Variables


In the example below, we are calculating mean and median for the variable Y2015.
summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))

EXAMPLE 19 : Summarize Multiple Variables


In the following example, we are calculating number of records, mean and median for variables Y2005 and
Y2006. The summarise_at function allows us to select multiple variables by their names.
summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))

EXAMPLE 20 : Summarize With Custom Functions


We can also use custom functions in the summarise function. In this case, we are computing the number of
records, number of missing values, mean and median for variables Y2011 and Y2012. The dot (.) denotes
each variables specified in the second argument of the function.
summarise_at(mydata, vars(Y2011, Y2012),

 HOW TO APPLY NON-STANDARD FUNCTIONS


EXAMPLE 21 : Summarize All Numeric Variables
The summarise_if function allows you to summarise conditionally.
summarise_if(mydata, is.numeric, funs(n(),mean,median))

 ALTERNATIVE METHOD :
First, store data for all the numeric variables
numdata = mydata[sapply(mydata,is.numeric)]
Second, the summarise_all function calculates summary statistics for all the columns in a data frame
summarise_all(numdata, funs(n(),mean,median))

EXAMPLE 22 : Summarize Factor Variable


PIPE OPERATOR %>%
It is important to understand the pipe (%>%) operator before knowing the other functions of dplyr package.
dplyr utilizes pipe operator from another package (magrittr).
It allows you to write sub-queries like we do it in sql.
Note : All the functions in dplyr package can be used without the pipe operator. The question arises "Why
to use pipe operator %>%". The answer is it lets to wrap multiple functions together with the use of %>%.

SYNTAX :
filter(data_frame, variable == value)
The %>% is NOT restricted to filter function. It can be used with any function.

EXAMPLE 23: The code below demonstrates the usage of pipe %>% operator. In this example, we are
selecting 10 random observations of two variables "Index" "State" from the data frame
dt = sample_n(select(mydata, Index, State),10)
GROUP_BY() FUNCTION :
Use : Group data by categorical variable

SYNTAX :
group_by(data, variables)

You might also like