1research Methodology For Commerce Lab
1research Methodology For Commerce Lab
BCOM 113
R PROGRAMMING
IN PRACTICAL FULFILLMENT OF
BACHELORS OF COMMERCE [ BCOM.(H) ]
[2022-2025]
The objective of this lab is to understand various aspects of research. Identification, and use of
various statistical tests using software tools available to a researcher. Labs are structured to give
students experience with conducting experiments, analyzing data, thinking critically about
theory and data, and communicating their results and analysis in writing and oral presentation.
TABLE OF CONTENTS
We call the pdf() function to inform R that we want the graph we create to be saved in the PDF file
xh.pdf.
We call rnorm() (for random normal) to generate 100 N(0,1) random variates.
We call hist() on those variates to draw a histogram of these values.
We call dev.off() to close the graphical “device” we are using, which is
the file xh.pdf in this case. This is the mechanism that actually causes the file to be written to disk.
We could run this code automatically, without entering R’s interactive mode, by invoking R with an
operating system shell command (such as at the $ prompt commonly used in Linux systems):
You can confirm that this worked by using your PDF viewer to display the saved histogram. (It will just be
a plain-vanilla histogram, but R is capable of producing quite sophisticated variations.)
INTRODUCTION TO FUNCTIONS
As in most programming languages, the heart of R programming consists of writing functions. A function is
a group of instructions that takes inputs, uses them to compute other values, and returns a result.
As a simple introduction, let’s define a function named oddcount(), whose purpose is to count the odd
numbers in a vector of integers. Normally, we would compose the function code using a text editor and
save it in a file, but in this quick-and-dirty example, we’ll enter it line by line in R’s interactive mode. We’ll
then call the function on a couple of test cases.
# counts the number of odd integers in x
>oddcount <- function(x) {
+ k <- 0 # assign 0 to k
+ for (n in x) {
+ if (n %% 2 == 1) k <- k+1 # %% is the modulo operator
+}
+ return(k)
+}
>oddcount(c(1,3,5))
[1] 3
>oddcount(c(1,2,3,7,9))
[1] 4
First, we told R that we wanted to define a function named oddcount with one argument, x. The left brace
demarcates the start of the body of the function. We wrote one R statement per line.
Until the body of the function is finished, R reminds you that you’re still in the definition by using + as its
prompt, instead of the usual >. (Actually,+ is a line-continuation character, not a prompt for a new input.) R
resumes the > prompt after you finally enter a right brace to conclude the function body.
After defining the function, we evaluated two calls to oddcount(). Since there are three odd numbers in the
vector (1,3,5), the call oddcount(c(1,3,5)) returns the value 3. There are four odd numbers in (1,2,3,7,9), so
the second call returns 4.
Notice that the modulo operator for remainder arithmetic is %% in R, as indicated by the comment. For
example, 38 divided by 7 leaves a remainder of 3:
>38 %% 7
[1] 3
For instance, let’s see what happens with the following code:
for (n in x) {
if (n %% 2 == 1) k <- k+1
}
First, it sets n to x[1], and then it tests that value for being odd or even. If the value is odd, which is the case
here, the count variable k is incremented.Then n is set to x[2], tested for being odd or even, and so on.
By the way, C/C++ programmers might be tempted to write the preceding loop like this:
for (i in 1:length(x)) {
if (x[i] %% 2 == 1) k <- k+1
}
Here, length(x) is the number of elements in x. Suppose there are 25 elements. Then 1:length(x) means
1:25, which in turn means 1,2,3,...,25. This code would also work (unless x were to have length 0), but one
of the major themes of R programming is to avoid loops if possible; if not, keep loops simple. Look again
at our original formulation:
for (n in x) {
if (n %% 2 == 1) k <- k+1
}
It’s simpler and cleaner, as we do not need to resort to using the length() function and array indexing.
At the end of the code, we use the return statement:
return(k)
This has the function return the computed value of k to the code that called it. However, simply writing the
following also works:
K
R functions will return the last value computed if there is no explicit return() call. However, this approach
must be used with care,
In programming language terminology, x is the formal argument (or formal parameter ) of the function
oddcount(). In the first function call in the preceding example, c(1,3,5) is referred to as the actual argument.
These terms allude to the fact that x in the function definition is just a placeholder, whereas c(1,3,5) is the
value actually used in the computation. Similarly, in
the second function call, c(1,2,3,7,9) is the actual argument.
ASSIGNMENT NO. 1
Aim: Descriptive statistics
Let’s make a simple data set (in R parlance, a vector ) consisting of the numbers 1, 2, and 4, and name it x:
>x <- c(1,2,4)
The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not work
in some special situations. Note that there are no fixed types associated with variables. Here, we’ve
assigned a vector to x, but later we might assign something of a different type to it. We’ll look at vectors
and the other types in Section 1.4.
The c stands for concatenate. Here, we are concatenating the numbers 1, 2, and 4. More precisely, we are
concatenating three one-element vectors that consist of those numbers. This is because any number is also
considered to be a one-element vector.Now we can also do the following:
>q <- c(x,x,8)
which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates).Now let’s confirm that the data is really in x.
To print the vector to the screen, simply type its name. If you type any variable name (or, more
generally,any expression) while in interactive mode, R will print out the value of that
variable (or expression). Programmers familiar with other languages such as Python will find this feature
familiar. For our example, enter this:
>x
[1] 1 2 4
sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are accessed via [ ].
Here’s how we can print out the third element of x:
>x[3]
[1] 4
As in other languages, the selector (here, 3) is called the index or subscript. Those familiar with ALGOL-
family languages, such as C and C++, should note that elements of R vectors are indexed starting from 1,
not 0. Subsetting is a very important operation on vectors. Here’s an example:
>x <- c(1,2,4)
>x[2:3]
[1The expression x[2:3] refers to the subvector of x consisting of elements 2 through 3, which are 2 and 4
here. We can easily find the mean and standard deviation of our data set, as follows:
>mean(x) [1] 2.333333
>sd(x)
[1] 1.527525
This again demonstrates typing an expression at the prompt in order to print it. In the first line, our
expression is the function call mean(x). The return value from that call is printed automatically, without
requiring a call to R’s print() function.
If we want to save the computed mean in a variable instead of just printing it to the screen, we could
execute this code:
>y <- mean(x)
A window pops up with the histogram in it, as shown in Figure 1-1. This graph is bare-bones simple, but R
has all kinds of optional bells and whistles for plotting. For instance, you can change the number of bins by
specifying the breaks variable. The call hist(z,breaks=12) would draw a histogram of the data set z with 12
bins. You can also create nicer labels, make use of color, and make many other changes to create a more
informative and eye appealing graph. When you become more familiar with R, you’ll be able to construct
complex, rich color graphics of striking beauty.
Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function (or
alternatively by pressing CTRL-D in Linux or CMD-D on a Mac):
>q()
Save workspace image? [y/n/c]: n
That last prompt asks whether you want to save your variables so that you can resume work later. If you
answer y, then all those objects will be loaded automatically the next time you run R. This is a very
important feature, especially when working with large or numerous data sets. Answering y here also saves
the session’s command history. We’ll talk more about saving your workspace.
Assignment No. 2
Theory: Matrices Are Much Used In Statistics, And So Play An Important Role In R. To Create A Matrix
Use The Function Matrix(), Specifying Elements By Column First:
[2,] 2 5 8 11
[3,] 3 6 9 12
This is called column-major order. Of course, we need only give one of the dimensions:
>matrix(1:12, nrow=3)
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj . The function
outer() generalizes this to any function f on two arguments, to create a matrix with entries f(xi , yj ). (More
on functions later.)
>outer(1:3, 1:4, "+") [,1] [,2] [,3] [,4]
[1,] 2 3 4 5
[2,] 3 4 5 6
[3,] 4 5 6 7
Matrix multiplication is performed using the operator %*%, which is quite distinct from scalar
multiplication *.
Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3 matrix. Note that
we no longer specify the number of rows and columns separately, but use a single vector dim whose length
is the number of dimensions. You can recover this vector with the dim() function.
>dim(arr) [1] 2 3 3
Note that a 2-dimensional array is identical to a matrix. Arrays can be subsetted and modified in exactly the
same way as a matrix, only using the appropriate number of co-ordinates:
>arr[1,2,3] [1] 15
>arr[,2,]
[,1] [,2] [,3]
[1,] 3 9 15
[2,] 4 10 16
Factors
R has a special data structure to store categorical variables. It tells R that a variable is nominal or ordinal by
making it a factor.
character
numeric (real numbers)
integer
complex
logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the vector() function. There
is really only one rule about vectors in R, which is that A vector can only contain objects of the same class.
But of course, like any good rule, there is an exception, which is a list, which we will get to a bit later. A list
is represented as a vector but can contain objects of different classes. Indeed, that’s usually why we use
them.
There is also a class for “raw” objects, but they are not commonly used directly in data analysis
CREATING VECTORS
The c() function can be used to create vectors of objects by concatenating things together.
NUMERIC VECTOR
x <- c(1,2,3,4,5,6)
The operator <– is equivalent to "=" sign.
CHARACTER VECTOR
State <- c("DL", "MU", "NY", "DL", "NY", "MU")
To calculate frequency for State vector, you can use table function.
Since the above vector contains a NA (not available) value, the mean function returns NA
To calculate mean for a vector excluding NA values, you can include na.rm = TRUE parameter in mean
function.
You can use subscripts to refer elements of a vector.
[1] 1 2 3 4 5 6 7 8 9 10
>-3:4
[1] -3 -2 -1 0 1 2 3 4
>9:5
[1] 9 8 7 6 5
More generally, the function seq() can generate any arithmetic progression.
[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0
>seq(from=-1, to=1, length=6) [1] -1.0 -0.6 -0.2 0.2 0.6 1.0
Sometimes it’s necessary to have repeated values, for which we use rep()
>rep(5,3) [1] 5 5 5
>rep(2:5,each=3)
[1] 2 2 2 3 3 3 4 4 4 5 5 5
>2^(0:10)
Here x has 4 elements: a numeric vector, a logical, a string and another list. We can select an entry of x with
double square brackets:
>x[[3]]
[1] "Hello"
>x[c(1,3)]
[[1]]
[1] 1 2 3
[[2]]
[1] "Hello"
[1] 1 2 3
[[2]]
[1] TRUE
$z
[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to how we can recover the entries by their
name. We can still use the numeric position if we prefer:
>x$y [1] 1 2 3
>x[[1]]
[1] 1 2 3
The function names() can be used to obtain a character vector of all the names of objects in a list.
>names(x) [1] "y" "" "z"
ASSIGNMENT 4
Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R
THEORY: This covers how to execute most frequently used data manipulation tasks with R. It includes
various examples with datasets and code. It gives you a quick look at several functions used in R.
# OR
>DF[keeps]
>DF
>d3 d3[order(d3$roll),] OR
d3[with(d3,order(roll)),]
SUBSETS:
RENAME COLUMNS IN R
colnames(d)[colnames(d)==“roll"]=“ID“
In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to convert the variable
from factor to character.
mydata$State = as.character(mydata$State) mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
In this example, we are replacing 2 and 3 with NA values in whole dataset. mydata[mydata == 2 | mydata
== 3] <- NA
ANOTHER METHOD
You have to first install the car package. # Install the car package install.packages("car")
# Load the car package library("car")
# Recode 1 to 6
mydata$Q1 <- recode(mydata$Q1, "1=6")
SORTING
Sorting is one of the most common data manipulation task. It is generally used when we want to see the top
5 highest / lowest values of a variable.
SORTING A VECTOR
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than 1 dimensional
vector.
SORTING A DATA FRAME
mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT= sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
Sort gender variable in ascending order and then SAT in descending order mydata.sorted1 <-
mydata[order(mydata$Gender, -mydata$SAT),]
Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.
VALUE LABELLING
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male", "female"))
Use ordered() for ordinal data
mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly agree", "Somewhat agree",
"Somewhat disagree", "Strongly disagree"))
CONCLUSION:
DATA2 RESULT=INTERSECT(DATA1$ROLL,DATA2$ROLL) RESULT
RESULT=MERGE(DATA1,DATA2,ALL=FALSE) RESULT
ASSIGNMENT NO. 5
SAMPLE DATA
Let's create a sample data to show how to perform IF ELSE function. This data frame would be used
further in examples.
X1 X2 X3
1 12 A
9
3 17 B
8
5 14 C
0
7 18 D
6
9 19 E
1
11 10 F
4
13 15 G
0
15 18 H
3
17 15 I
1
19 14 J
2
Run the program below to generate the above table in R
set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
x2 = sample(100:200,10,FALSE),
x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In total,
these are 10 numeric values
x1 x2
x3 y 1
129 A
2
3 178 B 9
5 140 C 15
7 186 D 14
9 191 E 27
11 104 F 33
13 150 G 39
15 183 H 45
17 151 I 51
19 142 J 57
ASSIGNMENT NO. 6
WHAT IS DPLYR?
The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes
data exploration and data manipulation easy and fast in R.
EXAMPLE 3 : Remove Duplicate Rows Based On All The Variables (Complete Row)
The distinct function is used to eliminate duplicates.
x1 = distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.
The following functions helps you to select variables based on their names.
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02,
x03.
one_of() Variables in character vector.
everything() All variables.
RENAME( ) FUNCTION
It is used to change variable name.
rename() syntax : rename(data , new_name = old_name)
data : Data Frame
new_name : New variable name you want to keep
FILTER( ) FUNCTION
It is used to subset data with matching logical conditions.
filter() syntax : filter(data , )
data : Data Frame
ALTERNATIVE METHOD :
First, store data for all the numeric variables
numdata = mydata[sapply(mydata,is.numeric)]
Second, the summarise_all function calculates summary statistics for all the columns in a data frame
summarise_all(numdata, funs(n(),mean,median))
SYNTAX :
filter(data_frame, variable == value)
The %>% is NOT restricted to filter function. It can be used with any function.
EXAMPLE 23: The code below demonstrates the usage of pipe %>% operator. In this example, we are
selecting 10 random observations of two variables "Index" "State" from the data frame
dt = sample_n(select(mydata, Index, State),10)
GROUP_BY() FUNCTION :
Use : Group data by categorical variable
SYNTAX :
group_by(data, variables)